Creating Visualizations of Intake-ESM Catalogs#

A common initial task when working with a new dataset is figuring out what data is available. This is especially true when working with climate ensembles with several components and time-frequency output (ex. Community Earth System Model Large Ensemble, CESM-LE). Here, we will examine different methods of investigating this catalog

Imports#

Here, we will use intake-esm and graphviz, which can be installed using the following (including jupyterlab too!)

conda install -c conda-forge jupyterlab intake-esm graphviz

Once you install these packages, open jupyterlab!

import intake
from graphviz import Digraph

Read in intake-esm catalog#

col = intake.open_esm_datastore(
    'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json'
)

Typically, the process is to read in the dataframe containing the metadata, but this can be tough to read/understand what data is all there

col.df
variable long_name component experiment frequency vertical_levels spatial_domain units start_time end_time path
0 FLNS net longwave flux at surface atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS....
1 FLNSC clearsky net longwave flux at surface atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC...
2 FLUT upwelling longwave flux at top of model atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT....
3 FSNS net solar flux at surface atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS....
4 FSNSC clearsky net solar flux at surface atm 20C daily 1.0 global W/m2 1920-01-01 12:00:00 2005-12-31 12:00:00 s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC...
... ... ... ... ... ... ... ... ... ... ... ...
430 WVEL vertical velocity ocn RCP85 monthly 60.0 global_ocean centimeter/s 2006-01-16 12:00:00 2100-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-RCP85-W...
431 NaN NaN ocn CTRL static NaN global_ocean NaN NaN NaN s3://ncar-cesm-lens/ocn/static/grid.zarr
432 NaN NaN ocn HIST static NaN global_ocean NaN NaN NaN s3://ncar-cesm-lens/ocn/static/grid.zarr
433 NaN NaN ocn RCP85 static NaN global_ocean NaN NaN NaN s3://ncar-cesm-lens/ocn/static/grid.zarr
434 NaN NaN ocn 20C static NaN global_ocean NaN NaN NaN s3://ncar-cesm-lens/ocn/static/grid.zarr

435 rows × 11 columns

You can search via intake-esm, using the following syntax

cat = col.search(experiment='20C', frequency='monthly')

Here again, it is tough to see everything that is here, also it requires knowing which experiments are in the dataset, and which frequency you are looking for

cat.df
variable long_name component experiment frequency vertical_levels spatial_domain units start_time end_time path
0 FLNS net longwave flux at surface atm 20C monthly 1.0 global W/m2 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FLN...
1 FLNSC clearsky net longwave flux at surface atm 20C monthly 1.0 global W/m2 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FLN...
2 FLUT upwelling longwave flux at top of model atm 20C monthly 1.0 global W/m2 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FLU...
3 FSNS net solar flux at surface atm 20C monthly 1.0 global W/m2 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FSN...
4 FSNSC clearsky net solar flux at surface atm 20C monthly 1.0 global W/m2 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/atm/monthly/cesmLE-20C-FSN...
... ... ... ... ... ... ... ... ... ... ... ...
60 VNT flux of heat in grid-y direction ocn 20C monthly 60.0 global_ocean degC/s 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-VNT...
61 VVEL velocity in grid-y direction ocn 20C monthly 60.0 global_ocean centimeter/s 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-VVE...
62 WTS salt flux across top face ocn 20C monthly 60.0 global_ocean gram/kilogram/s 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-WTS...
63 WTT heat flux across top face ocn 20C monthly 60.0 global_ocean degC/s 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-WTT...
64 WVEL vertical velocity ocn 20C monthly 60.0 global_ocean centimeter/s 1920-01-16 12:00:00 2005-12-16 12:00:00 s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-WVE...

65 rows × 11 columns

Using Graphviz in a Jupyter Notebook#

Graphviz offers an interface to create network graphs

Main “components” of Graphviz#

  • Digraph class

    • This is the main class that is used to build the visualization - typically assign to a variable dot, but you can use any variable you like!

  • Node

    • The “bubbles” which contain a numbered label (ex. ‘1’) and a label (ex. ‘HIST’)

    • These can be connected together - the numbered label must be a unique integer

  • Edge

    • Edges connect the different nodes, using the numbered indices (ex. .edge('1', '3') would connect the first and third nodes

Example of case visualization#

# Create Digraph object
dot = Digraph()

# Create the first node which serves as the main parent
dot.node('1', label='HIST')

dot.node('2', label='ocn')
dot.edge('1', '2')

# Add a monthly child from the ocn component parent
dot.node('3', label='monthly')
dot.edge('2', '3')

# Add a daily child from the ocn component parent
dot.node('4', label='daily')
dot.edge('2', '4')

# Add an atm component node and connect to experiment parent
dot.node('5', label='atm')
dot.edge('1', '5')

# Add a monthly child from the atm component parent
dot.node('6', label='monthly')
dot.edge('5', '6')

# Add a weekly child from the atm component parent
dot.node('7', label='weekly')
dot.edge('5', '7')

# Visualize the graph
dot
../../_images/e696afcdb2fcbec7e9268399b9609587a5f0f51f6c5249afe3b3a0fcd46f0376.svg

Looping through the CESM-LE catalog#

Let’s apply this to our data catalog, assigning the dataframe with dataset attributes to df

df = col.df
# Create Digraph object - use the left to right orientation instead of vertical
dot = Digraph(graph_attr={'rankdir': 'LR'})

# Start counting at one for node numbers
num_node = 1

# Loop through the different experiments
for experiment in df.experiment.unique():
    exp_i = num_node
    dot.node(str(exp_i), label=experiment)
    num_node += 1

    # Loop through the different components in each experiment
    for component in df.loc[df.experiment == experiment].component.unique():
        comp_i = num_node
        dot.node(str(comp_i), label=component)
        dot.edge(str(exp_i), str(comp_i))
        num_node += 1

        # Loop through the frequency in each component within each experiment
        for frequency in df.loc[
            (df.experiment == experiment) & (df.component == component)
        ].frequency.unique():
            freq_i = num_node
            dot.node(str(freq_i), label=frequency)
            dot.edge(str(comp_i), str(freq_i))
            num_node += 1
        comp_i += 1
    exp_i += 1
dot
../../_images/943c5e4d80e0ce8f02ecfac7b6ec89c7db19f3e5e45efe9d570e3860a84633d4.svg

Conclusion#

Graphviz can be a helpful tool when visualizing what data is within your data catalog - I hope this provides a good starting point in terms of using this with intake-esm catalogs!