{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reimagining Diagnostics Through the Use of the Jupyter Ecosystem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Typically, diagnostics packages are written with following structure, using script-based workflows\n", "* Read files and do some preprocessing\n", "* Compute some average or derived quantity\n", "* Visualize the output\n", "\n", "An example of this workflow is the Model Diagnostic Task Force (MDTF) package, which contains a collection of Process Oriented Diagnostics (PODs). Within the this workflow, files are pre-processed, then read into the various diagnostics using similar syntax.\n", "\n", "These workflows are typically executed using a collection of scripts, with Jupyter Notebooks primarily being used as a tool for ***exploratory analysis***. Here, we investigate what diagnostics package built using **Jupyter Notebooks** would look like." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why Jupyter\n", "\n", "The Jupyter ecocystem (notebooks, hub, books, etc.) offers an alternative to traditional scripting and packages. In this example, investigate how using this interactive interface be used to write a parameterizable diagnostics package, capable of providing comparisons of CESM model output. The main reasons for using Jupyter include:\n", "* Provide a launch point for exploratory data analysis\n", "* Interactive plots\n", "* Ability to compile a website using JupyterBook\n", "\n", "In this example, we will walk through how we built a diagnostics package (CESM2 MARBL Diagnostics) prototype using a Jupyter focused workflow.\n", "\n", "* [Link to the rendered JupyterBook](https://marbl-ecosys.github.io/marbl-bgc-diagnostics/intro.html)\n", "* [Link to the project repository](https://github.com/marbl-ecosys/marbl-bgc-diagnostics)\n", "\n", "![cesm-marbl-book-overview](../images/cesm_marbl_book_overview.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting up an Analysis Configuration File\n", "A basic requirement of a diagnostics package is that it be configurable to different cases, allowing the user to specify which files to use, which diagnostics to plot, and where to place the output. An example of the configuration file used for this example is given below:\n", "\n", "```yaml\n", "reference_case_name: b1850.f19_g17.validation_mct.004\n", "\n", "compare_case_name:\n", " - b1850.f19_g17.validation_mct.002\n", " - b1850.f19_g17.validation_nuopc.004\n", "\n", "case_data_paths:\n", " - /glade/scratch/hannay/archive/b1850.f19_g17.validation_mct.004/ocn/hist\n", " - /glade/scratch/hannay/archive/b1850.f19_g17.validation_mct.002/ocn/hist\n", " - /glade/scratch/hannay/archive/b1850.f19_g17.validation_nuopc.004_copy2/ocn/hist\n", "\n", "cesm_file_format:\n", " - hist\n", " \n", "catalog_csv: ../data/cesm-validation-catalog.csv\n", " \n", "catalog_json: ../data/cesm-validation-catalog.json\n", "\n", "variables:\n", " physics:\n", " - TEMP\n", " - SALT\n", " - HMXL\n", " - BSF\n", " - IFRAC\n", " - SHF\n", " - SFWF\n", " - SHF_QSW\n", " biogeochemistry:\n", " - FG_CO2\n", " - PO4\n", " - photoC_TOT_zint\n", " - photoC_sp_zint\n", " - photoC_diat_zint\n", " - photoC_diaz_zint\n", " - POC_FLUX_100m\n", " - diaz_Nfix\n", " - SiO2_PROD\n", " - CaCO3_FLUX_100m\n", " - SiO2_FLUX_100m\n", " - ALK\n", " - DIC\n", " - O2\n", " - DOC\n", " - DOCr\n", " - NH4\n", " - NO3\n", " - SiO3\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting up the \"Preprocessing\" Notebooks\n", "\n", "The first two bullet points from the \"traditional diagnostics\" workflow can be accomplished using two primary notebooks, one for creating a data catalog (`00_Build_Catalog.ipynb`), the other accessing data from that catalog and applying a computation (`01_Compute_20yr_mean.ipynb`). A third notebook plots a visualization of the data catalog used in the analysis (`intro.ipynb`)\n", "\n", "If you are interested in how we built the data catalog, I encourage you check out the [\"Building Intake-ESM Catalogs from History Files\" blog post](https://ncar.github.io/esds/posts/2021/ecgtools-history-files-example/).\n", "\n", "If you are interested in how we computed the average over some time period using Dask, Xarray, and Intake-ESM, I encourage you check out our [\"Examining Diagnostics Using Intake-ESM and hvPlot\" blog post](https://ncar.github.io/esds/posts/2021/intake-esm-holoviews-diagnostics/).\n", "\n", "If you are interested in how we built the catalog visualization, I encourage you to check out the [\"Creating Model Documentation Using Jupyterbook and Intake-ESM\" blog post](https://ncar.github.io/esds/posts/2021/model_documentation_jupyterbook/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using our Configuration File in the Notebooks\n", "\n", "At the top of every notebook, we call \n", "\n", "```python\n", "from config import analysis_config\n", "```\n", "\n", "which access the `analysis_config` variable from the `config.py` script:\n", "\n", "```python\n", "import yaml\n", "\n", "with open(\"analysis_config.yml\", mode=\"r\") as fptr:\n", " analysis_config = yaml.safe_load(fptr)\n", "\n", "analysis_config['all_variables'] = analysis_config['variables']['physics'] + analysis_config['variables']['biogeochemistry']\n", "analysis_config['all_cases'] = [analysis_config['reference_case_name']] + analysis_config['compare_case_name'] \n", "```\n", "\n", "This allows us to access a dictionary with the configuration set in our `analysis_config.yml` file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to Parameterize Notebooks\n", "An important requirement for a diagnostic package is that it be \"parameterizable\" such that a user can input a list of variables, and the package is able to \"inject\" those variables into the analysis. Here, we demonstrate a bare-minimum example of how this might work, specifying a list of `variables` in the `analysis_config.yaml` file:\n", "\n", "```yaml\n", "variables:\n", " physics:\n", " - TEMP\n", " - SALT\n", " - HMXL\n", " - BSF\n", " - IFRAC\n", " - SHF\n", " - SFWF\n", " - SHF_QSW\n", " biogeochemistry:\n", " - FG_CO2\n", " - PO4\n", " - photoC_TOT_zint\n", " - photoC_sp_zint\n", " - photoC_diat_zint\n", " - photoC_diaz_zint\n", " - POC_FLUX_100m\n", " - diaz_Nfix\n", " - SiO2_PROD\n", " - CaCO3_FLUX_100m\n", " - SiO2_FLUX_100m\n", " - ALK\n", " - DIC\n", " - O2\n", " - DOC\n", " - DOCr\n", " - NH4\n", " - NO3\n", " - SiO3\n", "```\n", "These variables can then be \"substituted\" into notebooks using [papermill](https://papermill.readthedocs.io/en/latest/index.html) or [nbformat](https://nbformat.readthedocs.io/en/latest/).\n", "\n", "**Make sure you install papermill before adding these tags!**\n", "\n", "```bash\n", "conda install -c conda-forge papermill\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameterizing Code Cells\n", "We use [papermill](https://papermill.readthedocs.io/en/latest/index.html) here to substitute variables within code blocks. You add a cell with the desired variable name (ex. `variable`), then add a cell tag `parameters`. An image of what this looks like the Jupyter Lab interface is shown below:\n", "\n", "![papermill_cell_tags](../images/papermill_cell_tags.png)\n", "\n", "In addition to the `parameters` tag, we add `hide-input` which will \"hide\" the cell on a JupyterBook page, with the code still visible via expanding the cell. \n", "\n", "The code block then executed to generate the resultant notebooks is provided below (where the screenshot above is the `interactive_plot_template.ipnyb` notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import papermill as pm\n", "from config import analysis_config\n", "\n", "for variable_type in analysis_config['variables']:\n", " for variable in analysis_config['variables'][variable_type]:\n", " out_notebook_name = f\"{variable_type}_{variable}.ipynb\"\n", " pm.execute_notebook(\n", " \"interactive_plot_template.ipynb\",\n", " out_notebook_name,\n", " parameters=dict(variable=variable),\n", " kernel_name='python3',\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An example of that is shown below, with the resultant `physics_SALT.ipynb` notebook:\n", "\n", "![papermill_output_salt](../images/papermill_output_salt.png)\n", "\n", "We only see the output from papermill, and the plotting cell. If we expand the \"Click to Show\" section, we can see the original input.\n", "\n", "![papermill_output_salt_expaned](../images/papermill_output_salt_expanded.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Taking a Look at the Interactive Plots\n", "Below these first few cells are interactive plots with the specified variables. An example of the `SALT` diagnostics are shown below (if you are interested in generating similar plots, be sure to check out the [\"Examining Diagnostics Using Intake-ESM and hvPlot\" blog post](https://ncar.github.io/esds/posts/2021/intake-esm-holoviews-diagnostics/).\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameterizing Markdown Cells\n", "You'll notice that the title of the first notebook is `Variable`, while the output notebooks have their respective variable titles. This is accomplished using [nbformat](https://nbformat.readthedocs.io/en/latest/), with the code block motivated by a thread on the [Jupyter Discourse](https://discourse.jupyter.org/t/possible-to-send-markdown-text-to-a-markdown-cell-in-a-new-notebook-via-papermill-other-options/748/5). A function to go through and substitute some variable name is given below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nbformat as nbf\n", "from config import analysis_config\n", "\n", "\n", "def modify_markdown_header(notebook_name, variable):\n", " notebook = nbf.read(notebook_name, nbf.NO_CONVERT)\n", " cells_to_keep = []\n", " for cell in notebook.cells:\n", " if cell.cell_type == 'markdown':\n", " cell['source'] = cell['source'].replace('Variable', variable)\n", "\n", " cells_to_keep.append(cell)\n", " new_notebook = notebook\n", " new_notebook.cells = cells_to_keep\n", " nbf.write(new_notebook, notebook_name, version=nbf.NO_CONVERT)\n", " return print(f'Modified {notebook_name} with {variable} header')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then add this function after creating the notebooks in papermill, with the workflow shown below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for variable_type in analysis_config['variables']:\n", " for variable in analysis_config['variables'][variable_type]:\n", " out_notebook_name = f\"{variable_type}_{variable}.ipynb\"\n", " pm.execute_notebook(\n", " \"interactive_plot_template.ipynb\",\n", " out_notebook_name,\n", " parameters=dict(variable=variable),\n", " kernel_name='python3',\n", " )\n", " modify_markdown_header(out_notebook_name, variable)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting up our Jupyter Book Table of Contents\n", "When setting up a Jupyter Book, the two required configuration files are:\n", "* `_config.yml`\n", "* `_toc.yml`\n", "\n", "If you are interested in building a Jupyter Book, I encourage you to check out their [documentation](https://jupyterbook.org/intro.html).\n", "\n", "For our table of contents file (`_toc.yml`), we use the following:\n", "\n", "```yaml\n", "format: jb-book\n", "root: intro\n", "parts:\n", " - caption: Build Catalog and Compute\n", " chapters:\n", " - file: 00_Build_Catalog\n", " - file: 01_Compute_20yr_mean\n", " - caption: Physics Plots\n", " chapters:\n", " - glob: physics_*\n", " - caption: Biogeochemistry Plots\n", " chapters:\n", " - glob: biogeochemistry_*\n", "```\n", "\n", "Notice how we can use `glob` instead of `file` for the plotting notebooks to include any notebooks that start with either `physics` or `biogeochemistry`, placing these in their respective sections in the table of contents.\n", "\n", "### Automating the Book Build using Github Actions\n", "\n", "The Jupyter Book in this project is built automatically using Github Actions, with the web page rendered through Github Pages. The file used for this process (`deploy.yml`) is provided below, with the build triggered by pushes to the `main` branch:\n", "\n", "**Make sure that this file is saved in the `{repo_root_directory}.github/workflows/` directory!**\n", "\n", "```yaml\n", "name: deploy\n", "\n", "on:\n", " # Trigger the workflow on push to main branch\n", " push:\n", " branches:\n", " - main\n", "\n", "# This job installs dependencies, build the book, and pushes it to `gh-pages`\n", "jobs:\n", " build-and-deploy-book:\n", " runs-on: ${{ matrix.os }}\n", " strategy:\n", " matrix:\n", " os: [ubuntu-latest]\n", " python-version: [3.8]\n", " steps:\n", " - uses: actions/checkout@v2\n", "\n", " # Install dependencies\n", " - name: Set up Python ${{ matrix.python-version }}\n", " uses: actions/setup-python@v1\n", " with:\n", " python-version: ${{ matrix.python-version }}\n", " - name: Install dependencies\n", " run: |\n", " pip install -r requirements.txt\n", " # Build the book\n", " - name: Build the book\n", " run: |\n", " jupyter-book build notebooks\n", " # Deploy the book's HTML to gh-pages branch\n", " - name: GitHub Pages action\n", " uses: peaceiris/actions-gh-pages@v3.6.1\n", " with:\n", " github_token: ${{ secrets.GITHUB_TOKEN }}\n", " publish_dir: notebooks/_build/html\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions\n", "\n", "In this example, we covered how the Jupyter ecoysystem provides the tools neccessary to build a diagnostics package. The combination of Jupyter Notebooks, Jupyter Book, and various open source packages made to modify these notebooks offer an alternative diagnostic workflow. Not only does this offer the ability to synthesize curiosity-driven analysis, but it also provides a means of automatically generating an interactive website to share with collaborators. The outputs from this workflow include:\n", "* Interactive notebooks\n", "* A data catalog\n", "* An interactive webpage which can be shared with others\n", "\n", "More effort should go into exploring how to better parameterize these calculations and data visualizations, in addition to how one would curate a collection of diagnostic Jupyter Books." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "author": "Max Grover", "date": "2021-09-24", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "tags": "diagnostics,jupyter,intake,cesm,visualization", "title": "Reimagining Diagnostics Through the Use of the Jupyter Ecosystem" }, "nbformat": 4, "nbformat_minor": 4 }