{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sparse arrays and the CESM land model component\n", "\n", "An underappreciated feature of Xarray + Dask is the ability to plug in [different array types](https://docs.xarray.dev/en/stable/user-guide/duckarrays.html). Usually we work with Xarray wrapping a Dask array which in turn uses *NumPy* arrays for each block; or just Xarray wrapping NumPy arrays directly. NumPy arrays are dense in-memory arrays. Other array types exist:\n", "- [sparse](https://sparse.pydata.org) for sparse arrays\n", "- [pint](https://pint.readthedocs.io) for units\n", "- [cupy](https://cupy.dev/) for GPU arrays\n", "\n", "Over the past few years, significant effort has been made to make these array types speak a common protocol so that higher-level packages like Xarray can easily wrap all of them. The latest (and hopefully last) version of these efforts is described at [data-apis](https://data-apis.org/) if you are interested.\n", "\n", "\n", "This notebook explores using sparse arrays with dask and xarray motivated by some Zulip conversations around representing \"Plant Functional Types\" from the land model component. A preliminary version of this notebook is [here](https://nbviewer.jupyter.org/github/NCAR/ctsm_python_gallery/blob/master/notebooks/sparse-PFT-gridding.ipynb); and the work builds on [PFT-Gridding.ipynb](https://nbviewer.org/github/NCAR/ctsm_python_gallery/blob/master/notebooks/PFT-Gridding.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Importing Libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import cartopy.crs as ccrs\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import sparse\n", "import xarray as xr\n", "\n", "# some nice plotting settings\n", "xr.set_options(cmap_sequential=mpl.cm.YlGn, keep_attrs=True)\n", "plt.rcParams[\"figure.dpi\"] = 120\n", "cbar_kwargs = {\"orientation\": \"horizontal\", \"shrink\": 0.8, \"aspect\": 30}\n", "\n", "\n", "def setup_axes():\n", " ax = plt.axes(projection=ccrs.PlateCarree())\n", " ax.coastlines()\n", " return ax" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Community Land Model (CLM) output\n", "\n", "Lets read a dataset.\n", "\n", "This dataset represents *monthly mean* output from the Community Land Model (CLM), the land model component of CESM. The particular variable here is gross primary production (GPP), which is essentially photosynthesis or carbon uptake by plants. Because CLM represents different [plant functional types (PFTs)](https://escomp.github.io/ctsm-docs/versions/release-clm5.0/html/tech_note/Crop_Irrigation/CLM50_Tech_Note_Crop_Irrigation.html#crop-plant-functional-types) within a single model grid cell (up to 79 different types including natural vegetation and crops), this dataset includes information on how GPP varies by PFT. This way we can examine photosynthesis and how it varies across different plant types. To visualize this output, we often need to remap the information to a latitude/longitude grid, preserving plant type." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<xarray.Dataset>\n", "Dimensions: (levgrnd: 25, levlak: 10, levdcmp: 25, lon: 288, lat: 192, gridcell: 21013, landunit: 48359, column: 111429, pft: 166408, time: 3828, hist_interval: 2)\n", "Coordinates:\n", " * levgrnd (levgrnd) float32 0.01 0.04 0.09 ... 19.48 28.87 42.0\n", " * levlak (levlak) float32 0.05 0.6 2.1 4.6 ... 25.6 34.33 44.78\n", " * levdcmp (levdcmp) float32 0.01 0.04 0.09 ... 19.48 28.87 42.0\n", " * lon (lon) float32 0.0 1.25 2.5 3.75 ... 356.2 357.5 358.8\n", " * lat (lat) float32 -90.0 -89.06 -88.12 ... 88.12 89.06 90.0\n", " * time (time) object 1700-02-01 00:00:00 ... 2019-01-01 00:0...\n", "Dimensions without coordinates: gridcell, landunit, column, pft, hist_interval\n", "Data variables: (12/51)\n", " area (lat, lon) float32 dask.array<chunksize=(192, 288), meta=np.ndarray>\n", " landfrac (lat, lon) float32 dask.array<chunksize=(192, 288), meta=np.ndarray>\n", " landmask (lat, lon) float64 dask.array<chunksize=(192, 288), meta=np.ndarray>\n", " pftmask (lat, lon) float64 dask.array<chunksize=(192, 288), meta=np.ndarray>\n", " nbedrock (lat, lon) float64 dask.array<chunksize=(192, 288), meta=np.ndarray>\n", " grid1d_lon (gridcell) float64 dask.array<chunksize=(21013,), meta=np.ndarray>\n", " ... ...\n", " mscur (time) float64 dask.array<chunksize=(100,), meta=np.ndarray>\n", " nstep (time) float64 dask.array<chunksize=(100,), meta=np.ndarray>\n", " time_bounds (time, hist_interval) object dask.array<chunksize=(100, 2), meta=np.ndarray>\n", " date_written (time) object dask.array<chunksize=(100,), meta=np.ndarray>\n", " time_written (time) object dask.array<chunksize=(100,), meta=np.ndarray>\n", " GPP (time, pft) float32 dask.array<chunksize=(100, 166408), meta=np.ndarray>\n", "Attributes: (12/102)\n", " title: CLM History file information\n", " comment: NOTE: None of the variables ar...\n", " Conventions: CF-1.0\n", " history: created on 09/27/19 16:25:57\n", " source: Community Terrestrial Systems ...\n", " hostname: cheyenne\n", " ... ...\n", " cft_irrigated_tropical_corn: 62\n", " cft_tropical_soybean: 63\n", " cft_irrigated_tropical_soybean: 64\n", " time_period_freq: month_1\n", " Time_constant_3Dvars_filename: ./TRENDY2019_S0_constant_v2.cl...\n", " Time_constant_3Dvars: ZSOI:DZSOI:WATSAT:SUCSAT:BSW:H...
<xarray.DataArray 'GPP' (time: 3828, pft: 166408)>\n", "dask.array<open_dataset-e503263520abd161067be1f5e311c2e1GPP, shape=(3828, 166408), dtype=float32, chunksize=(100, 166408), chunktype=numpy.ndarray>\n", "Coordinates:\n", " * time (time) object 1700-02-01 00:00:00 ... 2019-01-01 00:00:00\n", "Dimensions without coordinates: pft\n", "Attributes:\n", " long_name: gross primary production\n", " units: gC/m^2/s\n", " cell_methods: time: mean
<xarray.DataArray 'GPP' (time: 3828, vegtype: 79, lat: 192, lon: 288)>\n", "dask.array<transpose, shape=(3828, 79, 192, 288), dtype=float32, chunksize=(100, 79, 192, 288), chunktype=sparse.COO>\n", "Coordinates:\n", " * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n", " * time (time) object 1700-02-01 00:00:00 ... 2019-01-01 00:00:00\n", " * vegtype (vegtype) |S40 b'not_vegetated ' ... b...\n", "Attributes:\n", " long_name: gross primary production\n", " units: gC/m^2/s\n", " cell_methods: time: mean
Format | coo |
---|---|
Data Type | int64 |
Shape | (3, 3) |
nnz | 3 |
Density | 0.3333333333333333 |
Read-only | True |
Size | 72 |
Storage ratio | 1.0 |
Format | coo |
---|---|
Data Type | float32 |
Shape | (4, 4) |
nnz | 3 |
Density | 0.1875 |
Read-only | True |
Size | 60 |
Storage ratio | 0.9 |
\n",
"
| \n",
" \n", " \n", " | \n", "
<xarray.DataArray (x: 4, y: 4)>\n", "<COO: shape=(4, 4), dtype=float32, nnz=3, fill_value=nan>\n", "Coordinates:\n", " * x (x) int64 0 1 2 3\n", " * y (y) int64 0 1 2 3
Format | coo |
---|---|
Data Type | float32 |
Shape | (4, 4) |
nnz | 3 |
Density | 0.1875 |
Read-only | True |
Size | 60 |
Storage ratio | 0.9 |
<xarray.DataArray (x: 4, y: 4)>\n", "array([[ 1., nan, nan, nan],\n", " [nan, 1., nan, nan],\n", " [nan, nan, 1., nan],\n", " [nan, nan, nan, nan]], dtype=float32)\n", "Coordinates:\n", " * x (x) int64 0 1 2 3\n", " * y (y) int64 0 1 2 3
<xarray.DataArray (y: 4)>\n", "<COO: shape=(4,), dtype=float64, nnz=3, fill_value=nan>\n", "Coordinates:\n", " * y (y) int64 0 1 2 3
<xarray.DataArray 'array-8e28f4e6653ecaa445c49b8638c8f808' (x: 4, y: 4)>\n", "dask.array<array, shape=(4, 4), dtype=float32, chunksize=(2, 2), chunktype=sparse.COO>\n", "Coordinates:\n", " * x (x) int64 1 2 3 4\n", " * y (y) int64 1 2 3 4
<xarray.DataArray 'GPP' (pft: 166408)>\n", "array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)\n", "Coordinates:\n", " time object 1700-02-01 00:00:00\n", "Dimensions without coordinates: pft\n", "Attributes:\n", " long_name: gross primary production\n", " units: gC/m^2/s\n", " cell_methods: time: mean
<xarray.DataArray 'pfts1d_itype_veg' (pft: 166408)>\n", "array([0, 0, 0, ..., 0, 0, 0])\n", "Dimensions without coordinates: pft\n", "Attributes:\n", " long_name: pft vegetation type
<xarray.DataArray 'pfts1d_ixy' (pft: 166408)>\n", "array([ 1, 1, 2, ..., 265, 265, 265])\n", "Dimensions without coordinates: pft\n", "Attributes:\n", " long_name: 2d longitude index of corresponding pft
Format | coo |
---|---|
Data Type | float32 |
Shape | (78, 186, 288) |
nnz | 104414 |
Density | 0.024989565144135036 |
Read-only | True |
Size | 2.8M |
Storage ratio | 0.2 |
Format | coo |
---|---|
Data Type | float32 |
Shape | (79, 192, 288) |
nnz | 104414 |
Density | 0.023902202736755744 |
Read-only | True |
Size | 2.8M |
Storage ratio | 0.2 |
<xarray.DataArray (vegtype: 79, lat: 192, lon: 288)>\n", "<COO: shape=(79, 192, 288), dtype=float32, nnz=104414, fill_value=nan>\n", "Coordinates:\n", " * vegtype (vegtype) |S40 b'not_vegetated ' ... b...\n", " * lat (lat) float32 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0\n", " * lon (lon) float32 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
Format | coo |
---|---|
Data Type | float32 |
Shape | (23, 192, 288) |
nnz | 104414 |
Density | 0.08209887026972625 |
Read-only | True |
Size | 2.8M |
Storage ratio | 0.6 |
<xarray.DataArray 'GPP' (time: 4, pft: 166408)>\n", "array([[0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)\n", "Coordinates:\n", " * time (time) object 1700-02-01 00:00:00 ... 1700-05-01 00:00:00\n", "Dimensions without coordinates: pft\n", "Attributes:\n", " long_name: gross primary production\n", " units: gC/m^2/s\n", " cell_methods: time: mean
Format | coo |
---|---|
Data Type | float32 |
Shape | (4, 79, 192, 288) |
nnz | 417656 |
Density | 0.023902202736755744 |
Read-only | True |
Size | 14.3M |
Storage ratio | 0.2 |
<xarray.DataArray (time: 4, vegtype: 79, lat: 192, lon: 288)>\n", "<COO: shape=(4, 79, 192, 288), dtype=float32, nnz=417656, fill_value=nan>\n", "Coordinates:\n", " * time (time) object 1700-02-01 00:00:00 ... 1700-05-01 00:00:00\n", " * vegtype (vegtype) |S40 b'not_vegetated ' ... b...\n", " * lat (lat) float32 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0\n", " * lon (lon) float32 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
<xarray.DataArray 'GPP' (time: 4, vegtype: 79, lat: 192, lon: 288)>\n", "<COO: shape=(4, 79, 192, 288), dtype=float32, nnz=417656, fill_value=nan>\n", "Coordinates:\n", " * time (time) object 1700-02-01 00:00:00 ... 1700-05-01 00:00:00\n", "Dimensions without coordinates: vegtype, lat, lon\n", "Attributes:\n", " long_name: gross primary production\n", " units: gC/m^2/s\n", " cell_methods: time: mean
<xarray.DataArray 'GPP' (time: 3828, vegtype: 79, lat: 192, lon: 288)>\n", "dask.array<transpose, shape=(3828, 79, 192, 288), dtype=float32, chunksize=(100, 79, 192, 288), chunktype=numpy.ndarray>\n", "Coordinates:\n", " * time (time) object 1700-02-01 00:00:00 ... 2019-01-01 00:00:00\n", "Dimensions without coordinates: vegtype, lat, lon\n", "Attributes:\n", " long_name: gross primary production\n", " units: gC/m^2/s\n", " cell_methods: time: mean
<xarray.DataArray 'GPP' (time: 3828, vegtype: 79, lat: 192, lon: 288)>\n", "dask.array<transpose, shape=(3828, 79, 192, 288), dtype=float32, chunksize=(100, 79, 192, 288), chunktype=sparse.COO>\n", "Coordinates:\n", " * time (time) object 1700-02-01 00:00:00 ... 2019-01-01 00:00:00\n", "Dimensions without coordinates: vegtype, lat, lon\n", "Attributes:\n", " long_name: gross primary production\n", " units: gC/m^2/s\n", " cell_methods: time: mean
<xarray.Dataset>\n", "Dimensions: (lat: 192, lon: 288, vegtype: 79, time: 3828)\n", "Coordinates:\n", " * lat (lat) float64 -90.0 -89.06 -88.12 ... 88.12 89.06 90.0\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 ... 356.2 357.5 358.8\n", " * time (time) object 1700-02-01 00:00:00 ... 2019-01-01 00:0...\n", " * vegtype (vegtype) |S40 b'not_vegetated ...\n", "Data variables: (12/15)\n", " pfts1d_lon (vegtype, lat, lon) float64 dask.array<chunksize=(79, 192, 288), meta=np.ndarray>\n", " pfts1d_lat (vegtype, lat, lon) float64 dask.array<chunksize=(79, 192, 288), meta=np.ndarray>\n", " pfts1d_ixy (vegtype, lat, lon) float64 <COO: nnz=104414, fill_value=nan>\n", " pfts1d_jxy (vegtype, lat, lon) float64 <COO: nnz=104414, fill_value=nan>\n", " pfts1d_gi (vegtype, lat, lon) float64 dask.array<chunksize=(79, 192, 288), meta=np.ndarray>\n", " pfts1d_li (vegtype, lat, lon) float64 dask.array<chunksize=(79, 192, 288), meta=np.ndarray>\n", " ... ...\n", " pfts1d_wtcol (vegtype, lat, lon) float64 dask.array<chunksize=(79, 192, 288), meta=np.ndarray>\n", " pfts1d_itype_veg (vegtype, lat, lon) float64 <COO: nnz=104414, fill_value=nan>\n", " pfts1d_itype_col (vegtype, lat, lon) float64 dask.array<chunksize=(79, 192, 288), meta=np.ndarray>\n", " pfts1d_itype_lunit (vegtype, lat, lon) float64 dask.array<chunksize=(79, 192, 288), meta=np.ndarray>\n", " pfts1d_active (vegtype, lat, lon) float64 dask.array<chunksize=(79, 192, 288), meta=np.ndarray>\n", " GPP (time, vegtype, lat, lon) float32 dask.array<chunksize=(100, 79, 192, 288), meta=sparse.COO>\n", "Attributes: (12/102)\n", " title: CLM History file information\n", " comment: NOTE: None of the variables ar...\n", " Conventions: CF-1.0\n", " history: created on 09/27/19 16:25:57\n", " source: Community Terrestrial Systems ...\n", " hostname: cheyenne\n", " ... ...\n", " cft_irrigated_tropical_corn: 62\n", " cft_tropical_soybean: 63\n", " cft_irrigated_tropical_soybean: 64\n", " time_period_freq: month_1\n", " Time_constant_3Dvars_filename: ./TRENDY2019_S0_constant_v2.cl...\n", " Time_constant_3Dvars: ZSOI:DZSOI:WATSAT:SUCSAT:BSW:H...