Cloud-optimized access to CESM Timeseries netCDF files in the Cloud with kerchunk
#
Summary#
We benchmark reading a subset of the CESM2-Large Ensemble stored as a collection of netCDF files on the cloud (Amazon / AWS) from Casper. We use a single ensemble member historical experiment with daily data from 1850 to 2009, with a total dataset size of 600+ GB, from 13 netCDF4 files.
We read in two ways:
accessing the netCDF files directly
using
kerchunk
as an intermediate layer to enable cloud-optimized access.
File Access Method |
Dataset Lazy Read In Wall Time |
Visualization Wall Time |
---|---|---|
s3 netcdf |
8-9 s |
6 min 21 s |
kerchunk |
460 ms |
43.4 s |
The “s3 netcdf” row refers to accessing the data remotely using the traditional s3 api, the interface with Amazon’s cloud storage. There are Google and Microsoft equivalents of this
kerchunk is a new package, developed within the fsspec python community, which aims to improve I/O performance when reading from cloud hosted datasets which are not neccessarily in a “cloud optimized format (ex. netCDF vs. Zarr) where netCDF was engineered to performant on regular POSIX filesystems.
Note that usually you would want to execute some benchmarks and analysis on a cloud instance. The performance benefits here extend to that use case but the gap will be smaller (but still significant). For more see this post and this post.
Tip
The Project Pythia Cookbook on kerchunk is a great resource!
Imports#
%load_ext watermark
import glob
import os
import cftime
import dask
import fsspec
import holoviews as hv
import hvplot.xarray
import kerchunk
import s3fs
import ujson # fast json
import xarray as xr
from distributed import Client
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr
from ncar_jobqueue import NCARCluster
hv.extension('bokeh')
%watermark -iv