The Significance of Time#

Time is relative, as my coworker Anderson likes to remind me every time manipulating this coordinate proves to be relatively difficult.

This post is to be the first in a series of my struggles coming from an atmospheric science background and transitioning into software. My hope is that if I detail pain points and headaches I encounter along my journey, you won’t have to.

The Task#

Lately, I’ve been working on a project that compares modeled and measured data, which is pretty standard. The measured data is on a Gregorian calendar (meaning there are leap years), and the modeled data is on a 365-day no-leap calendar. This is a pretty standard occurrence, so there are tools available in pandas and xarray to help. But finding the correct steps across the many platforms that host documentation can be difficult – so let me add one more platform: this blog.

First tip: You can find out the calendar your data is on using xarray.dataset.time.encoding.

The Temptation#

At several moments I seriously considered writing my own functions to deal with date-time. When I talked to some of my friends (who are scientists) about it, they confessed to me that they had each written their own function for dealing with this. This is a problem. One of the tenants of the Pangeo community is making it so scientists aren’t all repeating the same work when encountering similar problems. I couldn’t write my own function for time, it was my job to demonstrate how to use the tools already approved by the Pangeo community. But it felt like all of the tools worked differently in different scenarios. I had to navigate the documentation to figure out the when and why of time.

The Tribulation#

Essentially my data is in two distinct conventional ways of formatting time: datetime64 and cftime. Both of these methods have different benefits.

All of the data is initally in float32 or float64, which xarray can automatically decode for me based on the time units and bounds supplied. If you did not want this to happen, then in xarray.open_dataset() specify the keyword argument decode_cf=False. In this case your time will be a series of float values that cannot be understood without your units, in this example ("days since 1900-01-01 00:00:00"):

<xarray.DataArray 'time' (time: 12784)>
array([28854.75, 28855.75, 28856.75, ..., 41635.75, 41636.75, 41637.75],
  * time     (time) float32 28854.75 28855.75 28856.75 ... 41636.75 41637.75
    axis:           T
    standard_name:  time
    long_name:      time
    calendar:       gregorian
    units:          days since 1900-01-01 00:00:00

I have never specified this option and think decoding data according to CF conventions is typically preferred.

For the measured data, on a Gregorian calendar, time is converted to datetime64 format which takes the following form:

<xarray.DataArray 'time' (time: 12784)>
array(['1979-01-01T18:00:00.000000000', '1979-01-02T18:00:00.000000000',
       '1979-01-03T18:00:00.000000000', ..., '2013-12-29T18:00:00.000000000',
       '2013-12-30T18:00:00.000000000', '2013-12-31T18:00:00.000000000'],
  * time     (time) datetime64[ns] 1979-01-01T18:00:00 ... 2013-12-31T18:00:00

The datetime64 type does not yet have support for the no-leap calendar, and since the modeled data is on a no-leap calendar, time is returned in cftime.DatetimeNoLeap. This looks as follows:

<xarray.DataArray 'time' (time: 20440)>
array([cftime.DatetimeNoLeap(1950, 1, 1, 12, 0, 0, 0, 4, 1),
       cftime.DatetimeNoLeap(1950, 1, 2, 12, 0, 0, 0, 5, 2),
       cftime.DatetimeNoLeap(1950, 1, 3, 12, 0, 0, 0, 6, 3), ...,
       cftime.DatetimeNoLeap(2005, 12, 29, 12, 0, 0, 0, 1, 363),
       cftime.DatetimeNoLeap(2005, 12, 30, 12, 0, 0, 0, 2, 364),
       cftime.DatetimeNoLeap(2005, 12, 31, 12, 0, 0, 0, 3, 365)], dtype=object)
  * time     (time) object 1950-01-01 12:00:00 ... 2005-12-31 12:00:00

So I need my data to be in the same format, but which one to pick?

The Case for datetime64#

Sometimes I want datetime. At one point I want to split my modeled dataset into two groups: an early piece that matches the time bounds from the measured data, and a future piece that includes all of the model yet to occur and to be bias-corrected. This is done with the xarray.dataset.sel(time = slice(a, b)) where a and b are the first and last time point from the measured dataset as yyy-mm-dd which is in datetime64 formatting!

The Case for cftime.DatetimeNoLeap#

Converting a non-leap cftime variable to datetime64 leads to subtle errors, which are avoided when I use cftime formatting.

Some of these errors occur in the xarray.dataset.groupby('time.dayofyear') functionality which I use to retrieve average values for each day of the year, across all years in the dataset. If I am in datetime64 formatting, but with February 29th missing, this function spans 366 days for every year, instead of 365, and this error grows with every additional year of data (suddenly March 1st is averaged with March 2nd from the next year, for example). So I need my time coordinate to be in cftime.DatetimeNoLeap formatting.

The best of both worlds#

The usual method for converting between cftime and datetime64 is not supported for the non-standard 365 day no-leap calendar. The usual method being pandas.to_datetime().

One workaround is to use xarray.dataset.indexes[...].to_datetimeindex():

datetimeindex = da.indexes['time'].to_datetimeindex()
da['time']= ('time', datetimeindex)

Which always returns this warning (remember the subtle errors I mentioned):

RuntimeWarning: Converting a CFTimeIndex with dates from a non-standard calendar, 'noleap', to a pandas.DatetimeIndex, which uses dates from the standard calendar.  This may lead to subtle errors in operations that depend on the length of time between dates.

But if I want to be able to use both cftime.DatetimeNoLeap and datetime64, I had to write my own function to do this conversion:

def cfnoleap_to_datetime(da):
    datetimeindex = da.indexes['time'].to_datetimeindex()
    ds = da.to_dataset()
    ds['time_dt']= ('time', datetimeindex)
    ds = ds.swap_dims({'time': 'time_dt'})
    assert len(da.time) == len(ds.time_dt)
    return ds

In order to maintain the information contained in cftime I need to upgrade my DataArray to a Dataset, then add a new dimension for the datetime64 formatting of the time coordinate. Then I am able to simply use xarray.dataset.swap_dims to pick which formatting I want depending on the functionality I am using.

I guess that wasn’t that hard, but I couldn’t find an example of this workflow anywhere. I hope sharing it here helps anyone with similar issues!