Xarray and gridded data

Xarray and gridded data#

# initialization
import pandas as pd

Oceanographic data are often gridded: for example, we may record ocean temperature on a range of latitude and longitude, as well as a range of depth. Such data are inherently 3-dimensional.

Naively, we may store such data in a 3-dimensional numpy array. However, in doing so we’ll lose information about the coordinates of the grid, e.g., the value of depth that corresponds an index of the third axis. What we need is a higher-dimensional equivalent of pandas, where information about coordinates are stored alongside the data.

The third-party xarray module provides such an extension. As an added benefit, xarray also provides interface to load and save NETCDF files, a common external file format for gridded data.

To import xarray, run the line below (again, the as xr part is optional but standard).

import xarray as xr

From pandas DataFrame to xarray Dataset#

Occasionally, you may need to convert a pandas DataFrame into an xarray Dataset and vice versa. To give a concrete example, consider the CalCOFI data we examined in week 6. One may argue that the data is grid-like when you consider time and depth as coordinates, and there may be advantage of turning it into an xarray Dataset.

To see how this may work, we load a version of the CalCOFI subset in which the depth is binned (you can download a copy of the file here):

CalCOFI2 = pd.read_csv("data/CalCOFI_binned.csv", parse_dates = ["Datetime"])
display(CalCOFI2)

	Cast_Count	Station_ID	Datetime	Depth_bin_m	T_degC	Salinity	SigmaTheta
0	992	090.0 070.0	1950-02-06 19:54:00	5	14.040	33.1700	24.76600
1	992	090.0 070.0	1950-02-06 19:54:00	15	13.950	33.2100	24.81500
2	992	090.0 070.0	1950-02-06 19:54:00	25	13.900	33.2100	24.82600
3	992	090.0 070.0	1950-02-06 19:54:00	35	13.810	33.2180	24.85100
4	992	090.0 070.0	1950-02-06 19:54:00	55	13.250	33.1500	24.91200
...	...	...	...	...	...	...	...
6402	35578	090.0 070.0	2021-01-21 13:36:00	205	8.518	34.0402	26.44858
6403	35578	090.0 070.0	2021-01-21 13:36:00	255	8.104	34.1405	26.59119
6404	35578	090.0 070.0	2021-01-21 13:36:00	275	8.012	34.1498	26.61270
6405	35578	090.0 070.0	2021-01-21 13:36:00	305	7.692	34.1712	26.67697
6406	35578	090.0 070.0	2021-01-21 13:36:00	385	7.144	34.2443	26.81386

6407 rows × 7 columns

To convert the pandas DataFrame to an xarray Dataset, we need to tell xarray which column(s) (“variable(s)” from the xarray perspective) are coordinates. We do so by converting such columns into row (multi-)index. In our case, the relevant columns are Datetime and Depth_bin_m, and we use the .set_index() method to turn these into row index.

Next, let’s say we want the resulting xarray Dataset to contain (and contain only) T_degC, Salinity, and SigmaTheta as data variables. Then we should make sure these are the only remaining columns after the index is set. To do so we can use the .iloc[] method.

Combining, we transform our pandas DataFrame like so:

CalCOFI3 = CalCOFI2.set_index(["Datetime", "Depth_bin_m"])
CalCOFI3 = CalCOFI3.loc[:, ["T_degC", "Salinity", "SigmaTheta"]]
display(CalCOFI3)

		T_degC	Salinity	SigmaTheta
Datetime	Depth_bin_m
1950-02-06 19:54:00	5	14.040	33.1700	24.76600
	15	13.950	33.2100	24.81500
	25	13.900	33.2100	24.82600
	35	13.810	33.2180	24.85100
	55	13.250	33.1500	24.91200
...	...	...	...	...
2021-01-21 13:36:00	205	8.518	34.0402	26.44858
	255	8.104	34.1405	26.59119
	275	8.012	34.1498	26.61270
	305	7.692	34.1712	26.67697
	385	7.144	34.2443	26.81386

6407 rows × 3 columns

We are now ready to convert this DataFrame into an xarray Dataset, and all it takes is a xr.Dataset.from_dataframe() call:

Note that xarray automatically “complete” the grid and fill in some missing values for us (e.g., there is no measurement near 385 m on 1950-02-06. This column is implicitly missing in the pandas DataFrame, but become explicitly missing in the xarray Dataset, since measurement near 385 m did happen on other dates):

CalCOFI2.loc[
    (CalCOFI2["Datetime"] == pd.to_datetime("1950-02-06 19:54")) & 
    (CalCOFI2["Depth_bin_m"] == 385)
]

	Cast_Count	Station_ID	Datetime	Depth_bin_m	T_degC	Salinity	SigmaTheta

From xarray Dataset to pandas DataFrame#

For the converse (from xarray to pandas), consider the tidal gauge measurement near Key West, FL, courtesy University of Hawaii Sea Level Center (you can download a copy of the netCDF file here)

Observe that the record_id dimension only has 4 coordinate values, which essentially identify the station. Thus, the data is essentially one-dimensional, and it make sense to present it in tabular form.

To convert an xarray Dataset to a pandas DataFrame, all you need is to call the .to_dataframe() method of the Dataset:

gauge_pd = gauge_xr.to_dataframe()
display(gauge_pd)

		sea_level	lat	lon	station_name	station_country
record_id	time
2570	1913-01-19 06:00:00.000000	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
	1913-01-19 07:00:00.028800	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
	1913-01-19 07:59:59.971200	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
	1913-01-19 09:00:00.000000	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
	1913-01-19 10:00:00.028800	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
...	...	...	...	...	...	...
2420	2024-09-30 19:00:00.028800	1744.0	24.552999	278.191986	Key West, FL	United States of America (the)
	2024-09-30 19:59:59.971200	1740.0	24.552999	278.191986	Key West, FL	United States of America (the)
	2024-09-30 21:00:00.000000	1801.0	24.552999	278.191986	Key West, FL	United States of America (the)
	2024-09-30 22:00:00.028800	1893.0	24.552999	278.191986	Key West, FL	United States of America (the)
	2024-09-30 22:59:59.971200	1978.0	24.552999	278.191986	Key West, FL	United States of America (the)

3916584 rows × 5 columns

Notice that the coordinates (time and record_id) of the Dataset have become (multi-)index (row labels) of the DataFrame. To convert the index back to regular columns, we apply the .reset_index() method:

gauge_pd = gauge_pd.reset_index()
display(gauge_pd)

	record_id	time	sea_level	lat	lon	station_name	station_country
0	2570	1913-01-19 06:00:00.000000	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
1	2570	1913-01-19 07:00:00.028800	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
2	2570	1913-01-19 07:59:59.971200	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
3	2570	1913-01-19 09:00:00.000000	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
4	2570	1913-01-19 10:00:00.028800	NaN	26.690001	281.016998	Settlement Point	Bahamas (the)
...	...	...	...	...	...	...	...
3916579	2420	2024-09-30 19:00:00.028800	1744.0	24.552999	278.191986	Key West, FL	United States of America (the)
3916580	2420	2024-09-30 19:59:59.971200	1740.0	24.552999	278.191986	Key West, FL	United States of America (the)
3916581	2420	2024-09-30 21:00:00.000000	1801.0	24.552999	278.191986	Key West, FL	United States of America (the)
3916582	2420	2024-09-30 22:00:00.028800	1893.0	24.552999	278.191986	Key West, FL	United States of America (the)
3916583	2420	2024-09-30 22:59:59.971200	1978.0	24.552999	278.191986	Key West, FL	United States of America (the)

3916584 rows × 7 columns

We can now manipulate this DataFrame using the usual DataFrame methods, export the results as a csv, and so on.

Export xarray Datasets as netCDF file#

As in the case of pandas DataFrame, sometimes we want to save the Dataset obtained after some manipulations into a new netCDF file. We can do so using the .to_netcdf() method of Dataset and DataArray. Importantly, you may want to make sure that each data variable is compressed so that your file will not be exceedingly large. The way to specify compression is to supply a nested dictionary to the encoding argument, where each data variable is a key, with the value itself a dictionary specifying the compression options. As an example, to save the oisst_all Dataset we created

# NOTE: the output folder has to already exist
oisst_all.to_netcdf("output/oisst_all.nc", encoding = {
    "sst": {"zlib": True, "complevel": 9},
    "anom": {"zlib": True, "complevel": 9},
    "err": {"zlib": True, "complevel": 9},
    "ice": {"zlib": True, "complevel": 9}
})

Xarray and gridded data

Contents

Xarray and gridded data#

Loading and inspecting netCDF file#

From pandas DataFrame to xarray Dataset#

From xarray Dataset to pandas DataFrame#

Combine multiple xarray Datasets#

Export xarray Datasets as netCDF file#