Xarray and gridded data#
# initialization
import pandas as pd
Oceanographic data are often gridded: for example, we may record ocean temperature on a range of latitude and longitude, as well as a range of depth. Such data are inherently 3-dimensional.
Naively, we may store such data in a 3-dimensional numpy array. However, in doing so we’ll lose information about the coordinates of the grid, e.g., the value of depth that corresponds an index of the third axis. What we need is a higher-dimensional equivalent of pandas, where information about coordinates are stored alongside the data.
The third-party xarray module provides such an extension. As an added benefit, xarray also provides interface to load and save NETCDF files, a common external file format for gridded data.
To import xarray, run the line below (again, the as xr part is optional but standard).
import xarray as xr
Loading and inspecting netCDF file#
As an example of external netCDF data, we will work with a subset of the B-SOSE (Biogeochemical Southern Ocean State Estimate) model output. The data is stored in the netCDF file bsose_monthly_velocities.nc accessible here.
To load the data, we call the xr.open_dataset() function, and assign the result to a variable:
bsose = xr.open_dataset("data/bsose_monthly_velocities.nc")
We can inspect the data using the display() function. The output is an interactive html snippet with parts that you can expand or hide:
display(bsose)
<xarray.Dataset> Size: 29MB
Dimensions: (time: 12, depth: 10, lat: 147, lon: 135)
Coordinates:
* time (time) datetime64[ns] 96B 2012-01-30T20:00:00 ... 2012-12-30T12:...
* lat (lat) float32 588B -77.93 -77.79 -77.65 ... -31.09 -30.52 -29.94
* lon (lon) float32 540B 90.17 90.83 91.5 92.17 ... 178.2 178.8 179.5
* depth (depth) float32 40B 2.1 26.25 65.0 105.0 ... 450.0 700.0 1.1e+03
Data variables:
U (time, depth, lat, lon) float32 10MB ...
V (time, depth, lat, lon) float64 19MB ...
Attributes:
name: B-SOSE (Southern Ocean State Estimate) model outputNotice that bsose is an xarray Dataset object. Note also that multiple sets of data (“Data variables”) are stored alongside the labels of the coordinates (“Coordinates”) in a single object. In addition, the dimension (a.k.a. shape in numpy lingo) of the data is presented by the “Dimensions” section at the top. Furthermore, there are metadata (“Attributes”) associated with the whole dataset, such as its name.
By clicking on the document-looking icon, you can see that each coordinate and data variable has metadata associated with them. For example, we see that depth is measured in m while U and V are measured in m/s. Finally, you can preview the actual data by clicking on the cylinder icon.
From pandas DataFrame to xarray Dataset#
Occasionally, you may need to convert a pandas DataFrame into an xarray Dataset and vice versa. To give a concrete example, consider the CalCOFI data we examined in week 6. One may argue that the data is grid-like when you consider time and depth as coordinates, and there may be advantage of turning it into an xarray Dataset.
To see how this may work, we load a version of the CalCOFI subset in which the depth is binned (you can download a copy of the file here):
CalCOFI2 = pd.read_csv("data/CalCOFI_binned.csv", parse_dates = ["Datetime"])
display(CalCOFI2)
| Cast_Count | Station_ID | Datetime | Depth_bin_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 5 | 14.040 | 33.1700 | 24.76600 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 15 | 13.950 | 33.2100 | 24.81500 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 25 | 13.900 | 33.2100 | 24.82600 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 35 | 13.810 | 33.2180 | 24.85100 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 55 | 13.250 | 33.1500 | 24.91200 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 6402 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 205 | 8.518 | 34.0402 | 26.44858 |
| 6403 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 255 | 8.104 | 34.1405 | 26.59119 |
| 6404 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 275 | 8.012 | 34.1498 | 26.61270 |
| 6405 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 305 | 7.692 | 34.1712 | 26.67697 |
| 6406 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 385 | 7.144 | 34.2443 | 26.81386 |
6407 rows × 7 columns
To convert the pandas DataFrame to an xarray Dataset, we need to tell xarray which column(s) (“variable(s)” from the xarray perspective) are coordinates. We do so by converting such columns into row (multi-)index. In our case, the relevant columns are Datetime and Depth_bin_m, and we use the .set_index() method to turn these into row index.
Next, let’s say we want the resulting xarray Dataset to contain (and contain only) T_degC, Salinity, and SigmaTheta as data variables. Then we should make sure these are the only remaining columns after the index is set. To do so we can use the .iloc[] method.
Combining, we transform our pandas DataFrame like so:
CalCOFI3 = CalCOFI2.set_index(["Datetime", "Depth_bin_m"])
CalCOFI3 = CalCOFI3.loc[:, ["T_degC", "Salinity", "SigmaTheta"]]
display(CalCOFI3)
| T_degC | Salinity | SigmaTheta | ||
|---|---|---|---|---|
| Datetime | Depth_bin_m | |||
| 1950-02-06 19:54:00 | 5 | 14.040 | 33.1700 | 24.76600 |
| 15 | 13.950 | 33.2100 | 24.81500 | |
| 25 | 13.900 | 33.2100 | 24.82600 | |
| 35 | 13.810 | 33.2180 | 24.85100 | |
| 55 | 13.250 | 33.1500 | 24.91200 | |
| ... | ... | ... | ... | ... |
| 2021-01-21 13:36:00 | 205 | 8.518 | 34.0402 | 26.44858 |
| 255 | 8.104 | 34.1405 | 26.59119 | |
| 275 | 8.012 | 34.1498 | 26.61270 | |
| 305 | 7.692 | 34.1712 | 26.67697 | |
| 385 | 7.144 | 34.2443 | 26.81386 |
6407 rows × 3 columns
We are now ready to convert this DataFrame into an xarray Dataset, and all it takes is a xr.Dataset.from_dataframe() call:
CalCOFI_xr = xr.Dataset.from_dataframe(CalCOFI3)
display(CalCOFI_xr)
<xarray.Dataset> Size: 333kB
Dimensions: (Datetime: 344, Depth_bin_m: 40)
Coordinates:
* Datetime (Datetime) datetime64[ns] 3kB 1950-02-06T19:54:00 ... 2021-0...
* Depth_bin_m (Depth_bin_m) int64 320B 5 15 25 35 45 ... 355 365 375 385 395
Data variables:
T_degC (Datetime, Depth_bin_m) float64 110kB 14.04 13.95 ... 7.144 nan
Salinity (Datetime, Depth_bin_m) float64 110kB 33.17 33.21 ... 34.24 nan
SigmaTheta (Datetime, Depth_bin_m) float64 110kB 24.77 24.82 ... 26.81 nanNote that xarray automatically “complete” the grid and fill in some missing values for us (e.g., there is no measurement near 385 m on 1950-02-06. This column is implicitly missing in the pandas DataFrame, but become explicitly missing in the xarray Dataset, since measurement near 385 m did happen on other dates):
CalCOFI2.loc[
(CalCOFI2["Datetime"] == pd.to_datetime("1950-02-06 19:54")) &
(CalCOFI2["Depth_bin_m"] == 385)
]
| Cast_Count | Station_ID | Datetime | Depth_bin_m | T_degC | Salinity | SigmaTheta |
|---|
CalCOFI_xr.sel(Datetime=pd.to_datetime("1950-02-06 19:54"), Depth_bin_m = 385)
<xarray.Dataset> Size: 40B
Dimensions: ()
Coordinates:
Datetime datetime64[ns] 8B 1950-02-06T19:54:00
Depth_bin_m int64 8B 385
Data variables:
T_degC float64 8B nan
Salinity float64 8B nan
SigmaTheta float64 8B nanFrom xarray Dataset to pandas DataFrame#
For the converse (from xarray to pandas), consider the tidal gauge measurement near Key West, FL, courtesy University of Hawaii Sea Level Center (you can download a copy of the netCDF file here)
gauge_xr = xr.open_dataset("data/tide_gauges.nc")
display(gauge_xr)
<xarray.Dataset> Size: 24MB
Dimensions: (record_id: 4, time: 979146)
Coordinates:
* time (time) datetime64[ns] 8MB 1913-01-19T06:00:00 ... 2024-0...
* record_id (record_id) int16 8B 2570 7550 7620 2420
Data variables:
sea_level (record_id, time) float32 16MB ...
lat (record_id) float32 16B ...
lon (record_id) float32 16B ...
station_name (record_id) <U16 256B ...
station_country (record_id) <U30 480B ...
Attributes:
title: UHSLC Fast Delivery Tide Gauge Data (hourly)
ncei_template_version: NCEI_NetCDF_TimeSeries_Orthogonal_Template_v2.0
featureType: timeSeries
Conventions: CF-1.6, ACDD-1.3
date_created: 2024-11-07T14:27:39Z
publisher_name: University of Hawaii Sea Level Center (UHSLC)
publisher_email: philiprt@hawaii.edu, markm@soest.hawaii.edu
publisher_url: http://uhslc.soest.hawaii.edu
summary: The UHSLC assembles and distributes the Fast Deli...
processing_level: Fast Delivery (FD) data undergo a level 1 quality...
acknowledgment: The UHSLC Fast Delivery database is supported by ...Observe that the record_id dimension only has 4 coordinate values, which essentially identify the station. Thus, the data is essentially one-dimensional, and it make sense to present it in tabular form.
To convert an xarray Dataset to a pandas DataFrame, all you need is to call the .to_dataframe() method of the Dataset:
gauge_pd = gauge_xr.to_dataframe()
display(gauge_pd)
| sea_level | lat | lon | station_name | station_country | ||
|---|---|---|---|---|---|---|
| record_id | time | |||||
| 2570 | 1913-01-19 06:00:00.000000 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) |
| 1913-01-19 07:00:00.028800 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) | |
| 1913-01-19 07:59:59.971200 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) | |
| 1913-01-19 09:00:00.000000 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) | |
| 1913-01-19 10:00:00.028800 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) | |
| ... | ... | ... | ... | ... | ... | ... |
| 2420 | 2024-09-30 19:00:00.028800 | 1744.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) |
| 2024-09-30 19:59:59.971200 | 1740.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) | |
| 2024-09-30 21:00:00.000000 | 1801.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) | |
| 2024-09-30 22:00:00.028800 | 1893.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) | |
| 2024-09-30 22:59:59.971200 | 1978.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) |
3916584 rows × 5 columns
Notice that the coordinates (time and record_id) of the Dataset have become (multi-)index (row labels) of the DataFrame. To convert the index back to regular columns, we apply the .reset_index() method:
gauge_pd = gauge_pd.reset_index()
display(gauge_pd)
| record_id | time | sea_level | lat | lon | station_name | station_country | |
|---|---|---|---|---|---|---|---|
| 0 | 2570 | 1913-01-19 06:00:00.000000 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) |
| 1 | 2570 | 1913-01-19 07:00:00.028800 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) |
| 2 | 2570 | 1913-01-19 07:59:59.971200 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) |
| 3 | 2570 | 1913-01-19 09:00:00.000000 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) |
| 4 | 2570 | 1913-01-19 10:00:00.028800 | NaN | 26.690001 | 281.016998 | Settlement Point | Bahamas (the) |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3916579 | 2420 | 2024-09-30 19:00:00.028800 | 1744.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) |
| 3916580 | 2420 | 2024-09-30 19:59:59.971200 | 1740.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) |
| 3916581 | 2420 | 2024-09-30 21:00:00.000000 | 1801.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) |
| 3916582 | 2420 | 2024-09-30 22:00:00.028800 | 1893.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) |
| 3916583 | 2420 | 2024-09-30 22:59:59.971200 | 1978.0 | 24.552999 | 278.191986 | Key West, FL | United States of America (the) |
3916584 rows × 7 columns
We can now manipulate this DataFrame using the usual DataFrame methods, export the results as a csv, and so on.
Combine multiple xarray Datasets#
As in the case of csv files, sometimes you can only download netCDF file for a subset of the data you want, and before your analysis you’ll need to combine multiple xarray Datasets into one. As an example, the 3 .nc files below are ocean surface temperature from OISST in 2025 on Jan 15, Feb 15, and Mar 15.
oisst_20250115 = xr.open_dataset("Data/oisst-20250115.nc")
oisst_20250215 = xr.open_dataset("Data/oisst-20250215.nc")
oisst_20250315 = xr.open_dataset("Data/oisst-20250315.nc")
display(oisst_20250115)
<xarray.Dataset> Size: 17MB
Dimensions: (time: 1, zlev: 1, lat: 720, lon: 1440)
Coordinates:
* time (time) datetime64[ns] 8B 2025-01-15T12:00:00
* zlev (zlev) float32 4B 0.0
* lat (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
* lon (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
sst (time, zlev, lat, lon) float32 4MB ...
anom (time, zlev, lat, lon) float32 4MB ...
err (time, zlev, lat, lon) float32 4MB ...
ice (time, zlev, lat, lon) float32 4MB ...
Attributes: (12/37)
Conventions: CF-1.6, ACDD-1.3
title: NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
references: Reynolds, et al.(2007) Daily High-Resolution-...
source: ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
id: oisst-avhrr-v02r01.20250115.nc
naming_authority: gov.noaa.ncei
... ...
time_coverage_start: 2025-01-15T00:00:00Z
time_coverage_end: 2025-01-15T23:59:59Z
metadata_link: https://doi.org/10.25921/RE9P-PT57
ncei_template_version: NCEI_NetCDF_Grid_Template_v2.0
comment: Data was converted from NetCDF-3 to NetCDF-4 ...
sensor: Thermometer, AVHRRWe can merge these xarray Dataset into one using xr.concat(), using dim="time" to tell xarray that we want to merge along the time dimension:
oisst_all = xr.concat([oisst_20250115, oisst_20250215, oisst_20250315], dim="time")
display(oisst_all)
<xarray.Dataset> Size: 50MB
Dimensions: (time: 3, zlev: 1, lat: 720, lon: 1440)
Coordinates:
* time (time) datetime64[ns] 24B 2025-01-15T12:00:00 ... 2025-03-15T12:...
* zlev (zlev) float32 4B 0.0
* lat (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
* lon (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
sst (time, zlev, lat, lon) float32 12MB nan nan nan ... -1.8 -1.8 -1.8
anom (time, zlev, lat, lon) float32 12MB nan nan nan nan ... 0.0 0.0 0.0
err (time, zlev, lat, lon) float32 12MB nan nan nan nan ... 0.3 0.3 0.3
ice (time, zlev, lat, lon) float32 12MB nan nan nan ... 0.98 0.98 0.98
Attributes: (12/37)
Conventions: CF-1.6, ACDD-1.3
title: NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
references: Reynolds, et al.(2007) Daily High-Resolution-...
source: ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
id: oisst-avhrr-v02r01.20250115.nc
naming_authority: gov.noaa.ncei
... ...
time_coverage_start: 2025-01-15T00:00:00Z
time_coverage_end: 2025-01-15T23:59:59Z
metadata_link: https://doi.org/10.25921/RE9P-PT57
ncei_template_version: NCEI_NetCDF_Grid_Template_v2.0
comment: Data was converted from NetCDF-3 to NetCDF-4 ...
sensor: Thermometer, AVHRROccasionally you’ll need to create an extra dimension so that your individual Dataset can be merged along this new dimension. To do so we can use the .expand_dims() method of both Dataset and DataArray (which create the dimension), followed by the .assign_coords() method (which assign coordinates to the new dimension).
Take oisst_20250115 as an example, suppose we want to create a new dimension call month, we may do:
oisst_20250115.expand_dims(dim={"month": 1}).assign_coords({"month": [1]})
<xarray.Dataset> Size: 17MB
Dimensions: (month: 1, time: 1, zlev: 1, lat: 720, lon: 1440)
Coordinates:
* time (time) datetime64[ns] 8B 2025-01-15T12:00:00
* zlev (zlev) float32 4B 0.0
* lat (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
* lon (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
* month (month) int32 4B 1
Data variables:
sst (month, time, zlev, lat, lon) float32 4MB nan nan nan ... -1.8 -1.8
anom (month, time, zlev, lat, lon) float32 4MB nan nan nan ... 0.0 0.0
err (month, time, zlev, lat, lon) float32 4MB nan nan nan ... 0.3 0.3
ice (month, time, zlev, lat, lon) float32 4MB nan nan nan ... 0.98 0.98
Attributes: (12/37)
Conventions: CF-1.6, ACDD-1.3
title: NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
references: Reynolds, et al.(2007) Daily High-Resolution-...
source: ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
id: oisst-avhrr-v02r01.20250115.nc
naming_authority: gov.noaa.ncei
... ...
time_coverage_start: 2025-01-15T00:00:00Z
time_coverage_end: 2025-01-15T23:59:59Z
metadata_link: https://doi.org/10.25921/RE9P-PT57
ncei_template_version: NCEI_NetCDF_Grid_Template_v2.0
comment: Data was converted from NetCDF-3 to NetCDF-4 ...
sensor: Thermometer, AVHRRExport xarray Datasets as netCDF file#
As in the case of pandas DataFrame, sometimes we want to save the Dataset obtained after some manipulations into a new netCDF file. We can do so using the .to_netcdf() method of Dataset and DataArray. Importantly, you may want to make sure that each data variable is compressed so that your file will not be exceedingly large. The way to specify compression is to supply a nested dictionary to the encoding argument, where each data variable is a key, with the value itself a dictionary specifying the compression options. As an example, to save the oisst_all Dataset we created
# NOTE: the output folder has to already exist
oisst_all.to_netcdf("output/oisst_all.nc", encoding = {
"sst": {"zlib": True, "complevel": 9},
"anom": {"zlib": True, "complevel": 9},
"err": {"zlib": True, "complevel": 9},
"ice": {"zlib": True, "complevel": 9}
})