Xarray and gridded data#

# initialization
import pandas as pd

Oceanographic data are often gridded: for example, we may record ocean temperature on a range of latitude and longitude, as well as a range of depth. Such data are inherently 3-dimensional.

Naively, we may store such data in a 3-dimensional numpy array. However, in doing so we’ll lose information about the coordinates of the grid, e.g., the value of depth that corresponds an index of the third axis. What we need is a higher-dimensional equivalent of pandas, where information about coordinates are stored alongside the data.

The third-party xarray module provides such an extension. As an added benefit, xarray also provides interface to load and save NETCDF files, a common external file format for gridded data.

To import xarray, run the line below (again, the as xr part is optional but standard).

import xarray as xr

Loading and inspecting netCDF file#

As an example of external netCDF data, we will work with a subset of the B-SOSE (Biogeochemical Southern Ocean State Estimate) model output. The data is stored in the netCDF file bsose_monthly_velocities.nc accessible here.

To load the data, we call the xr.open_dataset() function, and assign the result to a variable:

bsose = xr.open_dataset("data/bsose_monthly_velocities.nc")

We can inspect the data using the display() function. The output is an interactive html snippet with parts that you can expand or hide:

display(bsose)
<xarray.Dataset> Size: 29MB
Dimensions:  (time: 12, depth: 10, lat: 147, lon: 135)
Coordinates:
  * time     (time) datetime64[ns] 96B 2012-01-30T20:00:00 ... 2012-12-30T12:...
  * lat      (lat) float32 588B -77.93 -77.79 -77.65 ... -31.09 -30.52 -29.94
  * lon      (lon) float32 540B 90.17 90.83 91.5 92.17 ... 178.2 178.8 179.5
  * depth    (depth) float32 40B 2.1 26.25 65.0 105.0 ... 450.0 700.0 1.1e+03
Data variables:
    U        (time, depth, lat, lon) float32 10MB ...
    V        (time, depth, lat, lon) float64 19MB ...
Attributes:
    name:     B-SOSE (Southern Ocean State Estimate) model output

Notice that bsose is an xarray Dataset object. Note also that multiple sets of data (“Data variables”) are stored alongside the labels of the coordinates (“Coordinates”) in a single object. In addition, the dimension (a.k.a. shape in numpy lingo) of the data is presented by the “Dimensions” section at the top. Furthermore, there are metadata (“Attributes”) associated with the whole dataset, such as its name.

By clicking on the document-looking icon, you can see that each coordinate and data variable has metadata associated with them. For example, we see that depth is measured in m while U and V are measured in m/s. Finally, you can preview the actual data by clicking on the cylinder icon.

From pandas DataFrame to xarray Dataset#

Occasionally, you may need to convert a pandas DataFrame into an xarray Dataset and vice versa. To give a concrete example, consider the CalCOFI data we examined in week 6. One may argue that the data is grid-like when you consider time and depth as coordinates, and there may be advantage of turning it into an xarray Dataset.

To see how this may work, we load a version of the CalCOFI subset in which the depth is binned (you can download a copy of the file here):

CalCOFI2 = pd.read_csv("data/CalCOFI_binned.csv", parse_dates = ["Datetime"])
display(CalCOFI2)
Cast_Count Station_ID Datetime Depth_bin_m T_degC Salinity SigmaTheta
0 992 090.0 070.0 1950-02-06 19:54:00 5 14.040 33.1700 24.76600
1 992 090.0 070.0 1950-02-06 19:54:00 15 13.950 33.2100 24.81500
2 992 090.0 070.0 1950-02-06 19:54:00 25 13.900 33.2100 24.82600
3 992 090.0 070.0 1950-02-06 19:54:00 35 13.810 33.2180 24.85100
4 992 090.0 070.0 1950-02-06 19:54:00 55 13.250 33.1500 24.91200
... ... ... ... ... ... ... ...
6402 35578 090.0 070.0 2021-01-21 13:36:00 205 8.518 34.0402 26.44858
6403 35578 090.0 070.0 2021-01-21 13:36:00 255 8.104 34.1405 26.59119
6404 35578 090.0 070.0 2021-01-21 13:36:00 275 8.012 34.1498 26.61270
6405 35578 090.0 070.0 2021-01-21 13:36:00 305 7.692 34.1712 26.67697
6406 35578 090.0 070.0 2021-01-21 13:36:00 385 7.144 34.2443 26.81386

6407 rows × 7 columns

To convert the pandas DataFrame to an xarray Dataset, we need to tell xarray which column(s) (“variable(s)” from the xarray perspective) are coordinates. We do so by converting such columns into row (multi-)index. In our case, the relevant columns are Datetime and Depth_bin_m, and we use the .set_index() method to turn these into row index.

Next, let’s say we want the resulting xarray Dataset to contain (and contain only) T_degC, Salinity, and SigmaTheta as data variables. Then we should make sure these are the only remaining columns after the index is set. To do so we can use the .iloc[] method.

Combining, we transform our pandas DataFrame like so:

CalCOFI3 = CalCOFI2.set_index(["Datetime", "Depth_bin_m"])
CalCOFI3 = CalCOFI3.loc[:, ["T_degC", "Salinity", "SigmaTheta"]]
display(CalCOFI3)
T_degC Salinity SigmaTheta
Datetime Depth_bin_m
1950-02-06 19:54:00 5 14.040 33.1700 24.76600
15 13.950 33.2100 24.81500
25 13.900 33.2100 24.82600
35 13.810 33.2180 24.85100
55 13.250 33.1500 24.91200
... ... ... ... ...
2021-01-21 13:36:00 205 8.518 34.0402 26.44858
255 8.104 34.1405 26.59119
275 8.012 34.1498 26.61270
305 7.692 34.1712 26.67697
385 7.144 34.2443 26.81386

6407 rows × 3 columns

We are now ready to convert this DataFrame into an xarray Dataset, and all it takes is a xr.Dataset.from_dataframe() call:

CalCOFI_xr = xr.Dataset.from_dataframe(CalCOFI3)
display(CalCOFI_xr)
<xarray.Dataset> Size: 333kB
Dimensions:      (Datetime: 344, Depth_bin_m: 40)
Coordinates:
  * Datetime     (Datetime) datetime64[ns] 3kB 1950-02-06T19:54:00 ... 2021-0...
  * Depth_bin_m  (Depth_bin_m) int64 320B 5 15 25 35 45 ... 355 365 375 385 395
Data variables:
    T_degC       (Datetime, Depth_bin_m) float64 110kB 14.04 13.95 ... 7.144 nan
    Salinity     (Datetime, Depth_bin_m) float64 110kB 33.17 33.21 ... 34.24 nan
    SigmaTheta   (Datetime, Depth_bin_m) float64 110kB 24.77 24.82 ... 26.81 nan

Note that xarray automatically “complete” the grid and fill in some missing values for us (e.g., there is no measurement near 385 m on 1950-02-06. This column is implicitly missing in the pandas DataFrame, but become explicitly missing in the xarray Dataset, since measurement near 385 m did happen on other dates):

CalCOFI2.loc[
    (CalCOFI2["Datetime"] == pd.to_datetime("1950-02-06 19:54")) & 
    (CalCOFI2["Depth_bin_m"] == 385)
]
Cast_Count Station_ID Datetime Depth_bin_m T_degC Salinity SigmaTheta
CalCOFI_xr.sel(Datetime=pd.to_datetime("1950-02-06 19:54"), Depth_bin_m = 385)
<xarray.Dataset> Size: 40B
Dimensions:      ()
Coordinates:
    Datetime     datetime64[ns] 8B 1950-02-06T19:54:00
    Depth_bin_m  int64 8B 385
Data variables:
    T_degC       float64 8B nan
    Salinity     float64 8B nan
    SigmaTheta   float64 8B nan

From xarray Dataset to pandas DataFrame#

For the converse (from xarray to pandas), consider the tidal gauge measurement near Key West, FL, courtesy University of Hawaii Sea Level Center (you can download a copy of the netCDF file here)

gauge_xr = xr.open_dataset("data/tide_gauges.nc")
display(gauge_xr)
<xarray.Dataset> Size: 24MB
Dimensions:          (record_id: 4, time: 979146)
Coordinates:
  * time             (time) datetime64[ns] 8MB 1913-01-19T06:00:00 ... 2024-0...
  * record_id        (record_id) int16 8B 2570 7550 7620 2420
Data variables:
    sea_level        (record_id, time) float32 16MB ...
    lat              (record_id) float32 16B ...
    lon              (record_id) float32 16B ...
    station_name     (record_id) <U16 256B ...
    station_country  (record_id) <U30 480B ...
Attributes:
    title:                  UHSLC Fast Delivery Tide Gauge Data (hourly)
    ncei_template_version:  NCEI_NetCDF_TimeSeries_Orthogonal_Template_v2.0
    featureType:            timeSeries
    Conventions:            CF-1.6, ACDD-1.3
    date_created:           2024-11-07T14:27:39Z
    publisher_name:         University of Hawaii Sea Level Center (UHSLC)
    publisher_email:        philiprt@hawaii.edu, markm@soest.hawaii.edu
    publisher_url:          http://uhslc.soest.hawaii.edu
    summary:                The UHSLC assembles and distributes the Fast Deli...
    processing_level:       Fast Delivery (FD) data undergo a level 1 quality...
    acknowledgment:         The UHSLC Fast Delivery database is supported by ...

Observe that the record_id dimension only has 4 coordinate values, which essentially identify the station. Thus, the data is essentially one-dimensional, and it make sense to present it in tabular form.

To convert an xarray Dataset to a pandas DataFrame, all you need is to call the .to_dataframe() method of the Dataset:

gauge_pd = gauge_xr.to_dataframe()
display(gauge_pd)
sea_level lat lon station_name station_country
record_id time
2570 1913-01-19 06:00:00.000000 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
1913-01-19 07:00:00.028800 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
1913-01-19 07:59:59.971200 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
1913-01-19 09:00:00.000000 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
1913-01-19 10:00:00.028800 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
... ... ... ... ... ... ...
2420 2024-09-30 19:00:00.028800 1744.0 24.552999 278.191986 Key West, FL United States of America (the)
2024-09-30 19:59:59.971200 1740.0 24.552999 278.191986 Key West, FL United States of America (the)
2024-09-30 21:00:00.000000 1801.0 24.552999 278.191986 Key West, FL United States of America (the)
2024-09-30 22:00:00.028800 1893.0 24.552999 278.191986 Key West, FL United States of America (the)
2024-09-30 22:59:59.971200 1978.0 24.552999 278.191986 Key West, FL United States of America (the)

3916584 rows × 5 columns

Notice that the coordinates (time and record_id) of the Dataset have become (multi-)index (row labels) of the DataFrame. To convert the index back to regular columns, we apply the .reset_index() method:

gauge_pd = gauge_pd.reset_index()
display(gauge_pd)
record_id time sea_level lat lon station_name station_country
0 2570 1913-01-19 06:00:00.000000 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
1 2570 1913-01-19 07:00:00.028800 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
2 2570 1913-01-19 07:59:59.971200 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
3 2570 1913-01-19 09:00:00.000000 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
4 2570 1913-01-19 10:00:00.028800 NaN 26.690001 281.016998 Settlement Point Bahamas (the)
... ... ... ... ... ... ... ...
3916579 2420 2024-09-30 19:00:00.028800 1744.0 24.552999 278.191986 Key West, FL United States of America (the)
3916580 2420 2024-09-30 19:59:59.971200 1740.0 24.552999 278.191986 Key West, FL United States of America (the)
3916581 2420 2024-09-30 21:00:00.000000 1801.0 24.552999 278.191986 Key West, FL United States of America (the)
3916582 2420 2024-09-30 22:00:00.028800 1893.0 24.552999 278.191986 Key West, FL United States of America (the)
3916583 2420 2024-09-30 22:59:59.971200 1978.0 24.552999 278.191986 Key West, FL United States of America (the)

3916584 rows × 7 columns

We can now manipulate this DataFrame using the usual DataFrame methods, export the results as a csv, and so on.

Combine multiple xarray Datasets#

As in the case of csv files, sometimes you can only download netCDF file for a subset of the data you want, and before your analysis you’ll need to combine multiple xarray Datasets into one. As an example, the 3 .nc files below are ocean surface temperature from OISST in 2025 on Jan 15, Feb 15, and Mar 15.

oisst_20250115 = xr.open_dataset("Data/oisst-20250115.nc")
oisst_20250215 = xr.open_dataset("Data/oisst-20250215.nc")
oisst_20250315 = xr.open_dataset("Data/oisst-20250315.nc")
display(oisst_20250115)
<xarray.Dataset> Size: 17MB
Dimensions:  (time: 1, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 8B 2025-01-15T12:00:00
  * zlev     (zlev) float32 4B 0.0
  * lat      (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
    sst      (time, zlev, lat, lon) float32 4MB ...
    anom     (time, zlev, lat, lon) float32 4MB ...
    err      (time, zlev, lat, lon) float32 4MB ...
    ice      (time, zlev, lat, lon) float32 4MB ...
Attributes: (12/37)
    Conventions:                CF-1.6, ACDD-1.3
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
    references:                 Reynolds, et al.(2007) Daily High-Resolution-...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    id:                         oisst-avhrr-v02r01.20250115.nc
    naming_authority:           gov.noaa.ncei
    ...                         ...
    time_coverage_start:        2025-01-15T00:00:00Z
    time_coverage_end:          2025-01-15T23:59:59Z
    metadata_link:              https://doi.org/10.25921/RE9P-PT57
    ncei_template_version:      NCEI_NetCDF_Grid_Template_v2.0
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    sensor:                     Thermometer, AVHRR

We can merge these xarray Dataset into one using xr.concat(), using dim="time" to tell xarray that we want to merge along the time dimension:

oisst_all = xr.concat([oisst_20250115, oisst_20250215, oisst_20250315], dim="time")
display(oisst_all)
<xarray.Dataset> Size: 50MB
Dimensions:  (time: 3, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 24B 2025-01-15T12:00:00 ... 2025-03-15T12:...
  * zlev     (zlev) float32 4B 0.0
  * lat      (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
    sst      (time, zlev, lat, lon) float32 12MB nan nan nan ... -1.8 -1.8 -1.8
    anom     (time, zlev, lat, lon) float32 12MB nan nan nan nan ... 0.0 0.0 0.0
    err      (time, zlev, lat, lon) float32 12MB nan nan nan nan ... 0.3 0.3 0.3
    ice      (time, zlev, lat, lon) float32 12MB nan nan nan ... 0.98 0.98 0.98
Attributes: (12/37)
    Conventions:                CF-1.6, ACDD-1.3
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
    references:                 Reynolds, et al.(2007) Daily High-Resolution-...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    id:                         oisst-avhrr-v02r01.20250115.nc
    naming_authority:           gov.noaa.ncei
    ...                         ...
    time_coverage_start:        2025-01-15T00:00:00Z
    time_coverage_end:          2025-01-15T23:59:59Z
    metadata_link:              https://doi.org/10.25921/RE9P-PT57
    ncei_template_version:      NCEI_NetCDF_Grid_Template_v2.0
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    sensor:                     Thermometer, AVHRR

Occasionally you’ll need to create an extra dimension so that your individual Dataset can be merged along this new dimension. To do so we can use the .expand_dims() method of both Dataset and DataArray (which create the dimension), followed by the .assign_coords() method (which assign coordinates to the new dimension).

Take oisst_20250115 as an example, suppose we want to create a new dimension call month, we may do:

oisst_20250115.expand_dims(dim={"month": 1}).assign_coords({"month": [1]})
<xarray.Dataset> Size: 17MB
Dimensions:  (month: 1, time: 1, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 8B 2025-01-15T12:00:00
  * zlev     (zlev) float32 4B 0.0
  * lat      (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
  * month    (month) int32 4B 1
Data variables:
    sst      (month, time, zlev, lat, lon) float32 4MB nan nan nan ... -1.8 -1.8
    anom     (month, time, zlev, lat, lon) float32 4MB nan nan nan ... 0.0 0.0
    err      (month, time, zlev, lat, lon) float32 4MB nan nan nan ... 0.3 0.3
    ice      (month, time, zlev, lat, lon) float32 4MB nan nan nan ... 0.98 0.98
Attributes: (12/37)
    Conventions:                CF-1.6, ACDD-1.3
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
    references:                 Reynolds, et al.(2007) Daily High-Resolution-...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    id:                         oisst-avhrr-v02r01.20250115.nc
    naming_authority:           gov.noaa.ncei
    ...                         ...
    time_coverage_start:        2025-01-15T00:00:00Z
    time_coverage_end:          2025-01-15T23:59:59Z
    metadata_link:              https://doi.org/10.25921/RE9P-PT57
    ncei_template_version:      NCEI_NetCDF_Grid_Template_v2.0
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    sensor:                     Thermometer, AVHRR

Export xarray Datasets as netCDF file#

As in the case of pandas DataFrame, sometimes we want to save the Dataset obtained after some manipulations into a new netCDF file. We can do so using the .to_netcdf() method of Dataset and DataArray. Importantly, you may want to make sure that each data variable is compressed so that your file will not be exceedingly large. The way to specify compression is to supply a nested dictionary to the encoding argument, where each data variable is a key, with the value itself a dictionary specifying the compression options. As an example, to save the oisst_all Dataset we created

# NOTE: the output folder has to already exist
oisst_all.to_netcdf("output/oisst_all.nc", encoding = {
    "sst": {"zlib": True, "complevel": 9},
    "anom": {"zlib": True, "complevel": 9},
    "err": {"zlib": True, "complevel": 9},
    "ice": {"zlib": True, "complevel": 9}
})