DataFrame operations#

# initialization
import numpy as np
import pandas as pd

Inspecting a DataFrame#

Let’s load the CalCOFI dataset again and inspect it further:

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
display(CalCOFI)
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
0 992 090.0 070.0 1950-02-06 19:54:00 0 14.040 33.1700 24.76600
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
2 992 090.0 070.0 1950-02-06 19:54:00 20 13.900 33.2100 24.82600
3 992 090.0 070.0 1950-02-06 19:54:00 23 13.880 33.2100 24.83000
4 992 090.0 070.0 1950-02-06 19:54:00 30 13.810 33.2180 24.85100
... ... ... ... ... ... ... ...
10052 35578 090.0 070.0 2021-01-21 13:36:00 300 7.692 34.1712 26.67697
10053 35578 090.0 070.0 2021-01-21 13:36:00 381 7.144 34.2443 26.81386
10054 35578 090.0 070.0 2021-01-21 13:36:00 400 7.031 34.2746 26.85372
10055 35578 090.0 070.0 2021-01-21 13:36:00 500 6.293 34.3126 26.98372
10056 35578 090.0 070.0 2021-01-21 13:36:00 515 6.155 34.2903 26.98398

10057 rows × 7 columns

While display() gives us a good glimpse of the data, the datatype of each column is not exactly clear. A good way to examine the data type is to use the .info() method of the DataFrame:

CalCOFI.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10057 entries, 0 to 10056
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Cast_Count  10057 non-null  int64  
 1   Station_ID  10057 non-null  object 
 2   Datetime    10057 non-null  object 
 3   Depth_m     10057 non-null  int64  
 4   T_degC      10026 non-null  float64
 5   Salinity    10057 non-null  float64
 6   SigmaTheta  10008 non-null  float64
dtypes: float64(3), int64(2), object(2)
memory usage: 550.1+ KB

While it may not be obvious, columns with datatype (“Dtype”) object are most often (but not always!) string columns. So we have 3 columns of floats, 2 columns of integers, and 2 columns of string.

Just like numpy arrays, a pandas DataFrame has the .size, .shape, and .ndim attributes:

CalCOFI.size # total number of entries
70399
CalCOFI.shape # number of rows and columns
(10057, 7)
CalCOFI.ndim # number of dimensions
2

Extracting parts of a DataFrame#

The column names of a pandas DataFrame can be extracted using the .columns atrribute, and its row index can be extracted using the .index attribute

CalCOFI.columns # column labels
Index(['Cast_Count', 'Station_ID', 'Datetime', 'Depth_m', 'T_degC', 'Salinity',
       'SigmaTheta'],
      dtype='object')
CalCOFI.index # row labels
RangeIndex(start=0, stop=10057, step=1)

To extract the pandas Series that corresponds to a specific column, we can the square brackets [] syntax. For example:

CalCOFI["Depth_m"]
0          0
1         10
2         20
3         23
4         30
        ... 
10052    300
10053    381
10054    400
10055    500
10056    515
Name: Depth_m, Length: 10057, dtype: int64

Furthermore, we can extract the data contained within the pandas Series using the .values attribute:

CalCOFI["Depth_m"].values
array([  0,  10,  20, ..., 400, 500, 515], dtype=int64)

Similarly, we can extract the data contained in column labels and row labels of a DataFrame using the .values attribute:

CalCOFI.columns.values
array(['Cast_Count', 'Station_ID', 'Datetime', 'Depth_m', 'T_degC',
       'Salinity', 'SigmaTheta'], dtype=object)
CalCOFI.index.values
array([    0,     1,     2, ..., 10054, 10055, 10056], dtype=int64)

Subsetting a DataFrame#

There are two main methods to subset a DataFrame in pandas. The first is .iloc[], which subset a DataFrame by row and column positional index. For example, if we want to extract rows 1 (positional index 0) to 10 (positional index 9), we’ll do (note that like other slicing operations we’ve seen thus far, .iloc[] is endpoint exclusive):

CalCOFI.iloc[0:10]
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
0 992 090.0 070.0 1950-02-06 19:54:00 0 14.04 33.170 24.766
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.95 33.210 24.815
2 992 090.0 070.0 1950-02-06 19:54:00 20 13.90 33.210 24.826
3 992 090.0 070.0 1950-02-06 19:54:00 23 13.88 33.210 24.830
4 992 090.0 070.0 1950-02-06 19:54:00 30 13.81 33.218 24.851
5 992 090.0 070.0 1950-02-06 19:54:00 50 13.25 33.150 24.912
6 992 090.0 070.0 1950-02-06 19:54:00 72 11.83 33.130 25.171
7 992 090.0 070.0 1950-02-06 19:54:00 75 11.70 33.142 25.204
8 992 090.0 070.0 1950-02-06 19:54:00 99 10.86 33.280 25.464
9 992 090.0 070.0 1950-02-06 19:54:00 100 10.83 33.288 25.475

If, in addition, we need only the last 4 columns of the DataFrame, and we want their order reversed, we may do:

CalCOFI.iloc[0:10, -1:-5:-1]
SigmaTheta Salinity T_degC Depth_m
0 24.766 33.170 14.04 0
1 24.815 33.210 13.95 10
2 24.826 33.210 13.90 20
3 24.830 33.210 13.88 23
4 24.851 33.218 13.81 30
5 24.912 33.150 13.25 50
6 25.171 33.130 11.83 72
7 25.204 33.142 11.70 75
8 25.464 33.280 10.86 99
9 25.475 33.288 10.83 100

And if we want the last 4 columns of the DataFrame but retain all rows, we can do:

CalCOFI.iloc[:, -4:]
Depth_m T_degC Salinity SigmaTheta
0 0 14.040 33.1700 24.76600
1 10 13.950 33.2100 24.81500
2 20 13.900 33.2100 24.82600
3 23 13.880 33.2100 24.83000
4 30 13.810 33.2180 24.85100
... ... ... ... ...
10052 300 7.692 34.1712 26.67697
10053 381 7.144 34.2443 26.81386
10054 400 7.031 34.2746 26.85372
10055 500 6.293 34.3126 26.98372
10056 515 6.155 34.2903 26.98398

10057 rows × 4 columns

Alternatively, since pandas DataFrame also stores row and column labels, these can also be used to subset the DataFrame. To do so, we need to use the method .loc[] instead of .iloc[]. For example, suppose we only want to retain only the datetime, depth, and temperature information of the DataFrame, we may do:

CalCOFI.loc[:, ["Datetime", "Depth_m", "T_degC"]]
Datetime Depth_m T_degC
0 1950-02-06 19:54:00 0 14.040
1 1950-02-06 19:54:00 10 13.950
2 1950-02-06 19:54:00 20 13.900
3 1950-02-06 19:54:00 23 13.880
4 1950-02-06 19:54:00 30 13.810
... ... ... ...
10052 2021-01-21 13:36:00 300 7.692
10053 2021-01-21 13:36:00 381 7.144
10054 2021-01-21 13:36:00 400 7.031
10055 2021-01-21 13:36:00 500 6.293
10056 2021-01-21 13:36:00 515 6.155

10057 rows × 3 columns

Alternatively, noting that “Datetime”, “Depth_m”, and “T_degC” are consecutive columns, we can also use the slicing syntax. Note that unlike .iloc[], slices in .loc[] are endpoint inclusive:

CalCOFI.loc[:, "Datetime":"T_degC"]
Datetime Depth_m T_degC
0 1950-02-06 19:54:00 0 14.040
1 1950-02-06 19:54:00 10 13.950
2 1950-02-06 19:54:00 20 13.900
3 1950-02-06 19:54:00 23 13.880
4 1950-02-06 19:54:00 30 13.810
... ... ... ...
10052 2021-01-21 13:36:00 300 7.692
10053 2021-01-21 13:36:00 381 7.144
10054 2021-01-21 13:36:00 400 7.031
10055 2021-01-21 13:36:00 500 6.293
10056 2021-01-21 13:36:00 515 6.155

10057 rows × 3 columns

For subsetting rows, in addition to accepting row label lists and slices (which are not particularly useful when the row labels are dummies), .loc[] also accepts boolean arrays. For example, let’s say we want to extract all observations (rows) at depth = 10 m, we can do:

CalCOFI.loc[CalCOFI["Depth_m"] == 10]
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
28 1267 090.0 070.0 1950-04-08 17:30:00 10 14.980 33.4000 24.74400
62 1387 090.0 070.0 1950-05-07 17:36:00 10 14.170 33.3080 24.84500
98 1477 090.0 070.0 1950-06-13 00:42:00 10 14.510 33.1110 24.62200
130 1589 090.0 070.0 1950-07-13 00:54:00 10 16.550 33.1020 24.16400
... ... ... ... ... ... ... ...
9913 35250 090.0 070.0 2019-11-09 01:24:00 10 16.983 33.3950 24.28840
9944 35354 090.0 070.0 2020-01-09 18:13:00 10 15.054 33.3427 24.68417
9978 35429 090.0 070.0 2020-07-16 23:40:00 10 17.114 33.6188 24.42921
10015 35500 090.0 070.0 2020-10-14 23:52:00 10 17.373 33.3346 24.14988
10037 35578 090.0 070.0 2021-01-21 13:36:00 10 15.092 33.4561 24.76328

348 rows × 7 columns

Note that pandas has a special function pd.isna() that performs vectorized check on whether values are coded as missing (e.g., None or np.nan). So, to extract all observations (rows) for which the temperature measurement is missing, we may do:

CalCOFI.loc[pd.isna(CalCOFI["T_degC"])]
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
269 3097 090.0 070.0 1951-07-05 12:06:00 0 NaN 33.300 NaN
273 3097 090.0 070.0 1951-07-05 12:06:00 27 NaN 33.300 NaN
276 3097 090.0 070.0 1951-07-05 12:06:00 54 NaN 33.310 NaN
278 3097 090.0 070.0 1951-07-05 12:06:00 80 NaN 33.140 NaN
311 3484 090.0 070.0 1951-10-01 02:36:00 75 NaN 33.210 NaN
369 3873 090.0 070.0 1952-02-18 09:24:00 0 NaN 33.150 NaN
372 3873 090.0 070.0 1952-02-18 09:24:00 19 NaN 33.070 NaN
375 3873 090.0 070.0 1952-02-18 09:24:00 39 NaN 33.130 NaN
377 3873 090.0 070.0 1952-02-18 09:24:00 57 NaN 33.120 NaN
384 4037 090.0 070.0 1952-03-09 20:54:00 0 NaN 33.190 NaN
387 4037 090.0 070.0 1952-03-09 20:54:00 16 NaN 33.220 NaN
390 4037 090.0 070.0 1952-03-09 20:54:00 35 NaN 33.210 NaN
391 4037 090.0 070.0 1952-03-09 20:54:00 50 NaN 33.210 NaN
414 4136 090.0 070.0 1952-04-03 06:30:00 0 NaN 33.300 NaN
418 4136 090.0 070.0 1952-04-03 06:30:00 21 NaN 33.310 NaN
420 4136 090.0 070.0 1952-04-03 06:30:00 44 NaN 33.330 NaN
422 4136 090.0 070.0 1952-04-03 06:30:00 63 NaN 33.310 NaN
424 4136 090.0 070.0 1952-04-03 06:30:00 84 NaN 33.220 NaN
426 4136 090.0 070.0 1952-04-03 06:30:00 115 NaN 33.170 NaN
436 4136 090.0 070.0 1952-04-03 06:30:00 323 NaN 34.150 NaN
449 4359 090.0 070.0 1952-05-11 16:12:00 19 NaN 33.150 NaN
452 4359 090.0 070.0 1952-05-11 16:12:00 45 NaN 33.100 NaN
454 4359 090.0 070.0 1952-05-11 16:12:00 66 NaN 33.080 NaN
456 4359 090.0 070.0 1952-05-11 16:12:00 90 NaN 33.080 NaN
463 4359 090.0 070.0 1952-05-11 16:12:00 217 NaN 33.910 NaN
485 4524 090.0 070.0 1952-06-05 22:48:00 50 NaN 32.950 NaN
1427 8006 090.0 070.0 1955-12-05 12:24:00 57 NaN 33.420 NaN
1430 8006 090.0 070.0 1955-12-05 12:24:00 76 NaN 33.350 NaN
1587 9017 090.0 070.0 1957-07-14 08:42:00 1025 NaN 34.470 NaN
1609 9018 090.0 070.0 1957-07-14 10:42:00 63 NaN 33.300 NaN
5244 24355 090.0 070.0 1986-02-14 15:49:00 11 NaN 33.301 NaN

If you want to drop all rows for which some columns have missing values, you can use the .dropna() method of the DataFrame:

CalCOFI.dropna()
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
0 992 090.0 070.0 1950-02-06 19:54:00 0 14.040 33.1700 24.76600
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
2 992 090.0 070.0 1950-02-06 19:54:00 20 13.900 33.2100 24.82600
3 992 090.0 070.0 1950-02-06 19:54:00 23 13.880 33.2100 24.83000
4 992 090.0 070.0 1950-02-06 19:54:00 30 13.810 33.2180 24.85100
... ... ... ... ... ... ... ...
10052 35578 090.0 070.0 2021-01-21 13:36:00 300 7.692 34.1712 26.67697
10053 35578 090.0 070.0 2021-01-21 13:36:00 381 7.144 34.2443 26.81386
10054 35578 090.0 070.0 2021-01-21 13:36:00 400 7.031 34.2746 26.85372
10055 35578 090.0 070.0 2021-01-21 13:36:00 500 6.293 34.3126 26.98372
10056 35578 090.0 070.0 2021-01-21 13:36:00 515 6.155 34.2903 26.98398

10008 rows × 7 columns

And if you want .dropna() to consider only certain columns when deciding if a row should be dropped, use its subset argument:

# no data is dropped since the Depth_m column has no missing values
CalCOFI.dropna(subset=["Depth_m"])
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
0 992 090.0 070.0 1950-02-06 19:54:00 0 14.040 33.1700 24.76600
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
2 992 090.0 070.0 1950-02-06 19:54:00 20 13.900 33.2100 24.82600
3 992 090.0 070.0 1950-02-06 19:54:00 23 13.880 33.2100 24.83000
4 992 090.0 070.0 1950-02-06 19:54:00 30 13.810 33.2180 24.85100
... ... ... ... ... ... ... ...
10052 35578 090.0 070.0 2021-01-21 13:36:00 300 7.692 34.1712 26.67697
10053 35578 090.0 070.0 2021-01-21 13:36:00 381 7.144 34.2443 26.81386
10054 35578 090.0 070.0 2021-01-21 13:36:00 400 7.031 34.2746 26.85372
10055 35578 090.0 070.0 2021-01-21 13:36:00 500 6.293 34.3126 26.98372
10056 35578 090.0 070.0 2021-01-21 13:36:00 515 6.155 34.2903 26.98398

10057 rows × 7 columns

Sorting the rows of a DataFrame#

We can use the .sort_values() method to sort a pandas DataFrame. For example, suppose we want to sort the rows of CalCOFI by temperature, from high to low, we can do:

CalCOFI_sorted = CalCOFI.sort_values("T_degC", ascending=False)
display(CalCOFI_sorted)
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
4276 22468 090.0 070.0 1983-09-27 15:15:00 10 20.59 33.532 23.488
4275 22468 090.0 070.0 1983-09-27 15:15:00 0 20.59 33.533 23.489
4277 22468 090.0 070.0 1983-09-27 15:15:00 20 20.58 33.519 23.482
6809 28613 090.0 070.0 1997-09-25 11:46:00 15 20.21 33.635 23.668
6808 28613 090.0 070.0 1997-09-25 11:46:00 10 20.20 33.634 23.667
... ... ... ... ... ... ... ...
1427 8006 090.0 070.0 1955-12-05 12:24:00 57 NaN 33.420 NaN
1430 8006 090.0 070.0 1955-12-05 12:24:00 76 NaN 33.350 NaN
1587 9017 090.0 070.0 1957-07-14 08:42:00 1025 NaN 34.470 NaN
1609 9018 090.0 070.0 1957-07-14 10:42:00 63 NaN 33.300 NaN
5244 24355 090.0 070.0 1986-02-14 15:49:00 11 NaN 33.301 NaN

10057 rows × 7 columns

Note that missing values are sorted to the end. Also note that the index of the DataFrame is rearranged too. If we want to rebuild the index (so that it starts from 0 and increment by 1 in the sorted DataFrame), we can use the .reset_index() method, with the drop=True argument (which remove the old index instead of retaining it as a regular column):

CalCOFI_sorted.reset_index(drop=True)
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
0 22468 090.0 070.0 1983-09-27 15:15:00 10 20.59 33.532 23.488
1 22468 090.0 070.0 1983-09-27 15:15:00 0 20.59 33.533 23.489
2 22468 090.0 070.0 1983-09-27 15:15:00 20 20.58 33.519 23.482
3 28613 090.0 070.0 1997-09-25 11:46:00 15 20.21 33.635 23.668
4 28613 090.0 070.0 1997-09-25 11:46:00 10 20.20 33.634 23.667
... ... ... ... ... ... ... ...
10052 8006 090.0 070.0 1955-12-05 12:24:00 57 NaN 33.420 NaN
10053 8006 090.0 070.0 1955-12-05 12:24:00 76 NaN 33.350 NaN
10054 9017 090.0 070.0 1957-07-14 08:42:00 1025 NaN 34.470 NaN
10055 9018 090.0 070.0 1957-07-14 10:42:00 63 NaN 33.300 NaN
10056 24355 090.0 070.0 1986-02-14 15:49:00 11 NaN 33.301 NaN

10057 rows × 7 columns

Modifying a DataFrame#

The subsetting syntax can also be used overwrite data on a DataFrame. For example, suppose we want to set the SigmaTheta value to 0 whenever depth is above 300 m, we can do:

CalCOFI.loc[CalCOFI["Depth_m"] > 300, "SigmaTheta"] = np.nan
display(CalCOFI)
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
0 992 090.0 070.0 1950-02-06 19:54:00 0 14.040 33.1700 24.76600
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
2 992 090.0 070.0 1950-02-06 19:54:00 20 13.900 33.2100 24.82600
3 992 090.0 070.0 1950-02-06 19:54:00 23 13.880 33.2100 24.83000
4 992 090.0 070.0 1950-02-06 19:54:00 30 13.810 33.2180 24.85100
... ... ... ... ... ... ... ...
10052 35578 090.0 070.0 2021-01-21 13:36:00 300 7.692 34.1712 26.67697
10053 35578 090.0 070.0 2021-01-21 13:36:00 381 7.144 34.2443 NaN
10054 35578 090.0 070.0 2021-01-21 13:36:00 400 7.031 34.2746 NaN
10055 35578 090.0 070.0 2021-01-21 13:36:00 500 6.293 34.3126 NaN
10056 35578 090.0 070.0 2021-01-21 13:36:00 515 6.155 34.2903 NaN

10057 rows × 7 columns

Note that unlike most functions, the changes happen in-place here.

More often, we would like to derive a new column using data from existing columns. Just like ndarray, we can do vectorized computation on pandas Series (recall that each column of a pandas DataFrame is a pandas Series). Moreover, we can assign the result to a new column of a DataFrame. For example, suppose we want to create a log_Depth_m column and a T_degF column. We can do so with:

CalCOFI["log_Depth_m"] = np.log(CalCOFI["Depth_m"] + 1)
CalCOFI["T_degF"] = CalCOFI["T_degC"] * (9/5) + 32

display(CalCOFI)
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta log_Depth_m T_degF
0 992 090.0 070.0 1950-02-06 19:54:00 0 14.040 33.1700 24.76600 0.000000 57.2720
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500 2.397895 57.1100
2 992 090.0 070.0 1950-02-06 19:54:00 20 13.900 33.2100 24.82600 3.044522 57.0200
3 992 090.0 070.0 1950-02-06 19:54:00 23 13.880 33.2100 24.83000 3.178054 56.9840
4 992 090.0 070.0 1950-02-06 19:54:00 30 13.810 33.2180 24.85100 3.433987 56.8580
... ... ... ... ... ... ... ... ... ...
10052 35578 090.0 070.0 2021-01-21 13:36:00 300 7.692 34.1712 26.67697 5.707110 45.8456
10053 35578 090.0 070.0 2021-01-21 13:36:00 381 7.144 34.2443 NaN 5.945421 44.8592
10054 35578 090.0 070.0 2021-01-21 13:36:00 400 7.031 34.2746 NaN 5.993961 44.6558
10055 35578 090.0 070.0 2021-01-21 13:36:00 500 6.293 34.3126 NaN 6.216606 43.3274
10056 35578 090.0 070.0 2021-01-21 13:36:00 515 6.155 34.2903 NaN 6.246107 43.0790

10057 rows × 9 columns

Notice that the new columns are inserted to the right of the existing columns.

Remark: dummy versus semantic row index#

In the examples above, the row index is really just an arbitrary number that has no meaning in and of itself, and our row operations have been based on the actual data contained in the columns. For most data, this setup is a good default.

However, sometimes your tabular data have a special column that can serve as a unique indentifier of the row. In other words, the value of this column uniquely identify the row of data it comes from. A good example are time series data, where the timestamp uniquely identify a row of observations. In such case, it can be beneficial to set this specific row as the index of the DataFrame.

Our CalCOFI subset has no natural unique identifier. However, suppose we are interested only in on observations at depth = 10 m. Then, the Cast_Count column uniquely identify an observation.

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
CalCOFI_10m = CalCOFI.loc[CalCOFI["Depth_m"] == 10]
display(CalCOFI_10m)
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
28 1267 090.0 070.0 1950-04-08 17:30:00 10 14.980 33.4000 24.74400
62 1387 090.0 070.0 1950-05-07 17:36:00 10 14.170 33.3080 24.84500
98 1477 090.0 070.0 1950-06-13 00:42:00 10 14.510 33.1110 24.62200
130 1589 090.0 070.0 1950-07-13 00:54:00 10 16.550 33.1020 24.16400
... ... ... ... ... ... ... ...
9913 35250 090.0 070.0 2019-11-09 01:24:00 10 16.983 33.3950 24.28840
9944 35354 090.0 070.0 2020-01-09 18:13:00 10 15.054 33.3427 24.68417
9978 35429 090.0 070.0 2020-07-16 23:40:00 10 17.114 33.6188 24.42921
10015 35500 090.0 070.0 2020-10-14 23:52:00 10 17.373 33.3346 24.14988
10037 35578 090.0 070.0 2021-01-21 13:36:00 10 15.092 33.4561 24.76328

348 rows × 7 columns

Thus, we may set the Cast_Count column as the row index of this subsetted DataFrame. To do so, we use the .set_index() method of the DataFrame:

CalCOFI_10m = CalCOFI_10m.set_index("Cast_Count")
display(CalCOFI_10m)
Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
Cast_Count
992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
1267 090.0 070.0 1950-04-08 17:30:00 10 14.980 33.4000 24.74400
1387 090.0 070.0 1950-05-07 17:36:00 10 14.170 33.3080 24.84500
1477 090.0 070.0 1950-06-13 00:42:00 10 14.510 33.1110 24.62200
1589 090.0 070.0 1950-07-13 00:54:00 10 16.550 33.1020 24.16400
... ... ... ... ... ... ...
35250 090.0 070.0 2019-11-09 01:24:00 10 16.983 33.3950 24.28840
35354 090.0 070.0 2020-01-09 18:13:00 10 15.054 33.3427 24.68417
35429 090.0 070.0 2020-07-16 23:40:00 10 17.114 33.6188 24.42921
35500 090.0 070.0 2020-10-14 23:52:00 10 17.373 33.3346 24.14988
35578 090.0 070.0 2021-01-21 13:36:00 10 15.092 33.4561 24.76328

348 rows × 6 columns

With a semantic row index such as the Cast_Count, we can use .loc[] to meaningfully subset out data, e.g.,

CalCOFI_10m_mid = CalCOFI_10m.loc[8000:10000]
display(CalCOFI_10m_mid)
Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
Cast_Count
8006 090.0 070.0 1955-12-05 12:24:00 10 13.44 33.400 25.066
8141 090.0 070.0 1956-02-08 08:24:00 10 12.79 33.270 25.095
8330 090.0 070.0 1956-04-19 21:18:00 10 12.53 33.330 25.192
8437 090.0 070.0 1956-06-09 14:24:00 10 13.68 33.220 24.879
8755 090.0 070.0 1957-04-15 19:00:00 10 13.98 33.440 24.987
9018 090.0 070.0 1957-07-14 10:42:00 10 17.95 33.417 24.074
9203 090.0 070.0 1957-10-29 06:54:00 10 18.97 33.710 24.045
9327 090.0 070.0 1957-12-17 14:00:00 10 16.45 33.440 24.446
9527 090.0 070.0 1958-03-12 02:48:00 10 15.38 33.580 24.795
9993 090.0 070.0 1958-09-11 20:24:00 10 18.82 33.275 23.750

Notice that even though there is no data with Cast_Count exactly equal to 8000 and 10000, pandas know how to use these values as bound, because our index is sorted (or equivalenetly, monotone). Moreover, recall that .loc[] is end-point inclusive. Thus, knowing our result, the above is equivalent to:

CalCOFI_10m_mid = CalCOFI_10m.loc[8006:9993]
display(CalCOFI_10m_mid)
Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
Cast_Count
8006 090.0 070.0 1955-12-05 12:24:00 10 13.44 33.400 25.066
8141 090.0 070.0 1956-02-08 08:24:00 10 12.79 33.270 25.095
8330 090.0 070.0 1956-04-19 21:18:00 10 12.53 33.330 25.192
8437 090.0 070.0 1956-06-09 14:24:00 10 13.68 33.220 24.879
8755 090.0 070.0 1957-04-15 19:00:00 10 13.98 33.440 24.987
9018 090.0 070.0 1957-07-14 10:42:00 10 17.95 33.417 24.074
9203 090.0 070.0 1957-10-29 06:54:00 10 18.97 33.710 24.045
9327 090.0 070.0 1957-12-17 14:00:00 10 16.45 33.440 24.446
9527 090.0 070.0 1958-03-12 02:48:00 10 15.38 33.580 24.795
9993 090.0 070.0 1958-09-11 20:24:00 10 18.82 33.275 23.750

As it turns out, pandas allow index to be non-unique and non-monotone. However, some row operations may no longer work or have unexpected behaviors in such cases.

If the index of your DataFrame is not monotone, you can sort the data along the index using the .sort_index() method, which works similarly to the .sort_values() method. For example:

CalCOFI_sal = CalCOFI.set_index("Salinity").sort_index()
display(CalCOFI_sal)
Cast_Count Station_ID Datetime Depth_m T_degC SigmaTheta
Salinity
32.624 30396 090.0 070.0 2003-10-25 05:31:00 59 12.07 24.733
32.641 30396 090.0 070.0 2003-10-25 05:31:00 70 11.42 24.866
32.653 30396 090.0 070.0 2003-10-25 05:31:00 75 11.15 24.924
32.674 30396 090.0 070.0 2003-10-25 05:31:00 50 12.85 24.623
32.693 30396 090.0 070.0 2003-10-25 05:31:00 85 10.66 25.041
... ... ... ... ... ... ...
34.677 13987 090.0 070.0 1964-04-18 02:18:00 3600 1.57 27.766
34.679 13987 090.0 070.0 1964-04-18 02:18:00 3689 1.57 27.769
34.680 13987 090.0 070.0 1964-04-18 02:18:00 3789 1.58 27.769
34.690 2378 090.0 070.0 1951-02-20 03:00:00 2366 1.86 27.747
34.720 6634 090.0 070.0 1954-02-10 14:18:00 1753 2.34 27.730

10057 rows × 6 columns

Note that even upon sorting, Salinity is still not a good row index, since it is not unique.

Statistics function#

If you want to obtain summary statistics of a DataFrame, pandas has a convenient .describe() method that computes a bunch of them for you in one go:

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
CalCOFI.describe()
Cast_Count Depth_m T_degC Salinity SigmaTheta
count 10057.000000 10057.000000 10026.000000 10057.00000 10008.000000
mean 21507.085214 221.678035 10.420719 33.74584 25.833045
std 9920.178631 279.990210 3.912421 0.41843 0.987026
min 992.000000 0.000000 1.570000 32.62400 23.482000
25% 12480.000000 50.000000 7.290000 33.38200 24.924000
50% 23672.000000 126.000000 9.600000 33.70400 26.005500
75% 29689.000000 300.000000 13.870000 34.11100 26.682250
max 35578.000000 3789.000000 20.590000 34.72000 27.769000

Note that by default, .describe() works only on numerical columns.

Importantly, the output of .describe() is another DataFrame, so you can extract particular rows and columns using the previous tricks. For example, the median (50%) of temperature can be extracted as:

CalSummary = CalCOFI.describe()
CalSummary.loc["50%", "T_degC"]
9.6

And if you want to find the inter-quartile range for all the columns, you may do:

CalSummary.loc["75%"] - CalSummary.loc["25%"]
Cast_Count    17209.00000
Depth_m         250.00000
T_degC            6.58000
Salinity          0.72900
SigmaTheta        1.75825
dtype: float64

In addition, there are also individual methods that you can apply to the DataFrame to obtain a particular statistics. Some of these methods include:

  • .mean() for the mean

  • .median() for the median

  • .quantile() for quantiles

  • .var() for the variance

  • .std() for the standard deviation

  • .sem() for the standard error

For example, to compute the standard error in the above numerical variables, we can do:

CalCOFI.sem(numeric_only=True)
Cast_Count    98.920264
Depth_m        2.791956
T_degC         0.039073
Salinity       0.004172
SigmaTheta     0.009866
dtype: float64

Note that for functions other than .describe(), we need to explicitly pass numeric_only=True