DataFrame operations

DataFrame operations#

# initialization
import numpy as np
import pandas as pd

Inspecting a DataFrame#

Let’s load the CalCOFI dataset again and inspect it further:

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
display(CalCOFI)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
0	992	090.0 070.0	1950-02-06 19:54:00	0	14.040	33.1700	24.76600
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
2	992	090.0 070.0	1950-02-06 19:54:00	20	13.900	33.2100	24.82600
3	992	090.0 070.0	1950-02-06 19:54:00	23	13.880	33.2100	24.83000
4	992	090.0 070.0	1950-02-06 19:54:00	30	13.810	33.2180	24.85100
...	...	...	...	...	...	...	...
10052	35578	090.0 070.0	2021-01-21 13:36:00	300	7.692	34.1712	26.67697
10053	35578	090.0 070.0	2021-01-21 13:36:00	381	7.144	34.2443	26.81386
10054	35578	090.0 070.0	2021-01-21 13:36:00	400	7.031	34.2746	26.85372
10055	35578	090.0 070.0	2021-01-21 13:36:00	500	6.293	34.3126	26.98372
10056	35578	090.0 070.0	2021-01-21 13:36:00	515	6.155	34.2903	26.98398

10057 rows × 7 columns

While display() gives us a good glimpse of the data, the datatype of each column is not exactly clear. A good way to examine the data type is to use the .info() method of the DataFrame:

CalCOFI.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10057 entries, 0 to 10056
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Cast_Count  10057 non-null  int64  
 1   Station_ID  10057 non-null  object 
 2   Datetime    10057 non-null  object 
 3   Depth_m     10057 non-null  int64  
 4   T_degC      10026 non-null  float64
 5   Salinity    10057 non-null  float64
 6   SigmaTheta  10008 non-null  float64
dtypes: float64(3), int64(2), object(2)
memory usage: 550.1+ KB

While it may not be obvious, columns with datatype (“Dtype”) object are most often (but not always!) string columns. So we have 3 columns of floats, 2 columns of integers, and 2 columns of string.

Just like numpy arrays, a pandas DataFrame has the .size, .shape, and .ndim attributes:

CalCOFI.size # total number of entries

CalCOFI.shape # number of rows and columns

(10057, 7)

CalCOFI.ndim # number of dimensions

Extracting parts of a DataFrame#

The column names of a pandas DataFrame can be extracted using the .columns atrribute, and its row index can be extracted using the .index attribute

CalCOFI.columns # column labels

Index(['Cast_Count', 'Station_ID', 'Datetime', 'Depth_m', 'T_degC', 'Salinity',
       'SigmaTheta'],
      dtype='object')

CalCOFI.index # row labels

RangeIndex(start=0, stop=10057, step=1)

To extract the pandas Series that corresponds to a specific column, we can the square brackets [] syntax. For example:

CalCOFI["Depth_m"]

        0
       10
       20
       23
       30
        ... 
  300
  381
  400
  500
  515
Name: Depth_m, Length: 10057, dtype: int64

Furthermore, we can extract the data contained within the pandas Series using the .values attribute:

CalCOFI["Depth_m"].values

array([  0,  10,  20, ..., 400, 500, 515], dtype=int64)

Similarly, we can extract the data contained in column labels and row labels of a DataFrame using the .values attribute:

CalCOFI.columns.values

array(['Cast_Count', 'Station_ID', 'Datetime', 'Depth_m', 'T_degC',
       'Salinity', 'SigmaTheta'], dtype=object)

CalCOFI.index.values

array([    0,     1,     2, ..., 10054, 10055, 10056], dtype=int64)

Subsetting a DataFrame#

There are two main methods to subset a DataFrame in pandas. The first is .iloc[], which subset a DataFrame by row and column positional index. For example, if we want to extract rows 1 (positional index 0) to 10 (positional index 9), we’ll do (note that like other slicing operations we’ve seen thus far, .iloc[] is endpoint exclusive):

CalCOFI.iloc[0:10]

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
0	992	090.0 070.0	1950-02-06 19:54:00	0	14.04	33.170	24.766
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.95	33.210	24.815
2	992	090.0 070.0	1950-02-06 19:54:00	20	13.90	33.210	24.826
3	992	090.0 070.0	1950-02-06 19:54:00	23	13.88	33.210	24.830
4	992	090.0 070.0	1950-02-06 19:54:00	30	13.81	33.218	24.851
5	992	090.0 070.0	1950-02-06 19:54:00	50	13.25	33.150	24.912
6	992	090.0 070.0	1950-02-06 19:54:00	72	11.83	33.130	25.171
7	992	090.0 070.0	1950-02-06 19:54:00	75	11.70	33.142	25.204
8	992	090.0 070.0	1950-02-06 19:54:00	99	10.86	33.280	25.464
9	992	090.0 070.0	1950-02-06 19:54:00	100	10.83	33.288	25.475

If, in addition, we need only the last 4 columns of the DataFrame, and we want their order reversed, we may do:

CalCOFI.iloc[0:10, -1:-5:-1]

	SigmaTheta	Salinity	T_degC	Depth_m
0	24.766	33.170	14.04	0
1	24.815	33.210	13.95	10
2	24.826	33.210	13.90	20
3	24.830	33.210	13.88	23
4	24.851	33.218	13.81	30
5	24.912	33.150	13.25	50
6	25.171	33.130	11.83	72
7	25.204	33.142	11.70	75
8	25.464	33.280	10.86	99
9	25.475	33.288	10.83	100

And if we want the last 4 columns of the DataFrame but retain all rows, we can do:

CalCOFI.iloc[:, -4:]

	Depth_m	T_degC	Salinity	SigmaTheta
0	0	14.040	33.1700	24.76600
1	10	13.950	33.2100	24.81500
2	20	13.900	33.2100	24.82600
3	23	13.880	33.2100	24.83000
4	30	13.810	33.2180	24.85100
...	...	...	...	...
10052	300	7.692	34.1712	26.67697
10053	381	7.144	34.2443	26.81386
10054	400	7.031	34.2746	26.85372
10055	500	6.293	34.3126	26.98372
10056	515	6.155	34.2903	26.98398

10057 rows × 4 columns

Alternatively, since pandas DataFrame also stores row and column labels, these can also be used to subset the DataFrame. To do so, we need to use the method .loc[] instead of .iloc[]. For example, suppose we only want to retain only the datetime, depth, and temperature information of the DataFrame, we may do:

CalCOFI.loc[:, ["Datetime", "Depth_m", "T_degC"]]

	Datetime	Depth_m	T_degC
0	1950-02-06 19:54:00	0	14.040
1	1950-02-06 19:54:00	10	13.950
2	1950-02-06 19:54:00	20	13.900
3	1950-02-06 19:54:00	23	13.880
4	1950-02-06 19:54:00	30	13.810
...	...	...	...
10052	2021-01-21 13:36:00	300	7.692
10053	2021-01-21 13:36:00	381	7.144
10054	2021-01-21 13:36:00	400	7.031
10055	2021-01-21 13:36:00	500	6.293
10056	2021-01-21 13:36:00	515	6.155

10057 rows × 3 columns

Alternatively, noting that “Datetime”, “Depth_m”, and “T_degC” are consecutive columns, we can also use the slicing syntax. Note that unlike .iloc[], slices in .loc[] are endpoint inclusive:

CalCOFI.loc[:, "Datetime":"T_degC"]

	Datetime	Depth_m	T_degC
0	1950-02-06 19:54:00	0	14.040
1	1950-02-06 19:54:00	10	13.950
2	1950-02-06 19:54:00	20	13.900
3	1950-02-06 19:54:00	23	13.880
4	1950-02-06 19:54:00	30	13.810
...	...	...	...
10052	2021-01-21 13:36:00	300	7.692
10053	2021-01-21 13:36:00	381	7.144
10054	2021-01-21 13:36:00	400	7.031
10055	2021-01-21 13:36:00	500	6.293
10056	2021-01-21 13:36:00	515	6.155

10057 rows × 3 columns

For subsetting rows, in addition to accepting row label lists and slices (which are not particularly useful when the row labels are dummies), .loc[] also accepts boolean arrays. For example, let’s say we want to extract all observations (rows) at depth = 10 m, we can do:

CalCOFI.loc[CalCOFI["Depth_m"] == 10]

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
28	1267	090.0 070.0	1950-04-08 17:30:00	10	14.980	33.4000	24.74400
62	1387	090.0 070.0	1950-05-07 17:36:00	10	14.170	33.3080	24.84500
98	1477	090.0 070.0	1950-06-13 00:42:00	10	14.510	33.1110	24.62200
130	1589	090.0 070.0	1950-07-13 00:54:00	10	16.550	33.1020	24.16400
...	...	...	...	...	...	...	...
9913	35250	090.0 070.0	2019-11-09 01:24:00	10	16.983	33.3950	24.28840
9944	35354	090.0 070.0	2020-01-09 18:13:00	10	15.054	33.3427	24.68417
9978	35429	090.0 070.0	2020-07-16 23:40:00	10	17.114	33.6188	24.42921
10015	35500	090.0 070.0	2020-10-14 23:52:00	10	17.373	33.3346	24.14988
10037	35578	090.0 070.0	2021-01-21 13:36:00	10	15.092	33.4561	24.76328

348 rows × 7 columns

Note that pandas has a special function pd.isna() that performs vectorized check on whether values are coded as missing (e.g., None or np.nan). So, to extract all observations (rows) for which the temperature measurement is missing, we may do:

CalCOFI.loc[pd.isna(CalCOFI["T_degC"])]

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
269	3097	090.0 070.0	1951-07-05 12:06:00	0	NaN	33.300	NaN
273	3097	090.0 070.0	1951-07-05 12:06:00	27	NaN	33.300	NaN
276	3097	090.0 070.0	1951-07-05 12:06:00	54	NaN	33.310	NaN
278	3097	090.0 070.0	1951-07-05 12:06:00	80	NaN	33.140	NaN
311	3484	090.0 070.0	1951-10-01 02:36:00	75	NaN	33.210	NaN
369	3873	090.0 070.0	1952-02-18 09:24:00	0	NaN	33.150	NaN
372	3873	090.0 070.0	1952-02-18 09:24:00	19	NaN	33.070	NaN
375	3873	090.0 070.0	1952-02-18 09:24:00	39	NaN	33.130	NaN
377	3873	090.0 070.0	1952-02-18 09:24:00	57	NaN	33.120	NaN
384	4037	090.0 070.0	1952-03-09 20:54:00	0	NaN	33.190	NaN
387	4037	090.0 070.0	1952-03-09 20:54:00	16	NaN	33.220	NaN
390	4037	090.0 070.0	1952-03-09 20:54:00	35	NaN	33.210	NaN
391	4037	090.0 070.0	1952-03-09 20:54:00	50	NaN	33.210	NaN
414	4136	090.0 070.0	1952-04-03 06:30:00	0	NaN	33.300	NaN
418	4136	090.0 070.0	1952-04-03 06:30:00	21	NaN	33.310	NaN
420	4136	090.0 070.0	1952-04-03 06:30:00	44	NaN	33.330	NaN
422	4136	090.0 070.0	1952-04-03 06:30:00	63	NaN	33.310	NaN
424	4136	090.0 070.0	1952-04-03 06:30:00	84	NaN	33.220	NaN
426	4136	090.0 070.0	1952-04-03 06:30:00	115	NaN	33.170	NaN
436	4136	090.0 070.0	1952-04-03 06:30:00	323	NaN	34.150	NaN
449	4359	090.0 070.0	1952-05-11 16:12:00	19	NaN	33.150	NaN
452	4359	090.0 070.0	1952-05-11 16:12:00	45	NaN	33.100	NaN
454	4359	090.0 070.0	1952-05-11 16:12:00	66	NaN	33.080	NaN
456	4359	090.0 070.0	1952-05-11 16:12:00	90	NaN	33.080	NaN
463	4359	090.0 070.0	1952-05-11 16:12:00	217	NaN	33.910	NaN
485	4524	090.0 070.0	1952-06-05 22:48:00	50	NaN	32.950	NaN
1427	8006	090.0 070.0	1955-12-05 12:24:00	57	NaN	33.420	NaN
1430	8006	090.0 070.0	1955-12-05 12:24:00	76	NaN	33.350	NaN
1587	9017	090.0 070.0	1957-07-14 08:42:00	1025	NaN	34.470	NaN
1609	9018	090.0 070.0	1957-07-14 10:42:00	63	NaN	33.300	NaN
5244	24355	090.0 070.0	1986-02-14 15:49:00	11	NaN	33.301	NaN

If you want to drop all rows for which some columns have missing values, you can use the .dropna() method of the DataFrame:

CalCOFI.dropna()

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
0	992	090.0 070.0	1950-02-06 19:54:00	0	14.040	33.1700	24.76600
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
2	992	090.0 070.0	1950-02-06 19:54:00	20	13.900	33.2100	24.82600
3	992	090.0 070.0	1950-02-06 19:54:00	23	13.880	33.2100	24.83000
4	992	090.0 070.0	1950-02-06 19:54:00	30	13.810	33.2180	24.85100
...	...	...	...	...	...	...	...
10052	35578	090.0 070.0	2021-01-21 13:36:00	300	7.692	34.1712	26.67697
10053	35578	090.0 070.0	2021-01-21 13:36:00	381	7.144	34.2443	26.81386
10054	35578	090.0 070.0	2021-01-21 13:36:00	400	7.031	34.2746	26.85372
10055	35578	090.0 070.0	2021-01-21 13:36:00	500	6.293	34.3126	26.98372
10056	35578	090.0 070.0	2021-01-21 13:36:00	515	6.155	34.2903	26.98398

10008 rows × 7 columns

And if you want .dropna() to consider only certain columns when deciding if a row should be dropped, use its subset argument:

# no data is dropped since the Depth_m column has no missing values
CalCOFI.dropna(subset=["Depth_m"])

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
0	992	090.0 070.0	1950-02-06 19:54:00	0	14.040	33.1700	24.76600
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
2	992	090.0 070.0	1950-02-06 19:54:00	20	13.900	33.2100	24.82600
3	992	090.0 070.0	1950-02-06 19:54:00	23	13.880	33.2100	24.83000
4	992	090.0 070.0	1950-02-06 19:54:00	30	13.810	33.2180	24.85100
...	...	...	...	...	...	...	...
10052	35578	090.0 070.0	2021-01-21 13:36:00	300	7.692	34.1712	26.67697
10053	35578	090.0 070.0	2021-01-21 13:36:00	381	7.144	34.2443	26.81386
10054	35578	090.0 070.0	2021-01-21 13:36:00	400	7.031	34.2746	26.85372
10055	35578	090.0 070.0	2021-01-21 13:36:00	500	6.293	34.3126	26.98372
10056	35578	090.0 070.0	2021-01-21 13:36:00	515	6.155	34.2903	26.98398

10057 rows × 7 columns

Sorting the rows of a DataFrame#

We can use the .sort_values() method to sort a pandas DataFrame. For example, suppose we want to sort the rows of CalCOFI by temperature, from high to low, we can do:

CalCOFI_sorted = CalCOFI.sort_values("T_degC", ascending=False)
display(CalCOFI_sorted)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
4276	22468	090.0 070.0	1983-09-27 15:15:00	10	20.59	33.532	23.488
4275	22468	090.0 070.0	1983-09-27 15:15:00	0	20.59	33.533	23.489
4277	22468	090.0 070.0	1983-09-27 15:15:00	20	20.58	33.519	23.482
6809	28613	090.0 070.0	1997-09-25 11:46:00	15	20.21	33.635	23.668
6808	28613	090.0 070.0	1997-09-25 11:46:00	10	20.20	33.634	23.667
...	...	...	...	...	...	...	...
1427	8006	090.0 070.0	1955-12-05 12:24:00	57	NaN	33.420	NaN
1430	8006	090.0 070.0	1955-12-05 12:24:00	76	NaN	33.350	NaN
1587	9017	090.0 070.0	1957-07-14 08:42:00	1025	NaN	34.470	NaN
1609	9018	090.0 070.0	1957-07-14 10:42:00	63	NaN	33.300	NaN
5244	24355	090.0 070.0	1986-02-14 15:49:00	11	NaN	33.301	NaN

10057 rows × 7 columns

Note that missing values are sorted to the end. Also note that the index of the DataFrame is rearranged too. If we want to rebuild the index (so that it starts from 0 and increment by 1 in the sorted DataFrame), we can use the .reset_index() method, with the drop=True argument (which remove the old index instead of retaining it as a regular column):

CalCOFI_sorted.reset_index(drop=True)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
0	22468	090.0 070.0	1983-09-27 15:15:00	10	20.59	33.532	23.488
1	22468	090.0 070.0	1983-09-27 15:15:00	0	20.59	33.533	23.489
2	22468	090.0 070.0	1983-09-27 15:15:00	20	20.58	33.519	23.482
3	28613	090.0 070.0	1997-09-25 11:46:00	15	20.21	33.635	23.668
4	28613	090.0 070.0	1997-09-25 11:46:00	10	20.20	33.634	23.667
...	...	...	...	...	...	...	...
10052	8006	090.0 070.0	1955-12-05 12:24:00	57	NaN	33.420	NaN
10053	8006	090.0 070.0	1955-12-05 12:24:00	76	NaN	33.350	NaN
10054	9017	090.0 070.0	1957-07-14 08:42:00	1025	NaN	34.470	NaN
10055	9018	090.0 070.0	1957-07-14 10:42:00	63	NaN	33.300	NaN
10056	24355	090.0 070.0	1986-02-14 15:49:00	11	NaN	33.301	NaN

10057 rows × 7 columns

Modifying a DataFrame#

The subsetting syntax can also be used overwrite data on a DataFrame. For example, suppose we want to set the SigmaTheta value to 0 whenever depth is above 300 m, we can do:

CalCOFI.loc[CalCOFI["Depth_m"] > 300, "SigmaTheta"] = np.nan
display(CalCOFI)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
0	992	090.0 070.0	1950-02-06 19:54:00	0	14.040	33.1700	24.76600
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
2	992	090.0 070.0	1950-02-06 19:54:00	20	13.900	33.2100	24.82600
3	992	090.0 070.0	1950-02-06 19:54:00	23	13.880	33.2100	24.83000
4	992	090.0 070.0	1950-02-06 19:54:00	30	13.810	33.2180	24.85100
...	...	...	...	...	...	...	...
10052	35578	090.0 070.0	2021-01-21 13:36:00	300	7.692	34.1712	26.67697
10053	35578	090.0 070.0	2021-01-21 13:36:00	381	7.144	34.2443	NaN
10054	35578	090.0 070.0	2021-01-21 13:36:00	400	7.031	34.2746	NaN
10055	35578	090.0 070.0	2021-01-21 13:36:00	500	6.293	34.3126	NaN
10056	35578	090.0 070.0	2021-01-21 13:36:00	515	6.155	34.2903	NaN

10057 rows × 7 columns

Note that unlike most functions, the changes happen in-place here.

More often, we would like to derive a new column using data from existing columns. Just like ndarray, we can do vectorized computation on pandas Series (recall that each column of a pandas DataFrame is a pandas Series). Moreover, we can assign the result to a new column of a DataFrame. For example, suppose we want to create a log_Depth_m column and a T_degF column. We can do so with:

CalCOFI["log_Depth_m"] = np.log(CalCOFI["Depth_m"] + 1)
CalCOFI["T_degF"] = CalCOFI["T_degC"] * (9/5) + 32

display(CalCOFI)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta	log_Depth_m	T_degF
0	992	090.0 070.0	1950-02-06 19:54:00	0	14.040	33.1700	24.76600	0.000000	57.2720
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500	2.397895	57.1100
2	992	090.0 070.0	1950-02-06 19:54:00	20	13.900	33.2100	24.82600	3.044522	57.0200
3	992	090.0 070.0	1950-02-06 19:54:00	23	13.880	33.2100	24.83000	3.178054	56.9840
4	992	090.0 070.0	1950-02-06 19:54:00	30	13.810	33.2180	24.85100	3.433987	56.8580
...	...	...	...	...	...	...	...	...	...
10052	35578	090.0 070.0	2021-01-21 13:36:00	300	7.692	34.1712	26.67697	5.707110	45.8456
10053	35578	090.0 070.0	2021-01-21 13:36:00	381	7.144	34.2443	NaN	5.945421	44.8592
10054	35578	090.0 070.0	2021-01-21 13:36:00	400	7.031	34.2746	NaN	5.993961	44.6558
10055	35578	090.0 070.0	2021-01-21 13:36:00	500	6.293	34.3126	NaN	6.216606	43.3274
10056	35578	090.0 070.0	2021-01-21 13:36:00	515	6.155	34.2903	NaN	6.246107	43.0790

10057 rows × 9 columns

Notice that the new columns are inserted to the right of the existing columns.

Remark: dummy versus semantic row index#

In the examples above, the row index is really just an arbitrary number that has no meaning in and of itself, and our row operations have been based on the actual data contained in the columns. For most data, this setup is a good default.

However, sometimes your tabular data have a special column that can serve as a unique indentifier of the row. In other words, the value of this column uniquely identify the row of data it comes from. A good example are time series data, where the timestamp uniquely identify a row of observations. In such case, it can be beneficial to set this specific row as the index of the DataFrame.

Our CalCOFI subset has no natural unique identifier. However, suppose we are interested only in on observations at depth = 10 m. Then, the Cast_Count column uniquely identify an observation.

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
CalCOFI_10m = CalCOFI.loc[CalCOFI["Depth_m"] == 10]
display(CalCOFI_10m)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
28	1267	090.0 070.0	1950-04-08 17:30:00	10	14.980	33.4000	24.74400
62	1387	090.0 070.0	1950-05-07 17:36:00	10	14.170	33.3080	24.84500
98	1477	090.0 070.0	1950-06-13 00:42:00	10	14.510	33.1110	24.62200
130	1589	090.0 070.0	1950-07-13 00:54:00	10	16.550	33.1020	24.16400
...	...	...	...	...	...	...	...
9913	35250	090.0 070.0	2019-11-09 01:24:00	10	16.983	33.3950	24.28840
9944	35354	090.0 070.0	2020-01-09 18:13:00	10	15.054	33.3427	24.68417
9978	35429	090.0 070.0	2020-07-16 23:40:00	10	17.114	33.6188	24.42921
10015	35500	090.0 070.0	2020-10-14 23:52:00	10	17.373	33.3346	24.14988
10037	35578	090.0 070.0	2021-01-21 13:36:00	10	15.092	33.4561	24.76328

348 rows × 7 columns

Thus, we may set the Cast_Count column as the row index of this subsetted DataFrame. To do so, we use the .set_index() method of the DataFrame:

CalCOFI_10m = CalCOFI_10m.set_index("Cast_Count")
display(CalCOFI_10m)

	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
Cast_Count
992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
1267	090.0 070.0	1950-04-08 17:30:00	10	14.980	33.4000	24.74400
1387	090.0 070.0	1950-05-07 17:36:00	10	14.170	33.3080	24.84500
1477	090.0 070.0	1950-06-13 00:42:00	10	14.510	33.1110	24.62200
1589	090.0 070.0	1950-07-13 00:54:00	10	16.550	33.1020	24.16400
...	...	...	...	...	...	...
35250	090.0 070.0	2019-11-09 01:24:00	10	16.983	33.3950	24.28840
35354	090.0 070.0	2020-01-09 18:13:00	10	15.054	33.3427	24.68417
35429	090.0 070.0	2020-07-16 23:40:00	10	17.114	33.6188	24.42921
35500	090.0 070.0	2020-10-14 23:52:00	10	17.373	33.3346	24.14988
35578	090.0 070.0	2021-01-21 13:36:00	10	15.092	33.4561	24.76328

348 rows × 6 columns

With a semantic row index such as the Cast_Count, we can use .loc[] to meaningfully subset out data, e.g.,

CalCOFI_10m_mid = CalCOFI_10m.loc[8000:10000]
display(CalCOFI_10m_mid)

	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
Cast_Count
8006	090.0 070.0	1955-12-05 12:24:00	10	13.44	33.400	25.066
8141	090.0 070.0	1956-02-08 08:24:00	10	12.79	33.270	25.095
8330	090.0 070.0	1956-04-19 21:18:00	10	12.53	33.330	25.192
8437	090.0 070.0	1956-06-09 14:24:00	10	13.68	33.220	24.879
8755	090.0 070.0	1957-04-15 19:00:00	10	13.98	33.440	24.987
9018	090.0 070.0	1957-07-14 10:42:00	10	17.95	33.417	24.074
9203	090.0 070.0	1957-10-29 06:54:00	10	18.97	33.710	24.045
9327	090.0 070.0	1957-12-17 14:00:00	10	16.45	33.440	24.446
9527	090.0 070.0	1958-03-12 02:48:00	10	15.38	33.580	24.795
9993	090.0 070.0	1958-09-11 20:24:00	10	18.82	33.275	23.750

Notice that even though there is no data with Cast_Count exactly equal to 8000 and 10000, pandas know how to use these values as bound, because our index is sorted (or equivalenetly, monotone). Moreover, recall that .loc[] is end-point inclusive. Thus, knowing our result, the above is equivalent to:

CalCOFI_10m_mid = CalCOFI_10m.loc[8006:9993]
display(CalCOFI_10m_mid)

	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
Cast_Count
8006	090.0 070.0	1955-12-05 12:24:00	10	13.44	33.400	25.066
8141	090.0 070.0	1956-02-08 08:24:00	10	12.79	33.270	25.095
8330	090.0 070.0	1956-04-19 21:18:00	10	12.53	33.330	25.192
8437	090.0 070.0	1956-06-09 14:24:00	10	13.68	33.220	24.879
8755	090.0 070.0	1957-04-15 19:00:00	10	13.98	33.440	24.987
9018	090.0 070.0	1957-07-14 10:42:00	10	17.95	33.417	24.074
9203	090.0 070.0	1957-10-29 06:54:00	10	18.97	33.710	24.045
9327	090.0 070.0	1957-12-17 14:00:00	10	16.45	33.440	24.446
9527	090.0 070.0	1958-03-12 02:48:00	10	15.38	33.580	24.795
9993	090.0 070.0	1958-09-11 20:24:00	10	18.82	33.275	23.750

As it turns out, pandas allow index to be non-unique and non-monotone. However, some row operations may no longer work or have unexpected behaviors in such cases.

If the index of your DataFrame is not monotone, you can sort the data along the index using the .sort_index() method, which works similarly to the .sort_values() method. For example:

CalCOFI_sal = CalCOFI.set_index("Salinity").sort_index()
display(CalCOFI_sal)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	SigmaTheta
Salinity
32.624	30396	090.0 070.0	2003-10-25 05:31:00	59	12.07	24.733
32.641	30396	090.0 070.0	2003-10-25 05:31:00	70	11.42	24.866
32.653	30396	090.0 070.0	2003-10-25 05:31:00	75	11.15	24.924
32.674	30396	090.0 070.0	2003-10-25 05:31:00	50	12.85	24.623
32.693	30396	090.0 070.0	2003-10-25 05:31:00	85	10.66	25.041
...	...	...	...	...	...	...
34.677	13987	090.0 070.0	1964-04-18 02:18:00	3600	1.57	27.766
34.679	13987	090.0 070.0	1964-04-18 02:18:00	3689	1.57	27.769
34.680	13987	090.0 070.0	1964-04-18 02:18:00	3789	1.58	27.769
34.690	2378	090.0 070.0	1951-02-20 03:00:00	2366	1.86	27.747
34.720	6634	090.0 070.0	1954-02-10 14:18:00	1753	2.34	27.730

10057 rows × 6 columns

Note that even upon sorting, Salinity is still not a good row index, since it is not unique.

Statistics function#

If you want to obtain summary statistics of a DataFrame, pandas has a convenient .describe() method that computes a bunch of them for you in one go:

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")

CalCOFI.describe()

	Cast_Count	Depth_m	T_degC	Salinity	SigmaTheta
count	10057.000000	10057.000000	10026.000000	10057.00000	10008.000000
mean	21507.085214	221.678035	10.420719	33.74584	25.833045
std	9920.178631	279.990210	3.912421	0.41843	0.987026
min	992.000000	0.000000	1.570000	32.62400	23.482000
25%	12480.000000	50.000000	7.290000	33.38200	24.924000
50%	23672.000000	126.000000	9.600000	33.70400	26.005500
75%	29689.000000	300.000000	13.870000	34.11100	26.682250
max	35578.000000	3789.000000	20.590000	34.72000	27.769000

Note that by default, .describe() works only on numerical columns.

Importantly, the output of .describe() is another DataFrame, so you can extract particular rows and columns using the previous tricks. For example, the median (50%) of temperature can be extracted as:

CalSummary = CalCOFI.describe()
CalSummary.loc["50%", "T_degC"]

9.6

And if you want to find the inter-quartile range for all the columns, you may do:

CalSummary.loc["75%"] - CalSummary.loc["25%"]

Cast_Count    17209.00000
Depth_m         250.00000
T_degC            6.58000
Salinity          0.72900
SigmaTheta        1.75825
dtype: float64

In addition, there are also individual methods that you can apply to the DataFrame to obtain a particular statistics. Some of these methods include:

.mean() for the mean
.median() for the median
.quantile() for quantiles
.var() for the variance
.std() for the standard deviation
.sem() for the standard error

For example, to compute the standard error in the above numerical variables, we can do:

CalCOFI.sem(numeric_only=True)

Cast_Count    98.920264
Depth_m        2.791956
T_degC         0.039073
Salinity       0.004172
SigmaTheta     0.009866
dtype: float64

Note that for functions other than .describe(), we need to explicitly pass numeric_only=True

DataFrame operations

Contents

DataFrame operations#

Inspecting a DataFrame#

Extracting parts of a DataFrame#

Subsetting a DataFrame#

Sorting the rows of a DataFrame#

Modifying a DataFrame#

Remark: dummy versus semantic row index#

Statistics function#