DataFrame operations#
# initialization
import numpy as np
import pandas as pd
Inspecting a DataFrame#
Let’s load the CalCOFI dataset again and inspect it further:
CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
display(CalCOFI)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 0 | 14.040 | 33.1700 | 24.76600 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 20 | 13.900 | 33.2100 | 24.82600 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 23 | 13.880 | 33.2100 | 24.83000 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 30 | 13.810 | 33.2180 | 24.85100 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 300 | 7.692 | 34.1712 | 26.67697 |
| 10053 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 381 | 7.144 | 34.2443 | 26.81386 |
| 10054 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 400 | 7.031 | 34.2746 | 26.85372 |
| 10055 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 500 | 6.293 | 34.3126 | 26.98372 |
| 10056 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 515 | 6.155 | 34.2903 | 26.98398 |
10057 rows × 7 columns
While display() gives us a good glimpse of the data, the datatype of each column is not exactly clear. A good way to examine the data type is to use the .info() method of the DataFrame:
CalCOFI.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10057 entries, 0 to 10056
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Cast_Count 10057 non-null int64
1 Station_ID 10057 non-null object
2 Datetime 10057 non-null object
3 Depth_m 10057 non-null int64
4 T_degC 10026 non-null float64
5 Salinity 10057 non-null float64
6 SigmaTheta 10008 non-null float64
dtypes: float64(3), int64(2), object(2)
memory usage: 550.1+ KB
While it may not be obvious, columns with datatype (“Dtype”) object are most often (but not always!) string columns. So we have 3 columns of floats, 2 columns of integers, and 2 columns of string.
Just like numpy arrays, a pandas DataFrame has the .size, .shape, and .ndim attributes:
CalCOFI.size # total number of entries
70399
CalCOFI.shape # number of rows and columns
(10057, 7)
CalCOFI.ndim # number of dimensions
2
Extracting parts of a DataFrame#
The column names of a pandas DataFrame can be extracted using the .columns atrribute, and its row index can be extracted using the .index attribute
CalCOFI.columns # column labels
Index(['Cast_Count', 'Station_ID', 'Datetime', 'Depth_m', 'T_degC', 'Salinity',
'SigmaTheta'],
dtype='object')
CalCOFI.index # row labels
RangeIndex(start=0, stop=10057, step=1)
To extract the pandas Series that corresponds to a specific column, we can the square brackets [] syntax. For example:
CalCOFI["Depth_m"]
0 0
1 10
2 20
3 23
4 30
...
10052 300
10053 381
10054 400
10055 500
10056 515
Name: Depth_m, Length: 10057, dtype: int64
Furthermore, we can extract the data contained within the pandas Series using the .values attribute:
CalCOFI["Depth_m"].values
array([ 0, 10, 20, ..., 400, 500, 515], dtype=int64)
Similarly, we can extract the data contained in column labels and row labels of a DataFrame using the .values attribute:
CalCOFI.columns.values
array(['Cast_Count', 'Station_ID', 'Datetime', 'Depth_m', 'T_degC',
'Salinity', 'SigmaTheta'], dtype=object)
CalCOFI.index.values
array([ 0, 1, 2, ..., 10054, 10055, 10056], dtype=int64)
Subsetting a DataFrame#
There are two main methods to subset a DataFrame in pandas. The first is .iloc[], which subset a DataFrame by row and column positional index. For example, if we want to extract rows 1 (positional index 0) to 10 (positional index 9), we’ll do (note that like other slicing operations we’ve seen thus far, .iloc[] is endpoint exclusive):
CalCOFI.iloc[0:10]
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 0 | 14.04 | 33.170 | 24.766 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.95 | 33.210 | 24.815 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 20 | 13.90 | 33.210 | 24.826 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 23 | 13.88 | 33.210 | 24.830 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 30 | 13.81 | 33.218 | 24.851 |
| 5 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 50 | 13.25 | 33.150 | 24.912 |
| 6 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 72 | 11.83 | 33.130 | 25.171 |
| 7 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 75 | 11.70 | 33.142 | 25.204 |
| 8 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 99 | 10.86 | 33.280 | 25.464 |
| 9 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 100 | 10.83 | 33.288 | 25.475 |
If, in addition, we need only the last 4 columns of the DataFrame, and we want their order reversed, we may do:
CalCOFI.iloc[0:10, -1:-5:-1]
| SigmaTheta | Salinity | T_degC | Depth_m | |
|---|---|---|---|---|
| 0 | 24.766 | 33.170 | 14.04 | 0 |
| 1 | 24.815 | 33.210 | 13.95 | 10 |
| 2 | 24.826 | 33.210 | 13.90 | 20 |
| 3 | 24.830 | 33.210 | 13.88 | 23 |
| 4 | 24.851 | 33.218 | 13.81 | 30 |
| 5 | 24.912 | 33.150 | 13.25 | 50 |
| 6 | 25.171 | 33.130 | 11.83 | 72 |
| 7 | 25.204 | 33.142 | 11.70 | 75 |
| 8 | 25.464 | 33.280 | 10.86 | 99 |
| 9 | 25.475 | 33.288 | 10.83 | 100 |
And if we want the last 4 columns of the DataFrame but retain all rows, we can do:
CalCOFI.iloc[:, -4:]
| Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|
| 0 | 0 | 14.040 | 33.1700 | 24.76600 |
| 1 | 10 | 13.950 | 33.2100 | 24.81500 |
| 2 | 20 | 13.900 | 33.2100 | 24.82600 |
| 3 | 23 | 13.880 | 33.2100 | 24.83000 |
| 4 | 30 | 13.810 | 33.2180 | 24.85100 |
| ... | ... | ... | ... | ... |
| 10052 | 300 | 7.692 | 34.1712 | 26.67697 |
| 10053 | 381 | 7.144 | 34.2443 | 26.81386 |
| 10054 | 400 | 7.031 | 34.2746 | 26.85372 |
| 10055 | 500 | 6.293 | 34.3126 | 26.98372 |
| 10056 | 515 | 6.155 | 34.2903 | 26.98398 |
10057 rows × 4 columns
Alternatively, since pandas DataFrame also stores row and column labels, these can also be used to subset the DataFrame. To do so, we need to use the method .loc[] instead of .iloc[]. For example, suppose we only want to retain only the datetime, depth, and temperature information of the DataFrame, we may do:
CalCOFI.loc[:, ["Datetime", "Depth_m", "T_degC"]]
| Datetime | Depth_m | T_degC | |
|---|---|---|---|
| 0 | 1950-02-06 19:54:00 | 0 | 14.040 |
| 1 | 1950-02-06 19:54:00 | 10 | 13.950 |
| 2 | 1950-02-06 19:54:00 | 20 | 13.900 |
| 3 | 1950-02-06 19:54:00 | 23 | 13.880 |
| 4 | 1950-02-06 19:54:00 | 30 | 13.810 |
| ... | ... | ... | ... |
| 10052 | 2021-01-21 13:36:00 | 300 | 7.692 |
| 10053 | 2021-01-21 13:36:00 | 381 | 7.144 |
| 10054 | 2021-01-21 13:36:00 | 400 | 7.031 |
| 10055 | 2021-01-21 13:36:00 | 500 | 6.293 |
| 10056 | 2021-01-21 13:36:00 | 515 | 6.155 |
10057 rows × 3 columns
Alternatively, noting that “Datetime”, “Depth_m”, and “T_degC” are consecutive columns, we can also use the slicing syntax. Note that unlike .iloc[], slices in .loc[] are endpoint inclusive:
CalCOFI.loc[:, "Datetime":"T_degC"]
| Datetime | Depth_m | T_degC | |
|---|---|---|---|
| 0 | 1950-02-06 19:54:00 | 0 | 14.040 |
| 1 | 1950-02-06 19:54:00 | 10 | 13.950 |
| 2 | 1950-02-06 19:54:00 | 20 | 13.900 |
| 3 | 1950-02-06 19:54:00 | 23 | 13.880 |
| 4 | 1950-02-06 19:54:00 | 30 | 13.810 |
| ... | ... | ... | ... |
| 10052 | 2021-01-21 13:36:00 | 300 | 7.692 |
| 10053 | 2021-01-21 13:36:00 | 381 | 7.144 |
| 10054 | 2021-01-21 13:36:00 | 400 | 7.031 |
| 10055 | 2021-01-21 13:36:00 | 500 | 6.293 |
| 10056 | 2021-01-21 13:36:00 | 515 | 6.155 |
10057 rows × 3 columns
For subsetting rows, in addition to accepting row label lists and slices (which are not particularly useful when the row labels are dummies), .loc[] also accepts boolean arrays. For example, let’s say we want to extract all observations (rows) at depth = 10 m, we can do:
CalCOFI.loc[CalCOFI["Depth_m"] == 10]
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 28 | 1267 | 090.0 070.0 | 1950-04-08 17:30:00 | 10 | 14.980 | 33.4000 | 24.74400 |
| 62 | 1387 | 090.0 070.0 | 1950-05-07 17:36:00 | 10 | 14.170 | 33.3080 | 24.84500 |
| 98 | 1477 | 090.0 070.0 | 1950-06-13 00:42:00 | 10 | 14.510 | 33.1110 | 24.62200 |
| 130 | 1589 | 090.0 070.0 | 1950-07-13 00:54:00 | 10 | 16.550 | 33.1020 | 24.16400 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 9913 | 35250 | 090.0 070.0 | 2019-11-09 01:24:00 | 10 | 16.983 | 33.3950 | 24.28840 |
| 9944 | 35354 | 090.0 070.0 | 2020-01-09 18:13:00 | 10 | 15.054 | 33.3427 | 24.68417 |
| 9978 | 35429 | 090.0 070.0 | 2020-07-16 23:40:00 | 10 | 17.114 | 33.6188 | 24.42921 |
| 10015 | 35500 | 090.0 070.0 | 2020-10-14 23:52:00 | 10 | 17.373 | 33.3346 | 24.14988 |
| 10037 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 10 | 15.092 | 33.4561 | 24.76328 |
348 rows × 7 columns
Note that pandas has a special function pd.isna() that performs vectorized check on whether values are coded as missing (e.g., None or np.nan). So, to extract all observations (rows) for which the temperature measurement is missing, we may do:
CalCOFI.loc[pd.isna(CalCOFI["T_degC"])]
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 269 | 3097 | 090.0 070.0 | 1951-07-05 12:06:00 | 0 | NaN | 33.300 | NaN |
| 273 | 3097 | 090.0 070.0 | 1951-07-05 12:06:00 | 27 | NaN | 33.300 | NaN |
| 276 | 3097 | 090.0 070.0 | 1951-07-05 12:06:00 | 54 | NaN | 33.310 | NaN |
| 278 | 3097 | 090.0 070.0 | 1951-07-05 12:06:00 | 80 | NaN | 33.140 | NaN |
| 311 | 3484 | 090.0 070.0 | 1951-10-01 02:36:00 | 75 | NaN | 33.210 | NaN |
| 369 | 3873 | 090.0 070.0 | 1952-02-18 09:24:00 | 0 | NaN | 33.150 | NaN |
| 372 | 3873 | 090.0 070.0 | 1952-02-18 09:24:00 | 19 | NaN | 33.070 | NaN |
| 375 | 3873 | 090.0 070.0 | 1952-02-18 09:24:00 | 39 | NaN | 33.130 | NaN |
| 377 | 3873 | 090.0 070.0 | 1952-02-18 09:24:00 | 57 | NaN | 33.120 | NaN |
| 384 | 4037 | 090.0 070.0 | 1952-03-09 20:54:00 | 0 | NaN | 33.190 | NaN |
| 387 | 4037 | 090.0 070.0 | 1952-03-09 20:54:00 | 16 | NaN | 33.220 | NaN |
| 390 | 4037 | 090.0 070.0 | 1952-03-09 20:54:00 | 35 | NaN | 33.210 | NaN |
| 391 | 4037 | 090.0 070.0 | 1952-03-09 20:54:00 | 50 | NaN | 33.210 | NaN |
| 414 | 4136 | 090.0 070.0 | 1952-04-03 06:30:00 | 0 | NaN | 33.300 | NaN |
| 418 | 4136 | 090.0 070.0 | 1952-04-03 06:30:00 | 21 | NaN | 33.310 | NaN |
| 420 | 4136 | 090.0 070.0 | 1952-04-03 06:30:00 | 44 | NaN | 33.330 | NaN |
| 422 | 4136 | 090.0 070.0 | 1952-04-03 06:30:00 | 63 | NaN | 33.310 | NaN |
| 424 | 4136 | 090.0 070.0 | 1952-04-03 06:30:00 | 84 | NaN | 33.220 | NaN |
| 426 | 4136 | 090.0 070.0 | 1952-04-03 06:30:00 | 115 | NaN | 33.170 | NaN |
| 436 | 4136 | 090.0 070.0 | 1952-04-03 06:30:00 | 323 | NaN | 34.150 | NaN |
| 449 | 4359 | 090.0 070.0 | 1952-05-11 16:12:00 | 19 | NaN | 33.150 | NaN |
| 452 | 4359 | 090.0 070.0 | 1952-05-11 16:12:00 | 45 | NaN | 33.100 | NaN |
| 454 | 4359 | 090.0 070.0 | 1952-05-11 16:12:00 | 66 | NaN | 33.080 | NaN |
| 456 | 4359 | 090.0 070.0 | 1952-05-11 16:12:00 | 90 | NaN | 33.080 | NaN |
| 463 | 4359 | 090.0 070.0 | 1952-05-11 16:12:00 | 217 | NaN | 33.910 | NaN |
| 485 | 4524 | 090.0 070.0 | 1952-06-05 22:48:00 | 50 | NaN | 32.950 | NaN |
| 1427 | 8006 | 090.0 070.0 | 1955-12-05 12:24:00 | 57 | NaN | 33.420 | NaN |
| 1430 | 8006 | 090.0 070.0 | 1955-12-05 12:24:00 | 76 | NaN | 33.350 | NaN |
| 1587 | 9017 | 090.0 070.0 | 1957-07-14 08:42:00 | 1025 | NaN | 34.470 | NaN |
| 1609 | 9018 | 090.0 070.0 | 1957-07-14 10:42:00 | 63 | NaN | 33.300 | NaN |
| 5244 | 24355 | 090.0 070.0 | 1986-02-14 15:49:00 | 11 | NaN | 33.301 | NaN |
If you want to drop all rows for which some columns have missing values, you can use the .dropna() method of the DataFrame:
CalCOFI.dropna()
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 0 | 14.040 | 33.1700 | 24.76600 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 20 | 13.900 | 33.2100 | 24.82600 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 23 | 13.880 | 33.2100 | 24.83000 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 30 | 13.810 | 33.2180 | 24.85100 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 300 | 7.692 | 34.1712 | 26.67697 |
| 10053 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 381 | 7.144 | 34.2443 | 26.81386 |
| 10054 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 400 | 7.031 | 34.2746 | 26.85372 |
| 10055 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 500 | 6.293 | 34.3126 | 26.98372 |
| 10056 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 515 | 6.155 | 34.2903 | 26.98398 |
10008 rows × 7 columns
And if you want .dropna() to consider only certain columns when deciding if a row should be dropped, use its subset argument:
# no data is dropped since the Depth_m column has no missing values
CalCOFI.dropna(subset=["Depth_m"])
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 0 | 14.040 | 33.1700 | 24.76600 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 20 | 13.900 | 33.2100 | 24.82600 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 23 | 13.880 | 33.2100 | 24.83000 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 30 | 13.810 | 33.2180 | 24.85100 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 300 | 7.692 | 34.1712 | 26.67697 |
| 10053 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 381 | 7.144 | 34.2443 | 26.81386 |
| 10054 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 400 | 7.031 | 34.2746 | 26.85372 |
| 10055 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 500 | 6.293 | 34.3126 | 26.98372 |
| 10056 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 515 | 6.155 | 34.2903 | 26.98398 |
10057 rows × 7 columns
Sorting the rows of a DataFrame#
We can use the .sort_values() method to sort a pandas DataFrame. For example, suppose we want to sort the rows of CalCOFI by temperature, from high to low, we can do:
CalCOFI_sorted = CalCOFI.sort_values("T_degC", ascending=False)
display(CalCOFI_sorted)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 4276 | 22468 | 090.0 070.0 | 1983-09-27 15:15:00 | 10 | 20.59 | 33.532 | 23.488 |
| 4275 | 22468 | 090.0 070.0 | 1983-09-27 15:15:00 | 0 | 20.59 | 33.533 | 23.489 |
| 4277 | 22468 | 090.0 070.0 | 1983-09-27 15:15:00 | 20 | 20.58 | 33.519 | 23.482 |
| 6809 | 28613 | 090.0 070.0 | 1997-09-25 11:46:00 | 15 | 20.21 | 33.635 | 23.668 |
| 6808 | 28613 | 090.0 070.0 | 1997-09-25 11:46:00 | 10 | 20.20 | 33.634 | 23.667 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1427 | 8006 | 090.0 070.0 | 1955-12-05 12:24:00 | 57 | NaN | 33.420 | NaN |
| 1430 | 8006 | 090.0 070.0 | 1955-12-05 12:24:00 | 76 | NaN | 33.350 | NaN |
| 1587 | 9017 | 090.0 070.0 | 1957-07-14 08:42:00 | 1025 | NaN | 34.470 | NaN |
| 1609 | 9018 | 090.0 070.0 | 1957-07-14 10:42:00 | 63 | NaN | 33.300 | NaN |
| 5244 | 24355 | 090.0 070.0 | 1986-02-14 15:49:00 | 11 | NaN | 33.301 | NaN |
10057 rows × 7 columns
Note that missing values are sorted to the end. Also note that the index of the DataFrame is rearranged too. If we want to rebuild the index (so that it starts from 0 and increment by 1 in the sorted DataFrame), we can use the .reset_index() method, with the drop=True argument (which remove the old index instead of retaining it as a regular column):
CalCOFI_sorted.reset_index(drop=True)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 22468 | 090.0 070.0 | 1983-09-27 15:15:00 | 10 | 20.59 | 33.532 | 23.488 |
| 1 | 22468 | 090.0 070.0 | 1983-09-27 15:15:00 | 0 | 20.59 | 33.533 | 23.489 |
| 2 | 22468 | 090.0 070.0 | 1983-09-27 15:15:00 | 20 | 20.58 | 33.519 | 23.482 |
| 3 | 28613 | 090.0 070.0 | 1997-09-25 11:46:00 | 15 | 20.21 | 33.635 | 23.668 |
| 4 | 28613 | 090.0 070.0 | 1997-09-25 11:46:00 | 10 | 20.20 | 33.634 | 23.667 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 8006 | 090.0 070.0 | 1955-12-05 12:24:00 | 57 | NaN | 33.420 | NaN |
| 10053 | 8006 | 090.0 070.0 | 1955-12-05 12:24:00 | 76 | NaN | 33.350 | NaN |
| 10054 | 9017 | 090.0 070.0 | 1957-07-14 08:42:00 | 1025 | NaN | 34.470 | NaN |
| 10055 | 9018 | 090.0 070.0 | 1957-07-14 10:42:00 | 63 | NaN | 33.300 | NaN |
| 10056 | 24355 | 090.0 070.0 | 1986-02-14 15:49:00 | 11 | NaN | 33.301 | NaN |
10057 rows × 7 columns
Modifying a DataFrame#
The subsetting syntax can also be used overwrite data on a DataFrame. For example, suppose we want to set the SigmaTheta value to 0 whenever depth is above 300 m, we can do:
CalCOFI.loc[CalCOFI["Depth_m"] > 300, "SigmaTheta"] = np.nan
display(CalCOFI)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 0 | 14.040 | 33.1700 | 24.76600 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 20 | 13.900 | 33.2100 | 24.82600 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 23 | 13.880 | 33.2100 | 24.83000 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 30 | 13.810 | 33.2180 | 24.85100 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 300 | 7.692 | 34.1712 | 26.67697 |
| 10053 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 381 | 7.144 | 34.2443 | NaN |
| 10054 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 400 | 7.031 | 34.2746 | NaN |
| 10055 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 500 | 6.293 | 34.3126 | NaN |
| 10056 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 515 | 6.155 | 34.2903 | NaN |
10057 rows × 7 columns
Note that unlike most functions, the changes happen in-place here.
More often, we would like to derive a new column using data from existing columns. Just like ndarray, we can do vectorized computation on pandas Series (recall that each column of a pandas DataFrame is a pandas Series). Moreover, we can assign the result to a new column of a DataFrame. For example, suppose we want to create a log_Depth_m column and a T_degF column. We can do so with:
CalCOFI["log_Depth_m"] = np.log(CalCOFI["Depth_m"] + 1)
CalCOFI["T_degF"] = CalCOFI["T_degC"] * (9/5) + 32
display(CalCOFI)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | log_Depth_m | T_degF | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 0 | 14.040 | 33.1700 | 24.76600 | 0.000000 | 57.2720 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 | 2.397895 | 57.1100 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 20 | 13.900 | 33.2100 | 24.82600 | 3.044522 | 57.0200 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 23 | 13.880 | 33.2100 | 24.83000 | 3.178054 | 56.9840 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 30 | 13.810 | 33.2180 | 24.85100 | 3.433987 | 56.8580 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 300 | 7.692 | 34.1712 | 26.67697 | 5.707110 | 45.8456 |
| 10053 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 381 | 7.144 | 34.2443 | NaN | 5.945421 | 44.8592 |
| 10054 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 400 | 7.031 | 34.2746 | NaN | 5.993961 | 44.6558 |
| 10055 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 500 | 6.293 | 34.3126 | NaN | 6.216606 | 43.3274 |
| 10056 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 515 | 6.155 | 34.2903 | NaN | 6.246107 | 43.0790 |
10057 rows × 9 columns
Notice that the new columns are inserted to the right of the existing columns.
Remark: dummy versus semantic row index#
In the examples above, the row index is really just an arbitrary number that has no meaning in and of itself, and our row operations have been based on the actual data contained in the columns. For most data, this setup is a good default.
However, sometimes your tabular data have a special column that can serve as a unique indentifier of the row. In other words, the value of this column uniquely identify the row of data it comes from. A good example are time series data, where the timestamp uniquely identify a row of observations. In such case, it can be beneficial to set this specific row as the index of the DataFrame.
Our CalCOFI subset has no natural unique identifier. However, suppose we are interested only in on observations at depth = 10 m. Then, the Cast_Count column uniquely identify an observation.
CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
CalCOFI_10m = CalCOFI.loc[CalCOFI["Depth_m"] == 10]
display(CalCOFI_10m)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 28 | 1267 | 090.0 070.0 | 1950-04-08 17:30:00 | 10 | 14.980 | 33.4000 | 24.74400 |
| 62 | 1387 | 090.0 070.0 | 1950-05-07 17:36:00 | 10 | 14.170 | 33.3080 | 24.84500 |
| 98 | 1477 | 090.0 070.0 | 1950-06-13 00:42:00 | 10 | 14.510 | 33.1110 | 24.62200 |
| 130 | 1589 | 090.0 070.0 | 1950-07-13 00:54:00 | 10 | 16.550 | 33.1020 | 24.16400 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 9913 | 35250 | 090.0 070.0 | 2019-11-09 01:24:00 | 10 | 16.983 | 33.3950 | 24.28840 |
| 9944 | 35354 | 090.0 070.0 | 2020-01-09 18:13:00 | 10 | 15.054 | 33.3427 | 24.68417 |
| 9978 | 35429 | 090.0 070.0 | 2020-07-16 23:40:00 | 10 | 17.114 | 33.6188 | 24.42921 |
| 10015 | 35500 | 090.0 070.0 | 2020-10-14 23:52:00 | 10 | 17.373 | 33.3346 | 24.14988 |
| 10037 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 10 | 15.092 | 33.4561 | 24.76328 |
348 rows × 7 columns
Thus, we may set the Cast_Count column as the row index of this subsetted DataFrame. To do so, we use the .set_index() method of the DataFrame:
CalCOFI_10m = CalCOFI_10m.set_index("Cast_Count")
display(CalCOFI_10m)
| Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|
| Cast_Count | ||||||
| 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 1267 | 090.0 070.0 | 1950-04-08 17:30:00 | 10 | 14.980 | 33.4000 | 24.74400 |
| 1387 | 090.0 070.0 | 1950-05-07 17:36:00 | 10 | 14.170 | 33.3080 | 24.84500 |
| 1477 | 090.0 070.0 | 1950-06-13 00:42:00 | 10 | 14.510 | 33.1110 | 24.62200 |
| 1589 | 090.0 070.0 | 1950-07-13 00:54:00 | 10 | 16.550 | 33.1020 | 24.16400 |
| ... | ... | ... | ... | ... | ... | ... |
| 35250 | 090.0 070.0 | 2019-11-09 01:24:00 | 10 | 16.983 | 33.3950 | 24.28840 |
| 35354 | 090.0 070.0 | 2020-01-09 18:13:00 | 10 | 15.054 | 33.3427 | 24.68417 |
| 35429 | 090.0 070.0 | 2020-07-16 23:40:00 | 10 | 17.114 | 33.6188 | 24.42921 |
| 35500 | 090.0 070.0 | 2020-10-14 23:52:00 | 10 | 17.373 | 33.3346 | 24.14988 |
| 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 10 | 15.092 | 33.4561 | 24.76328 |
348 rows × 6 columns
With a semantic row index such as the Cast_Count, we can use .loc[] to meaningfully subset out data, e.g.,
CalCOFI_10m_mid = CalCOFI_10m.loc[8000:10000]
display(CalCOFI_10m_mid)
| Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|
| Cast_Count | ||||||
| 8006 | 090.0 070.0 | 1955-12-05 12:24:00 | 10 | 13.44 | 33.400 | 25.066 |
| 8141 | 090.0 070.0 | 1956-02-08 08:24:00 | 10 | 12.79 | 33.270 | 25.095 |
| 8330 | 090.0 070.0 | 1956-04-19 21:18:00 | 10 | 12.53 | 33.330 | 25.192 |
| 8437 | 090.0 070.0 | 1956-06-09 14:24:00 | 10 | 13.68 | 33.220 | 24.879 |
| 8755 | 090.0 070.0 | 1957-04-15 19:00:00 | 10 | 13.98 | 33.440 | 24.987 |
| 9018 | 090.0 070.0 | 1957-07-14 10:42:00 | 10 | 17.95 | 33.417 | 24.074 |
| 9203 | 090.0 070.0 | 1957-10-29 06:54:00 | 10 | 18.97 | 33.710 | 24.045 |
| 9327 | 090.0 070.0 | 1957-12-17 14:00:00 | 10 | 16.45 | 33.440 | 24.446 |
| 9527 | 090.0 070.0 | 1958-03-12 02:48:00 | 10 | 15.38 | 33.580 | 24.795 |
| 9993 | 090.0 070.0 | 1958-09-11 20:24:00 | 10 | 18.82 | 33.275 | 23.750 |
Notice that even though there is no data with Cast_Count exactly equal to 8000 and 10000, pandas know how to use these values as bound, because our index is sorted (or equivalenetly, monotone). Moreover, recall that .loc[] is end-point inclusive. Thus, knowing our result, the above is equivalent to:
CalCOFI_10m_mid = CalCOFI_10m.loc[8006:9993]
display(CalCOFI_10m_mid)
| Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|
| Cast_Count | ||||||
| 8006 | 090.0 070.0 | 1955-12-05 12:24:00 | 10 | 13.44 | 33.400 | 25.066 |
| 8141 | 090.0 070.0 | 1956-02-08 08:24:00 | 10 | 12.79 | 33.270 | 25.095 |
| 8330 | 090.0 070.0 | 1956-04-19 21:18:00 | 10 | 12.53 | 33.330 | 25.192 |
| 8437 | 090.0 070.0 | 1956-06-09 14:24:00 | 10 | 13.68 | 33.220 | 24.879 |
| 8755 | 090.0 070.0 | 1957-04-15 19:00:00 | 10 | 13.98 | 33.440 | 24.987 |
| 9018 | 090.0 070.0 | 1957-07-14 10:42:00 | 10 | 17.95 | 33.417 | 24.074 |
| 9203 | 090.0 070.0 | 1957-10-29 06:54:00 | 10 | 18.97 | 33.710 | 24.045 |
| 9327 | 090.0 070.0 | 1957-12-17 14:00:00 | 10 | 16.45 | 33.440 | 24.446 |
| 9527 | 090.0 070.0 | 1958-03-12 02:48:00 | 10 | 15.38 | 33.580 | 24.795 |
| 9993 | 090.0 070.0 | 1958-09-11 20:24:00 | 10 | 18.82 | 33.275 | 23.750 |
As it turns out, pandas allow index to be non-unique and non-monotone. However, some row operations may no longer work or have unexpected behaviors in such cases.
If the index of your DataFrame is not monotone, you can sort the data along the index using the .sort_index() method, which works similarly to the .sort_values() method. For example:
CalCOFI_sal = CalCOFI.set_index("Salinity").sort_index()
display(CalCOFI_sal)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | SigmaTheta | |
|---|---|---|---|---|---|---|
| Salinity | ||||||
| 32.624 | 30396 | 090.0 070.0 | 2003-10-25 05:31:00 | 59 | 12.07 | 24.733 |
| 32.641 | 30396 | 090.0 070.0 | 2003-10-25 05:31:00 | 70 | 11.42 | 24.866 |
| 32.653 | 30396 | 090.0 070.0 | 2003-10-25 05:31:00 | 75 | 11.15 | 24.924 |
| 32.674 | 30396 | 090.0 070.0 | 2003-10-25 05:31:00 | 50 | 12.85 | 24.623 |
| 32.693 | 30396 | 090.0 070.0 | 2003-10-25 05:31:00 | 85 | 10.66 | 25.041 |
| ... | ... | ... | ... | ... | ... | ... |
| 34.677 | 13987 | 090.0 070.0 | 1964-04-18 02:18:00 | 3600 | 1.57 | 27.766 |
| 34.679 | 13987 | 090.0 070.0 | 1964-04-18 02:18:00 | 3689 | 1.57 | 27.769 |
| 34.680 | 13987 | 090.0 070.0 | 1964-04-18 02:18:00 | 3789 | 1.58 | 27.769 |
| 34.690 | 2378 | 090.0 070.0 | 1951-02-20 03:00:00 | 2366 | 1.86 | 27.747 |
| 34.720 | 6634 | 090.0 070.0 | 1954-02-10 14:18:00 | 1753 | 2.34 | 27.730 |
10057 rows × 6 columns
Note that even upon sorting, Salinity is still not a good row index, since it is not unique.
Statistics function#
If you want to obtain summary statistics of a DataFrame, pandas has a convenient .describe() method that computes a bunch of them for you in one go:
CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
CalCOFI.describe()
| Cast_Count | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|
| count | 10057.000000 | 10057.000000 | 10026.000000 | 10057.00000 | 10008.000000 |
| mean | 21507.085214 | 221.678035 | 10.420719 | 33.74584 | 25.833045 |
| std | 9920.178631 | 279.990210 | 3.912421 | 0.41843 | 0.987026 |
| min | 992.000000 | 0.000000 | 1.570000 | 32.62400 | 23.482000 |
| 25% | 12480.000000 | 50.000000 | 7.290000 | 33.38200 | 24.924000 |
| 50% | 23672.000000 | 126.000000 | 9.600000 | 33.70400 | 26.005500 |
| 75% | 29689.000000 | 300.000000 | 13.870000 | 34.11100 | 26.682250 |
| max | 35578.000000 | 3789.000000 | 20.590000 | 34.72000 | 27.769000 |
Note that by default, .describe() works only on numerical columns.
Importantly, the output of .describe() is another DataFrame, so you can extract particular rows and columns using the previous tricks. For example, the median (50%) of temperature can be extracted as:
CalSummary = CalCOFI.describe()
CalSummary.loc["50%", "T_degC"]
9.6
And if you want to find the inter-quartile range for all the columns, you may do:
CalSummary.loc["75%"] - CalSummary.loc["25%"]
Cast_Count 17209.00000
Depth_m 250.00000
T_degC 6.58000
Salinity 0.72900
SigmaTheta 1.75825
dtype: float64
In addition, there are also individual methods that you can apply to the DataFrame to obtain a particular statistics. Some of these methods include:
.mean()for the mean.median()for the median.quantile()for quantiles.var()for the variance.std()for the standard deviation.sem()for the standard error
For example, to compute the standard error in the above numerical variables, we can do:
CalCOFI.sem(numeric_only=True)
Cast_Count 98.920264
Depth_m 2.791956
T_degC 0.039073
Salinity 0.004172
SigmaTheta 0.009866
dtype: float64
Note that for functions other than .describe(), we need to explicitly pass numeric_only=True