Pandas DataFrame and tabular data#
Pandas and pandas DataFrames#
In data science we often encounter data that are organized in tables. Our main tool to handle such data is the third-party module pandas, which we import as:
import pandas as pd
As before, the as pd part is optional but is relatively standard. It defines a shorthand pd for the pandas package.
Just like how ndarray is the central objects of numpy, pandas also a central object called DataFrame. Structurally, pandas DataFrames are row homogeneous (i.e., each row is similar to the next row) but column heterogeneous (i.e., one column may differ from the next one). This makes pandas DataFrame a good representation of tabular data, since tabular data tends to be row homogeneous and column heterogeneous too.
Moreover, unlike 2D numpy arrays, a pandas DataFrame also stores column names and row index, and these can be used to subset the DataFrame. Again this makes pandas DataFrame a better choice for representing tabular data than numpy array, since tabular data often come with descriptive column headers.
Loading data as pandas DataFrame#
Most likely, your DataFrame will be created from external tabular data. The most portable format for such data is a CSV (comma separated values) file. We can use the pd.read_csv() function to load such external file as DataFrame.
For example, a subset of the CalCOFI dataset (https://calcofi.org/) can be loaded as follows (note: our subset can be found here)
CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
display(CalCOFI)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 0 | 14.040 | 33.1700 | 24.76600 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 20 | 13.900 | 33.2100 | 24.82600 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 23 | 13.880 | 33.2100 | 24.83000 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 30 | 13.810 | 33.2180 | 24.85100 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 300 | 7.692 | 34.1712 | 26.67697 |
| 10053 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 381 | 7.144 | 34.2443 | 26.81386 |
| 10054 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 400 | 7.031 | 34.2746 | 26.85372 |
| 10055 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 500 | 6.293 | 34.3126 | 26.98372 |
| 10056 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 515 | 6.155 | 34.2903 | 26.98398 |
10057 rows × 7 columns
(Note that in the above we used the display() function to display a pandas DataFrame. The display() function is built-in to the JupyterHub and is used to display information using webpage (HTML) technology, which tends to be richer than the plain-text interface you get from using print())
You can find the official documentation of pd.read_csv() from https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html. The function have plenty of (keyword-only, optional) arguments. Some highlights are:
names: The names to use as column names. If not supplied the first line being read is treated as the header rowskiprows: if anint, the number of rows to skip before contents are read in; if a pythonlistofint, the line indices to skip.cols: the columns to read into the DataFrame. Can be a list of column indices or column namesindex_col: the column to use as row index. Can be anintor a namena_values: which values are to be treated as indication of missing data
As a second example, here is the content of a simple CSV file (don’t worry about the details of the code; we don’t expect you to read a file as plain text in this course):
with open("data/header_example.csv") as infile:
print(infile.read())
# Two lines of metadata followed by an empty line
# In this file -999 is the code for missing values
A,B,C,D
m,g,cm,L
1,3.2,4,2
2,7.9,7,5
2,-999,5,3
Suppose we want to skip the metadata lines (indices 0, 1), the empty line (index 2), and the units line (index 4), and read only the columns “A”, “B”, and “D”. Suppose we also want to convert the -999 code into actual np.nan. Then, we may do:
small_df = pd.read_csv(
"data/header_example.csv",
skiprows=[0, 1, 2, 4],
usecols=["A", "B", "D"],
na_values="-999"
)
display(small_df)
| A | B | D | |
|---|---|---|---|
| 0 | 1 | 3.2 | 2 |
| 1 | 2 | 7.9 | 5 |
| 2 | 2 | NaN | 3 |
Alternatively, suppose we want to read in all columns, and use column D as row index. The code is then:
small_df = pd.read_csv(
"data/header_example.csv",
skiprows=[0, 1, 2, 4],
index_col = "D",
na_values="-999"
)
display(small_df)
| A | B | C | |
|---|---|---|---|
| D | |||
| 2 | 1 | 3.2 | 4 |
| 5 | 2 | 7.9 | 7 |
| 3 | 2 | NaN | 5 |
As a third example, consider the Seattle tide prediction data from Jan 1, 2025 to Jan 31, 2025, obtained from NOAA (a copy of the file can be found here) If you open the file in JupyterHub, you’ll find that this is a text file with 13 lines of metadata, followed by header at line 14, and data from that point on. Note also that the data is separated by white spaces (actually tabs). We can load this file as a pandas DataFrame like so:
tides = pd.read_csv(
"data/tide_prediction_2025-01.txt",
sep="\\s+", skiprows=13
)
display(tides)
| Date | Day | Time | Pred | High/Low | |
|---|---|---|---|---|---|
| 2025/01/01 | Wed | 06:58 | AM | 3.80 | H |
| 2025/01/01 | Wed | 12:17 | PM | 2.47 | L |
| 2025/01/01 | Wed | 04:41 | PM | 3.19 | H |
| 2025/01/01 | Wed | 11:54 | PM | -0.73 | L |
| 2025/01/02 | Thu | 07:28 | AM | 3.83 | H |
| ... | ... | ... | ... | ... | ... |
| 2025/01/30 | Thu | 04:44 | PM | 3.28 | H |
| 2025/01/30 | Thu | 11:36 | PM | -0.57 | L |
| 2025/01/31 | Fri | 06:45 | AM | 3.84 | H |
| 2025/01/31 | Fri | 12:29 | PM | 1.68 | L |
| 2025/01/31 | Fri | 05:38 | PM | 3.22 | H |
119 rows × 5 columns
In the above, sep = "\\s+" tells pandas that the entries are separated by one or more “white-space like characters” (another possible choice is sep = "\t", where \t is python’s way of writing the tab character). Unfortunately, the precise formatting of this file causes pandas to turn the first column into the index of the DataFrame. To deal with this, we use .reset_index() to turn the index back into a regular column:
tides = tides.reset_index()
display(tides)
| index | Date | Day | Time | Pred | High/Low | |
|---|---|---|---|---|---|---|
| 0 | 2025/01/01 | Wed | 06:58 | AM | 3.80 | H |
| 1 | 2025/01/01 | Wed | 12:17 | PM | 2.47 | L |
| 2 | 2025/01/01 | Wed | 04:41 | PM | 3.19 | H |
| 3 | 2025/01/01 | Wed | 11:54 | PM | -0.73 | L |
| 4 | 2025/01/02 | Thu | 07:28 | AM | 3.83 | H |
| ... | ... | ... | ... | ... | ... | ... |
| 114 | 2025/01/30 | Thu | 04:44 | PM | 3.28 | H |
| 115 | 2025/01/30 | Thu | 11:36 | PM | -0.57 | L |
| 116 | 2025/01/31 | Fri | 06:45 | AM | 3.84 | H |
| 117 | 2025/01/31 | Fri | 12:29 | PM | 1.68 | L |
| 118 | 2025/01/31 | Fri | 05:38 | PM | 3.22 | H |
119 rows × 6 columns
Additional clean up will be needed to fix the header row, etc, and you will learn some of these in the later parts of this week’s readings.
In addition to offline files, pandas can also read a csv file hosted on the internet. For example, consider the dataset of chlorophyll-a across the year hosted by the OOI Ocean Data Labs Project at Rutgers University. A direct url to their data in csv format is https://datalab.marine.rutgers.edu/explorations/productivity/data/productivity2.csv, which we can supply as the first argument to pd.read_csv() to load the data into pandas DataFrame:
url = "https://datalab.marine.rutgers.edu/explorations/productivity/data/productivity2.csv"
pd.read_csv(url)
| date | Chlorophyll-a | |
|---|---|---|
| 0 | 2015-06-01 00:00:00 | 6.1105 |
| 1 | 2015-06-01 01:00:00 | 7.9981 |
| 2 | 2015-06-01 02:00:00 | 11.3498 |
| 3 | 2015-06-01 03:00:00 | 12.2573 |
| 4 | 2015-06-01 04:00:00 | 9.7526 |
| ... | ... | ... |
| 8780 | 2016-05-31 20:00:00 | 1.4399 |
| 8781 | 2016-05-31 21:00:00 | 1.0043 |
| 8782 | 2016-05-31 22:00:00 | 0.8954 |
| 8783 | 2016-05-31 23:00:00 | 0.9438 |
| 8784 | 2016-06-01 00:00:00 | 1.1616 |
8785 rows × 2 columns
Incidentally, pandas can also be used to read Microsoft Excel file. The relevant function is pd.read_excel(), which is documented at https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html. Most of its arguments are the same as pd.read_csv(), with the notable except of the sheet_name argument, which is used to specify the sheet to load the data from.
As an example, the CalCOFI data is also saved as an Excel .xlsx document (which you can find here), and we can load it as such:
CalCOFI_2 = pd.read_excel("data/CalCOFI_subset.xlsx", sheet_name="Sheet1")
display(CalCOFI_2)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 0 | 14.040 | 33.1700 | 24.76600 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 20 | 13.900 | 33.2100 | 24.82600 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 23 | 13.880 | 33.2100 | 24.83000 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 30 | 13.810 | 33.2180 | 24.85100 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 300 | 7.692 | 34.1712 | 26.67697 |
| 10053 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 381 | 7.144 | 34.2443 | 26.81386 |
| 10054 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 400 | 7.031 | 34.2746 | 26.85372 |
| 10055 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 500 | 6.293 | 34.3126 | 26.98372 |
| 10056 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 515 | 6.155 | 34.2903 | 26.98398 |
10057 rows × 7 columns
Creating pandas DataFrame by hands#
Internally, each column of a pandas DataFrame is a pandas Series. And we can create a Series using the pd.Series() function:
pd.Series(["A", "B", "C", "D"])
0 A
1 B
2 C
3 D
dtype: object
Optionally, we can supply an index and a name to a pandas Series. For example:
pd.Series(["A", "B", "C", "D"], index=range(65, 69), name="letters")
65 A
66 B
67 C
68 D
Name: letters, dtype: object
We can combine multiple pandas Series of the same length into a DataFrame using pd.concat() with axis=1:
letters = pd.Series(["A", "B", "C", "D"], name="letters")
integers = pd.Series([7, 2, 5, 3], name="integer")
df = pd.concat([letters, integers], axis=1)
display(df)
| letters | integer | |
|---|---|---|
| 0 | A | 7 |
| 1 | B | 2 |
| 2 | C | 5 |
| 3 | D | 3 |
Notice that when combining pandas Series into DataFrame, the index of the Series becomes the index of the DataFrame, and the name of the Series becomes the name of the corresponding column.
Alternatively, we can also create an entire DataFrame by hand using the pd.DataFrame() function. When we call pd.DataFrame() we supply key-values pairs enclosed by curly braces {}, and use : to separate the keys frome the values. In the resulting DataFrame, the keys will become the column names, while the values (usually a python list or a numpy array) will become the columns themselves. For example, to reproduce the above DataFrame, we may do:
df = pd.DataFrame({
"letters": ["A", "B", "C", "D"],
"integer": [7, 2, 5, 3]
})
display(df)
| letters | integer | |
|---|---|---|
| 0 | A | 7 |
| 1 | B | 2 |
| 2 | C | 5 |
| 3 | D | 3 |
(Note: the syntax {key1: value1, key2: value2, ...} defines a python dictionary. It is a useful data structure from core python but we won’t be making much use of it in this course other than to supply it as arguments to functions.)
Just as in the case of pd.Series(), we can optionally provide an index to the DataFrame using the index argument:
df = pd.DataFrame(
{
"letters": ["A", "B", "C", "D"],
"integer": [7, 2, 5, 3]
},
index = range(2017, 2021)
)
display(df)
| letters | integer | |
|---|---|---|
| 2017 | A | 7 |
| 2018 | B | 2 |
| 2019 | C | 5 |
| 2020 | D | 3 |
Combining multiple DataFrames#
Sometimes it is necessary to first save data in separate files and then combine them during analysis. For example, suppose we are interested in tide measurement. The NOAA interface only let us download 31 days of data at a time. So if you want to analyze tide patterns from January to March of 2025, your data will consist of 3 files:
tides_Jan = pd.read_csv(
"data/tide_prediction_2025-01.txt",
sep="\\s+", skiprows=13
).reset_index()
tides_Feb = pd.read_csv(
"data/tide_prediction_2025-02.txt",
sep="\\s+", skiprows=13
).reset_index()
tides_Mar = pd.read_csv(
"data/tide_prediction_2025-03.txt",
sep="\\s+", skiprows=13
).reset_index()
(As in the previous section, to be useful these need some further manipulations. Nevertheless, since the 3 DataFrames are at least consistent in their column names we can proceed to combine them)
We can combine the three DataFrames into a single one using pd.concat(), this time with the default axis=0:
pd.concat([tides_Jan, tides_Feb, tides_Mar])
| index | Date | Day | Time | Pred | High/Low | |
|---|---|---|---|---|---|---|
| 0 | 2025/01/01 | Wed | 06:58 | AM | 3.80 | H |
| 1 | 2025/01/01 | Wed | 12:17 | PM | 2.47 | L |
| 2 | 2025/01/01 | Wed | 04:41 | PM | 3.19 | H |
| 3 | 2025/01/01 | Wed | 11:54 | PM | -0.73 | L |
| 4 | 2025/01/02 | Thu | 07:28 | AM | 3.83 | H |
| ... | ... | ... | ... | ... | ... | ... |
| 114 | 2025/03/30 | Sun | 06:52 | PM | 3.46 | H |
| 115 | 2025/03/31 | Mon | 12:37 | AM | 1.10 | L |
| 116 | 2025/03/31 | Mon | 06:25 | AM | 3.70 | H |
| 117 | 2025/03/31 | Mon | 01:09 | PM | -0.48 | L |
| 118 | 2025/03/31 | Mon | 07:52 | PM | 3.48 | H |
347 rows × 6 columns
Writing a DataFrame to file#
Sometimes it is useful to save your intermediate data into a new file (e.g., so that other people can access it outside of python, or so that you don’t have to carry out the same data cleaning step every time you reanalyze the data). To export your DataFrame into a csv file, all you need to do is to call the .to_csv() method on it. For example:
# generate a new dataframe
df = pd.DataFrame(
{
"A": [1, 2, 3, 4],
"B": ["this", "that", "here", "there"],
"C": [1.3, 2.4, 7.5, 8.1]
}, index=["2017", "2018", "2019", "2020"]
)
# export to a csv file named "new_data.csv"
# NOTE: the output folder needs to already exist
df.to_csv("output/new_data.csv")