Pandas DataFrame and tabular data#

Pandas and pandas DataFrames#

In data science we often encounter data that are organized in tables. Our main tool to handle such data is the third-party module pandas, which we import as:

import pandas as pd

As before, the as pd part is optional but is relatively standard. It defines a shorthand pd for the pandas package.

Just like how ndarray is the central objects of numpy, pandas also a central object called DataFrame. Structurally, pandas DataFrames are row homogeneous (i.e., each row is similar to the next row) but column heterogeneous (i.e., one column may differ from the next one). This makes pandas DataFrame a good representation of tabular data, since tabular data tends to be row homogeneous and column heterogeneous too.

Moreover, unlike 2D numpy arrays, a pandas DataFrame also stores column names and row index, and these can be used to subset the DataFrame. Again this makes pandas DataFrame a better choice for representing tabular data than numpy array, since tabular data often come with descriptive column headers.

Loading data as pandas DataFrame#

Most likely, your DataFrame will be created from external tabular data. The most portable format for such data is a CSV (comma separated values) file. We can use the pd.read_csv() function to load such external file as DataFrame.

For example, a subset of the CalCOFI dataset (https://calcofi.org/) can be loaded as follows (note: our subset can be found here)

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
display(CalCOFI)
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
0 992 090.0 070.0 1950-02-06 19:54:00 0 14.040 33.1700 24.76600
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
2 992 090.0 070.0 1950-02-06 19:54:00 20 13.900 33.2100 24.82600
3 992 090.0 070.0 1950-02-06 19:54:00 23 13.880 33.2100 24.83000
4 992 090.0 070.0 1950-02-06 19:54:00 30 13.810 33.2180 24.85100
... ... ... ... ... ... ... ...
10052 35578 090.0 070.0 2021-01-21 13:36:00 300 7.692 34.1712 26.67697
10053 35578 090.0 070.0 2021-01-21 13:36:00 381 7.144 34.2443 26.81386
10054 35578 090.0 070.0 2021-01-21 13:36:00 400 7.031 34.2746 26.85372
10055 35578 090.0 070.0 2021-01-21 13:36:00 500 6.293 34.3126 26.98372
10056 35578 090.0 070.0 2021-01-21 13:36:00 515 6.155 34.2903 26.98398

10057 rows × 7 columns

(Note that in the above we used the display() function to display a pandas DataFrame. The display() function is built-in to the JupyterHub and is used to display information using webpage (HTML) technology, which tends to be richer than the plain-text interface you get from using print())

You can find the official documentation of pd.read_csv() from https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html. The function have plenty of (keyword-only, optional) arguments. Some highlights are:

  • names: The names to use as column names. If not supplied the first line being read is treated as the header row

  • skiprows: if an int, the number of rows to skip before contents are read in; if a python list of int, the line indices to skip.

  • cols: the columns to read into the DataFrame. Can be a list of column indices or column names

  • index_col: the column to use as row index. Can be an int or a name

  • na_values: which values are to be treated as indication of missing data

As a second example, here is the content of a simple CSV file (don’t worry about the details of the code; we don’t expect you to read a file as plain text in this course):

with open("data/header_example.csv") as infile:
    print(infile.read())
# Two lines of metadata followed by an empty line
# In this file -999 is the code for missing values

A,B,C,D
m,g,cm,L
1,3.2,4,2
2,7.9,7,5
2,-999,5,3

Suppose we want to skip the metadata lines (indices 0, 1), the empty line (index 2), and the units line (index 4), and read only the columns “A”, “B”, and “D”. Suppose we also want to convert the -999 code into actual np.nan. Then, we may do:

small_df = pd.read_csv(
    "data/header_example.csv", 
    skiprows=[0, 1, 2, 4], 
    usecols=["A", "B", "D"], 
    na_values="-999"
)

display(small_df)
A B D
0 1 3.2 2
1 2 7.9 5
2 2 NaN 3

Alternatively, suppose we want to read in all columns, and use column D as row index. The code is then:

small_df = pd.read_csv(
    "data/header_example.csv", 
    skiprows=[0, 1, 2, 4], 
    index_col = "D",
    na_values="-999"
)

display(small_df)
A B C
D
2 1 3.2 4
5 2 7.9 7
3 2 NaN 5

As a third example, consider the Seattle tide prediction data from Jan 1, 2025 to Jan 31, 2025, obtained from NOAA (a copy of the file can be found here) If you open the file in JupyterHub, you’ll find that this is a text file with 13 lines of metadata, followed by header at line 14, and data from that point on. Note also that the data is separated by white spaces (actually tabs). We can load this file as a pandas DataFrame like so:

tides = pd.read_csv(
    "data/tide_prediction_2025-01.txt", 
    sep="\\s+", skiprows=13
)

display(tides)
Date Day Time Pred High/Low
2025/01/01 Wed 06:58 AM 3.80 H
2025/01/01 Wed 12:17 PM 2.47 L
2025/01/01 Wed 04:41 PM 3.19 H
2025/01/01 Wed 11:54 PM -0.73 L
2025/01/02 Thu 07:28 AM 3.83 H
... ... ... ... ... ...
2025/01/30 Thu 04:44 PM 3.28 H
2025/01/30 Thu 11:36 PM -0.57 L
2025/01/31 Fri 06:45 AM 3.84 H
2025/01/31 Fri 12:29 PM 1.68 L
2025/01/31 Fri 05:38 PM 3.22 H

119 rows × 5 columns

In the above, sep = "\\s+" tells pandas that the entries are separated by one or more “white-space like characters” (another possible choice is sep = "\t", where \t is python’s way of writing the tab character). Unfortunately, the precise formatting of this file causes pandas to turn the first column into the index of the DataFrame. To deal with this, we use .reset_index() to turn the index back into a regular column:

tides = tides.reset_index()
display(tides)
index Date Day Time Pred High/Low
0 2025/01/01 Wed 06:58 AM 3.80 H
1 2025/01/01 Wed 12:17 PM 2.47 L
2 2025/01/01 Wed 04:41 PM 3.19 H
3 2025/01/01 Wed 11:54 PM -0.73 L
4 2025/01/02 Thu 07:28 AM 3.83 H
... ... ... ... ... ... ...
114 2025/01/30 Thu 04:44 PM 3.28 H
115 2025/01/30 Thu 11:36 PM -0.57 L
116 2025/01/31 Fri 06:45 AM 3.84 H
117 2025/01/31 Fri 12:29 PM 1.68 L
118 2025/01/31 Fri 05:38 PM 3.22 H

119 rows × 6 columns

Additional clean up will be needed to fix the header row, etc, and you will learn some of these in the later parts of this week’s readings.

In addition to offline files, pandas can also read a csv file hosted on the internet. For example, consider the dataset of chlorophyll-a across the year hosted by the OOI Ocean Data Labs Project at Rutgers University. A direct url to their data in csv format is https://datalab.marine.rutgers.edu/explorations/productivity/data/productivity2.csv, which we can supply as the first argument to pd.read_csv() to load the data into pandas DataFrame:

url = "https://datalab.marine.rutgers.edu/explorations/productivity/data/productivity2.csv"
pd.read_csv(url)
date Chlorophyll-a
0 2015-06-01 00:00:00 6.1105
1 2015-06-01 01:00:00 7.9981
2 2015-06-01 02:00:00 11.3498
3 2015-06-01 03:00:00 12.2573
4 2015-06-01 04:00:00 9.7526
... ... ...
8780 2016-05-31 20:00:00 1.4399
8781 2016-05-31 21:00:00 1.0043
8782 2016-05-31 22:00:00 0.8954
8783 2016-05-31 23:00:00 0.9438
8784 2016-06-01 00:00:00 1.1616

8785 rows × 2 columns

Incidentally, pandas can also be used to read Microsoft Excel file. The relevant function is pd.read_excel(), which is documented at https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html. Most of its arguments are the same as pd.read_csv(), with the notable except of the sheet_name argument, which is used to specify the sheet to load the data from.

As an example, the CalCOFI data is also saved as an Excel .xlsx document (which you can find here), and we can load it as such:

CalCOFI_2 = pd.read_excel("data/CalCOFI_subset.xlsx", sheet_name="Sheet1")
display(CalCOFI_2)
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
0 992 090.0 070.0 1950-02-06 19:54:00 0 14.040 33.1700 24.76600
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
2 992 090.0 070.0 1950-02-06 19:54:00 20 13.900 33.2100 24.82600
3 992 090.0 070.0 1950-02-06 19:54:00 23 13.880 33.2100 24.83000
4 992 090.0 070.0 1950-02-06 19:54:00 30 13.810 33.2180 24.85100
... ... ... ... ... ... ... ...
10052 35578 090.0 070.0 2021-01-21 13:36:00 300 7.692 34.1712 26.67697
10053 35578 090.0 070.0 2021-01-21 13:36:00 381 7.144 34.2443 26.81386
10054 35578 090.0 070.0 2021-01-21 13:36:00 400 7.031 34.2746 26.85372
10055 35578 090.0 070.0 2021-01-21 13:36:00 500 6.293 34.3126 26.98372
10056 35578 090.0 070.0 2021-01-21 13:36:00 515 6.155 34.2903 26.98398

10057 rows × 7 columns

Creating pandas DataFrame by hands#

Internally, each column of a pandas DataFrame is a pandas Series. And we can create a Series using the pd.Series() function:

pd.Series(["A", "B", "C", "D"])
0    A
1    B
2    C
3    D
dtype: object

Optionally, we can supply an index and a name to a pandas Series. For example:

pd.Series(["A", "B", "C", "D"], index=range(65, 69), name="letters")
65    A
66    B
67    C
68    D
Name: letters, dtype: object

We can combine multiple pandas Series of the same length into a DataFrame using pd.concat() with axis=1:

letters = pd.Series(["A", "B", "C", "D"], name="letters")
integers = pd.Series([7, 2, 5, 3], name="integer")

df = pd.concat([letters, integers], axis=1)
display(df)
letters integer
0 A 7
1 B 2
2 C 5
3 D 3

Notice that when combining pandas Series into DataFrame, the index of the Series becomes the index of the DataFrame, and the name of the Series becomes the name of the corresponding column.

Alternatively, we can also create an entire DataFrame by hand using the pd.DataFrame() function. When we call pd.DataFrame() we supply key-values pairs enclosed by curly braces {}, and use : to separate the keys frome the values. In the resulting DataFrame, the keys will become the column names, while the values (usually a python list or a numpy array) will become the columns themselves. For example, to reproduce the above DataFrame, we may do:

df = pd.DataFrame({
    "letters": ["A", "B", "C", "D"],
    "integer": [7, 2, 5, 3]
})

display(df)
letters integer
0 A 7
1 B 2
2 C 5
3 D 3

(Note: the syntax {key1: value1, key2: value2, ...} defines a python dictionary. It is a useful data structure from core python but we won’t be making much use of it in this course other than to supply it as arguments to functions.)

Just as in the case of pd.Series(), we can optionally provide an index to the DataFrame using the index argument:

df = pd.DataFrame(
    {
        "letters": ["A", "B", "C", "D"],
        "integer": [7, 2, 5, 3]
    }, 
    index = range(2017, 2021)
)

display(df)
letters integer
2017 A 7
2018 B 2
2019 C 5
2020 D 3

Combining multiple DataFrames#

Sometimes it is necessary to first save data in separate files and then combine them during analysis. For example, suppose we are interested in tide measurement. The NOAA interface only let us download 31 days of data at a time. So if you want to analyze tide patterns from January to March of 2025, your data will consist of 3 files:

tides_Jan = pd.read_csv(
    "data/tide_prediction_2025-01.txt", 
    sep="\\s+", skiprows=13
).reset_index()

tides_Feb = pd.read_csv(
    "data/tide_prediction_2025-02.txt", 
    sep="\\s+", skiprows=13
).reset_index()

tides_Mar = pd.read_csv(
    "data/tide_prediction_2025-03.txt", 
    sep="\\s+", skiprows=13
).reset_index()

(As in the previous section, to be useful these need some further manipulations. Nevertheless, since the 3 DataFrames are at least consistent in their column names we can proceed to combine them)

We can combine the three DataFrames into a single one using pd.concat(), this time with the default axis=0:

pd.concat([tides_Jan, tides_Feb, tides_Mar])
index Date Day Time Pred High/Low
0 2025/01/01 Wed 06:58 AM 3.80 H
1 2025/01/01 Wed 12:17 PM 2.47 L
2 2025/01/01 Wed 04:41 PM 3.19 H
3 2025/01/01 Wed 11:54 PM -0.73 L
4 2025/01/02 Thu 07:28 AM 3.83 H
... ... ... ... ... ... ...
114 2025/03/30 Sun 06:52 PM 3.46 H
115 2025/03/31 Mon 12:37 AM 1.10 L
116 2025/03/31 Mon 06:25 AM 3.70 H
117 2025/03/31 Mon 01:09 PM -0.48 L
118 2025/03/31 Mon 07:52 PM 3.48 H

347 rows × 6 columns

Writing a DataFrame to file#

Sometimes it is useful to save your intermediate data into a new file (e.g., so that other people can access it outside of python, or so that you don’t have to carry out the same data cleaning step every time you reanalyze the data). To export your DataFrame into a csv file, all you need to do is to call the .to_csv() method on it. For example:

# generate a new dataframe

df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4],
        "B": ["this", "that", "here", "there"],
        "C": [1.3, 2.4, 7.5, 8.1]
    }, index=["2017", "2018", "2019", "2020"]
)

# export to a csv file named "new_data.csv"
# NOTE: the output folder needs to already exist
df.to_csv("output/new_data.csv")