Pandas DataFrame and tabular data

Pandas DataFrame and tabular data#

Pandas and pandas DataFrames#

In data science we often encounter data that are organized in tables. Our main tool to handle such data is the third-party module pandas, which we import as:

import pandas as pd

As before, the as pd part is optional but is relatively standard. It defines a shorthand pd for the pandas package.

Just like how ndarray is the central objects of numpy, pandas also a central object called DataFrame. Structurally, pandas DataFrames are row homogeneous (i.e., each row is similar to the next row) but column heterogeneous (i.e., one column may differ from the next one). This makes pandas DataFrame a good representation of tabular data, since tabular data tends to be row homogeneous and column heterogeneous too.

Moreover, unlike 2D numpy arrays, a pandas DataFrame also stores column names and row index, and these can be used to subset the DataFrame. Again this makes pandas DataFrame a better choice for representing tabular data than numpy array, since tabular data often come with descriptive column headers.

Loading data as pandas DataFrame#

Most likely, your DataFrame will be created from external tabular data. The most portable format for such data is a CSV (comma separated values) file. We can use the pd.read_csv() function to load such external file as DataFrame.

For example, a subset of the CalCOFI dataset (https://calcofi.org/) can be loaded as follows (note: our subset can be found here)

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
display(CalCOFI)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
0	992	090.0 070.0	1950-02-06 19:54:00	0	14.040	33.1700	24.76600
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
2	992	090.0 070.0	1950-02-06 19:54:00	20	13.900	33.2100	24.82600
3	992	090.0 070.0	1950-02-06 19:54:00	23	13.880	33.2100	24.83000
4	992	090.0 070.0	1950-02-06 19:54:00	30	13.810	33.2180	24.85100
...	...	...	...	...	...	...	...
10052	35578	090.0 070.0	2021-01-21 13:36:00	300	7.692	34.1712	26.67697
10053	35578	090.0 070.0	2021-01-21 13:36:00	381	7.144	34.2443	26.81386
10054	35578	090.0 070.0	2021-01-21 13:36:00	400	7.031	34.2746	26.85372
10055	35578	090.0 070.0	2021-01-21 13:36:00	500	6.293	34.3126	26.98372
10056	35578	090.0 070.0	2021-01-21 13:36:00	515	6.155	34.2903	26.98398

10057 rows × 7 columns

(Note that in the above we used the display() function to display a pandas DataFrame. The display() function is built-in to the JupyterHub and is used to display information using webpage (HTML) technology, which tends to be richer than the plain-text interface you get from using print())

You can find the official documentation of pd.read_csv() from https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html. The function have plenty of (keyword-only, optional) arguments. Some highlights are:

names: The names to use as column names. If not supplied the first line being read is treated as the header row
skiprows: if an int, the number of rows to skip before contents are read in; if a python list of int, the line indices to skip.
cols: the columns to read into the DataFrame. Can be a list of column indices or column names
index_col: the column to use as row index. Can be an int or a name
na_values: which values are to be treated as indication of missing data

As a second example, here is the content of a simple CSV file (don’t worry about the details of the code; we don’t expect you to read a file as plain text in this course):

with open("data/header_example.csv") as infile:
    print(infile.read())

# Two lines of metadata followed by an empty line
# In this file -999 is the code for missing values

A,B,C,D
m,g,cm,L
1,3.2,4,2
2,7.9,7,5
2,-999,5,3

Suppose we want to skip the metadata lines (indices 0, 1), the empty line (index 2), and the units line (index 4), and read only the columns “A”, “B”, and “D”. Suppose we also want to convert the -999 code into actual np.nan. Then, we may do:

small_df = pd.read_csv(
    "data/header_example.csv", 
    skiprows=[0, 1, 2, 4], 
    usecols=["A", "B", "D"], 
    na_values="-999"
)

display(small_df)

	A	B	D
0	1	3.2	2
1	2	7.9	5
2	2	NaN	3

Alternatively, suppose we want to read in all columns, and use column D as row index. The code is then:

small_df = pd.read_csv(
    "data/header_example.csv", 
    skiprows=[0, 1, 2, 4], 
    index_col = "D",
    na_values="-999"
)

display(small_df)

	A	B	C
D
2	1	3.2	4
5	2	7.9	7
3	2	NaN	5

As a third example, consider the Seattle tide prediction data from Jan 1, 2025 to Jan 31, 2025, obtained from NOAA (a copy of the file can be found here) If you open the file in JupyterHub, you’ll find that this is a text file with 13 lines of metadata, followed by header at line 14, and data from that point on. Note also that the data is separated by white spaces (actually tabs). We can load this file as a pandas DataFrame like so:

tides = pd.read_csv(
    "data/tide_prediction_2025-01.txt", 
    sep="\\s+", skiprows=13
)

display(tides)

	Date	Day	Time	Pred	High/Low
2025/01/01	Wed	06:58	AM	3.80	H
2025/01/01	Wed	12:17	PM	2.47	L
2025/01/01	Wed	04:41	PM	3.19	H
2025/01/01	Wed	11:54	PM	-0.73	L
2025/01/02	Thu	07:28	AM	3.83	H
...	...	...	...	...	...
2025/01/30	Thu	04:44	PM	3.28	H
2025/01/30	Thu	11:36	PM	-0.57	L
2025/01/31	Fri	06:45	AM	3.84	H
2025/01/31	Fri	12:29	PM	1.68	L
2025/01/31	Fri	05:38	PM	3.22	H

119 rows × 5 columns

In the above, sep = "\\s+" tells pandas that the entries are separated by one or more “white-space like characters” (another possible choice is sep = "\t", where \t is python’s way of writing the tab character). Unfortunately, the precise formatting of this file causes pandas to turn the first column into the index of the DataFrame. To deal with this, we use .reset_index() to turn the index back into a regular column:

tides = tides.reset_index()
display(tides)

	index	Date	Day	Time	Pred	High/Low
0	2025/01/01	Wed	06:58	AM	3.80	H
1	2025/01/01	Wed	12:17	PM	2.47	L
2	2025/01/01	Wed	04:41	PM	3.19	H
3	2025/01/01	Wed	11:54	PM	-0.73	L
4	2025/01/02	Thu	07:28	AM	3.83	H
...	...	...	...	...	...	...
114	2025/01/30	Thu	04:44	PM	3.28	H
115	2025/01/30	Thu	11:36	PM	-0.57	L
116	2025/01/31	Fri	06:45	AM	3.84	H
117	2025/01/31	Fri	12:29	PM	1.68	L
118	2025/01/31	Fri	05:38	PM	3.22	H

119 rows × 6 columns

Additional clean up will be needed to fix the header row, etc, and you will learn some of these in the later parts of this week’s readings.

In addition to offline files, pandas can also read a csv file hosted on the internet. For example, consider the dataset of chlorophyll-a across the year hosted by the OOI Ocean Data Labs Project at Rutgers University. A direct url to their data in csv format is https://datalab.marine.rutgers.edu/explorations/productivity/data/productivity2.csv, which we can supply as the first argument to pd.read_csv() to load the data into pandas DataFrame:

url = "https://datalab.marine.rutgers.edu/explorations/productivity/data/productivity2.csv"
pd.read_csv(url)

	date	Chlorophyll-a
0	2015-06-01 00:00:00	6.1105
1	2015-06-01 01:00:00	7.9981
2	2015-06-01 02:00:00	11.3498
3	2015-06-01 03:00:00	12.2573
4	2015-06-01 04:00:00	9.7526
...	...	...
8780	2016-05-31 20:00:00	1.4399
8781	2016-05-31 21:00:00	1.0043
8782	2016-05-31 22:00:00	0.8954
8783	2016-05-31 23:00:00	0.9438
8784	2016-06-01 00:00:00	1.1616

8785 rows × 2 columns

Incidentally, pandas can also be used to read Microsoft Excel file. The relevant function is pd.read_excel(), which is documented at https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html. Most of its arguments are the same as pd.read_csv(), with the notable except of the sheet_name argument, which is used to specify the sheet to load the data from.

As an example, the CalCOFI data is also saved as an Excel .xlsx document (which you can find here), and we can load it as such:

CalCOFI_2 = pd.read_excel("data/CalCOFI_subset.xlsx", sheet_name="Sheet1")
display(CalCOFI_2)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
0	992	090.0 070.0	1950-02-06 19:54:00	0	14.040	33.1700	24.76600
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
2	992	090.0 070.0	1950-02-06 19:54:00	20	13.900	33.2100	24.82600
3	992	090.0 070.0	1950-02-06 19:54:00	23	13.880	33.2100	24.83000
4	992	090.0 070.0	1950-02-06 19:54:00	30	13.810	33.2180	24.85100
...	...	...	...	...	...	...	...
10052	35578	090.0 070.0	2021-01-21 13:36:00	300	7.692	34.1712	26.67697
10053	35578	090.0 070.0	2021-01-21 13:36:00	381	7.144	34.2443	26.81386
10054	35578	090.0 070.0	2021-01-21 13:36:00	400	7.031	34.2746	26.85372
10055	35578	090.0 070.0	2021-01-21 13:36:00	500	6.293	34.3126	26.98372
10056	35578	090.0 070.0	2021-01-21 13:36:00	515	6.155	34.2903	26.98398

10057 rows × 7 columns

Creating pandas DataFrame by hands#

Internally, each column of a pandas DataFrame is a pandas Series. And we can create a Series using the pd.Series() function:

pd.Series(["A", "B", "C", "D"])

  A
  B
  C
  D
dtype: object

Optionally, we can supply an index and a name to a pandas Series. For example:

pd.Series(["A", "B", "C", "D"], index=range(65, 69), name="letters")

  A
  B
  C
  D
Name: letters, dtype: object

We can combine multiple pandas Series of the same length into a DataFrame using pd.concat() with axis=1:

letters = pd.Series(["A", "B", "C", "D"], name="letters")
integers = pd.Series([7, 2, 5, 3], name="integer")

df = pd.concat([letters, integers], axis=1)
display(df)

	letters	integer
0	A	7
1	B	2
2	C	5
3	D	3

Notice that when combining pandas Series into DataFrame, the index of the Series becomes the index of the DataFrame, and the name of the Series becomes the name of the corresponding column.

Alternatively, we can also create an entire DataFrame by hand using the pd.DataFrame() function. When we call pd.DataFrame() we supply key-values pairs enclosed by curly braces {}, and use : to separate the keys frome the values. In the resulting DataFrame, the keys will become the column names, while the values (usually a python list or a numpy array) will become the columns themselves. For example, to reproduce the above DataFrame, we may do:

df = pd.DataFrame({
    "letters": ["A", "B", "C", "D"],
    "integer": [7, 2, 5, 3]
})

display(df)

	letters	integer
0	A	7
1	B	2
2	C	5
3	D	3

(Note: the syntax {key1: value1, key2: value2, ...} defines a python dictionary. It is a useful data structure from core python but we won’t be making much use of it in this course other than to supply it as arguments to functions.)

Just as in the case of pd.Series(), we can optionally provide an index to the DataFrame using the index argument:

df = pd.DataFrame(
    {
        "letters": ["A", "B", "C", "D"],
        "integer": [7, 2, 5, 3]
    }, 
    index = range(2017, 2021)
)

display(df)

	letters	integer
2017	A	7
2018	B	2
2019	C	5
2020	D	3

Combining multiple DataFrames#

Sometimes it is necessary to first save data in separate files and then combine them during analysis. For example, suppose we are interested in tide measurement. The NOAA interface only let us download 31 days of data at a time. So if you want to analyze tide patterns from January to March of 2025, your data will consist of 3 files:

tides_Jan = pd.read_csv(
    "data/tide_prediction_2025-01.txt", 
    sep="\\s+", skiprows=13
).reset_index()

tides_Feb = pd.read_csv(
    "data/tide_prediction_2025-02.txt", 
    sep="\\s+", skiprows=13
).reset_index()

tides_Mar = pd.read_csv(
    "data/tide_prediction_2025-03.txt", 
    sep="\\s+", skiprows=13
).reset_index()

(As in the previous section, to be useful these need some further manipulations. Nevertheless, since the 3 DataFrames are at least consistent in their column names we can proceed to combine them)

We can combine the three DataFrames into a single one using pd.concat(), this time with the default axis=0:

pd.concat([tides_Jan, tides_Feb, tides_Mar])

	index	Date	Day	Time	Pred	High/Low
0	2025/01/01	Wed	06:58	AM	3.80	H
1	2025/01/01	Wed	12:17	PM	2.47	L
2	2025/01/01	Wed	04:41	PM	3.19	H
3	2025/01/01	Wed	11:54	PM	-0.73	L
4	2025/01/02	Thu	07:28	AM	3.83	H
...	...	...	...	...	...	...
114	2025/03/30	Sun	06:52	PM	3.46	H
115	2025/03/31	Mon	12:37	AM	1.10	L
116	2025/03/31	Mon	06:25	AM	3.70	H
117	2025/03/31	Mon	01:09	PM	-0.48	L
118	2025/03/31	Mon	07:52	PM	3.48	H

347 rows × 6 columns

Writing a DataFrame to file#

Sometimes it is useful to save your intermediate data into a new file (e.g., so that other people can access it outside of python, or so that you don’t have to carry out the same data cleaning step every time you reanalyze the data). To export your DataFrame into a csv file, all you need to do is to call the .to_csv() method on it. For example:

# generate a new dataframe

df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4],
        "B": ["this", "that", "here", "there"],
        "C": [1.3, 2.4, 7.5, 8.1]
    }, index=["2017", "2018", "2019", "2020"]
)

# export to a csv file named "new_data.csv"
# NOTE: the output folder needs to already exist
df.to_csv("output/new_data.csv")