Handling datetime in pandas#
# initialization
import numpy as np
import pandas as pd
Create pandas datetime objects#
One reason to use pandas to handle tabular data is that pandas has good support of datetime values. Suppose we have a python list of dates, in the form of string:
dates = ["2024-01-15", "2024-02-15", "2024-03-15"]
We can convert these strings to a pandas DatetimeIndex using pd.to_datetime():
dates_pd = pd.to_datetime(dates)
display(dates_pd)
DatetimeIndex(['2024-01-15', '2024-02-15', '2024-03-15'], dtype='datetime64[ns]', freq=None)
The same can be done if we have timestamps that also include hour, minute, and (optionally) second information:
timestamps = ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
timestamps_pd = pd.to_datetime(timestamps)
display(timestamps_pd)
DatetimeIndex(['2024-01-15 14:20:00', '2024-02-15 07:35:00',
'2024-03-15 18:06:00'],
dtype='datetime64[ns]', freq=None)
Note that when you use pd.to_datetime() on a sequence of strings, the output is not a DataFrame nor a Series. To convert the results to a pandas Series, just wrap it around a pd.Series() call:
time_series = pd.Series(timestamps_pd)
display(time_series)
0 2024-01-15 14:20:00
1 2024-02-15 07:35:00
2 2024-03-15 18:06:00
dtype: datetime64[ns]
We can also assign the result of pd.to_datetime() to a column of a DataFrame. For the CalCOFI dateset, we may do:
CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")
CalCOFI["Datetime"] = pd.to_datetime(CalCOFI["Datetime"])
CalCOFI.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10057 entries, 0 to 10056
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Cast_Count 10057 non-null int64
1 Station_ID 10057 non-null object
2 Datetime 10057 non-null datetime64[ns]
3 Depth_m 10057 non-null int64
4 T_degC 10026 non-null float64
5 Salinity 10057 non-null float64
6 SigmaTheta 10008 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 550.1+ KB
CalCOFI
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 0 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 0 | 14.040 | 33.1700 | 24.76600 |
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 2 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 20 | 13.900 | 33.2100 | 24.82600 |
| 3 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 23 | 13.880 | 33.2100 | 24.83000 |
| 4 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 30 | 13.810 | 33.2180 | 24.85100 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 300 | 7.692 | 34.1712 | 26.67697 |
| 10053 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 381 | 7.144 | 34.2443 | 26.81386 |
| 10054 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 400 | 7.031 | 34.2746 | 26.85372 |
| 10055 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 500 | 6.293 | 34.3126 | 26.98372 |
| 10056 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 515 | 6.155 | 34.2903 | 26.98398 |
10057 rows × 7 columns
In addition, when we use .read_csv(), we can use the parse_dates argument to tell pandas to read certain column(s) as datetimes:
CalCOFI2 = pd.read_csv("data/CalCOFI_subset.csv", parse_dates=["Datetime"])
CalCOFI2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10057 entries, 0 to 10056
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Cast_Count 10057 non-null int64
1 Station_ID 10057 non-null object
2 Datetime 10057 non-null datetime64[ns]
3 Depth_m 10057 non-null int64
4 T_degC 10026 non-null float64
5 Salinity 10057 non-null float64
6 SigmaTheta 10008 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 550.1+ KB
Datetime formatting#
Of course, textual representations of datetime come in many different formats. For example, the following all specifies the same date:
date_1 = "2024-08-05 3:08 PM"
date_2 = "August 5, 2024, 15:08"
date_3 = "5 Aug 2024 - 3:08:00 pm"
date_4 = "5/8/2024 15:08:00"
date_5 = "8/5/2024 15:08:00"
To deal with the different formats, pd.to_datetime() accepts a format argument, which uses different format codes to specify how months, days, etc. are encoded. The documentation for format code can be found in https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior. Note that when we supply a format code, the non-informational characters (e.g., , / or ,) need to be included too.
As examples, to read the above strings into datetime, we’ll need:
tstamp_1 = pd.to_datetime(date_1, format="%Y-%m-%d %I:%M %p")
display(tstamp_1)
Timestamp('2024-08-05 15:08:00')
tstamp_2 = pd.to_datetime(date_2, format="%B %d, %Y, %H:%M")
display(tstamp_2)
Timestamp('2024-08-05 15:08:00')
tstamp_3 = pd.to_datetime(date_3, format="%d %b %Y - %I:%M:%S %p")
display(tstamp_3)
Timestamp('2024-08-05 15:08:00')
tstamp_4 = pd.to_datetime(date_4, format="%d/%m/%Y %H:%M:%S")
display(tstamp_4)
Timestamp('2024-08-05 15:08:00')
tstamp_5 = pd.to_datetime(date_5, format="%m/%d/%Y %H:%M:%S")
display(tstamp_5)
Timestamp('2024-08-05 15:08:00')
Conversely, the format code can also be used to print out time:
# create a pandas Series of time
time_series = pd.Series(pd.to_datetime(
["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
))
# use strftime (string-from-time) to create the corresponding strings
time_series.dt.strftime("%b %d, %Y - %I:%M %p")
0 Jan 15, 2024 - 02:20 PM
1 Feb 15, 2024 - 07:35 AM
2 Mar 15, 2024 - 06:06 PM
dtype: object
Datetime comparisons#
To perform logical comparison between datetime, we can create a datetime from a scalar string, and use logical operator in the expected way:
time_series = pd.Series(pd.to_datetime(
["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
))
instant = pd.to_datetime("2024-02-21")
time_series > instant
0 False
1 False
2 True
dtype: bool
As an application, we can combine datetime comparison and .loc[] to extract data that are collected in the new millennium (which we defined as year ≥ 2000)
CalCOFI.loc[
(CalCOFI["Datetime"] >= pd.to_datetime("2000-01-01"))
]
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 7364 | 29359 | 090.0 070.0 | 2000-01-12 18:23:00 | 0 | 14.310 | 33.4100 | 24.89000 |
| 7365 | 29359 | 090.0 070.0 | 2000-01-12 18:23:00 | 2 | 14.310 | 33.4130 | 24.89700 |
| 7366 | 29359 | 090.0 070.0 | 2000-01-12 18:23:00 | 2 | 14.310 | 33.4130 | NaN |
| 7367 | 29359 | 090.0 070.0 | 2000-01-12 18:23:00 | 10 | 14.290 | 33.4130 | 24.90100 |
| 7368 | 29359 | 090.0 070.0 | 2000-01-12 18:23:00 | 20 | 14.280 | 33.4130 | 24.90400 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 10052 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 300 | 7.692 | 34.1712 | 26.67697 |
| 10053 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 381 | 7.144 | 34.2443 | 26.81386 |
| 10054 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 400 | 7.031 | 34.2746 | 26.85372 |
| 10055 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 500 | 6.293 | 34.3126 | 26.98372 |
| 10056 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 515 | 6.155 | 34.2903 | 26.98398 |
2693 rows × 7 columns
Extracting parts of a datetime#
Sometimes you are interested in parts of the datetime and not the whole datetime (e.g., in climatology studies you may be interested in the day of year but not the year itself). You are access these as attributes of the datetime accessor .dt of the pandas Series (again, a column of a DataFrame is a Series). Examples include:
.dt.year: year of the datetimes.dt.month: month of the datetimes.dt.day: day of the datetimes.dt.hour: hour of the datetimes.dt.minute: minute of the datetimes.dt.dayofyear: day of year of the datetimes
For a full list, see https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components
As an example, let’s extract the day of year from the datetimes in CalCOFI:
# select one single depth to look at, to reduce duplication of dates
CalCOFI_10 = CalCOFI.loc[CalCOFI["Depth_m"]==10]
display(CalCOFI_10)
| Cast_Count | Station_ID | Datetime | Depth_m | T_degC | Salinity | SigmaTheta | |
|---|---|---|---|---|---|---|---|
| 1 | 992 | 090.0 070.0 | 1950-02-06 19:54:00 | 10 | 13.950 | 33.2100 | 24.81500 |
| 28 | 1267 | 090.0 070.0 | 1950-04-08 17:30:00 | 10 | 14.980 | 33.4000 | 24.74400 |
| 62 | 1387 | 090.0 070.0 | 1950-05-07 17:36:00 | 10 | 14.170 | 33.3080 | 24.84500 |
| 98 | 1477 | 090.0 070.0 | 1950-06-13 00:42:00 | 10 | 14.510 | 33.1110 | 24.62200 |
| 130 | 1589 | 090.0 070.0 | 1950-07-13 00:54:00 | 10 | 16.550 | 33.1020 | 24.16400 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 9913 | 35250 | 090.0 070.0 | 2019-11-09 01:24:00 | 10 | 16.983 | 33.3950 | 24.28840 |
| 9944 | 35354 | 090.0 070.0 | 2020-01-09 18:13:00 | 10 | 15.054 | 33.3427 | 24.68417 |
| 9978 | 35429 | 090.0 070.0 | 2020-07-16 23:40:00 | 10 | 17.114 | 33.6188 | 24.42921 |
| 10015 | 35500 | 090.0 070.0 | 2020-10-14 23:52:00 | 10 | 17.373 | 33.3346 | 24.14988 |
| 10037 | 35578 | 090.0 070.0 | 2021-01-21 13:36:00 | 10 | 15.092 | 33.4561 | 24.76328 |
348 rows × 7 columns
# extract day of year
CalCOFI_10["Datetime"].dt.dayofyear
1 37
28 98
62 127
98 164
130 194
...
9913 313
9944 9
9978 198
10015 288
10037 21
Name: Datetime, Length: 348, dtype: int32