Handling datetime in pandas#

# initialization
import numpy as np
import pandas as pd

Create pandas datetime objects#

One reason to use pandas to handle tabular data is that pandas has good support of datetime values. Suppose we have a python list of dates, in the form of string:

dates = ["2024-01-15", "2024-02-15", "2024-03-15"]

We can convert these strings to a pandas DatetimeIndex using pd.to_datetime():

dates_pd = pd.to_datetime(dates)
display(dates_pd)
DatetimeIndex(['2024-01-15', '2024-02-15', '2024-03-15'], dtype='datetime64[ns]', freq=None)

The same can be done if we have timestamps that also include hour, minute, and (optionally) second information:

timestamps = ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
timestamps_pd = pd.to_datetime(timestamps)
display(timestamps_pd)
DatetimeIndex(['2024-01-15 14:20:00', '2024-02-15 07:35:00',
               '2024-03-15 18:06:00'],
              dtype='datetime64[ns]', freq=None)

Note that when you use pd.to_datetime() on a sequence of strings, the output is not a DataFrame nor a Series. To convert the results to a pandas Series, just wrap it around a pd.Series() call:

time_series = pd.Series(timestamps_pd)
display(time_series)
0   2024-01-15 14:20:00
1   2024-02-15 07:35:00
2   2024-03-15 18:06:00
dtype: datetime64[ns]

We can also assign the result of pd.to_datetime() to a column of a DataFrame. For the CalCOFI dateset, we may do:

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")

CalCOFI["Datetime"] = pd.to_datetime(CalCOFI["Datetime"])
CalCOFI.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10057 entries, 0 to 10056
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Cast_Count  10057 non-null  int64         
 1   Station_ID  10057 non-null  object        
 2   Datetime    10057 non-null  datetime64[ns]
 3   Depth_m     10057 non-null  int64         
 4   T_degC      10026 non-null  float64       
 5   Salinity    10057 non-null  float64       
 6   SigmaTheta  10008 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 550.1+ KB
CalCOFI
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
0 992 090.0 070.0 1950-02-06 19:54:00 0 14.040 33.1700 24.76600
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
2 992 090.0 070.0 1950-02-06 19:54:00 20 13.900 33.2100 24.82600
3 992 090.0 070.0 1950-02-06 19:54:00 23 13.880 33.2100 24.83000
4 992 090.0 070.0 1950-02-06 19:54:00 30 13.810 33.2180 24.85100
... ... ... ... ... ... ... ...
10052 35578 090.0 070.0 2021-01-21 13:36:00 300 7.692 34.1712 26.67697
10053 35578 090.0 070.0 2021-01-21 13:36:00 381 7.144 34.2443 26.81386
10054 35578 090.0 070.0 2021-01-21 13:36:00 400 7.031 34.2746 26.85372
10055 35578 090.0 070.0 2021-01-21 13:36:00 500 6.293 34.3126 26.98372
10056 35578 090.0 070.0 2021-01-21 13:36:00 515 6.155 34.2903 26.98398

10057 rows × 7 columns

In addition, when we use .read_csv(), we can use the parse_dates argument to tell pandas to read certain column(s) as datetimes:

CalCOFI2 = pd.read_csv("data/CalCOFI_subset.csv", parse_dates=["Datetime"])
CalCOFI2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10057 entries, 0 to 10056
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Cast_Count  10057 non-null  int64         
 1   Station_ID  10057 non-null  object        
 2   Datetime    10057 non-null  datetime64[ns]
 3   Depth_m     10057 non-null  int64         
 4   T_degC      10026 non-null  float64       
 5   Salinity    10057 non-null  float64       
 6   SigmaTheta  10008 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 550.1+ KB

Datetime formatting#

Of course, textual representations of datetime come in many different formats. For example, the following all specifies the same date:

date_1 = "2024-08-05 3:08 PM"
date_2 = "August 5, 2024, 15:08"
date_3 = "5 Aug 2024 - 3:08:00 pm"
date_4 = "5/8/2024 15:08:00"
date_5 = "8/5/2024 15:08:00"

To deal with the different formats, pd.to_datetime() accepts a format argument, which uses different format codes to specify how months, days, etc. are encoded. The documentation for format code can be found in https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior. Note that when we supply a format code, the non-informational characters (e.g., , / or ,) need to be included too.

As examples, to read the above strings into datetime, we’ll need:

tstamp_1 = pd.to_datetime(date_1, format="%Y-%m-%d %I:%M %p")
display(tstamp_1)
Timestamp('2024-08-05 15:08:00')
tstamp_2 = pd.to_datetime(date_2, format="%B %d, %Y, %H:%M")
display(tstamp_2)
Timestamp('2024-08-05 15:08:00')
tstamp_3 = pd.to_datetime(date_3, format="%d %b %Y - %I:%M:%S %p")
display(tstamp_3)
Timestamp('2024-08-05 15:08:00')
tstamp_4 = pd.to_datetime(date_4, format="%d/%m/%Y %H:%M:%S")
display(tstamp_4)
Timestamp('2024-08-05 15:08:00')
tstamp_5 = pd.to_datetime(date_5, format="%m/%d/%Y %H:%M:%S")
display(tstamp_5)
Timestamp('2024-08-05 15:08:00')

Conversely, the format code can also be used to print out time:

# create a pandas Series of time
time_series = pd.Series(pd.to_datetime(
    ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
))
# use strftime (string-from-time) to create the corresponding strings
time_series.dt.strftime("%b %d, %Y - %I:%M %p")
0    Jan 15, 2024 - 02:20 PM
1    Feb 15, 2024 - 07:35 AM
2    Mar 15, 2024 - 06:06 PM
dtype: object

Datetime comparisons#

To perform logical comparison between datetime, we can create a datetime from a scalar string, and use logical operator in the expected way:

time_series = pd.Series(pd.to_datetime(
    ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
))
instant = pd.to_datetime("2024-02-21")
time_series > instant
0    False
1    False
2     True
dtype: bool

As an application, we can combine datetime comparison and .loc[] to extract data that are collected in the new millennium (which we defined as year ≥ 2000)

CalCOFI.loc[
    (CalCOFI["Datetime"] >= pd.to_datetime("2000-01-01"))
]
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
7364 29359 090.0 070.0 2000-01-12 18:23:00 0 14.310 33.4100 24.89000
7365 29359 090.0 070.0 2000-01-12 18:23:00 2 14.310 33.4130 24.89700
7366 29359 090.0 070.0 2000-01-12 18:23:00 2 14.310 33.4130 NaN
7367 29359 090.0 070.0 2000-01-12 18:23:00 10 14.290 33.4130 24.90100
7368 29359 090.0 070.0 2000-01-12 18:23:00 20 14.280 33.4130 24.90400
... ... ... ... ... ... ... ...
10052 35578 090.0 070.0 2021-01-21 13:36:00 300 7.692 34.1712 26.67697
10053 35578 090.0 070.0 2021-01-21 13:36:00 381 7.144 34.2443 26.81386
10054 35578 090.0 070.0 2021-01-21 13:36:00 400 7.031 34.2746 26.85372
10055 35578 090.0 070.0 2021-01-21 13:36:00 500 6.293 34.3126 26.98372
10056 35578 090.0 070.0 2021-01-21 13:36:00 515 6.155 34.2903 26.98398

2693 rows × 7 columns

Extracting parts of a datetime#

Sometimes you are interested in parts of the datetime and not the whole datetime (e.g., in climatology studies you may be interested in the day of year but not the year itself). You are access these as attributes of the datetime accessor .dt of the pandas Series (again, a column of a DataFrame is a Series). Examples include:

  • .dt.year: year of the datetimes

  • .dt.month: month of the datetimes

  • .dt.day: day of the datetimes

  • .dt.hour: hour of the datetimes

  • .dt.minute: minute of the datetimes

  • .dt.dayofyear: day of year of the datetimes

For a full list, see https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components

As an example, let’s extract the day of year from the datetimes in CalCOFI:

# select one single depth to look at, to reduce duplication of dates
CalCOFI_10 = CalCOFI.loc[CalCOFI["Depth_m"]==10]
display(CalCOFI_10)
Cast_Count Station_ID Datetime Depth_m T_degC Salinity SigmaTheta
1 992 090.0 070.0 1950-02-06 19:54:00 10 13.950 33.2100 24.81500
28 1267 090.0 070.0 1950-04-08 17:30:00 10 14.980 33.4000 24.74400
62 1387 090.0 070.0 1950-05-07 17:36:00 10 14.170 33.3080 24.84500
98 1477 090.0 070.0 1950-06-13 00:42:00 10 14.510 33.1110 24.62200
130 1589 090.0 070.0 1950-07-13 00:54:00 10 16.550 33.1020 24.16400
... ... ... ... ... ... ... ...
9913 35250 090.0 070.0 2019-11-09 01:24:00 10 16.983 33.3950 24.28840
9944 35354 090.0 070.0 2020-01-09 18:13:00 10 15.054 33.3427 24.68417
9978 35429 090.0 070.0 2020-07-16 23:40:00 10 17.114 33.6188 24.42921
10015 35500 090.0 070.0 2020-10-14 23:52:00 10 17.373 33.3346 24.14988
10037 35578 090.0 070.0 2021-01-21 13:36:00 10 15.092 33.4561 24.76328

348 rows × 7 columns

# extract day of year
CalCOFI_10["Datetime"].dt.dayofyear
1         37
28        98
62       127
98       164
130      194
        ... 
9913     313
9944       9
9978     198
10015    288
10037     21
Name: Datetime, Length: 348, dtype: int32