Handling datetime in pandas

Handling datetime in pandas#

# initialization
import numpy as np
import pandas as pd

Create pandas datetime objects#

One reason to use pandas to handle tabular data is that pandas has good support of datetime values. Suppose we have a python list of dates, in the form of string:

dates = ["2024-01-15", "2024-02-15", "2024-03-15"]

We can convert these strings to a pandas DatetimeIndex using pd.to_datetime():

dates_pd = pd.to_datetime(dates)
display(dates_pd)

DatetimeIndex(['2024-01-15', '2024-02-15', '2024-03-15'], dtype='datetime64[ns]', freq=None)

The same can be done if we have timestamps that also include hour, minute, and (optionally) second information:

timestamps = ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]

timestamps_pd = pd.to_datetime(timestamps)
display(timestamps_pd)

DatetimeIndex(['2024-01-15 14:20:00', '2024-02-15 07:35:00',
               '2024-03-15 18:06:00'],
              dtype='datetime64[ns]', freq=None)

Note that when you use pd.to_datetime() on a sequence of strings, the output is not a DataFrame nor a Series. To convert the results to a pandas Series, just wrap it around a pd.Series() call:

time_series = pd.Series(timestamps_pd)
display(time_series)

 2024-01-15 14:20:00
 2024-02-15 07:35:00
 2024-03-15 18:06:00
dtype: datetime64[ns]

We can also assign the result of pd.to_datetime() to a column of a DataFrame. For the CalCOFI dateset, we may do:

CalCOFI = pd.read_csv("data/CalCOFI_subset.csv")

CalCOFI["Datetime"] = pd.to_datetime(CalCOFI["Datetime"])

CalCOFI.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10057 entries, 0 to 10056
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Cast_Count  10057 non-null  int64         
 1   Station_ID  10057 non-null  object        
 2   Datetime    10057 non-null  datetime64[ns]
 3   Depth_m     10057 non-null  int64         
 4   T_degC      10026 non-null  float64       
 5   Salinity    10057 non-null  float64       
 6   SigmaTheta  10008 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 550.1+ KB

CalCOFI

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
0	992	090.0 070.0	1950-02-06 19:54:00	0	14.040	33.1700	24.76600
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
2	992	090.0 070.0	1950-02-06 19:54:00	20	13.900	33.2100	24.82600
3	992	090.0 070.0	1950-02-06 19:54:00	23	13.880	33.2100	24.83000
4	992	090.0 070.0	1950-02-06 19:54:00	30	13.810	33.2180	24.85100
...	...	...	...	...	...	...	...
10052	35578	090.0 070.0	2021-01-21 13:36:00	300	7.692	34.1712	26.67697
10053	35578	090.0 070.0	2021-01-21 13:36:00	381	7.144	34.2443	26.81386
10054	35578	090.0 070.0	2021-01-21 13:36:00	400	7.031	34.2746	26.85372
10055	35578	090.0 070.0	2021-01-21 13:36:00	500	6.293	34.3126	26.98372
10056	35578	090.0 070.0	2021-01-21 13:36:00	515	6.155	34.2903	26.98398

10057 rows × 7 columns

In addition, when we use .read_csv(), we can use the parse_dates argument to tell pandas to read certain column(s) as datetimes:

CalCOFI2 = pd.read_csv("data/CalCOFI_subset.csv", parse_dates=["Datetime"])
CalCOFI2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10057 entries, 0 to 10056
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Cast_Count  10057 non-null  int64         
 1   Station_ID  10057 non-null  object        
 2   Datetime    10057 non-null  datetime64[ns]
 3   Depth_m     10057 non-null  int64         
 4   T_degC      10026 non-null  float64       
 5   Salinity    10057 non-null  float64       
 6   SigmaTheta  10008 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 550.1+ KB

Datetime formatting#

Of course, textual representations of datetime come in many different formats. For example, the following all specifies the same date:

date_1 = "2024-08-05 3:08 PM"
date_2 = "August 5, 2024, 15:08"
date_3 = "5 Aug 2024 - 3:08:00 pm"
date_4 = "5/8/2024 15:08:00"
date_5 = "8/5/2024 15:08:00"

To deal with the different formats, pd.to_datetime() accepts a format argument, which uses different format codes to specify how months, days, etc. are encoded. The documentation for format code can be found in https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior. Note that when we supply a format code, the non-informational characters (e.g., , / or ,) need to be included too.

As examples, to read the above strings into datetime, we’ll need:

tstamp_1 = pd.to_datetime(date_1, format="%Y-%m-%d %I:%M %p")
display(tstamp_1)

Timestamp('2024-08-05 15:08:00')

tstamp_2 = pd.to_datetime(date_2, format="%B %d, %Y, %H:%M")
display(tstamp_2)

Timestamp('2024-08-05 15:08:00')

tstamp_3 = pd.to_datetime(date_3, format="%d %b %Y - %I:%M:%S %p")
display(tstamp_3)

Timestamp('2024-08-05 15:08:00')

tstamp_4 = pd.to_datetime(date_4, format="%d/%m/%Y %H:%M:%S")
display(tstamp_4)

Timestamp('2024-08-05 15:08:00')

tstamp_5 = pd.to_datetime(date_5, format="%m/%d/%Y %H:%M:%S")
display(tstamp_5)

Timestamp('2024-08-05 15:08:00')

Conversely, the format code can also be used to print out time:

# create a pandas Series of time
time_series = pd.Series(pd.to_datetime(
    ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
))

# use strftime (string-from-time) to create the corresponding strings
time_series.dt.strftime("%b %d, %Y - %I:%M %p")

  Jan 15, 2024 - 02:20 PM
  Feb 15, 2024 - 07:35 AM
  Mar 15, 2024 - 06:06 PM
dtype: object

Datetime comparisons#

To perform logical comparison between datetime, we can create a datetime from a scalar string, and use logical operator in the expected way:

time_series = pd.Series(pd.to_datetime(
    ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
))
instant = pd.to_datetime("2024-02-21")
time_series > instant

  False
  False
   True
dtype: bool

As an application, we can combine datetime comparison and .loc[] to extract data that are collected in the new millennium (which we defined as year ≥ 2000)

CalCOFI.loc[
    (CalCOFI["Datetime"] >= pd.to_datetime("2000-01-01"))
]

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
7364	29359	090.0 070.0	2000-01-12 18:23:00	0	14.310	33.4100	24.89000
7365	29359	090.0 070.0	2000-01-12 18:23:00	2	14.310	33.4130	24.89700
7366	29359	090.0 070.0	2000-01-12 18:23:00	2	14.310	33.4130	NaN
7367	29359	090.0 070.0	2000-01-12 18:23:00	10	14.290	33.4130	24.90100
7368	29359	090.0 070.0	2000-01-12 18:23:00	20	14.280	33.4130	24.90400
...	...	...	...	...	...	...	...
10052	35578	090.0 070.0	2021-01-21 13:36:00	300	7.692	34.1712	26.67697
10053	35578	090.0 070.0	2021-01-21 13:36:00	381	7.144	34.2443	26.81386
10054	35578	090.0 070.0	2021-01-21 13:36:00	400	7.031	34.2746	26.85372
10055	35578	090.0 070.0	2021-01-21 13:36:00	500	6.293	34.3126	26.98372
10056	35578	090.0 070.0	2021-01-21 13:36:00	515	6.155	34.2903	26.98398

2693 rows × 7 columns

Extracting parts of a datetime#

Sometimes you are interested in parts of the datetime and not the whole datetime (e.g., in climatology studies you may be interested in the day of year but not the year itself). You are access these as attributes of the datetime accessor .dt of the pandas Series (again, a column of a DataFrame is a Series). Examples include:

.dt.year: year of the datetimes
.dt.month: month of the datetimes
.dt.day: day of the datetimes
.dt.hour: hour of the datetimes
.dt.minute: minute of the datetimes
.dt.dayofyear: day of year of the datetimes

For a full list, see https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components

As an example, let’s extract the day of year from the datetimes in CalCOFI:

# select one single depth to look at, to reduce duplication of dates
CalCOFI_10 = CalCOFI.loc[CalCOFI["Depth_m"]==10]
display(CalCOFI_10)

	Cast_Count	Station_ID	Datetime	Depth_m	T_degC	Salinity	SigmaTheta
1	992	090.0 070.0	1950-02-06 19:54:00	10	13.950	33.2100	24.81500
28	1267	090.0 070.0	1950-04-08 17:30:00	10	14.980	33.4000	24.74400
62	1387	090.0 070.0	1950-05-07 17:36:00	10	14.170	33.3080	24.84500
98	1477	090.0 070.0	1950-06-13 00:42:00	10	14.510	33.1110	24.62200
130	1589	090.0 070.0	1950-07-13 00:54:00	10	16.550	33.1020	24.16400
...	...	...	...	...	...	...	...
9913	35250	090.0 070.0	2019-11-09 01:24:00	10	16.983	33.3950	24.28840
9944	35354	090.0 070.0	2020-01-09 18:13:00	10	15.054	33.3427	24.68417
9978	35429	090.0 070.0	2020-07-16 23:40:00	10	17.114	33.6188	24.42921
10015	35500	090.0 070.0	2020-10-14 23:52:00	10	17.373	33.3346	24.14988
10037	35578	090.0 070.0	2021-01-21 13:36:00	10	15.092	33.4561	24.76328

348 rows × 7 columns

# extract day of year
CalCOFI_10["Datetime"].dt.dayofyear

       37
      98
     127
     164
    194
        ... 
   313
     9
   198
  288
   21
Name: Datetime, Length: 348, dtype: int32