Database access via python

Database access via python#

As mentioned in week 6, sometimes a database may provide a dedicated python package for accessing its data. The packages include in our JupyterHub environment for such purpose includes:

  • earthaccess: for downloading NASA arth science data.

  • sciencebasepy: for downloading certain data provided by the U.S. Geological Survey (USGS).

  • ndbc-api: for downloading data from the National Data Buoy Center (NDBC).

  • ooipy: for downloading Ocean Observatories Initiative (OOI) data.

  • cdsapi: for downloading data from the Climate Data Store (CDS).

  • argopy: for Argo data access.

  • owslib: for accessing data provided by the EMODnet OGC Web Services.

  • rda-apps-clients: for downloading data from the NSF NCAR Research Data Archive (RDA).

  • pyncei: for accessing data from NOAA’s Climate Data Online Web Services v2 API

  • pycmap: for accessing data from Simons Collaborative Marine Atlas Project (Simons CMAP).

Each of these packages has its own idiosyncrasy and it is impossible to cover all of them within reasonable space, so we will focus on pycmap to illustrate some of the steps you may need to perform to utilizes this kind of packages.

Example: Simons CMAP#

In order to access data programatically from the Simons CMAP, we first need to obtain an API key. To do so, we go to the Simons CMAP site and register as a user. Once registered, we can return to the starting page and click on “API Access”. You will then be prompted to login, after which you’ll be directed to a page where you can generate your own API key. Once generated, save the API key to somewhere safe (in my case, I stored the API key in a file called api_key.txt under the secrets subfolder).

Now we read the API key into the notebook (you could do so by pasting the string of API key to the notebook directly, but if you do so you should not make your notebook publicly accessible!). In my case, what I can do is (the details of this code block is not important, all it does it to read in the text stored in the api_key.txt file):

with open("secrets/api_key.txt", "r") as keyfile:
    api_key = keyfile.read().strip()

Now we import the pycmap and create an API instance using the API key already loaded:

import pycmap

api = pycmap.API(api_key)

The api object can then be used to get data from the Simons CMAP database. For example, suppose I am interested in the CTD data from the SCOPE Gradient 5 cruise. On the public webpage I see that the database table name is tblTN412_Gradients5_CTD, so I may execute:

ctd = api.get_dataset('tblTN412_Gradients5_CTD')
C:\Users\wingho\miniforge3\envs\learn\Lib\site-packages\pycmap\rest.py:334: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  df = pd.read_json(df['JSON_stats'][0])
C:\Users\wingho\miniforge3\envs\learn\Lib\site-packages\pycmap\rest.py:335: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead
  rows = int(df.loc[['count'], 'lat'])

We note that the resulting ctd object is a pandas DataFrame:

type(ctd)
pandas.core.frame.DataFrame

In addition, the pycmap modules provide additional functionalities for subsetting data. For example, the Chlorophyll_REP dataset in Simons CMAP is a huge dataset with hundreds of billions of rows. Suppose we only need the data between Jun 1, 2015 and Jun 7, 2015, we can use the api.space_time() function to subset the dataset using two timestamps:

chl = api.space_time(
    table="tblCHL_REP", variable="chl", 
    dt1="2015-06-01 00:00:00", dt2="2015-06-07 23:59:59",
    lat1=-90, lat2=90, lon1=-180, lon2=180, 
    depth1=0, depth2=10000
)
chl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1036800 entries, 0 to 1036799
Data columns (total 4 columns):
 #   Column  Non-Null Count    Dtype  
---  ------  --------------    -----  
 0   time    1036800 non-null  object 
 1   lat     1036800 non-null  float64
 2   lon     1036800 non-null  float64
 3   chl     490223 non-null   float64
dtypes: float64(3), object(1)
memory usage: 31.6+ MB
display(chl)
time lat lon chl
0 2015-06-02T00:00:00 -89.875 -179.875 NaN
1 2015-06-02T00:00:00 -89.875 -179.625 NaN
2 2015-06-02T00:00:00 -89.875 -179.375 NaN
3 2015-06-02T00:00:00 -89.875 -179.125 NaN
4 2015-06-02T00:00:00 -89.875 -178.875 NaN
... ... ... ... ...
1036795 2015-06-02T00:00:00 89.875 178.875 NaN
1036796 2015-06-02T00:00:00 89.875 179.125 NaN
1036797 2015-06-02T00:00:00 89.875 179.375 NaN
1036798 2015-06-02T00:00:00 89.875 179.625 NaN
1036799 2015-06-02T00:00:00 89.875 179.875 NaN

1036800 rows × 4 columns