Earth System Grid Federation Data Access¶
The simplest way to search datasets programmatically within the Earth System Grid Federation (ESGF) is to use intake-esgf. It works like intake-esm, but instead of loading a static catalog, intake-esgf populates its catalog by querying ESGF index nodes. There is also a new graphical search interface called Metagrid, but it does not yet federate data from all ESGF nodes.
ESGF provides two different mechanisms to retrieve data, the legacy Solr index and the new Globus index. ESGF is decommissioning the old indexing service based on SOLR, so methods relying on it might break. intake-esgf works with both, so should allow for a graceful transition.
What intake-esgf does is send search queries to a list of index nodes and aggregates the results. You can configure the list of index nodes it queries, especially since some of the nodes are very slow to respond. For this notebook, we’ll query only Globus nodes to speed things up, but to ensure you find all the available data, call intake_esgf.conf.set(all_indices=True) to search all nodes.
# NBVAL_IGNORE_OUTPUT
import intake_esgf
# Use only Globus index nodes to speed things up
intake_esgf.conf["solr_indices"] = {}
# Show the configuration
print(intake_esgf.conf)
# Initialize an empty ESGF catalog
cat = intake_esgf.ESGFCatalog()
# Launch a search query.
# Here we're looking for any variable related to humidity within the CMIP6 SSP2-4.5 experiment.
# Results will be stored in a dictionary with keys defined by the `facets` argument.
cat.search(
project="CMIP6",
variable_id=["hurs"],
table_id="Amon",
experiment_id=["ssp126", "ssp245", "ssp370", "ssp585"],
)
You can get a sense of what datasets are available with the model_group method, which counts the number of unique combinations of source_id, member_id and grid_label.
Other useful methods are remove_ensembles, which picks only one member_id (the smallest 4 integer values for the ripf code, and remove_incomplete, which filters model groups according to criteria you can define. See the docs for details.
# NBVAL_IGNORE_OUTPUT
# Keep only one member per model, experiment and grid.
cat.remove_ensembles()
# Remove models groups that don't have the four SSPs.
cat.remove_incomplete(lambda df: len(df) == 4)
cat.model_groups()
Now we’ll try to access some data. For small queries, a good approach is to use streaming, rather than downloading the whole thing. Here we’ll just ask for simulations from CanESM5, and try to stream some data. Getting the file information can take some time.
# NBVAL_IGNORE_OUTPUT
# Let's focus the search on one single model to speed up the rest of the notebook
cat.search(
project="CMIP6",
source_id="CanESM5",
variable_id=["hurs"],
table_id="Amon",
experiment_id=["ssp126", "ssp245", "ssp370", "ssp585"],
)
cat.remove_ensembles()
# The `prefer_streaming` argument specifies that we'd rather not download entire files.
# When True, the `add_measures` argument triggers search for variables that are referenced
# in the `cell_measure` attribute, such as `areacella` or `areacello`.
dsd = cat.to_dataset_dict(prefer_streaming=True, add_measures=False)
# NBVAL_IGNORE_OUTPUT
# Here the result is keyed by experiment_id.
dsd["ssp370"]["hurs"]
By default, the to_dataset_dict method downloads files locally. If you already hold datasets locally, you can specify the esg_dataroot in the configuration. You can also specify the local_cache where missing datasets will be downloaded.
Please check the documentation for more details on how to use to_dataset_dict.