climakitae.new_core.data_access package#

Submodules#

climakitae.new_core.data_access.boundaries module#

Lazy-loading boundaries module for ClimakitAE.

This module provides efficient access to geospatial boundary data for climate data subsetting and analysis. The Boundaries class implements lazy loading to minimize memory usage and improve startup performance by only loading datasets when they are first accessed.

The module supports various types of geographical boundaries including: - US western states - California counties - California watersheds (HUC8 level) - California electric utilities (IOUs and POUs) - California electricity demand forecast zones - California electric balancing authority areas

All boundary data is sourced from S3-stored parquet files accessed through intake catalogs, providing fast and efficient data retrieval.

Classes#

Boundaries

Lazy-loading class for managing geospatial polygon data from S3 stored parquet catalogs. Provides cached lookup dictionaries and memory-efficient access to boundary datasets for geographic subsetting operations.

Examples

>>> import intake
>>> catalog = intake.open_catalog('boundaries.yaml')
>>> boundaries = Boundaries(catalog)
>>>
>>> # Get all boundary options for UI population
>>> boundary_options = boundaries.boundary_dict()
>>>
>>> # Access specific boundary data (loaded lazily)
>>> ca_counties = boundaries._ca_counties
>>>
>>> # Preload all data for performance-critical scenarios
>>> boundaries.preload_all()
class climakitae.new_core.data_access.boundaries.Boundaries(boundary_catalog: Catalog)#

Bases: object

Lazy-loading geospatial polygon data manager for ClimakitAE.

This class provides efficient access to various boundary datasets stored in S3 parquet catalogs. Data is loaded only when first accessed, improving memory usage and initialization performance. All lookup dictionaries are cached to avoid recomputation.

The class supports geographic subsetting for climate data analysis by providing access to various administrative and utility boundaries in California and the western United States. All data access is optimized for memory efficiency through lazy loading and intelligent caching.

Parameters:

boundary_catalog (intake.catalog.Catalog) – Intake catalog instance for accessing boundary parquet files from S3

_cat#

Reference to the boundary catalog instance used for data access

Type:

intake.catalog.Catalog

Properties#
----------
_us_states#

US western states with names, abbreviations, and geometries (lazy-loaded)

Type:

DataFrame

_ca_counties#

California counties with names and geometries, sorted alphabetically (lazy-loaded)

Type:

DataFrame

_ca_watersheds#

California HUC8 watersheds with names and geometries, sorted alphabetically (lazy-loaded)

Type:

DataFrame

_ca_utilities#

California electric utilities (IOUs and POUs) with names and geometries (lazy-loaded)

Type:

DataFrame

_ca_forecast_zones#

California electricity demand forecast zones with processed names (lazy-loaded)

Type:

DataFrame

_ca_electric_balancing_areas#

Electric balancing authority areas with filtered geometries (lazy-loaded)

Type:

DataFrame

boundary_dict() Dict[str, Dict[str, int]]#

Return dictionary of all boundary lookup dictionaries for UI population

preload_all() None#

Preload all boundary data for performance-critical scenarios

clear_cache() None#

Clear all cached data and lookup dictionaries to free memory

validate_catalog() None#

Validate that required catalog entries exist and are accessible

get_memory_usage() Dict[str, int | str]#

Get detailed memory usage information for loaded boundary datasets

load() None#

Deprecated method for backward compatibility - use preload_all() instead

Examples

Basic usage with lazy loading:

>>> import intake
>>> catalog = intake.open_catalog('boundaries.yaml')
>>> boundaries = Boundaries(catalog)
>>>
>>> # Data loads automatically when accessed
>>> counties = boundaries._ca_counties
>>> watersheds = boundaries._ca_watersheds

Getting boundary options for UI components:

>>> boundary_options = boundaries.boundary_dict()
>>> state_options = boundary_options['states']
>>> county_options = boundary_options['CA counties']

Performance optimization:

>>> # Preload all data if you know you'll need it
>>> boundaries.preload_all()
>>>
>>> # Check memory usage
>>> usage = boundaries.get_memory_usage()
>>> print(f"Total memory: {usage['total_human']}")

Memory management:

>>> # Clear cache to free memory
>>> boundaries.clear_cache()
>>>
>>> # Data will be reloaded on next access
>>> counties = boundaries._ca_counties

Notes

  • All boundary data is cached after first access for performance

  • The class automatically validates catalog structure on initialization

  • Processing includes sorting, filtering, and name standardization

  • Memory usage can be monitored and managed through provided methods

  • Western states are ordered according to WESTERN_STATES_LIST constant

  • Utilities are ordered with priority utilities first, then alphabetically

boundary_dict() Dict[str, Dict[str, int]]#

Return dictionary of all boundary lookup dictionaries for UI population.

Creates a comprehensive dictionary of all available boundary datasets with their corresponding lookup dictionaries. This is primarily used to populate user interface components that allow boundary selection for geographic subsetting of climate data.

The returned dictionary maps boundary category names to lookup dictionaries that map specific boundary names to their DataFrame indices. This enables efficient boundary selection and data subsetting operations.

Returns:

Dict[str, Dict[str, int]] – Nested dictionary structure: - Outer keys: boundary category names (e.g., ‘states’, ‘CA counties’) - Inner dictionaries: map boundary names to DataFrame indices

Available categories: - ‘none’: No geographic subsetting - ‘lat/lon’: Custom coordinate-based selection - ‘states’: Western US states - ‘CA counties’: California counties (alphabetical) - ‘CA watersheds’: California HUC8 watersheds (alphabetical) - ‘CA Electric Load Serving Entities (IOU & POU)’: Electric utilities - ‘CA Electricity Demand Forecast Zones’: Forecast zones - ‘CA Electric Balancing Authority Areas’: Balancing areas

Examples

>>> boundaries = Boundaries(catalog)
>>> boundary_options = boundaries.boundary_dict()
>>>
>>> # Get available states
>>> states = boundary_options['states']
>>> print(states.keys())  # ['CA', 'OR', 'WA', ...]
>>>
>>> # Get available counties
>>> counties = boundary_options['CA counties']
>>> alameda_idx = counties['Alameda']
>>>
>>> # Use in UI dropdown population
>>> for category, options in boundary_options.items():
>>>     populate_dropdown(category, options.keys())

Notes

  • Lookup dictionaries are cached for performance

  • Western states follow ordering in WESTERN_STATES_LIST

  • Utilities are ordered with priority utilities first

  • All other boundaries are sorted alphabetically

clear_cache() None#

Clear all cached data and lookup dictionaries to free memory.

Removes all loaded boundary DataFrames and lookup dictionaries from memory, returning the Boundaries instance to its initial state. Data will be reloaded on next access through the lazy loading mechanism.

This is useful for: - Memory management in long-running applications - Forcing fresh data loads after catalog updates - Resetting state during testing or debugging

Examples

>>> boundaries = Boundaries(catalog)
>>> boundaries.preload_all()
>>> usage_before = boundaries.get_memory_usage()
>>> print(f"Memory before: {usage_before['total_human']}")
>>>
>>> boundaries.clear_cache()
>>> usage_after = boundaries.get_memory_usage()
>>> print(f"Memory after: {usage_after['total_human']}")  # Much lower
>>>
>>> # Data loads again on next access
>>> counties = boundaries._ca_counties  # Triggers reload

Notes

  • All subsequent data access will trigger fresh loads from catalog

  • Lookup dictionaries will be rebuilt as needed

  • Does not affect the underlying catalog or data sources

  • Memory savings are immediate and substantial for loaded datasets

get_memory_usage() Dict[str, int | str]#

Get detailed memory usage information for loaded boundary datasets.

Analyzes memory consumption of all loaded boundary DataFrames and provides both detailed per-dataset usage and summary statistics. Useful for memory monitoring and optimization decisions.

Returns:

Dict[str, Union[int, str]] – Comprehensive memory usage information:

Per-dataset usage (bytes): - ‘us_states’: Memory used by US states DataFrame (0 if not loaded) - ‘ca_counties’: Memory used by CA counties DataFrame (0 if not loaded) - ‘ca_watersheds’: Memory used by CA watersheds DataFrame (0 if not loaded) - ‘ca_utilities’: Memory used by CA utilities DataFrame (0 if not loaded) - ‘ca_forecast_zones’: Memory used by forecast zones DataFrame (0 if not loaded) - ‘ca_electric_balancing_areas’: Memory used by balancing areas DataFrame (0 if not loaded)

Summary statistics: - ‘total_bytes’: Total memory usage in bytes - ‘total_human’: Human-readable total memory usage (e.g., ‘15.2 MB’) - ‘loaded_datasets’: Count of currently loaded datasets - ‘cached_lookups’: Count of cached lookup dictionaries

Examples

>>> boundaries = Boundaries(catalog)
>>> boundaries.preload_all()
>>> usage = boundaries.get_memory_usage()
>>>
>>> print(f"Total memory: {usage['total_human']}")
>>> print(f"Loaded datasets: {usage['loaded_datasets']}/6")
>>> print(f"Largest dataset: {max(usage['us_states'], usage['ca_counties'])}")
>>>
>>> # Check if specific dataset is loaded
>>> if usage['ca_counties'] > 0:
>>>     print("Counties data is loaded")
>>> # Monitor memory before/after operations
>>> usage_before = boundaries.get_memory_usage()
>>> boundaries.clear_cache()
>>> usage_after = boundaries.get_memory_usage()
>>> saved = usage_before['total_bytes'] - usage_after['total_bytes']
>>> print(f"Memory freed: {boundaries._format_bytes(saved)}")

Notes

  • Memory usage includes deep analysis of DataFrame contents

  • Unloaded datasets report 0 bytes usage

  • Lookup dictionary cache usage is counted separately

  • Total includes all loaded DataFrames but not lookup dictionaries

load() None#

Preload all boundary data (deprecated - data loads automatically when accessed).

This method is kept for backward compatibility. Data now loads automatically when first accessed through the property system.

Deprecated#

This method is deprecated as of version X.X.X. Use preload_all() instead for explicit preloading, or simply access data normally for automatic lazy loading.

preload_all() None#

Preload all boundary data for performance-critical scenarios.

Forces immediate loading of all boundary datasets and builds all lookup caches. This eliminates lazy loading delays for subsequent data access operations, making it ideal for performance-critical scenarios or when you know all boundary data will be needed.

The method loads all six boundary datasets: - US western states - California counties - California watersheds - California utilities - California forecast zones - California electric balancing areas

And builds all corresponding lookup dictionaries for fast boundary selection operations.

Examples

>>> boundaries = Boundaries(catalog)
>>>
>>> # Preload for performance-critical batch processing
>>> boundaries.preload_all()
>>>
>>> # All subsequent access is now immediate
>>> for county in boundaries._ca_counties.itertuples():
>>>     process_county_data(county)

Notes

  • Increases initial memory usage but eliminates loading delays

  • Useful for batch processing or repeated boundary access

  • Data remains cached until clear_cache() is called

  • Memory usage can be monitored with get_memory_usage()

validate_catalog() None#

Validate that required catalog entries exist and are accessible.

Checks for the presence of all required boundary datasets in the catalog. This ensures that the boundary data can be loaded when requested by the user.

Raises:

ValueError – If any required catalog entries are missing. The error message will list all missing entries.

Notes

Required catalog entries: - ‘states’: US state boundaries - ‘counties’: California county boundaries - ‘huc8’: California watershed boundaries (HUC8 level) - ‘utilities’: California electric utility boundaries - ‘dfz’: California demand forecast zones - ‘eba’: Electric balancing authority areas

climakitae.new_core.data_access.data_access module#

Data access module for ClimakitAE.

This module provides a singleton DataCatalog class for managing connections to various climate data catalogs including boundary, renewables, and general climate datasets. The DataCatalog class offers a unified interface for accessing and querying multiple intake catalogs with support for method chaining and dynamic catalog management.

Classes#

DataCatalog

Singleton class that inherits from dict and manages catalog connections. Provides properties for accessing specific catalogs and methods for querying and retrieving climate datasets.

class climakitae.new_core.data_access.data_access.DataCatalog#

Bases: dict

Singleton class for managing catalog connections to climate data sources.

This class implements the singleton pattern and inherits from dict to provide a unified interface for accessing multiple climate data catalogs. It manages connections to boundary, renewables, and general climate datasets through intake and intake-esm catalogs, offering convenient properties and methods for data querying and retrieval.

The class automatically initializes connections to predefined catalogs and supports dynamic addition of new catalogs. Method chaining is supported for fluent API usage.

catalog_key#

The currently selected catalog key for data operations. Defaults to UNSET until explicitly set via set_catalog_key().

Type:

str or UNSET

Properties#
----------
data#

Access to the main climate data catalog.

Type:

intake_esm.core.esm_datastore

boundary#

Access to the boundary conditions catalog.

Type:

intake.catalog.Catalog

boundaries#

Access to the lazy-loading boundaries data manager.

Type:

Boundaries

renewables#

Access to the renewables data catalog.

Type:

intake_esm.core.esm_datastore

set_catalog_key(key)#

Set the active catalog for subsequent operations.

set_catalog(name, catalog)#

Add a new catalog to the collection.

get_data(query)#

Retrieve data from the active catalog using query parameters.

Notes

This class implements the singleton pattern, ensuring only one instance exists throughout the application lifecycle. Multiple calls to DataCatalog() will return the same instance.

The class automatically handles catalog initialization and provides sensible defaults when invalid catalog keys are specified.

static __new__(cls) DataCatalog#

Override __new__ to implement singleton pattern.

Returns:

DataCatalog – The singleton instance of DataCatalog.

property boundaries: Boundaries#

Access boundaries data with lazy loading.

Returns:

Boundaries – The lazy-loading boundaries data manager.

property boundary: Catalog#

Access boundary catalog.

Returns:

intake.catalog.Catalog – The boundary conditions catalog.

property data: esm_datastore#

Access data catalog.

Returns:

intake_esm.core.esm_datastore – The main climate data catalog.

get_data(query: Dict[str, Any]) Dict[str, Dataset]#

Get data from the catalog.

This method queries the active catalog using the provided parameters and returns the matching datasets as a dictionary.

Parameters:

query (dict) – Query parameters for filtering data. The available parameters depend on the active catalog and may include items like ‘variable’, ‘scenario’, ‘model’, etc.

Returns:

dict[str, xr.Dataset] – The requested dataset(s) from the catalog, keyed by dataset identifiers.

Notes

The catalog_key must be set before calling this method. If not set, this will raise an error.

list_clip_boundaries() dict[str, list[str]]#

List all available boundary options for clipping operations.

This method populates the available_boundaries attribute with a dictionary of boundary categories and their available options. It’s a convenience method that provides direct access to boundary options without needing to instantiate a Clip processor.

Notes

After calling this method, the available boundaries can be accessed via the available_boundaries attribute.

Examples

>>> catalog = DataCatalog()
>>> catalog.list_clip_boundaries()
>>> print(catalog.available_boundaries["states"])
['AZ', 'CA', 'CO', 'ID', 'MT', 'NV', 'NM', 'OR', 'UT', 'WA', 'WY']
merge_catalogs() DataFrame#

Merge the intake catalogs for data and renewables into a single DataFrame.

This method combines the data and renewables catalogs into a unified DataFrame for easier searching and querying across all available datasets.

Returns:

DataFrame – A DataFrame containing the merged data from both catalogs with an additional ‘catalog’ column identifying the source catalog.

print_clip_boundaries() None#

Print all available boundary options for clipping in a user-friendly format.

This method provides a nicely formatted output showing all boundary categories and their available options for clipping operations. The output is formatted to be readable and includes summarized counts for categories with many options.

Examples

>>> catalog = DataCatalog()
>>> catalog.print_clip_boundaries()
Available Boundary Options for Clipping:
========================================
states:
  • AZ, CA, CO, ID, MT … and 6 more options

property renewables: esm_datastore#

Access renewables catalog.

Returns:

intake_esm.core.esm_datastore – The renewables data catalog.

reset() None#

Reset the DataCatalog instance to its initial state.

This method clears the catalog key and resets the instance to its original state. The catalogs themselves remain loaded and available.

set_catalog(name: str, catalog: str) DataCatalog#

Set a named catalog.

Parameters:
  • name (str) – Name of the catalog to set.

  • catalog (str) – URL or path to the catalog file.

Returns:

DataCatalog – The current instance of DataCatalog allowing method chaining.

set_catalog_key(key: str) DataCatalog#

Set the catalog key for accessing a specific catalog.

Parameters:

key (str) – Key of the catalog to set. Must be one of the available catalog keys.

Returns:

DataCatalog – The current instance of DataCatalog allowing method chaining.

Warns:

UserWarning – If the catalog key is not found in the available catalogs. Defaults to ‘data’ catalog in this case.

Module contents#

Data access module for the new core functionality.