climakitae.new_core.data_access package#
Submodules#
climakitae.new_core.data_access.boundaries module#
Lazy-loading boundaries module for ClimakitAE.
This module provides efficient access to geospatial boundary data for climate data subsetting and analysis. The Boundaries class implements lazy loading to minimize memory usage and improve startup performance by only loading datasets when they are first accessed.
The module supports various types of geographical boundaries including: - US western states - California counties - California watersheds (HUC8 level) - California electric utilities (IOUs and POUs) - California electricity demand forecast zones - California electric balancing authority areas
All boundary data is sourced from S3-stored parquet files accessed through intake catalogs, providing fast and efficient data retrieval.
Classes#
- Boundaries
Lazy-loading class for managing geospatial polygon data from S3 stored parquet catalogs. Provides cached lookup dictionaries and memory-efficient access to boundary datasets for geographic subsetting operations.
Examples
>>> import intake
>>> catalog = intake.open_catalog('boundaries.yaml')
>>> boundaries = Boundaries(catalog)
>>>
>>> # Get all boundary options for UI population
>>> boundary_options = boundaries.boundary_dict()
>>>
>>> # Access specific boundary data (loaded lazily)
>>> ca_counties = boundaries._ca_counties
>>>
>>> # Preload all data for performance-critical scenarios
>>> boundaries.preload_all()
- class climakitae.new_core.data_access.boundaries.Boundaries(boundary_catalog: Catalog)#
Bases:
objectLazy-loading geospatial polygon data manager for ClimakitAE.
This class provides efficient access to various boundary datasets stored in S3 parquet catalogs. Data is loaded only when first accessed, improving memory usage and initialization performance. All lookup dictionaries are cached to avoid recomputation.
The class supports geographic subsetting for climate data analysis by providing access to various administrative and utility boundaries in California and the western United States. All data access is optimized for memory efficiency through lazy loading and intelligent caching.
- Parameters:
boundary_catalog (
intake.catalog.Catalog) – Intake catalog instance for accessing boundary parquet files from S3
- _cat#
Reference to the boundary catalog instance used for data access
- Type:
intake.catalog.Catalog
- Properties#
- ----------
- _us_states#
US western states with names, abbreviations, and geometries (lazy-loaded)
- Type:
- _ca_counties#
California counties with names and geometries, sorted alphabetically (lazy-loaded)
- Type:
- _ca_watersheds#
California HUC8 watersheds with names and geometries, sorted alphabetically (lazy-loaded)
- Type:
- _ca_utilities#
California electric utilities (IOUs and POUs) with names and geometries (lazy-loaded)
- Type:
- _ca_forecast_zones#
California electricity demand forecast zones with processed names (lazy-loaded)
- Type:
- _ca_electric_balancing_areas#
Electric balancing authority areas with filtered geometries (lazy-loaded)
- Type:
- boundary_dict() Dict[str, Dict[str, int]]#
Return dictionary of all boundary lookup dictionaries for UI population
- get_memory_usage() Dict[str, int | str]#
Get detailed memory usage information for loaded boundary datasets
Examples
Basic usage with lazy loading:
>>> import intake >>> catalog = intake.open_catalog('boundaries.yaml') >>> boundaries = Boundaries(catalog) >>> >>> # Data loads automatically when accessed >>> counties = boundaries._ca_counties >>> watersheds = boundaries._ca_watersheds
Getting boundary options for UI components:
>>> boundary_options = boundaries.boundary_dict() >>> state_options = boundary_options['states'] >>> county_options = boundary_options['CA counties']
Performance optimization:
>>> # Preload all data if you know you'll need it >>> boundaries.preload_all() >>> >>> # Check memory usage >>> usage = boundaries.get_memory_usage() >>> print(f"Total memory: {usage['total_human']}")
Memory management:
>>> # Clear cache to free memory >>> boundaries.clear_cache() >>> >>> # Data will be reloaded on next access >>> counties = boundaries._ca_counties
Notes
All boundary data is cached after first access for performance
The class automatically validates catalog structure on initialization
Processing includes sorting, filtering, and name standardization
Memory usage can be monitored and managed through provided methods
Western states are ordered according to WESTERN_STATES_LIST constant
Utilities are ordered with priority utilities first, then alphabetically
- boundary_dict() Dict[str, Dict[str, int]]#
Return dictionary of all boundary lookup dictionaries for UI population.
Creates a comprehensive dictionary of all available boundary datasets with their corresponding lookup dictionaries. This is primarily used to populate user interface components that allow boundary selection for geographic subsetting of climate data.
The returned dictionary maps boundary category names to lookup dictionaries that map specific boundary names to their DataFrame indices. This enables efficient boundary selection and data subsetting operations.
- Returns:
Dict[str,Dict[str,int]]– Nested dictionary structure: - Outer keys: boundary category names (e.g., ‘states’, ‘CA counties’) - Inner dictionaries: map boundary names to DataFrame indicesAvailable categories: - ‘none’: No geographic subsetting - ‘lat/lon’: Custom coordinate-based selection - ‘states’: Western US states - ‘CA counties’: California counties (alphabetical) - ‘CA watersheds’: California HUC8 watersheds (alphabetical) - ‘CA Electric Load Serving Entities (IOU & POU)’: Electric utilities - ‘CA Electricity Demand Forecast Zones’: Forecast zones - ‘CA Electric Balancing Authority Areas’: Balancing areas
Examples
>>> boundaries = Boundaries(catalog) >>> boundary_options = boundaries.boundary_dict() >>> >>> # Get available states >>> states = boundary_options['states'] >>> print(states.keys()) # ['CA', 'OR', 'WA', ...] >>> >>> # Get available counties >>> counties = boundary_options['CA counties'] >>> alameda_idx = counties['Alameda'] >>> >>> # Use in UI dropdown population >>> for category, options in boundary_options.items(): >>> populate_dropdown(category, options.keys())
Notes
Lookup dictionaries are cached for performance
Western states follow ordering in WESTERN_STATES_LIST
Utilities are ordered with priority utilities first
All other boundaries are sorted alphabetically
- clear_cache() None#
Clear all cached data and lookup dictionaries to free memory.
Removes all loaded boundary DataFrames and lookup dictionaries from memory, returning the Boundaries instance to its initial state. Data will be reloaded on next access through the lazy loading mechanism.
This is useful for: - Memory management in long-running applications - Forcing fresh data loads after catalog updates - Resetting state during testing or debugging
Examples
>>> boundaries = Boundaries(catalog) >>> boundaries.preload_all() >>> usage_before = boundaries.get_memory_usage() >>> print(f"Memory before: {usage_before['total_human']}") >>> >>> boundaries.clear_cache() >>> usage_after = boundaries.get_memory_usage() >>> print(f"Memory after: {usage_after['total_human']}") # Much lower >>> >>> # Data loads again on next access >>> counties = boundaries._ca_counties # Triggers reload
Notes
All subsequent data access will trigger fresh loads from catalog
Lookup dictionaries will be rebuilt as needed
Does not affect the underlying catalog or data sources
Memory savings are immediate and substantial for loaded datasets
- get_memory_usage() Dict[str, int | str]#
Get detailed memory usage information for loaded boundary datasets.
Analyzes memory consumption of all loaded boundary DataFrames and provides both detailed per-dataset usage and summary statistics. Useful for memory monitoring and optimization decisions.
- Returns:
Dict[str,Union[int,str]]– Comprehensive memory usage information:Per-dataset usage (bytes): - ‘us_states’: Memory used by US states DataFrame (0 if not loaded) - ‘ca_counties’: Memory used by CA counties DataFrame (0 if not loaded) - ‘ca_watersheds’: Memory used by CA watersheds DataFrame (0 if not loaded) - ‘ca_utilities’: Memory used by CA utilities DataFrame (0 if not loaded) - ‘ca_forecast_zones’: Memory used by forecast zones DataFrame (0 if not loaded) - ‘ca_electric_balancing_areas’: Memory used by balancing areas DataFrame (0 if not loaded)
Summary statistics: - ‘total_bytes’: Total memory usage in bytes - ‘total_human’: Human-readable total memory usage (e.g., ‘15.2 MB’) - ‘loaded_datasets’: Count of currently loaded datasets - ‘cached_lookups’: Count of cached lookup dictionaries
Examples
>>> boundaries = Boundaries(catalog) >>> boundaries.preload_all() >>> usage = boundaries.get_memory_usage() >>> >>> print(f"Total memory: {usage['total_human']}") >>> print(f"Loaded datasets: {usage['loaded_datasets']}/6") >>> print(f"Largest dataset: {max(usage['us_states'], usage['ca_counties'])}") >>> >>> # Check if specific dataset is loaded >>> if usage['ca_counties'] > 0: >>> print("Counties data is loaded")
>>> # Monitor memory before/after operations >>> usage_before = boundaries.get_memory_usage() >>> boundaries.clear_cache() >>> usage_after = boundaries.get_memory_usage() >>> saved = usage_before['total_bytes'] - usage_after['total_bytes'] >>> print(f"Memory freed: {boundaries._format_bytes(saved)}")
Notes
Memory usage includes deep analysis of DataFrame contents
Unloaded datasets report 0 bytes usage
Lookup dictionary cache usage is counted separately
Total includes all loaded DataFrames but not lookup dictionaries
- load() None#
Preload all boundary data (deprecated - data loads automatically when accessed).
This method is kept for backward compatibility. Data now loads automatically when first accessed through the property system.
Deprecated#
This method is deprecated as of version X.X.X. Use preload_all() instead for explicit preloading, or simply access data normally for automatic lazy loading.
- preload_all() None#
Preload all boundary data for performance-critical scenarios.
Forces immediate loading of all boundary datasets and builds all lookup caches. This eliminates lazy loading delays for subsequent data access operations, making it ideal for performance-critical scenarios or when you know all boundary data will be needed.
The method loads all six boundary datasets: - US western states - California counties - California watersheds - California utilities - California forecast zones - California electric balancing areas
And builds all corresponding lookup dictionaries for fast boundary selection operations.
Examples
>>> boundaries = Boundaries(catalog) >>> >>> # Preload for performance-critical batch processing >>> boundaries.preload_all() >>> >>> # All subsequent access is now immediate >>> for county in boundaries._ca_counties.itertuples(): >>> process_county_data(county)
Notes
Increases initial memory usage but eliminates loading delays
Useful for batch processing or repeated boundary access
Data remains cached until clear_cache() is called
Memory usage can be monitored with get_memory_usage()
- validate_catalog() None#
Validate that required catalog entries exist and are accessible.
Checks for the presence of all required boundary datasets in the catalog. This ensures that the boundary data can be loaded when requested by the user.
- Raises:
ValueError – If any required catalog entries are missing. The error message will list all missing entries.
Notes
Required catalog entries: - ‘states’: US state boundaries - ‘counties’: California county boundaries - ‘huc8’: California watershed boundaries (HUC8 level) - ‘utilities’: California electric utility boundaries - ‘dfz’: California demand forecast zones - ‘eba’: Electric balancing authority areas
climakitae.new_core.data_access.data_access module#
Data access module for ClimakitAE.
This module provides a singleton DataCatalog class for managing connections to various climate data catalogs including boundary, renewables, and general climate datasets. The DataCatalog class offers a unified interface for accessing and querying multiple intake catalogs with support for method chaining and dynamic catalog management.
Classes#
- DataCatalog
Singleton class that inherits from dict and manages catalog connections. Provides properties for accessing specific catalogs and methods for querying and retrieving climate datasets.
- class climakitae.new_core.data_access.data_access.DataCatalog#
Bases:
dictSingleton class for managing catalog connections to climate data sources.
This class implements the singleton pattern and inherits from dict to provide a unified interface for accessing multiple climate data catalogs. It manages connections to boundary, renewables, and general climate datasets through intake and intake-esm catalogs, offering convenient properties and methods for data querying and retrieval.
The class automatically initializes connections to predefined catalogs and supports dynamic addition of new catalogs. Method chaining is supported for fluent API usage.
- catalog_key#
The currently selected catalog key for data operations. Defaults to UNSET until explicitly set via set_catalog_key().
- Type:
strorUNSET
- Properties#
- ----------
- data#
Access to the main climate data catalog.
- Type:
intake_esm.core.esm_datastore
- boundary#
Access to the boundary conditions catalog.
- Type:
intake.catalog.Catalog
- boundaries#
Access to the lazy-loading boundaries data manager.
- Type:
Boundaries
- renewables#
Access to the renewables data catalog.
- Type:
intake_esm.core.esm_datastore
- set_catalog_key(key)#
Set the active catalog for subsequent operations.
- set_catalog(name, catalog)#
Add a new catalog to the collection.
- get_data(query)#
Retrieve data from the active catalog using query parameters.
Notes
This class implements the singleton pattern, ensuring only one instance exists throughout the application lifecycle. Multiple calls to DataCatalog() will return the same instance.
The class automatically handles catalog initialization and provides sensible defaults when invalid catalog keys are specified.
- static __new__(cls) DataCatalog#
Override __new__ to implement singleton pattern.
- Returns:
DataCatalog– The singleton instance of DataCatalog.
- property boundaries: Boundaries#
Access boundaries data with lazy loading.
- Returns:
Boundaries– The lazy-loading boundaries data manager.
- property boundary: Catalog#
Access boundary catalog.
- Returns:
intake.catalog.Catalog– The boundary conditions catalog.
- property data: esm_datastore#
Access data catalog.
- Returns:
intake_esm.core.esm_datastore– The main climate data catalog.
- get_data(query: Dict[str, Any]) Dict[str, Dataset]#
Get data from the catalog.
This method queries the active catalog using the provided parameters and returns the matching datasets as a dictionary.
- Parameters:
query (
dict) – Query parameters for filtering data. The available parameters depend on the active catalog and may include items like ‘variable’, ‘scenario’, ‘model’, etc.- Returns:
dict[str,xr.Dataset]– The requested dataset(s) from the catalog, keyed by dataset identifiers.
Notes
The catalog_key must be set before calling this method. If not set, this will raise an error.
- list_clip_boundaries() dict[str, list[str]]#
List all available boundary options for clipping operations.
This method populates the available_boundaries attribute with a dictionary of boundary categories and their available options. It’s a convenience method that provides direct access to boundary options without needing to instantiate a Clip processor.
Notes
After calling this method, the available boundaries can be accessed via the available_boundaries attribute.
Examples
>>> catalog = DataCatalog() >>> catalog.list_clip_boundaries() >>> print(catalog.available_boundaries["states"]) ['AZ', 'CA', 'CO', 'ID', 'MT', 'NV', 'NM', 'OR', 'UT', 'WA', 'WY']
- merge_catalogs() DataFrame#
Merge the intake catalogs for data and renewables into a single DataFrame.
This method combines the data and renewables catalogs into a unified DataFrame for easier searching and querying across all available datasets.
- Returns:
DataFrame– A DataFrame containing the merged data from both catalogs with an additional ‘catalog’ column identifying the source catalog.
- print_clip_boundaries() None#
Print all available boundary options for clipping in a user-friendly format.
This method provides a nicely formatted output showing all boundary categories and their available options for clipping operations. The output is formatted to be readable and includes summarized counts for categories with many options.
Examples
>>> catalog = DataCatalog() >>> catalog.print_clip_boundaries() Available Boundary Options for Clipping: ========================================
- states:
AZ, CA, CO, ID, MT … and 6 more options
- property renewables: esm_datastore#
Access renewables catalog.
- Returns:
intake_esm.core.esm_datastore– The renewables data catalog.
- reset() None#
Reset the DataCatalog instance to its initial state.
This method clears the catalog key and resets the instance to its original state. The catalogs themselves remain loaded and available.
- set_catalog(name: str, catalog: str) DataCatalog#
Set a named catalog.
- Parameters:
- Returns:
DataCatalog– The current instance of DataCatalog allowing method chaining.
- set_catalog_key(key: str) DataCatalog#
Set the catalog key for accessing a specific catalog.
- Parameters:
key (
str) – Key of the catalog to set. Must be one of the available catalog keys.- Returns:
DataCatalog– The current instance of DataCatalog allowing method chaining.- Warns:
UserWarning – If the catalog key is not found in the available catalogs. Defaults to ‘data’ catalog in this case.
Module contents#
Data access module for the new core functionality.