climakitae.new_core package

climakitae.new_core package#

Subpackages#

Submodules#

climakitae.new_core.dataset module#

Dataset Processing Pipeline Module

This module provides the core Dataset class that implements a flexible, pipeline-based approach for climate data processing. The Dataset class serves as a central orchestrator that coordinates data access, parameter validation, and a series of processing steps.

Classes#

Dataset: A pipeline-based data processing class that supports method chaining for building complex data workflows.

Key Features#

Pipeline Architecture: Execute sequential processing steps on climate data
Method Chaining: Fluent interface for building complex data workflows
Parameter Validation: Integrated validation system for query parameters
Data Access Integration: Pluggable data catalog system for various data sources
Error Handling: Comprehensive error handling with meaningful error messages

Usage Example#

```python from climakitae.new_core.dataset import Dataset from climakitae.new_core.data_access import DataCatalog from climakitae.new_core.param_validation import ParameterValidator from climakitae.new_core.processors import TimeSliceProcessor, ClipProcessor

# Create a dataset processing pipeline dataset = (Dataset()

.with_catalog(my_data_catalog) .with_param_validator(my_validator) .with_processing_step(TimeSliceProcessor(“2010-01-01”, “2020-12-31”)) .with_processing_step(ClipProcessor(bounds=((32, 42), (-125, -115))))

)

# Execute the pipeline result = dataset.execute({“variable”: “temperature”, “grid_label”: “d03”}) ```

Pipeline Processing#

The Dataset class executes processing in the following order:

Parameter Validation: Validates input parameters using the configured validator
Data Access: Retrieves raw data using the configured data catalog
Processing Steps: Applies each processing step in sequence
Result Return: Returns the final processed xarray.Dataset

Each processing step receives the output of the previous step, allowing for complex data transformations and filtering operations.

Error Handling#

The class provides comprehensive error handling:

TypeError: Raised for incorrect component types (validators, catalogs, processors)
ValueError: Raised for missing required components
AttributeError: Raised for components missing required methods
RuntimeError: Raised for pipeline execution failures

Notes

Processing steps are executed in the order they are added to the pipeline
The context dictionary is passed through all processing steps and may be modified
Steps that require data access can set needs_catalog = True to receive the data accessor
Validation failures return an empty xarray.Dataset rather than raising exceptions

class climakitae.new_core.dataset.Dataset#

Bases: object

A pipeline-based data processing class for climate data workflows.

The Dataset class serves as a central orchestrator that coordinates data access, parameter validation, and sequential processing steps. It implements a fluent interface pattern allowing method chaining for building complex data workflows.

Parameters:: None

data_access#

The data catalog instance used for retrieving raw data from various sources.

Type:: DataCatalog or UNSET

parameter_validator#

The parameter validator instance used for validating query parameters.

Type:: ParameterValidator or UNSET

processing_pipeline#

A list of processing steps to be executed sequentially on the data.

Type:: list of DataProcessor or UNSET

execute(parameters=UNSET)#: Execute the complete data processing pipeline and return the result.

with_param_validator(parameter_validator)#: Set the parameter validator for the dataset (method chaining).

with_catalog(catalog)#: Set the data catalog for the dataset (method chaining).

with_processing_step(step)#: Add a processing step to the pipeline (method chaining).

Raises:

TypeError – If provided components don’t match expected types or lack required methods.
ValueError – If required components are missing during execution.
RuntimeError – If the processing pipeline encounters execution errors.

Notes

Processing steps are executed in the order they are added to the pipeline
The context dictionary is passed through all processing steps and may be modified
Steps that require data access can set needs_catalog = True to receive the data accessor
Validation failures return an empty xarray.Dataset rather than raising exceptions
All components (validator, catalog, processors) must implement their respective interfaces

See also

DataCatalog: Interface for data access components
ParameterValidator: Interface for parameter validation components
DataProcessor: Interface for data processing components

execute(parameters: ~typing.Dict[str, ~typing.Any] = <object object>) → Dataset#

Execute the dataset processing pipeline.

Parameters:: parameters (Dict[str, Any], optional) – Parameters to pass to the processing pipeline
Returns:: Dataset – Result of the processing pipeline

with_catalog(catalog: DataCatalog) → Dataset#

Set a new data catalog.

Parameters:

catalog (DataCatalog) – Data catalog to set for the dataset.

Returns:

Dataset – The current instance of Dataset allowing method chaining.

Raises:

TypeError – If the catalog is not an instance of DataCatalog.
AttributeError – If the catalog does not have a ‘get_data’ method.
TypeError – If the ‘get_data’ method is not callable.

with_param_validator(parameter_validator: ParameterValidator) → Dataset#

Set a new parameter validator.

Parameters:: parameter_validator (ParameterValidator) – Parameter validator to set for the dataset.
Returns:: Dataset – The current instance of Dataset allowing method chaining.
Raises:: TypeError – If the parameter validator is not an instance of ParameterValidator.

with_processing_step(step: DataProcessor) → Dataset#

Add a new processing step to the pipeline.

Parameters:

step (DataProcessor) – Processing step to add to the pipeline. Must have ‘execute’ and ‘update_context’ methods.

Returns:

Dataset – The current instance of Dataset allowing method chaining.

Raises:

TypeError – If the step is not an instance of DataProcessor.
AttributeError – If the step does not have ‘execute’, ‘update_context’, or ‘set_data_accessor’ methods.
TypeError – If the step is not callable.

climakitae.new_core.dataset_factory module#

DatasetFactory Module.

This module provides a factory class for creating climate data processing components and complete datasets with appropriate validation and processing pipelines. It serves as the central orchestrator for constructing validators, processors, and data access objects based on data type, analytical approach, and user requirements.

The factory pattern implemented here simplifies the instantiation of complex component combinations while maintaining flexibility for different climate data scenarios including gridded versus station-based observations, time-based versus warming-level analysis approaches, and different data catalogs and processing requirements.

Key Features#

Dynamic component registration and discovery
Automatic processing pipeline construction
Catalog-based data source management
Extensible validator and processor registries

See also

climakitae.new_core.dataset.Dataset: Dataset container class
climakitae.new_core.data_access.DataCatalog: Data catalog management
climakitae.new_core.param_validation.abc_param_validator: Parameter validation framework
climakitae.new_core.processors.abc_data_processor: Data processing framework

Notes

This module follows the factory design pattern to encapsulate the complex logic of creating appropriate combinations of data access, validation, and processing components based on user queries from the ClimateData UI.

class climakitae.new_core.dataset_factory.DatasetFactory#

Bases: object

Factory for creating Dataset objects with appropriate catalogs, validators, and processors.

This factory translates UI queries from the ClimateData interface into fully configured Dataset objects with the correct combination of data catalogs for accessing climate data, parameter validators for query validation, and processing steps for data transformation.

The factory uses registries to maintain extensible collections of components and automatically determines the appropriate combination based on query parameters.

Parameters:: catalog_path (str, optional) – Path to the catalog configuration CSV file. Default is ‘climakitae/data/catalogs.csv’.

catalog_path#

Path to the catalog configuration CSV file.

Type:: str

_catalog#

Dictionary mapping catalog keys to DataCatalog instances.

Type:: dict

_catalog_df#

DataFrame containing catalog metadata loaded from CSV.

Type:: pandas.DataFrame

_validator_registry#

Registry mapping validator keys to ParameterValidator classes.

Type:: dict

_processing_step_registry#

Registry mapping processing step names to DataProcessor classes.

Type:: dict

register_catalog(key, catalog)#: Register a data catalog with the factory.

register_validator(key, validator_class)#: Register a parameter validator with the factory.

register_processing_step(step_type, step_class)#: Register a processing step with the factory.

create_validator(val_reg_key)#: Create a parameter validator based on registry key.

create_dataset(ui_query)#: Create a Dataset based on a UI query from ClimateData.

get_catalog_options(key, query=None)#: Get available options for a specific catalog.

get_validators()#: Get a list of available validators.

get_processors()#: Get a list of available processors.

Examples

Creating a basic dataset:

>>> factory = DatasetFactory()
>>> query = {'data_type': 'gridded', 'variable': 'precipitation'}
>>> dataset = factory.create_dataset(query)

Registering custom components:

>>> factory = DatasetFactory()
>>> factory.register_validator('custom_type', CustomValidator)
>>> factory.register_processing_step('custom_process', CustomProcessor)

Notes

The factory automatically handles the selection of appropriate processing steps based on the query parameters. Some processing steps are mandatory and will be added automatically even if not explicitly requested.

See also

Dataset: The main dataset container class
DataCatalog: Data access abstraction
ParameterValidator: Base class for parameter validation
DataProcessor: Base class for data processing steps

create_dataset(ui_query: Dict[str, Any]) → Dataset#

Create a Dataset based on a UI query from ClimateData.

This method orchestrates the creation of a complete Dataset by: 1. Determining the appropriate catalog based on query parameters 2. Creating and configuring the parameter validator 3. Adding the necessary processing steps in the correct order

Parameters:

ui_query (dict) – Query dictionary from ClimateData UI containing at minimum: - ‘data_type’ : str, type of climate data - Additional keys depend on the specific data type and analysis

Returns:

Dataset – Properly configured Dataset instance ready for data retrieval and processing.

Raises:

ValueError – If required query parameters are missing, invalid, or if no appropriate catalog can be determined.
RuntimeError – If dataset creation fails due to internal errors.

Notes

The method automatically adds mandatory processing steps such as concatenation and attribute updates even if not specified in the query.

Processing steps are applied in priority order, with preprocessing steps (like bias correction) applied before postprocessing steps.

See also

Dataset: The returned dataset class
create_validator: Method for creating parameter validators

create_validator(val_reg_key: str) → ParameterValidator#

Create a parameter validator based on data_type and approach.

Parameters:: val_reg_key (str) – Key for the validator (data_type_approach)
Returns:: ParameterValidator – An appropriate parameter validator
Raises:: ValueError – If no validator is registered for the given combination

get_boundaries(boundary_type: str) → List[str]#

Get a list of available boundary datasets.

Parameters:: boundary_type (str) – The type of boundary datasets to retrieve. If the type is not found in the cache, returns all available boundary types.
Returns:: List[str] – List of available boundary datasets for the specified type, or all available boundary types if the specified type is not found.

get_catalog_options(key: str, query: dict[str, ~typing.Any] | object = <object object>) → List[str]#

Get available options for a specific catalog.

Parameters:

key (str) – Key of the catalog to query.
query (dict, optional) – A dictionary to filter the catalog options. The keys of the dictionary should correspond to columns in the catalog, and the values are the values to filter by.

Returns:

List[str] – List of available options for the specified catalog.

get_processors() → List[str]#

Get a list of available processors.

Returns:: List[str] – List of available processors.

get_stations() → List[str]#

Get a list of available station datasets.

Returns:: List[str] – List of available station datasets.

get_validators() → List[str]#

Get a list of available validators.

Returns:: List[str] – List of available validators.

register_catalog(key: str, catalog: DataCatalog)#

Parameters:

key (str) – Identifier for the catalog. Should correspond to data_type, installation, or other distinguishing characteristics.
catalog (DataCatalog) – Catalog implementation to register for the given key.

Raises:

TypeError – If catalog is not an instance of DataCatalog.
ValueError – If key is empty or None.

Examples

>>> factory = DatasetFactory()
>>> custom_catalog = DataCatalog()
>>> factory.register_catalog('wind_data', custom_catalog)

See also

DataCatalog: Base catalog class

register_processing_step(step_type: str, step_class)#

Parameters:

step_type (str) – Identifier for the processing step
step_class (class) – Processing step class to register

register_validator(key: str, validator_class: Type[ParameterValidator])#

Parameters:

key (str) – Identifier for the validator (approach, data_type combination)
validator_class (Type[ParameterValidator]) – Validator class to register

reset()#

Reset the factory state, clearing all registered catalogs, validators, and processors.

This method is useful for reinitializing the factory without creating a new instance.

climakitae.new_core.user_interface module#

Climate Data Interface Module for Accessing Climate Data.

This module provides a high-level interface for accessing climate data through the ClimateData class. It implements a fluent interface pattern that allows users to chain method calls to configure data queries.

The module facilitates retrieving climate data with various parameters such as catalogs, installations, activities, institutions, sources, experiments, variables, and processing options. It implements a factory pattern for creating appropriate datasets and validators based on specified parameters.

Example Usage:

>>> data = ClimateData()
>>> result = (data.catalog("renewables")
...               .installation("pv_utility")
...               .activity_id("CMIP6")
...               .variable("tasmax")
...               .table_id("day")
...               .grid_label("d03")
...               .get())

class climakitae.new_core.user_interface.ClimateData#

Bases: object

A fluent interface for accessing climate data.

This class provides a chainable interface for setting parameters and retrieving climate data. It uses a factory pattern to create datasets and validators based on the specified parameters. The class is designed to be chainable, allowing users to set multiple parameters in a single expression.

The interface supports various climate data sources and allows for flexible querying with different combinations of parameters. All methods return the instance itself to enable method chaining.

Parameters supported in queries: - catalog: The data catalog to use (e.g., “renewable energy generation”, “cadcat”) - installation: The installation type (e.g., “pv_utility”, “wind_offshore”) - activity_id: The activity identifier (e.g., “WRF”, “LOCA2”) - institution_id: The institution identifier (e.g., “CNRM”, “DWD”) - source_id: The source identifier (e.g., “GCM”, “RCM”, “Station”) - experiment_id: The experiment identifier (e.g., “historical”, “ssp245”) - table_id: The temporal resolution (e.g., “1hr”, “day”, “mon”) - grid_label: The spatial resolution (e.g., “d01”, “d02”, “d03”) - variable_id: The climate variable (e.g., “tasmax”, “pr”, “cf”) - processes: Dictionary of data processing operations to apply

catalog(catalog: str) → ClimateData#: Set the data catalog to use.

installation(installation: str) → ClimateData#: Set the installation type.

activity_id(activity_id: str) → ClimateData#: Set the activity identifier.

institution_id(institution_id: str) → ClimateData#: Set the institution identifier.

source_id(source_id: str) → ClimateData#: Set the source identifier.

experiment_id(experiment_id: str | list[str]) → ClimateData#: Set the experiment identifier(s).

table_id(table_id: str) → ClimateData#: Set the temporal resolution.

grid_label(grid_label: str) → ClimateData#: Set the spatial resolution.

variable(variable: str) → ClimateData#: Set the climate variable to retrieve.

processes(processes: Dict[str, str | Iterable]) → ClimateData#: Set processing operations to apply to the data.

get() → xr.DataArray | None#: Execute the query and retrieve the climate data.

Utility methods for exploring available options:

show_*_options() methods display available values for each parameter.

show_query() displays the current query configuration.

show_all_options() displays all available options for exploration.

Returns:

DataArray or None – The retrieved climate data as a lazy-loaded xarray DataArray, or None if the query fails or required parameters are missing.

Raises:

ValueError – If required parameters are missing or invalid during validation.
Exception – If there is an error during data retrieval or processing.

Examples

Basic usage with method chaining:

>>> cd = ClimateData()
>>> data = (cd
...     .catalog("cadcat")
...     .activity_id("WRF")
...     .experiment_id("historical")
...     .table_id("1hr")
...     .grid_label("d02")
...     .variable("prec")
...     .get()
...    )

Exploring available options:

>>> cd = ClimateData()
>>> cd.show_catalog_options()
>>> cd.catalog("cadcat").show_variable_options()

Using with processing:

>>> processes = {"spatial_avg": "region", "temporal_avg": "monthly"}
>>> data = (ClimateData()
...         .catalog("climate")
...         .variable("pr")
...         .processes(processes)
...         .get())

activity_id(activity_id: str) → ClimateData#

Set the activity identifier for the query.

Parameters:: activity_id (str) – The activity ID (e.g., “CMIP6”, “CORDEX”).
Returns:: ClimateData – The current instance for method chaining.

catalog(catalog: str) → ClimateData#

Set the data catalog to use for the query.

Parameters:: catalog (str) – The name of the catalog (e.g., “renewables”, “climate”).
Returns:: ClimateData – The current instance for method chaining.

copy_query() → Dict[str, Any]#

Get a copy of the current query parameters.

Returns:: Dict[str, Any] – A copy of the current query parameters.

experiment_id(experiment_id: str | list[str]) → ClimateData#

Set the experiment identifier for the query.

Parameters:: experiment_id (str) – The experiment ID (e.g., “historical”, “ssp245”).
Returns:: ClimateData – The current instance for method chaining.

get() → Any | None#

Execute the configured query and retrieve climate data.

Validates required parameters, creates the appropriate dataset using the factory pattern, executes the query, and resets the query state for the next use.

Returns:

Optional[xr.DataArray] – The retrieved climate data as a lazy-loaded xarray DataArray, or None if the query fails or validation errors occur.

Raises:

ValueError – If required parameters are missing during validation.
Exception – If there are errors during dataset creation or execution.

grid_label(grid_label: str) → ClimateData#

Set the spatial resolution identifier for the query.

Parameters:: grid_label (str) – The spatial resolution (e.g., “d01”, “d02”, “d03”).
Returns:: ClimateData – The current instance for method chaining.

installation(installation: str) → ClimateData#

Set the installation type for the query.

Parameters:: installation (str) – The installation type (e.g., “pv_utility”, “wind_offshore”).
Returns:: ClimateData – The current instance for method chaining.

institution_id(institution_id: str) → ClimateData#

Set the institution identifier for the query.

Parameters:: institution_id (str) – The institution ID (e.g., “CNRM”, “DWD”).
Returns:: ClimateData – The current instance for method chaining.

load_query(query_params: Dict[str, Any]) → ClimateData#

Load query parameters from a dictionary.

Parameters:: query_params (Dict[str, Any]) – Dictionary of query parameters to load.
Returns:: ClimateData – The current instance with loaded parameters.

processes(processes: Dict[str, str | Iterable]) → ClimateData#

Set processing operations to apply to the retrieved data.

Parameters:: processes (Dict[str, Union[str, Iterable]]) – A dictionary of processing operations and their parameters.
Returns:: ClimateData – The current instance for method chaining.

reset() → ClimateData#

Manually reset the query parameters.

Returns:: ClimateData – The current instance with reset parameters.

show_activity_id_options() → None#: Display available activity ID options.

show_all_options() → None#: Display all available options for exploration.

show_boundary_options(type=<object object>) → None#: Display available boundaries for spatial queries.

show_catalog_options() → None#: Display available catalog options.

show_experiment_id_options() → None#: Display available experiment ID options.

show_grid_label_options() → None#: Display available grid label options (Spatial resolutions).

show_installation_options() → None#: Display available installation options.

show_institution_id_options() → None#: Display available institution ID options.

show_processors() → None#: Display available data processors.

show_query() → None#: Display the current query configuration.

show_source_id_options() → None#: Display available source ID options.

show_station_options() → None#: Display available station options for data retrieval.

show_table_id_options() → None#: Display available table ID options (Temporal resolutions).

show_variable_options() → None#: Display available variable options.

source_id(source_id: str) → ClimateData#

Set the source identifier for the query.

Parameters:: source_id (str) – The source ID (e.g., “GCM”, “RCM”, “Station”).
Returns:: ClimateData – The current instance for method chaining.

table_id(table_id: str) → ClimateData#

Set the temporal resolution identifier for the query.

Parameters:: table_id (str) – The temporal resolution (e.g., “1hr”, “day”, “mon”).
Returns:: ClimateData – The current instance for method chaining.

variable(variable: str) → ClimateData#

Set the climate variable to retrieve.

Parameters:: variable (str) – The variable identifier (e.g., “tasmax”, “pr”, “cf”).
Returns:: ClimateData – The current instance for method chaining.

climakitae.new_core package

Contents

climakitae.new_core package#

Subpackages#

Submodules#

climakitae.new_core.dataset module#

Classes#

Key Features#

Usage Example#

Pipeline Processing#

Error Handling#

climakitae.new_core.dataset_factory module#

Key Features#

climakitae.new_core.user_interface module#

Module contents#