plateau.core.dataset module
- class plateau.core.dataset.DatasetMetadata(uuid: str, partitions: Dict[str, Partition] | None = None, metadata: Dict | None = None, indices: Dict[str, IndexBase] | None = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: List[str] | None = None, schema: SchemaWrapper | None = None, table_name: str | None = 'table')[source]
Bases:
DatasetMetadataBase
Containing holding all metadata of the dataset.
- static from_dict(dct: Dict, explicit_partitions: bool = True)[source]
Load dataset metadata from a dictionary.
This must have no external references. Otherwise use
load_from_dict
to have them resolved automatically.
- static load_from_buffer(buf, store: KeyValueStore, format: str = 'json') DatasetMetadata [source]
Load a dataset from a (string) buffer.
- Parameters:
buf – Input to be parsed.
store – Object that implements the .get method for file/object loading.
- Returns:
Parsed metadata.
- Return type:
- static load_from_dict(dct: Dict, store: KeyValueStore, load_schema: bool = True) DatasetMetadata [source]
Load dataset metadata from a dictionary and resolve any external includes.
- Parameters:
dct
store – Object that implements the .get method for file/object loading.
load_schema – Load table schema
- static load_from_store(uuid: str, store: str | KeyValueStore | Callable[[], KeyValueStore], load_schema: bool = True, load_all_indices: bool = False) DatasetMetadata [source]
Load a dataset from a storage.
- Parameters:
uuid – UUID of the dataset.
store – Object that implements the .get method for file/object loading.
load_schema – Load table schema
load_all_indices – Load all registered indices into memory.
- Returns:
dataset_metadata – Parsed metadata.
- Return type:
- class plateau.core.dataset.DatasetMetadataBase(uuid: str, partitions: Dict[str, Partition] | None = None, metadata: Dict | None = None, indices: Dict[str, IndexBase] | None = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: List[str] | None = None, schema: SchemaWrapper | None = None, table_name: str | None = 'table')[source]
Bases:
CopyMixin
- static exists(uuid: str, store: str | KeyValueStore | Callable[[], KeyValueStore]) bool [source]
Check if a dataset exists in a storage.
- Parameters:
uuid – UUID of the dataset.
store – Object that implements the .get method for file/object loading.
- get_indices_as_dataframe(columns: List[str] | None = None, date_as_object: bool = True, predicates: List[List[Tuple[str, str, LiteralValue]]] | None = None)[source]
Converts the dataset indices to a pandas dataframe and filter relevant indices by predicates.
For a dataset with indices on columns column_a and column_b and three partitions, the dataset output may look like
column_a column_b part_1 1 A part_2 2 B part_3 3 None
- Parameters:
columns – A subset of columns to be loaded.
predicates (List[List[Tuple[str, str, Any]]) –
Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.
Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.
Available operators are: ==, !=, <=, >=, <, > and in.
Filtering for missings is supported with operators ==, != and in and values np.nan and None for float and string columns respectively.
Categorical data
When using order sensitive operators on categorical data we will assume that the categories obey a lexicographical ordering. This filtering may result in less than optimal performance and may be slower than the evaluation on non-categorical data.
See also Filtering / Predicate pushdown and Efficient Querying
- load_all_indices(store: str | KeyValueStore | Callable[[], KeyValueStore]) T [source]
Load all registered indices into memory.
Note: External indices need to be preloaded before they can be queried.
- Parameters:
store – Object that implements the .get method for file/object loading.
- Returns:
dataset_metadata – Mutated metadata object with the loaded indices.
- Return type:
- load_index(column: str, store: str | KeyValueStore | Callable[[], KeyValueStore]) T [source]
Load an index into memory.
Note: External indices need to be preloaded before they can be queried.
- Parameters:
column – Name of the column for which the index should be loaded.
store – Object that implements the .get method for file/object loading.
- Returns:
dataset_metadata – Mutated metadata object with the loaded index.
- Return type:
- load_partition_indices() T [source]
Load all filename encoded indices into RAM. File encoded indices can be extracted from datasets with partitions stored in a format like.
`dataset_uuid/table/IndexCol=IndexValue/SecondIndexCol=Value/partition_label.parquet`
Which results in an in-memory index holding the information
{ "IndexCol": { IndexValue: ["partition_label"] }, "SecondIndexCol": { Value: ["partition_label"] } }
- query(indices: List[IndexBase] | None = None, **kwargs) List[str] [source]
Query the dataset for partitions that contain specific values. Lookup is performed using the embedded and loaded external indices. Additional indices need to operate on the same partitions that the dataset contains, otherwise an empty list will be returned (the query method only restricts the set of partition keys using the indices).
- Parameters:
indices – List of optional additional indices.
**kwargs – Map of columns and values.
- Returns:
List of keys of partitions that contain the queries values in the respective columns.
- Return type:
List[str]
- property secondary_indices: Dict[str, ExplicitSecondaryIndex]
- static storage_keys(uuid: str, store: str | KeyValueStore | Callable[[], KeyValueStore]) List[str] [source]
Retrieve all keys that belong to the given dataset.
- Parameters:
uuid – UUID of the dataset.
store – Object that implements the .iter_keys method for key retrieval loading.