plateau.core.dataset module

class plateau.core.dataset.DatasetMetadata(uuid: str, partitions: Dict[str, Partition] | None = None, metadata: Dict | None = None, indices: Dict[str, IndexBase] | None = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: List[str] | None = None, schema: SchemaWrapper | None = None, table_name: str | None = 'table')[source]

Bases: DatasetMetadataBase

Containing holding all metadata of the dataset.

static from_buffer(buf: str, format: str = 'json', explicit_partitions: bool = True)[source]

static from_dict(dct: Dict, explicit_partitions: bool = True)[source]

Load dataset metadata from a dictionary.

This must have no external references. Otherwise use load_from_dict to have them resolved automatically.

static load_from_buffer(buf, store: KeyValueStore, format: str = 'json') → DatasetMetadata[source]

Load a dataset from a (string) buffer.

Parameters:

buf – Input to be parsed.
store – Object that implements the .get method for file/object loading.

Returns:

Parsed metadata.

Return type:

DatasetMetadata

static load_from_dict(dct: Dict, store: KeyValueStore, load_schema: bool = True) → DatasetMetadata[source]

Load dataset metadata from a dictionary and resolve any external includes.

Parameters:

dct
store – Object that implements the .get method for file/object loading.
load_schema – Load table schema

static load_from_store(uuid: str, store: str | KeyValueStore | Callable[[], KeyValueStore], load_schema: bool = True, load_all_indices: bool = False) → DatasetMetadata[source]

Load a dataset from a storage.

Parameters:

uuid – UUID of the dataset.
store – Object that implements the .get method for file/object loading.
load_schema – Load table schema
load_all_indices – Load all registered indices into memory.

Returns:

dataset_metadata – Parsed metadata.

Return type:

DatasetMetadata

class plateau.core.dataset.DatasetMetadataBase(uuid: str, partitions: Dict[str, Partition] | None = None, metadata: Dict | None = None, indices: Dict[str, IndexBase] | None = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: List[str] | None = None, schema: SchemaWrapper | None = None, table_name: str | None = 'table')[source]

Bases: CopyMixin

static exists(uuid: str, store: str | KeyValueStore | Callable[[], KeyValueStore]) → bool[source]

Check if a dataset exists in a storage.

Parameters:

uuid – UUID of the dataset.
store – Object that implements the .get method for file/object loading.

get_indices_as_dataframe(columns: List[str] | None = None, date_as_object: bool = True, predicates: List[List[Tuple[str, str, LiteralValue]]] | None = None)[source]

Converts the dataset indices to a pandas dataframe and filter relevant indices by predicates.

For a dataset with indices on columns column_a and column_b and three partitions, the dataset output may look like

        column_a column_b
part_1         1        A
part_2         2        B
part_3         3     None

Parameters:

columns – A subset of columns to be loaded.
predicates (List[List[Tuple[str, str, Any]]) –
Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.

Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.

Available operators are: ==, !=, <=, >=, <, > and in.

Filtering for missings is supported with operators ==, != and in and values np.nan and None for float and string columns respectively.

Categorical data

When using order sensitive operators on categorical data we will assume that the categories obey a lexicographical ordering. This filtering may result in less than optimal performance and may be slower than the evaluation on non-categorical data.

See also Filtering / Predicate pushdown and Efficient Querying

property index_columns: Set[str]

load_all_indices(store: str | KeyValueStore | Callable[[], KeyValueStore]) → T[source]

Load all registered indices into memory.

Note: External indices need to be preloaded before they can be queried.

Parameters:: store – Object that implements the .get method for file/object loading.
Returns:: dataset_metadata – Mutated metadata object with the loaded indices.
Return type:: DatasetMetadata

load_index(column: str, store: str | KeyValueStore | Callable[[], KeyValueStore]) → T[source]

Load an index into memory.

Note: External indices need to be preloaded before they can be queried.

Parameters:

column – Name of the column for which the index should be loaded.
store – Object that implements the .get method for file/object loading.

Returns:

dataset_metadata – Mutated metadata object with the loaded index.

Return type:

DatasetMetadata

load_partition_indices() → T[source]

Load all filename encoded indices into RAM. File encoded indices can be extracted from datasets with partitions stored in a format like.

`dataset_uuid/table/IndexCol=IndexValue/SecondIndexCol=Value/partition_label.parquet`

Which results in an in-memory index holding the information

{
    "IndexCol": {
        IndexValue: ["partition_label"]
    },
    "SecondIndexCol": {
        Value: ["partition_label"]
    }
}

property primary_indices_loaded: bool

query(indices: List[IndexBase] | None = None, **kwargs) → List[str][source]

Query the dataset for partitions that contain specific values. Lookup is performed using the embedded and loaded external indices. Additional indices need to operate on the same partitions that the dataset contains, otherwise an empty list will be returned (the query method only restricts the set of partition keys using the indices).

Parameters:

indices – List of optional additional indices.
**kwargs – Map of columns and values.

Returns:

List of keys of partitions that contain the queries values in the respective columns.

Return type:

List[str]

property secondary_indices: Dict[str, ExplicitSecondaryIndex]

static storage_keys(uuid: str, store: str | KeyValueStore | Callable[[], KeyValueStore]) → List[str][source]

Retrieve all keys that belong to the given dataset.

Parameters:

uuid – UUID of the dataset.
store – Object that implements the .iter_keys method for key retrieval loading.

property table_name: str

property tables: List[str]

to_dict() → Dict[source]

to_json() → bytes[source]

to_msgpack() → bytes[source]