plateau.io.iter module
- plateau.io.iter.read_dataset_as_dataframes__iterator(dataset_uuid=None, store=None, columns=None, predicate_pushdown_to_io=True, categoricals=None, dates_as_object: bool = True, predicates=None, factory=None, dispatch_by=None)[source]
A Python iterator to retrieve a dataset from store where each partition is loaded as a
DataFrame
.- Parameters:
dataset_uuid (str) – The dataset UUID
store (Callable or str or minimalkv.KeyValueStore) –
The store where we can find or store the dataset.
Can be either
minimalkv.KeyValueStore
, a minimalkv store url or a generic Callable producing aminimalkv.KeyValueStore
columns – A subset of columns to be loaded.
predicate_pushdown_to_io (bool) – Push predicates through to the I/O layer, default True. Disable this if you see problems with predicate pushdown for the given file even if the file format supports it. Note that this option only hides problems in the storage layer that need to be addressed there.
categoricals – Load the provided subset of columns as a
pandas.Categorical
.dates_as_object (bool) – Load pyarrow.date{32,64} columns as
object
columns in Pandas instead of usingnp.datetime64
to preserve their type. While this improves type-safety, this comes at a performance cost.predicates (List[List[Tuple[str, str, Any]]) –
Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.
Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.
Available operators are: ==, !=, <=, >=, <, > and in.
Filtering for missings is supported with operators ==, != and in and values np.nan and None for float and string columns respectively.
Categorical data
When using order sensitive operators on categorical data we will assume that the categories obey a lexicographical ordering. This filtering may result in less than optimal performance and may be slower than the evaluation on non-categorical data.
See also Filtering / Predicate pushdown and Efficient Querying
factory (plateau.core.factory.DatasetFactory) – A DatasetFactory holding the store and UUID to the source dataset.
dispatch_by (Optional[List[str]]) –
List of index columns to group and partition the jobs by. There will be one job created for every observed index value combination. This may result in either many very small partitions or in few very large partitions, depending on the index you are using this on.
Secondary indices
This is also useable in combination with secondary indices where the physical file layout may not be aligned with the logically requested layout. For optimal performance it is recommended to use this for columns which can benefit from predicate pushdown since the jobs will fetch their data individually and will not shuffle data in memory / over network.
- Returns:
A list containing a dictionary for each partition. The dictionaries keys are the in-partition file labels and the values are the corresponding dataframes.
- Return type:
Examples
Dataset in store contains two partitions with two files each
>>> import minimalkv >>> from plateau.io.iter import read_dataset_as_dataframes__iterator >>> store = minimalkv.get_store_from_url('s3://bucket_with_dataset') >>> dataframes = read_dataset_as_dataframes__iterator('dataset_uuid', store) >>> next(dataframes) [ # First partition {'core': pd.DataFrame, 'lookup': pd.DataFrame} ] >>> next(dataframes) [ # Second partition {'core': pd.DataFrame, 'lookup': pd.DataFrame} ]
- plateau.io.iter.store_dataframes_as_dataset__iter(df_generator, store, dataset_uuid=None, metadata=None, partition_on=None, df_serializer=None, overwrite=False, metadata_storage_format='json', metadata_version=4, secondary_indices=None, table_name: str = 'table')[source]
Store pd.DataFrame s iteratively as a partitioned dataset with multiple tables (files).
Useful for datasets which do not fit into memory.
- Parameters:
df_generator (Iterable[Union[pandas.DataFrame, Dict[str, pandas.DataFrame]]]) – The dataframe(s) to be stored
store (Callable or str or minimalkv.KeyValueStore) –
The store where we can find or store the dataset.
Can be either
minimalkv.KeyValueStore
, a minimalkv store url or a generic Callable producing aminimalkv.KeyValueStore
dataset_uuid (str) – The dataset UUID
metadata (Optional[Dict]) – A dictionary used to update the dataset metadata.
partition_on (List) – Column names by which the dataset should be partitioned by physically. These columns may later on be used as an Index to improve query performance. Partition columns need to be present in all dataset tables. Sensitive to ordering.
df_serializer (Optional[plateau.serialization.DataFrameSerializer]) – A pandas DataFrame serialiser from plateau.serialization
overwrite (Optional[bool]) – If True, allow overwrite of an existing dataset.
metadata_storage_format (str) – Optional list of datastorage format to use. Currently supported is .json & .msgpack.zstd”
metadata_version (Optional[int]) – The dataset metadata version
secondary_indices (List[str]) – A list of columns for which a secondary index should be calculated.
table_name –
The table name of the dataset to be loaded. This creates a namespace for the partitioning like
dataset_uuid/table_name/*
This is to support legacy workflows. We recommend not to use this and use the default wherever possible.
- Returns:
dataset – The stored dataset.
- Return type:
- plateau.io.iter.update_dataset_from_dataframes__iter(df_generator, store=None, dataset_uuid=None, delete_scope=None, metadata=None, df_serializer=None, metadata_merger=None, default_metadata_version=4, partition_on=None, sort_partitions_by=None, secondary_indices=None, factory=None, table_name: str = 'table')[source]
Update a plateau dataset in store iteratively, using a generator of dataframes.
Useful for datasets which do not fit into memory.
- Parameters:
df_generator (Iterable[Union[pandas.DataFrame, Dict[str, pandas.DataFrame]]]) – The dataframe(s) to be stored
store (Callable or str or minimalkv.KeyValueStore) –
The store where we can find or store the dataset.
Can be either
minimalkv.KeyValueStore
, a minimalkv store url or a generic Callable producing aminimalkv.KeyValueStore
dataset_uuid (str) – The dataset UUID
delete_scope (List[Dict]) – This defines which partitions are replaced with the input and therefore get deleted. It is a lists of query filters for the dataframe in the form of a dictionary, e.g.: [{‘column_1’: ‘value_1’}, {‘column_1’: ‘value_2’}]. Each query filter will be given to: func: `dataset.query and the returned partitions will be deleted. If no scope is given nothing will be deleted. For plateau.io.dask.update.update_dataset.* a delayed object resolving to a list of dicts is also accepted.
metadata (Optional[Dict]) – A dictionary used to update the dataset metadata.
df_serializer (Optional[plateau.serialization.DataFrameSerializer]) – A pandas DataFrame serialiser from plateau.serialization
metadata_merger (Optional[Callable]) – By default partition metadata is combined using the
combine_metadata()
function. You can supply a callable here that implements a custom merge operation on the metadata dictionaries (depending on the matches this might be more than two values).default_metadata_version (int) – Default metadata version. (Note: Metadata version greater than 3 are only supported)
partition_on (List) – Column names by which the dataset should be partitioned by physically. These columns may later on be used as an Index to improve query performance. Partition columns need to be present in all dataset tables. Sensitive to ordering.
sort_partitions_by (str) – Provide a column after which the data should be sorted before storage to enable predicate pushdown.
secondary_indices (List[str]) – A list of columns for which a secondary index should be calculated.
factory (plateau.core.factory.DatasetFactory) – A DatasetFactory holding the store and UUID to the source dataset.
table_name –
The table name of the dataset to be loaded. This creates a namespace for the partitioning like
dataset_uuid/table_name/*
This is to support legacy workflows. We recommend not to use this and use the default wherever possible.
- Return type:
The dataset metadata object (
DatasetMetadata
).
See also