DataFrame Serialization
Serialise Pandas DataFrames to/from bytes
Serialisation to bytes
For the serialsation, we need to pick a format serializer, you either use
default_serializer()
or explicitly select a serializer,
e.g. ParquetSerializer
.
from plateau.api.serialization import ParquetSerializer
serializer = ParquetSerializer()
df = ...
serializer.store(store, "storage_key", df)
Deserialisation
For deserialisation, you don’t have to instantiate any serializer as the correct one is determined from the filename.
from plateau.api.serialization import DataFrameSerializer
df = DataFrameSerializer.restore_dataframe(store, "file.parquet")
Supported data types
plateau generally does not impose any restrictions on the data types to be used as long as they are compatible and in alignment with the pyarrow pandas integration.
For a detailed explanation about how types are handled, please consult Table type system.
Filtering / Predicate pushdown
You can provide a filter expression in a DNF in a format of a nested list where every inner list is interpreted as a logical conjunction (AND
) whereas the entire expression is interpreted as one disjunction (OR
)
prediactes = [
[
("ColumnA", "==", 5),
],
[
("ColumnA", ">", 5),
("ColumnB", "<=", datetime.date(2021, 1, 1)),
],
]
The above list of predicates can be interpreted as the following whereclause:
ColumnA = 5 OR (ColumnA > 5 AND ColumnB < '2021-01-01')
The predicate expression can be provided to the predicates keyword argument of the serializer and/or full dataset read interfaces.
Note
All plateau reading pipelines are exposing this predicates argument as well where it is not only used for predicate pushdown but also for partition pruning. See Efficient Querying for details.
Literals, operators and typing
The literals used for building the predicates are tuples with three elements.
(<FieldName: str>, <Operator: str>, <Value: Any>)
FieldName
is a str identifying the column this literal describes.Operator
is a string for the logical operation applied to the field. Available operators are==
,!=
,<=
,>=
,<
,>
,in
Value
is the actual value for the query. The type of this value is always required to be identical to the fields data type. We apply the same type normalization for the predicates as described in Table type system.
Filtering for missing values / nulls is supported with operators ==, != and in and values np.nan and None for float and string columns respectively.