plateau.core.common_metadata module

class plateau.core.common_metadata.SchemaWrapper(schema, origin: str | Set[str])[source]

Bases: object

Wrapper object for pyarrow.Schema to handle forwards and backwards compatibility.

equals(self, Schema other, bool check_metadata=False)[source]

Test if this schema is equal to the other

Parameters:

other (pyarrow.Schema)
check_metadata (bool, default False) – Key/value metadata must be equal too

Returns:

is_equal

Return type:

bool

Examples

>>> import pyarrow as pa
>>> schema1 = pa.schema([
...     pa.field('n_legs', pa.int64()),
...     pa.field('animals', pa.string())],
...     metadata={"n_legs": "Number of legs per animal"})
>>> schema2 = pa.schema([
...     ('some_int', pa.int32()),
...     ('some_string', pa.string())
... ])

Test two equal schemas:

>>> schema1.equals(schema1)
True

Test two unequal schemas:

>>> schema1.equals(schema2)
False

internal()[source]

property origin: Set[str]

remove(i)[source]

Schema.set(self, int i, Field field)

Replace a field at position i in the schema.

Parameters:

i (int)
field (Field)

Returns:

schema

Return type:

Schema

Examples

>>> import pyarrow as pa
>>> schema = pa.schema([
...     pa.field('n_legs', pa.int64()),
...     pa.field('animals', pa.string())])

Replace the second field of the schema with a new field ‘extra’:

>>> schema.set(1, pa.field('replaced', pa.bool_()))
n_legs: int64
replaced: bool

remove_metadata(self)[source]

Create new schema without metadata, if any

Returns:: schema
Return type:: pyarrow.Schema

Examples

>>> import pyarrow as pa
>>> schema = pa.schema([
...     pa.field('n_legs', pa.int64()),
...     pa.field('animals', pa.string())],
...     metadata={"n_legs": "Number of legs per animal"})
>>> schema
n_legs: int64
animals: string
-- schema metadata --
n_legs: 'Number of legs per animal'

Create a new schema with removing the metadata from the original:

>>> schema.remove_metadata()
n_legs: int64
animals: string

set(self, int i, Field field)[source]

Replace a field at position i in the schema.

Parameters:

i (int)
field (Field)

Returns:

schema

Return type:

Schema

Examples

>>> import pyarrow as pa
>>> schema = pa.schema([
...     pa.field('n_legs', pa.int64()),
...     pa.field('animals', pa.string())])

Replace the second field of the schema with a new field ‘extra’:

>>> schema.set(1, pa.field('replaced', pa.bool_()))
n_legs: int64
replaced: bool

with_origin(origin: str | Set[str]) → SchemaWrapper[source]

Create new SchemaWrapper with given origin.

Parameters:: origin – New origin.

plateau.core.common_metadata.empty_dataframe_from_schema(schema, columns=None, date_as_object=False, coerce_temporal_nanoseconds=True)[source]

Create an empty DataFrame from provided schema.

Parameters:

schema (Schema) – Schema information of the new empty DataFrame.
columns (Union[None, List[str]]) – Optional list of columns that should be part of the resulting DataFrame. All columns in that list must also be part of the provided schema.
date_as_object (bool) – Cast dates to objects.
coerce_temporal_nanoseconds (bool) – Coerce date32, date64, duration and timestamp units to nanoseconds to retain behaviour of pandas 1.x. Only applicable to pandas version >= 2.0 and PyArrow version >= 13.0.0.

Returns:

Empty DataFrame with requested columns and types.

Return type:

DataFrame

plateau.core.common_metadata.make_meta(obj, origin, partition_keys=None)[source]

Create metadata object for DataFrame.

Note

This function can, for convenience reasons, also be applied to schema objects in which case they are just returned.

Warning

Information for categoricals will be stripped!

normalize_type() will be applied to normalize type information and normalize_column_order() will be applied to to reorder column information.

Parameters:

obj (Union[DataFrame, Schema]) – Object to extract metadata from.
origin (str) – Origin of the schema data, used for debugging and error reporting.
partition_keys (Union[None, List[str]]) – Partition keys used to split the dataset.

Returns:

schema – Schema information for DataFrame.

Return type:

SchemaWrapper

plateau.core.common_metadata.normalize_column_order(schema, partition_keys=None)[source]

Normalize column order in schema.

Columns are sorted in the following way:

Partition keys (as provided by partition_keys)
DataFrame columns in alphabetic order
Remaining fields as generated by pyarrow, mostly index columns

Parameters:

schema (SchemaWrapper) – Schema information for DataFrame.
partition_keys (Union[None, List[str]]) – Partition keys used to split the dataset.

Returns:

schema – Schema information for DataFrame.

Return type:

SchemaWrapper

This will normalize types as followed:

all signed integers (int8, int16, int32, int64) will be converted to int64
all unsigned integers (uint8, uint16, uint32, uint64) will be converted to uint64
all floats (float32, float64) will be converted to float64
all list value types will be normalized (e.g. list[int16] to list[int64], list[list[uint8]] to list[list[uint64]])
all dict value types will be normalized (e.g. dictionary<values=float32, indices=int16, ordered=0> to float64)

Parameters:

t_pa – pyarrow type object, e.g. pa.list_(pa.int8()).
t_pd – pandas type identifier, e.g. "list[int8]".
t_np – numpy type identifier, e.g. "object".
metadata – metadata associated with the type, e.g. information about categorials.

plateau.core.common_metadata.validate_compatible(schemas, ignore_pandas=False)[source]

Validate that all schemas in a given list are compatible.

Apart from the pandas version preserved in the schema metadata, schemas must be completely identical. That includes a perfect match of the whole metadata (except the pandas version) and pyarrow types.

Use make_meta() and normalize_column_order() for type and column order normalization.

In the case that all schemas don’t contain any pandas metadata, we will check the Arrow schemas directly for compatibility.

Parameters:

schemas (List[Schema]) – Schema information from multiple sources, e.g. multiple partitions. List may be empty.
ignore_pandas (bool) – Ignore the schema information given by Pandas an always use the Arrow schema.

Returns:

schema – The reference schema which was tested against

Return type:

SchemaWrapper

Raises:

ValueError – At least two schemas are incompatible.