Table type system
This document explains the type system of plateau.
Motivation
To understand why and how the type system was designed, we illustrate the requirements and use cases plateau should cover:
simplicity: For the average programmer, it should be possible to understand the semantics of plateau quickly.
optimized data representation: Data providers (humans and software) should be able to pick a memory representation that is sufficient to hold the data in question (e.g. pick an 8-bit unsigned integer if you only have integer up to 255).
lifecycle management: A dataset may be used for many months and can be extended over this time. Also, the software used with the dataset might change.
efficiency: Data written once should not be altered if not really necessary.
stability: plateau is made for production use. It should provide stable, predicable outputs and must not crash or lead other libraries to behave in an expected way.
compatibility: plateau is not the only software in the stack and must play nicely with others (also see Related Type Systems).
Base Type System
Since we think that Apache Arrow is a solid, future-proof choice for in-memory dataframes and will hopefully offer a sane alternative for many Pandas use cases, we opt for Apache Arrow as our type system baseline. In contrast to NumPy and Pandas, it offers better support for nested types, is more strict about type conversions and can also be used when interacting with other languages like R, Julia and Java.
Note
Even though we have chosen Apache Arrow as our type system, we will demonstrate many things using NumPy since it enables us to easily play around with fixed-size scalar values.
Note
Most user interaction with plateau are likely done by using Pandas DataFrames. Please consolidate the pyarrow documentation on Type differences to see how different Pandas constructs map to Apache Arrow.
Type Classes
This section shows how and why certain types are treated as compatible. This is required to enable partition-based memory and performance optimization. The rules for picking compatible type classes (i.e. a set of types which can be mapped to single container type) are:
feasibility: the container type must be able to hold all values of all types in the type class
semantics: types within a type class should be considered to be semantically compatible
Bool
Booleans (short “bool”) are the most basic unit of information. A bool can either be true or false. It is a quite handy bit of information to store things like “is this account active or not”.
Apache Arrow knows a single type here, bool
, which does not require any normalization on its own.
Unsigned Integers
Unsigned integers are used to store whole non-negative numbers with numerical information (often counts like number of cars) or to hold IDs (like a numeric ID generated by a database).
The maximum value of an unsigned integer with a given bitwidth \(b\) is \(2^b - 1\). The minimum value is always 0. All whole numbers between minimum and maximum can be represented. The following table lists the ranges for all supported unsigned integer types.
Type |
Minimum |
Maximum |
---|---|---|
|
0 |
255 |
|
0 |
65,535 |
|
0 |
4,294,967,295 |
|
0 |
18,446,744,073,709,551,615 |
Therefore, all values of all supported unsigned integers types can be represented by uint64
.
Warning
Since many libraries do not guarantee that small unsigned integers (like uint8
) are kept small (Pandas for
example chooses uint64
whenever it detects an overflow), do NOT rely on the exact bitwidth and pattern. If you
want to encode bitstream data, use bytes
!
Signed Integers
Signed integers are used to store whole numbers with numerical information (like delta of number of cars) or can also be used to hold index data like array offsets.
They work similar to unsigned integers, but the range is different. For a given bitwidth \(b\), the minimum is \(-(2^{b - 1})\) and the maximum is \(2^{b - 1} - 1\). The following table listsa all range of all supported signed integer types.
Type |
Minimum |
Maximum |
---|---|---|
|
-128 |
127 |
|
-32,768 |
32,767 |
|
-2,147,483,648 |
2,147,483,647 |
|
-9,223,372,036,854,775,808 |
9,223,372,036,854,775,807 |
Therefore, all values of all supported signed integer types can be represented by int64
.
Floats
Floating point values (often just called “floats”) are used to store quantitative data where precision is often only relatively important. For example, if you want to know the age of the universe in milliseconds (around 4.35e+20), a millisecond more or less might not make the difference.
The floating point types float16
, float32
, and float64
are defined in IEEE 754 and are called
binary{16, 32, 64}
there.
Type |
Fraction Bits |
Exponent Bits |
---|---|---|
|
11 |
5 |
|
24 |
8 |
|
53 |
11 |
Values are stored, depending on the bit-pattern in the exponent bits, as:
normalized: \((-1)^\text{signbit} \times 1.\text{fraction}_2 \times 2^{\text{exponent}_2 - \text{expbias}}\)
subnormals: \((-1)^\text{signbit} \times 0.\text{fraction}_2 \times 2^{1 - \text{expbias}}\)
zero: \((-1)^\text{signbit} \times 0\)
infinity: \((-1)^\text{signbit} \times \infty\)
where \(\text{expbias} = 2^{\#\text{exponentbits}}-1\).
Important
Signaling NaN values are discouraged and should not be used!
Important
NaN payloads are not handled and should not be used. The IEEE 754 declares them as optional and hardware and software may wipe them anyway, so portable code cannot make use this data.
For each of these categories, we can represent all values of float{16, 32}
by using float64
. So we normalize all
floating point types to float64
.
Decimal
Decimals have a given precision and scale and used to store fixed-point floats like money.
There is a single decimal type decimal128[P, S]
where P
measures the total precision in digits and S
measures the scale in digits (therefore \(P \ge S\)).
>>> from decimal import Context, Decimal
>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame({
... "profit_eu": [Decimal("110.12"), Decimal("20.00")],
... "reveneu_eu": [Decimal("20.00"), Decimal("1000.00")],
... "profit_lyd": [Decimal("0.0000"), Decimal("22.1050")],
... "reveneu_lyd": [Decimal("0.0000"), Decimal("200.0000")],
... })
>>> schema = pa.Schema.from_pandas(df)
>>> schema.field("profit_eu").type
Decimal128Type(decimal128(5, 2))
>>> schema.field("reveneu_eu").type
Decimal128Type(decimal128(6, 2))
>>> schema.field("profit_lyd").type
Decimal128Type(decimal128(6, 4))
>>> schema.field("reveneu_lyd").type
Decimal128Type(decimal128(7, 4))
As shown, not only the scale changes for various numbers but also the precision is bound to the largest number. While the scale-handling makes sense (currencies should not be mixed), the precision-handling is unfortunate and may lead to various problems.
We currently do not implement a normalization. This might change in future metadata versions.
Warning
Because no normalization is implemented for different decimal precisions, we strongly advice against using them in plateau.
Date
Dates are normally used to to store “which day it is”.
There are two date types with slightly different semantics:
date32
: 32bit unsigned integer counter for days since UNIX epochdate64
: 64bit unsigned integer counter for milliseconds since UNIX epoch
In theory, we could fit all date32
values into date64
:
>>> import math
>>> n_years_date32 = math.floor(2**32 / 366)
>>> n_years_date64 = math.floor(2**64 / (366 * 24 * 3600 * 1000))
>>> n_years_date32, n_years_date64
(11734883, 583344214)
>>> n_years_date64 > n_years_date32
True
Since date64
is a very rarely used, this normalization is currently NOT implemented. This might change in a future
metadata version.
Note
Date in Pandas can only be used by using an object
column with datetime.date
objects. Since this is
neither backed by NumPy nor has a special implementation in Pandas, this might be too slow and memory intensive
for certain use cases. There are the following known workarounds:
timestamps: Timestamps are backed by NumPy using the
datetime64
type and map directly to integer-like data and arithmetics. Use “midn” as a time (e.g.2019-05-21 00:00:00
) and most features including Pandas support work as expected.extension types: Using Extension Types would make it possible to have proper, fast date types in Pandas. Note that this would also require to either convert them back and forth before/after the plateau interaction or to teach pyarrow about them.
Time
This is the colleague of Date and stores the time at a given day.
The normalization of time32[U]
and time64[U]
(where U
is either "s"
for seconds or "ms"
for
milliseconds) is currently not implemented. This might change in a future metadata version.
Timestamp
A combination of Date and Time and is particularly useful to store when an event occurred without the need to store date and time separately.
There is a single, parametrized timestamp type called timestamp[U, Z]
(where U
is any of "s"
for seconds,
"ms"
for milliseconds, "us"
for microseconds, "ns"
for nanoseconds; and Z
stands for the timezone). It
occupies 64bits.
We cannot treat timestamps for different timezones as the same time because the timezone parameter has important semantic meaning. We also cannot treat timestamps with different unit types as same since they all have very different ranges. So, no normalization is implemented for timestamps.
Note
For compatibility reasons, plateau coerces timestamps to us accuracy, effectively truncating the timestamp. If the timestamp actually has a higher accuracy, arrow raises an exception, rejecting it
In [1]: df = pd.DataFrame({"nano": [pd.Timestamp("2021-01-01 00:00:00.0000001")]})
# nanosecond resolution
In [2]: ser.store(store, "key", df)
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[2], line 1
----> 1 ser.store(store, "key", df)
File ~/checkouts/readthedocs.org/user_builds/plateau/conda/latest/lib/python3.12/site-packages/plateau/serialization/_parquet.py:378, in ParquetSerializer.store(self, store, key_prefix, df)
375 table = pa.Table.from_pandas(df)
376 buf = pa.BufferOutputStream()
--> 378 pq.write_table(
379 table,
380 buf,
381 version=PARQUET_VERSION,
382 chunk_size=self.chunk_size,
383 compression=self.compression,
384 coerce_timestamps="us",
385 )
386 store.put(key, buf.getvalue().to_pybytes())
387 return key
File ~/checkouts/readthedocs.org/user_builds/plateau/conda/latest/lib/python3.12/site-packages/pyarrow/parquet/core.py:1908, in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, compression_level, use_byte_stream_split, column_encoding, data_page_version, use_compliant_nested_type, encryption_properties, write_batch_size, dictionary_pagesize_limit, store_schema, write_page_index, write_page_checksum, sorting_columns, **kwargs)
1882 try:
1883 with ParquetWriter(
1884 where, table.schema,
1885 filesystem=filesystem,
(...)
1906 sorting_columns=sorting_columns,
1907 **kwargs) as writer:
-> 1908 writer.write_table(table, row_group_size=row_group_size)
1909 except Exception:
1910 if _is_path_like(where):
File ~/checkouts/readthedocs.org/user_builds/plateau/conda/latest/lib/python3.12/site-packages/pyarrow/parquet/core.py:1104, in ParquetWriter.write_table(self, table, row_group_size)
1099 msg = ('Table schema does not match schema used to create file: '
1100 '\ntable:\n{!s} vs. \nfile:\n{!s}'
1101 .format(table.schema, self.schema))
1102 raise ValueError(msg)
-> 1104 self.writer.write_table(table, row_group_size=row_group_size)
File ~/checkouts/readthedocs.org/user_builds/plateau/conda/latest/lib/python3.12/site-packages/pyarrow/_parquet.pyx:2180, in pyarrow._parquet.ParquetWriter.write_table()
File ~/checkouts/readthedocs.org/user_builds/plateau/conda/latest/lib/python3.12/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1609459200000000100
One possibility to deal with this is to set the appropriate accuracy using pandas.Timestamp.ceil or pandas.Timestamp.floor
In [3]: df.nano = df.nano.dt.ceil("us")
In [4]: ser.restore_dataframe(store, ser.store(store, "key", df))
Out[4]:
nano
0 2021-01-01 00:00:00.000001
Lists
They are used to store a set of elements in a fixed order, like a list of cities to visit, or a plan how to connect given points to draw a panda.
Lists in Apache Arrow have a homogeneous element type. We can therefore assume that they can be optimized for certain
partitions similar to other data types. We therefore treat lists with compatible element types as compatible, i.e.
list[T1]
and list[T2]
are compatible iff T1
and T2
are compatible.
Structs
Structures (short “structs”) might be the most complex data type. They are used to store a collection of other data types, like all ID card information (containing name, the birthday and a picture). They can even be nested, i.e. a struct can hold another struct.
Normalization for structs is currently not implemented but might be in future releases.
Incompatibilities
This section points out why we treat certain type classes as incompatible, also in contrast to other libraries.
Signed / Unsigned Integer
This section shows why signed and unsigned integers are two distinct type classes.
Let us assume we represent signed and unsigned integers by the largest available types, int64
and uint64
. If
they would be in the same type class, either int64
or uint64
should than be able to represent all values of the
other. This however, does not work for uint64
because it cannot represent negative numbers. For int64
, this also
is not feasible because the range \((9223372036854775807, 18446744073709551615]\) cannot be represented (this is the
range int64
sacrifices to be able to represent negative numbers):
>>> import numpy as np
>>> x = ~np.uint64(0)
>>> y = np.int64(x)
>>> x, y
(18446744073709551615, -1)
Now you could represent uint{8, 16, 32}
(w/o uint64
) with int64
, but making uint64
special would be
confusing and also contradict the illustrated optimization use case.
Important
This is different to Dask and Pandas.
Float / Integer
Looking at the range of float64
, it may be feasible to just pack all integers into a floating point values and
everything is fine. This is what Pandas is doing by default. Since a float64
only has 53
fraction bits, it cannot store all 64 bit integers:
>>> import numpy as np
>>> x = np.int64((1 << 53) + 1)
>>> y = np.int64(np.float64(x))
>>> x, y
(9007199254740993, 9007199254740992)
>>> import numpy as np
>>> x = np.uint64((1 << 53) + 1)
>>> y = np.uint64(np.float64(x))
>>> x, y
(9007199254740993, 9007199254740992)
Integers might hold IDs which are by nature rather categorical than numeric. There, these tiny errors might lead to wrong / unpredictable results or crashes, we decided to treat integers and floats as distinct type classes.
Important
This is different to Dask and Pandas.
String / Binary
Not all binary
values are valid Unicode, e.g.:
>>> b"\xff".decode("utf8")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Furthermore, the encoding of Unicode strings is not per se defined. It might be UTF-8, UTF-16, UTF-32, or something
completely different. For that reason, we also cannot just represent all string
values with binary
.
This incompatibility is also supported by the semantic meaning that binary
data might be any bitstream (like image
data, crypto keys, Thrift bitstreams) and string
is reserved for text-like data.
Note
This was especially problematic under Python 2, where the content of str
was undefined and unicode
was not
the default choice of many libraries like Pandas. Under Python 3, this is now clarified (str
are always
Unicode), so it is easier for users to produce and consume proper string data.
Bool / Integer
We could encode booleans as signed or unsigned integer (False -> 0
and True -> 1
), but decided against it for
the following reasons:
semantic: Integers and booleans have a different meaning. Also, it is not always clear that
False
andTrue
are mapped to0
and1
.optimization: Booleans are clearly more efficient than integers and we would like to preserve that extreme advantage.
library support: Pandas for example makes a difference depending if a column contains boolean or integer data:
>>> import pandas as pd >>> df = pd.DataFrame({ ... "b": [False, True], ... "i": [0, 1], ... }) >>> df.dtypes b bool i int64 dtype: object >>> ~df["b"] 0 True 1 False Name: b, dtype: bool >>> ~df["i"] 0 -1 1 -2 Name: i, dtype: int64
Null
While null
has a semantic meaning, they can easily occur in production due to the type inference that pyarrow has to
do when working with pandas dataframes:
>>> import pandas as p
>>> import pyarrow as pa
>>> df = pd.DataFrame({
... "single_value": [None, "foo", None],
... "no_value": [None, None, None],
... })
>>> schema = pa.Schema.from_pandas(df)
>>> schema.field("single_value").type
DataType(string)
>>> schema.field("no_value").type
DataType(null)
The reason is that string and also data objects are stored as object
columns in pandas, which can contain arbitrary
python objects. None
acts as a placeholder “missing value”. Apache Arrow requires that values in a columns have
one single type and therefore needs to guess what an object
column should represent (i.e. type inference). If
pyarrow does not find any non-Null object, it treats the column as null
. Sadly, this might be wrong. It could easily
also have meant to be a string
or date32
column, but pyarrow cannot know that.
To keep things pragmatic, we ignore null
during type checks.
Dictionary Encoding
Dictionary encoded data is normally produced by Pandas categoricals:
>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame({
... "s": pd.Series(["foo", "foo", "bar"]).astype("category"),
... })
>>> schema = pa.Schema.from_pandas(df)
>>> schema.field("s").type
DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
They have the form dictionary[T, I, O]
where T
represents the value type, I
the index type (mostly integers)
and O
flags if the index is ordered or not.
Since categoricals are, in our opinion, a pure optimization and do not alter the nature of the data, we treat
dictionary-encoded data like the values they encode. So dictionary[T1, I1, O1]
is compatible with T2
if T1
and T2
are compatible. This also means that it is compatible with dictionary[T2, I2, O2]
. Note that the ordered
flag and the index data type are ignored. So the values in the example shown above are treated like string
.
Normalization
Following the outlined guidelines, we can write down the following normalization rule set:
Type Class |
Normalization |
Examples |
---|---|---|
signed integer |
|
norm(int8) = int64 norm(int64) = int64 |
unsigned integer |
|
norm(uint8) = uint64 norm(uint64) = uint64 |
float |
|
norm(float8) = float64 norm(float64) = float64 |
list |
|
norm(list[int8]) = list[int64] norm(list[int64]) = list[int64] norm(list[list[int8]]) = list[list[int64]] norm(list[string]) = list[string] norm(list[dictionary[int8, int8, 1]]) = list[int64] |
dictionary |
|
norm(dictionary[str, int8, 0]) = str norm(dictionary[int8, int16, 1]) = int64 norm(dictionary[list[int8], int8, 1]) = list[int64] |
Technical Implementation
There are three sources of type information:
partition parquet files: the actual payload data written to the different parquet files
common metadata: the metadata that offers a quick introspection and is also used to recover type information for partition indices since they are stored as strings and are part of the payload storage keys
secondary indices: parquet with secondary index information are typed
The ground truth for type information is the common metadata file. There, the outlined normalization is applied. The
payload data and the secondary indices may have any type that belongs to the correct type class, i.e. where
norm(T_payload)
equals T_common_metadata
.