.. _type_system: ================= Table type system ================= This document explains the type system of plateau. Motivation ---------- To understand why and how the type system was designed, we illustrate the requirements and use cases plateau should cover: - **simplicity:** For the average programmer, it should be possible to understand the semantics of plateau quickly. - **optimized data representation:** Data providers (humans and software) should be able to pick a memory representation that is sufficient to hold the data in question (e.g. pick an 8-bit unsigned integer if you only have integer up to 255). - **lifecycle management:** A dataset may be used for many months and can be extended over this time. Also, the software used with the dataset might change. - **efficiency:** Data written once should not be altered if not really necessary. - **stability:** plateau is made for production use. It should provide stable, predicable outputs and must not crash or lead other libraries to behave in an expected way. - **compatibility:** plateau is not the only software in the stack and must play nicely with others (also see `Related Type Systems`_). Base Type System ---------------- Since we think that `Apache Arrow`_ is a solid, future-proof choice for in-memory dataframes and will hopefully offer a sane alternative for many `Pandas`_ use cases, we opt for `Apache Arrow`_ as our type system baseline. In contrast to `NumPy`_ and `Pandas`_, it offers better support for nested types, is more strict about type conversions and can also be used when interacting with other languages like `R`_, `Julia`_ and `Java`_. .. note:: Even though we have chosen `Apache Arrow`_ as our type system, we will demonstrate many things using `NumPy`_ since it enables us to easily play around with fixed-size scalar values. .. note:: Most user interaction with plateau are likely done by using `Pandas`_ DataFrames. Please consolidate the pyarrow documentation on `Type differences `_ to see how different `Pandas`_ constructs map to `Apache Arrow`_. Type Classes ------------ This section shows how and why certain types are treated as compatible. This is required to enable partition-based memory and performance optimization. The rules for picking compatible type classes (i.e. a set of types which can be mapped to single container type) are: - **feasibility:** the container type must be able to hold all values of all types in the type class - **semantics:** types within a type class should be considered to be semantically compatible Bool ~~~~ Booleans (short "bool") are the most basic unit of information. A bool can either be true or false. It is a quite handy bit of information to store things like "is this account active or not". `Apache Arrow`_ knows a single type here, ``bool``, which does not require any normalization on its own. Unsigned Integers ~~~~~~~~~~~~~~~~~ Unsigned integers are used to store whole non-negative numbers with numerical information (often counts like number of cars) or to hold IDs (like a numeric ID generated by a database). The maximum value of an unsigned integer with a given bitwidth :math:`b` is :math:`2^b - 1`. The minimum value is always 0. All whole numbers between minimum and maximum can be represented. The following table lists the ranges for all supported unsigned integer types. +------------+---------+----------------------------+ | Type | Minimum | Maximum | +============+=========+============================+ | ``uint8`` | 0 | 255 | +------------+---------+----------------------------+ | ``uint16`` | 0 | 65,535 | +------------+---------+----------------------------+ | ``uint32`` | 0 | 4,294,967,295 | +------------+---------+----------------------------+ | ``uint64`` | 0 | 18,446,744,073,709,551,615 | +------------+---------+----------------------------+ Therefore, all values of all supported unsigned integers types can be represented by ``uint64``. .. warning:: Since many libraries do not guarantee that small unsigned integers (like ``uint8``) are kept small (`Pandas`_ for example chooses ``uint64`` whenever it detects an overflow), do NOT rely on the exact bitwidth and pattern. If you want to encode bitstream data, use ``bytes``! Signed Integers ~~~~~~~~~~~~~~~ Signed integers are used to store whole numbers with numerical information (like delta of number of cars) or can also be used to hold index data like array offsets. They work similar to unsigned integers, but the range is different. For a given bitwidth :math:`b`, the minimum is :math:`-(2^{b - 1})` and the maximum is :math:`2^{b - 1} - 1`. The following table listsa all range of all supported signed integer types. +-----------+----------------------------+---------------------------+ | Type | Minimum | Maximum | +===========+============================+===========================+ | ``int8`` | -128 | 127 | +-----------+----------------------------+---------------------------+ | ``int16`` | -32,768 | 32,767 | +-----------+----------------------------+---------------------------+ | ``int32`` | -2,147,483,648 | 2,147,483,647 | +-----------+----------------------------+---------------------------+ | ``int64`` | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 | +-----------+----------------------------+---------------------------+ Therefore, all values of all supported signed integer types can be represented by ``int64``. Floats ~~~~~~ Floating point values (often just called "floats") are used to store quantitative data where precision is often only relatively important. For example, if you want to know the age of the universe in milliseconds (around 4.35e+20), a millisecond more or less might not make the difference. The floating point types ``float16``, ``float32``, and ``float64`` are defined in `IEEE 754`_ and are called ``binary{16, 32, 64}`` there. +-------------+---------------+---------------+ | Type | Fraction Bits | Exponent Bits | +=============+===============+===============+ | ``float16`` | 11 | 5 | +-------------+---------------+---------------+ | ``float32`` | 24 | 8 | +-------------+---------------+---------------+ | ``float64`` | 53 | 11 | +-------------+---------------+---------------+ Values are stored, depending on the bit-pattern in the exponent bits, as: - **normalized:** :math:`(-1)^\text{signbit} \times 1.\text{fraction}_2 \times 2^{\text{exponent}_2 - \text{expbias}}` - **subnormals:** :math:`(-1)^\text{signbit} \times 0.\text{fraction}_2 \times 2^{1 - \text{expbias}}` - **zero:** :math:`(-1)^\text{signbit} \times 0` - **infinity:** :math:`(-1)^\text{signbit} \times \infty` where :math:`\text{expbias} = 2^{\#\text{exponentbits}}-1`. .. important:: `Signaling NaN`_ values are discouraged and should not be used! .. important:: `NaN payloads`_ are not handled and should not be used. The `IEEE 754`_ declares them as optional and hardware and software may wipe them anyway, so portable code cannot make use this data. For each of these categories, we can represent all values of ``float{16, 32}`` by using ``float64``. So we normalize all floating point types to ``float64``. Decimal ~~~~~~~ Decimals have a given precision and scale and used to store fixed-point floats like money. There is a single decimal type ``decimal128[P, S]`` where ``P`` measures the total precision in digits and ``S`` measures the scale in digits (therefore :math:`P \ge S`). >>> from decimal import Context, Decimal >>> import pandas as pd >>> import pyarrow as pa >>> df = pd.DataFrame({ ... "profit_eu": [Decimal("110.12"), Decimal("20.00")], ... "reveneu_eu": [Decimal("20.00"), Decimal("1000.00")], ... "profit_lyd": [Decimal("0.0000"), Decimal("22.1050")], ... "reveneu_lyd": [Decimal("0.0000"), Decimal("200.0000")], ... }) >>> schema = pa.Schema.from_pandas(df) >>> schema.field("profit_eu").type Decimal128Type(decimal128(5, 2)) >>> schema.field("reveneu_eu").type Decimal128Type(decimal128(6, 2)) >>> schema.field("profit_lyd").type Decimal128Type(decimal128(6, 4)) >>> schema.field("reveneu_lyd").type Decimal128Type(decimal128(7, 4)) As shown, not only the scale changes for various numbers but also the precision is bound to the largest number. While the scale-handling makes sense (currencies should not be mixed), the precision-handling is unfortunate and may lead to various problems. We currently do not implement a normalization. This might change in future metadata versions. .. warning:: Because no normalization is implemented for different decimal precisions, we strongly advice against using them in plateau. Date ~~~~ Dates are normally used to to store "which day it is". There are two date types with slightly different semantics: - ``date32``: 32bit unsigned integer counter for days since `UNIX epoch`_ - ``date64``: 64bit unsigned integer counter for milliseconds since `UNIX epoch`_ In theory, we could fit all ``date32`` values into ``date64``: >>> import math >>> n_years_date32 = math.floor(2**32 / 366) >>> n_years_date64 = math.floor(2**64 / (366 * 24 * 3600 * 1000)) >>> n_years_date32, n_years_date64 (11734883, 583344214) >>> n_years_date64 > n_years_date32 True Since ``date64`` is a very rarely used, this normalization is currently NOT implemented. This might change in a future metadata version. .. note:: Date in `Pandas`_ can only be used by using an ``object`` column with :class:`datetime.date` objects. Since this is neither backed by `NumPy`_ nor has a special implementation in `Pandas`_, this might be too slow and memory intensive for certain use cases. There are the following known workarounds: - **timestamps:** Timestamps are backed by `NumPy`_ using the ``datetime64`` type and map directly to integer-like data and arithmetics. Use "midn" as a time (e.g. ``2019-05-21 00:00:00``) and most features including `Pandas`_ support work as expected. - **extension types:** Using `Extension Types`_ would make it possible to have proper, fast date types in `Pandas`_. Note that this would also require to either convert them back and forth before/after the plateau interaction or to teach pyarrow about them. Time ~~~~ This is the colleague of `Date`_ and stores the time at a given day. The normalization of ``time32[U]`` and ``time64[U]`` (where ``U`` is either ``"s"`` for seconds or ``"ms"`` for milliseconds) is currently not implemented. This might change in a future metadata version. .. _timestamp: Timestamp ~~~~~~~~~ A combination of `Date`_ and `Time`_ and is particularly useful to store when an event occurred without the need to store date and time separately. There is a single, parametrized timestamp type called ``timestamp[U, Z]`` (where ``U`` is any of ``"s"`` for seconds, ``"ms"`` for milliseconds, ``"us"`` for microseconds, ``"ns"`` for nanoseconds; and ``Z`` stands for the timezone). It occupies 64bits. We cannot treat timestamps for different timezones as the same time because the timezone parameter has important semantic meaning. We also cannot treat timestamps with different unit types as same since they all have very different ranges. So, no normalization is implemented for timestamps. .. note:: For compatibility reasons, `plateau` coerces timestamps to `us` accuracy, effectively truncating the timestamp. If the timestamp actually has a higher accuracy, arrow raises an exception, rejecting it .. ipython:: python :suppress: :okwarning: import pandas as pd from plateau.api.serialization import ParquetSerializer from plateau.api.dataset import ensure_store store = ensure_store("memory://") ser = ParquetSerializer() .. ipython:: python :okexcept: :okwarning: df = pd.DataFrame({"nano": [pd.Timestamp("2021-01-01 00:00:00.0000001")]}) # nanosecond resolution ser.store(store, "key", df) One possibility to deal with this is to set the appropriate accuracy using `pandas.Timestamp.ceil`_ or `pandas.Timestamp.floor`_ .. ipython:: python :okwarning: df.nano = df.nano.dt.ceil("us") ser.restore_dataframe(store, ser.store(store, "key", df)) Lists ~~~~~ They are used to store a set of elements in a fixed order, like a list of cities to visit, or a plan how to connect given points to draw a panda. Lists in `Apache Arrow`_ have a homogeneous element type. We can therefore assume that they can be optimized for certain partitions similar to other data types. We therefore treat lists with compatible element types as compatible, i.e. ``list[T1]`` and ``list[T2]`` are compatible iff ``T1`` and ``T2`` are compatible. Structs ~~~~~~~ Structures (short "structs") might be the most complex data type. They are used to store a collection of other data types, like all ID card information (containing name, the birthday and a picture). They can even be nested, i.e. a struct can hold another struct. Normalization for structs is currently not implemented but might be in future releases. Incompatibilities ----------------- This section points out why we treat certain type classes as incompatible, also in contrast to other libraries. Signed / Unsigned Integer ~~~~~~~~~~~~~~~~~~~~~~~~~ This section shows why signed and unsigned integers are two distinct type classes. Let us assume we represent signed and unsigned integers by the largest available types, ``int64`` and ``uint64``. If they would be in the same type class, either ``int64`` or ``uint64`` should than be able to represent all values of the other. This however, does not work for ``uint64`` because it cannot represent negative numbers. For ``int64``, this also is not feasible because the range :math:`(9223372036854775807, 18446744073709551615]` cannot be represented (this is the range ``int64`` sacrifices to be able to represent negative numbers): >>> import numpy as np >>> x = ~np.uint64(0) >>> y = np.int64(x) >>> x, y (18446744073709551615, -1) Now you could represent ``uint{8, 16, 32}`` (w/o ``uint64``) with ``int64``, but making ``uint64`` special would be confusing and also contradict the illustrated optimization use case. .. important:: This is different to Dask and Pandas. Float / Integer ~~~~~~~~~~~~~~~ Looking at the range of ``float64``, it may be feasible to just pack all integers into a floating point values and everything is fine. This is what `Pandas is doing by default `_. Since a ``float64`` only has 53 fraction bits, it cannot store all 64 bit integers: >>> import numpy as np >>> x = np.int64((1 << 53) + 1) >>> y = np.int64(np.float64(x)) >>> x, y (9007199254740993, 9007199254740992) >>> import numpy as np >>> x = np.uint64((1 << 53) + 1) >>> y = np.uint64(np.float64(x)) >>> x, y (9007199254740993, 9007199254740992) Integers might hold IDs which are by nature rather categorical than numeric. There, these tiny errors might lead to wrong / unpredictable results or crashes, we decided to treat integers and floats as distinct type classes. .. important:: This is different to Dask and Pandas. String / Binary ~~~~~~~~~~~~~~~ Not all ``binary`` values are valid `Unicode`_, e.g.: >>> b"\xff".decode("utf8") Traceback (most recent call last): ... UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Furthermore, the encoding of `Unicode`_ strings is not per se defined. It might be UTF-8, UTF-16, UTF-32, or something completely different. For that reason, we also cannot just represent all ``string`` values with ``binary``. This incompatibility is also supported by the semantic meaning that ``binary`` data might be any bitstream (like image data, crypto keys, `Thrift`_ bitstreams) and ``string`` is reserved for text-like data. .. note:: This was especially problematic under Python 2, where the content of ``str`` was undefined and ``unicode`` was not the default choice of many libraries like Pandas. Under Python 3, this is now clarified (``str`` are always `Unicode`_), so it is easier for users to produce and consume proper string data. Bool / Integer ~~~~~~~~~~~~~~ We could encode booleans as signed or unsigned integer (``False -> 0`` and ``True -> 1``), but decided against it for the following reasons: - **semantic:** Integers and booleans have a different meaning. Also, it is not always clear that ``False`` and ``True`` are mapped to ``0`` and ``1``. - **optimization:** Booleans are clearly more efficient than integers and we would like to preserve that extreme advantage. - **library support:** Pandas for example makes a difference depending if a column contains boolean or integer data: >>> import pandas as pd >>> df = pd.DataFrame({ ... "b": [False, True], ... "i": [0, 1], ... }) >>> df.dtypes b bool i int64 dtype: object >>> ~df["b"] 0 True 1 False Name: b, dtype: bool >>> ~df["i"] 0 -1 1 -2 Name: i, dtype: int64 Null ---- While ``null`` has a semantic meaning, they can easily occur in production due to the type inference that pyarrow has to do when working with pandas dataframes: >>> import pandas as p >>> import pyarrow as pa >>> df = pd.DataFrame({ ... "single_value": [None, "foo", None], ... "no_value": [None, None, None], ... }) >>> schema = pa.Schema.from_pandas(df) >>> schema.field("single_value").type DataType(string) >>> schema.field("no_value").type DataType(null) The reason is that string and also data objects are stored as ``object`` columns in pandas, which can contain arbitrary python objects. ``None`` acts as a placeholder "missing value". `Apache Arrow`_ requires that values in a columns have one single type and therefore needs to guess what an ``object`` column should represent (i.e. type inference). If pyarrow does not find any non-Null object, it treats the column as ``null``. Sadly, this might be wrong. It could easily also have meant to be a ``string`` or ``date32`` column, but pyarrow cannot know that. To keep things pragmatic, we ignore ``null`` during type checks. .. _`Dictionary Encoding`: Dictionary Encoding ------------------- Dictionary encoded data is normally produced by Pandas categoricals: >>> import pandas as pd >>> import pyarrow as pa >>> df = pd.DataFrame({ ... "s": pd.Series(["foo", "foo", "bar"]).astype("category"), ... }) >>> schema = pa.Schema.from_pandas(df) >>> schema.field("s").type DictionaryType(dictionary) They have the form ``dictionary[T, I, O]`` where ``T`` represents the value type, ``I`` the index type (mostly integers) and ``O`` flags if the index is ordered or not. Since categoricals are, in our opinion, a pure optimization and do not alter the nature of the data, we treat dictionary-encoded data like the values they encode. So ``dictionary[T1, I1, O1]`` is compatible with ``T2`` if ``T1`` and ``T2`` are compatible. This also means that it is compatible with ``dictionary[T2, I2, O2]``. Note that the ordered flag and the index data type are ignored. So the values in the example shown above are treated like ``string``. Normalization ------------- Following the outlined guidelines, we can write down the following normalization rule set: +------------------+------------------------------------+-----------------------------------------------------------+ | Type Class | Normalization | Examples | +==================+====================================+===========================================================+ | signed integer | ``int{8, 16, 32, 64} -> int64`` | | ``norm(int8) = int64`` | | | | | ``norm(int64) = int64`` | +------------------+------------------------------------+-----------------------------------------------------------+ | unsigned integer | ``uint{8, 16, 32, 64} -> uint64`` | | ``norm(uint8) = uint64`` | | | | | ``norm(uint64) = uint64`` | +------------------+------------------------------------+-----------------------------------------------------------+ | float | ``float{16, 32, 64} -> float64`` | | ``norm(float8) = float64`` | | | | | ``norm(float64) = float64`` | +------------------+------------------------------------+-----------------------------------------------------------+ | list | ``list[T] -> list[norm(T)]`` | | ``norm(list[int8]) = list[int64]`` | | | | | ``norm(list[int64]) = list[int64]`` | | | | | ``norm(list[list[int8]]) = list[list[int64]]`` | | | | | ``norm(list[string]) = list[string]`` | | | | | ``norm(list[dictionary[int8, int8, 1]]) = list[int64]`` | +------------------+------------------------------------+-----------------------------------------------------------+ | dictionary | ``dictionary[T, I, O] -> norm(T)`` | | ``norm(dictionary[str, int8, 0]) = str`` | | | | | ``norm(dictionary[int8, int16, 1]) = int64`` | | | | | ``norm(dictionary[list[int8], int8, 1]) = list[int64]`` | +------------------+------------------------------------+-----------------------------------------------------------+ Technical Implementation ------------------------ There are three sources of type information: - **partition parquet files:** the actual payload data written to the different parquet files - **common metadata:** the metadata that offers a quick introspection and is also used to recover type information for partition indices since they are stored as strings and are part of the payload storage keys - **secondary indices:** parquet with secondary index information are typed The ground truth for type information is the common metadata file. There, the outlined normalization is applied. The payload data and the secondary indices may have any type that belongs to the correct type class, i.e. where ``norm(T_payload)`` equals ``T_common_metadata``. Related Type Systems -------------------- Python programmers can encounter different types systems, some examples are: - `Python`_ - `Boolean Values `_ - `Numeric Types `_ - `Text Sequence Type `_ - `Binary Sequence Types `_ - `Null `_ - `Sequence Types `_ - `Date `_ - `DateTime `_ - `Decimal `_ - `NumPy`_: - `Data Types `_ - `Data Type Objects `_ - `Datetimes and Timedelta `_ - `Pandas`_: - `dtypes `_ - `Extension Types`_ - `Categorical Data `_ - `NA Type Promotion `_ - `Nullable Integer Data Type `_ - `PyTorch`_: - `torch.Tensor `_ - `Turbodbc`_: - `Supported Data Types `_ - `NumPy Interaction `_ - `Arrow Interaction `_ - `Apache Arrow`_: - `Python Bindings - Data Types `_ - `Specification - Logical Types `_ - `Parquet`_: - `Format - Logical Types `_ plateau aims to be as compatible as possible with them. .. _Apache Arrow: https://arrow.apache.org/ .. _Extension Types: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types .. _IEEE 754: https://en.wikipedia.org/wiki/IEEE_754 .. _Java: https://openjdk.java.net/ .. _Julia: https://julialang.org/ .. _NaN Payloads: https://anniecherkaev.com/the-secret-life-of-nan .. _NumPy: https://www.numpy.org/ .. _Parquet: https://parquet.apache.org/ .. _Pandas: https://pandas.pydata.org/ .. _Python: https://www.python.org/ .. _PyTorch: https://pytorch.org/ .. _R: https://www.r-project.org/ .. _Signaling NaN: https://en.wikipedia.org/wiki/NaN#Signaling_NaN .. _Thrift: https://thrift.apache.org/ .. _Turbodbc: https://github.com/blue-yonder/turbodbc .. _Unicode: https://unicode.org/ .. _UNIX Epoch: https://en.wikipedia.org/wiki/Unix_time .. _pandas.Timestamp.ceil: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.ceil.html#pandas.Timestamp.ceil .. _pandas.Timestamp.floor: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.floor.html