Specification
Rationale
Storing data distributed over multiple files in an object store (S3, ABS, GCS, etc.) allows for a fast, cost efficient and highly scalable data infrastructure. A downside of storing data simply in an object store is that the storages themselves give little to no guarantees beyond the consistency of a single file. In particular, they cannot guarantee the consistency of your dataset. If we demand a consistent state of our dataset at all times we need to do track the state of the dataset ourself. Explicit state tracking can be more than a nuisance though, if done correctly.
Goals
Consistent dataset state at all times
Dataset state is only modified by atomic commits
Strongly typed table schemas
Query planing with O(1) calls to the remote store
Inverted indices for fast query planing
Read access without any locking mechanism
Portable across frameworks and languages
Seemless integration to OSS community software (Apache Arrow, Apache Parquet, pandas, etc.)
Lifecycle management (garbage collection, retention, etc.)
No external service required for state tracking
Dataset state tracking
The dataset state is tracked along with additional metadata in a single file which allows for the implementation of atomic commits using copy-on-write principles. The dataset metadata is stored in either a plain JSON file or as a zstd compressed msgpack file.
The dataset state is fully defined by:
List of all physical partitions
Schema specification for each table (_common_metadata)
Secondary (inverted) indices
Minimal Example:
<UUID>.by-dataset-metadata.{json|msgpack.zstd}
<UUID>/indices/<INDEX_COLUMN_NAME>/<ISOTIMESTAMP>.by-dataset-index.parquet
<UUID>/<TABLE_NAME>/_common_metadata
<UUID>/<TABLE_NAME>/{PARTITION_KEY=PARTITION_VALUE}/<PARTITION_UUID>.<FORMAT_SUFFIX>
Filenames
Multiple datasets may reside in a single storage location; a given dataset always resides in a single storage location. Also there will be files in this storage location that are not part of any dataset. To identify a dataset, the main metadata file must follow the naming specification. All other files belonging to the dataset must be referenced by this file.
A storage location can be one of:
An object store bucket incl. all of its keys (S3, ABS, GCS, etc.)
A directory on a filesystem incl. all files in subdirectories
Warning
Files cannot be shared between datasets, a dataset expects that all files mentioned in the metadata belong exclusively to the dataset. Thus deletion of all referenced files in the metadata will lead to a total deletion of the dataset. No other datasets will be impacted of this deletion.
General Filename rules
The files (described below in detail) all consist of several forward-slash separated components. All components must only consist of the following characters:
Uppercase and lowercase English letters (a-z, A-Z)
Digits 0 to 9
Characters plus, minus and underscore (+ - _)
Main metadata file
The filename of the main metadata file consists of the following components that are separated by a single dot each:
The UUID of the dataset. This should be a string that hasn’t been used yet for prefixing any other file in the storage location. One may use the Python function
gen_uuid()
(UUID type 4) to generate such a UUID but the only requirement here is that there is no other file in the target location that has the same prefix.The identifier
by-dataset-metadata
. The character sequenceby-dataset-metadata
. must not be used in any other file than main metadata files for datasets.The suffix json or msgpack.zstd to describe the file format (msgpack.zstd denotes a zstd compressed msgpack file in a plateau dataset).
Example:
0d6de3c6-b7a4-11e6-8ed1-08002753cf7b.by-dataset-metadata.json
Note
All files that form the dataset must also start with the prefix “<uuid>”.
Table schema
The table schema information consists of thee components:
The UUID of the dataset. This should be a string that hasn’t been used yet for prefixing any other file in the storage location. One may use the
gen_uuid()
(UUID type 4) to generate such a UUID.The table identifier
The string _common_metadata
Example:
0d6de3c6-b7a4-11e6-8ed1-08002753cf7b/core/_common_metadata
The data stored in _common_metadata
is supposed to be an _empty_ parquet
file fully specifying the schema of the table.
For more details, see Table type system.
Data files of partitions
These files must consist of the following forward-slash separated components:
The UUID of the dataset. This should be a string that hasn’t been used yet for prefixing any other file in the storage location. One may use the
gen_uuid()
(UUID type 4) to generate such a UUID.The table identifier
(optional) partition content encoding
The partition identifier.
The suffix to describe the file format, e.g. parquet, csv, h5, etc. For available serialization formats, see DataFrame Serialization
Example:
0d6de3c6-b7a4-11e6-8ed1-08002753cf7b/core/partition_key=partition_value/part_1.parquet
Note
Partition content encoding
Just like Dask, Apache Spark or Apache Hive are doing, it is possible
to encode the content of a particular column in the filename which allows
the construction of an index based on that column. Both the column name
and value are URL encoded and the column type is stored in the table schema
information. The payload data file itself should not include this column
any more but rather any reading client is supposed to type-safely
reconstruct this column upon loading.
For example the path
0d6de3c6-b7a4-11e6-8ed1-08002753cf7b/location=123/product=3454/*.parquet
indicates that data with (location == 123 AND product == 3454)
is stored in this directory.
Index files
These files must consist of the following dot-separated components:
The UUID of the dataset. This should be a string that hasn’t been used yet for prefixing any other file in the storage location. One may use the
gen_uuid()
(UUID type 4) to generate such a UUID.A hard coded identifier
indices
The name of the field used in the index
A url encoded ISO 8601 timestamp (format
YYYY-MM-DDTHH:MM:SS.ffffff
)The suffix parquet to describe the file format.
Example:
0d6de3c6-b7a4-11e6-8ed1-08002753cf7b/indices/<FIELD_NAME>/<ISOTIMESTAMP>.by-dataset-index.parquet
Attributes
This section describes the attributes that should be present in the main
metadata JSON file. For each attribute, we specify its key and the expected
type. The type is a must and conversion from e.g. INT
in the case
a STRING
is expected are not done. The usage of these attributes
can be seen in the example below.
dataset_metadata_version (INT) = 4
: The version of the metadata, needs to be increased on every specification change.
dataset_uuid (STRING)
: Unique identifier of the dataset. This needs to be the same as used in the filename.
metadata (MAP<STRING, STRING>)
: Arbitrary metadata that can be used to annotate a dataset. This may be empty or omitted.
partitions (MAP<STRING, ...>)
: Labeled set of partitions. The key is the partition identifier as used in the file name and in indices.
files (MAP<STRING, STRING>)
: Labeled files contained in a partition.
The filename must end with a known file extension, e.g.
.parquet
.All partitions shall have the same set of keys.
A single file must be part of exactly one dataset.
indices (MAP<STRING, STRING>)
:
(Secondary) indices are optional, so this mapping can be empty or omitted completely.
Indices provide support to find the matching partitions for a row selection. In the first iteration, an index can be used to find the set of matching files for a row selection with the constraint on a single column value (e.g.
product_id = 12345
). For a row selection with multiple row constraints, one shall query all 1-column indices and use the intersection of the all returned partition sets.The key of the map is the field on which the row selection constraint is defined. This field may also be a field that is not contained in the actual data in the case that this field would have the same value for all rows in a partition.
The value of the indices map is the name of the Parquet file storing the index.
For a description of the indices, see Indexing