Examples -------- Setup a store .. ipython:: python from tempfile import TemporaryDirectory # You can, of course, also directly use S3, ABS or anything else # supported by :mod:`minimalkv` dataset_dir = TemporaryDirectory() store_url = f"hfs://{dataset_dir.name}" .. ipython:: python :okwarning: import pandas as pd from plateau.api.dataset import read_table, store_dataframes_as_dataset df = pd.DataFrame({"Name": ["Paul", "Lisa"], "Age": [32, 29]}) dataset_uuid = "my_list_of_friends" metadata = { "Name": "My list of friends", "Columns": { "Name": "First name of my friend", "Age": "honest age of my friend in years", }, } store_dataframes_as_dataset( store=store_url, dataset_uuid=dataset_uuid, dfs=[df], metadata=metadata ) # Load your data # By default the single dataframe is stored in the 'core' table df_from_store = read_table(store=store_url, dataset_uuid=dataset_uuid) df_from_store Eager ````` Write ~~~~~ .. ipython:: python :okwarning: import pandas as pd from plateau.api.dataset import store_dataframes_as_dataset #  Now, define the actual partitions. This list will, most of the time, # be the intermediate result of a previously executed pipeline which e.g. pulls # data from an external data source # In our particular case, we'll use manual input and define our partitions explicitly # We'll define two partitions which both have two tables input_list_of_partitions = [ pd.DataFrame({"A": range(10)}), pd.DataFrame({"A": range(10, 20)}), ] # The pipeline will return a :class:`~plateau.core.dataset.DatasetMetadata` object #  which refers to the created dataset dataset = store_dataframes_as_dataset( dfs=input_list_of_partitions, store=store_url, dataset_uuid="MyFirstDataset", metadata={"dataset": "metadata"}, #  This is optional dataset metadata metadata_version=4, ) dataset Read ~~~~ .. ipython:: python import pandas as pd from plateau.api.dataset import read_dataset_as_dataframes #  Create the pipeline with a minimal set of configs list_of_partitions = read_dataset_as_dataframes( dataset_uuid="MyFirstDataset", store=store_url ) # In case you were using the dataset created in the Write example for d1, d2 in zip( list_of_partitions, [ pd.DataFrame({"A": range(10)}), pd.DataFrame({"A": range(10, 20)}), ], ): for k1, k2 in zip(d1, d2): assert k1 == k2 Iter ```` Write ~~~~~ .. ipython:: python :okwarning: import pandas as pd from plateau.api.dataset import store_dataframes_as_dataset__iter input_list_of_partitions = [ pd.DataFrame({"A": range(10)}), pd.DataFrame({"A": range(10, 20)}), ] # The pipeline will return a :class:`~plateau.core.dataset.DatasetMetadata` object #  which refers to the created dataset dataset = store_dataframes_as_dataset__iter( input_list_of_partitions, store=store_url, dataset_uuid="MyFirstDatasetIter", metadata={"dataset": "metadata"}, #  This is optional dataset metadata metadata_version=4, ) dataset Read ~~~~ .. ipython:: python :okwarning: import pandas as pd from plateau.api.dataset import read_dataset_as_dataframes__iterator #  Create the pipeline with a minimal set of configs list_of_partitions = read_dataset_as_dataframes__iterator( dataset_uuid="MyFirstDatasetIter", store=store_url ) # the iter backend returns a generator object. In our case we want to look at # all partitions at once list_of_partitions = list(list_of_partitions) # In case you were using the dataset created in the Write example for d1, d2 in zip( list_of_partitions, [ pd.DataFrame({"A": range(10)}), pd.DataFrame({"A": range(10, 20)}), ], ): for k1, k2 in zip(d1, d2): assert k1 == k2 Dask ```` Write ~~~~~ .. ipython:: python :okwarning: import pandas as pd from plateau.api.dataset import store_delayed_as_dataset input_list_of_partitions = [ pd.DataFrame({"A": range(10)}), pd.DataFrame({"A": range(10, 20)}), ] # This will return a :class:`~dask.delayed`. The figure below # show the generated task graph. task = store_delayed_as_dataset( input_list_of_partitions, store=store_url, dataset_uuid="MyFirstDatasetDask", metadata={"dataset": "metadata"}, #  This is optional dataset metadata metadata_version=4, ) task.compute() .. figure:: ./taskgraph.jpeg :scale: 40% :figclass: align-center Task graph for the above dataset store pipeline. Read ~~~~ .. ipython:: python import dask import pandas as pd from plateau.api.dataset import read_dataset_as_delayed tasks = read_dataset_as_delayed(dataset_uuid="MyFirstDatasetDask", store=store_url) tasks dask.compute(tasks)