##### Stream datasets from storage [image: .md][image]

This guide walks through streaming datasets from disk or cloud
storage.

 # replace with your username and S3 bucket
 !lamin login testuser1
 !lamin init --storage s3://lamindb-ci/test-arrays

Import lamindb and track this notebook.

 import lamindb as ln
 import numpy as np

 ln.track()

#### DataFrame

A dataframe stored as sharded "parquet".

 artifact = ln.Artifact.connect("laminlabs/lamindata").get(key="sharded_parquet")

 artifact.path.view_tree()

 backed = artifact.open()

This returns a pyarrow dataset.

 backed

 backed.head(5).to_pandas()

It is also possible to open a collection of cloud artifacts.

 collection = ln.Collection.connect("laminlabs/lamindata").get(
 key="sharded_parquet_collection"
 )

 backed = collection.open()

 backed

 backed.to_table().to_pandas()

By default "Artifact.open()" and "Collection.open()" use "pyarrow" to
lazily open dataframes. "polars" can be also used by passing
"engine="polars"". Note also that ".open(engine="polars")" returns a
context manager with LazyFrame.

 with collection.open(engine="polars", use_fsspec=True) as lazy_df:
 display(lazy_df.collect().to_pandas())

Yet another way to open several parquet files as a single dataset is
via calling ".open()" directly for a query set.

 backed = ln.Artifact.filter(suffix=".parquet").open()

 backed

#### AnnData

We'll need some test data:

 ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
 ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()

An "h5ad" artifact stored on s3:

 artifact = ln.Artifact.get(key="pbmc68k.h5ad")

 artifact.path

 access = artifact.open()

This object is an "AnnDataAccessor" object, an "AnnData" object backed
in the cloud:

 access

Without subsetting, the "AnnDataAccessor" object references underlying
lazy "h5" or "zarr" arrays:

 access.X

You can subset it like a normal "AnnData" object:

 obs_idx = access.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
 access.obs.percent_mito <= 0.05
 )
 access_subset = access[obs_idx]
 access_subset

Subsets load arrays into memory upon direct access:

 access_subset.X

To load the entire subset into memory as an actual "AnnData" object,
use "to_memory()":

 adata_subset = access_subset.to_memory()

 adata_subset

It is also possible to add columns to ".obs" and ".var" of cloud
AnnData objects without downloading them.

Create a new "AnnData" "zarr" artifact.

 adata_subset.write_zarr("adata_subset.zarr")

 artifact = ln.Artifact(
 "adata_subset.zarr", description="test add column to adata"
 ).save()

 artifact

 with artifact.open(mode="r+") as access:
 access.add_column(where="obs", col_name="ones", col=np.ones(access.shape[0]))
 display(access)

The version of the artifact is updated after the modification.

 artifact

 artifact.delete(permanent=True)

#### SpatialData

It is also possible to access "AnnData" objects inside "SpatialData"
"tables":

 artifact = ln.Artifact.connect("laminlabs/lamindata").get(
 key="visium_aligned_guide_min.zarr"
 )

 access = artifact.open()

 access

 access.tables

This gives you the same "AnnDataAccessor" object as for a normal
"AnnData".

 table = access.tables["table"]

 table

You can subset it and read into memory as an actual "AnnData":

 table_subset = table[table.obs["clone"] == "diploid"]

 table_subset

 adata = table_subset.to_memory()

#### Generic HDF5

Let us query a generic HDF5 artifact:

 artifact = ln.Artifact.get(key="testfile.hdf5")

And get a backed accessor:

 backed = artifact.open()

The returned object contains the ".connection" and "h5py.File" or
"zarr.Group" in ".storage"

 backed

 backed.storage

 # clean up test instance
 ln.setup.delete("test-arrays", force=True)