Slice & stream arrays¶

We saw how LaminDB allows to query & search across artifacts using registries: Query & search registries.

Let us now query the datasets in storage themselves. Here, we show how to subset an AnnData and generic HDF5 and zarr collections accessed in the cloud.

# replace with your username and S3 bucket
!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-arrays

Import lamindb and track this notebook.

import lamindb as ln
import numpy as np
import zarr

ln.track()

We’ll need some test data:

ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()

AnnData¶

An h5ad artifact stored on s3:

artifact = ln.Artifact.get(key="pbmc68k.h5ad")

artifact.path

S3QueryPath('s3://lamindb-ci/test-arrays/pbmc68k.h5ad')

access = artifact.open()

This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

access

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

access.X

You can subset it like a normal AnnData object:

obs_idx = access.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    access.obs.percent_mito <= 0.05
)
access_subset = access[obs_idx]
access_subset

Subsets load arrays into memory upon direct access:

access_subset.X

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset = access_subset.to_memory()

adata_subset

Add a column to a cloud AnnData object¶

It is also possible to add columns to .obs and .var of cloud AnnData objects without downloading them.

Create a new AnnData zarr artifact.

adata_subset.write_zarr("adata_subset.zarr")

artifact = ln.Artifact(
    "adata_subset.zarr", description="test add column to adata"
).save()

artifact

Artifact(uid='fL7tuAyueD7VGhT90000', version=None, is_latest=True, key=None, description='test add column to adata', suffix='.zarr', kind=None, otype='AnnData', size=215211, hash='aSHN77yMrOMiMzo6jh1xEA', n_files=120, n_observations=None, branch_id=1, space_id=1, storage_id=1, run_id=1, schema_id=None, created_by_id=1, created_at=2025-10-30 07:58:23 UTC, is_locked=False)

with artifact.open(mode="r+") as access:
    access.add_column(where="obs", col_name="ones", col=np.ones(access.shape[0]))
    display(access)

The version of the artifact is updated after the modification.

artifact

Artifact(uid='fL7tuAyueD7VGhT90001', version=None, is_latest=True, key=None, description='test add column to adata', suffix='.zarr', kind=None, otype=None, size=215962, hash='3Gf4tPzfnj06zeqiigcFOg', n_files=123, n_observations=None, branch_id=1, space_id=1, storage_id=1, run_id=1, schema_id=None, created_by_id=1, created_at=2025-10-30 07:58:25 UTC, is_locked=False)

artifact.delete(permanent=True)

→ deleting all versions of this artifact because they all share the same store

SpatialData¶

It is also possible to access AnnData objects inside SpatialData tables:

artifact = ln.Artifact.using("laminlabs/lamindata").get(
    key="visium_aligned_guide_min.zarr"
)

access = artifact.open()

→ transferred: Artifact(uid='bjH534dxVi1drmLZ0001'), Storage(uid='D9BilDV2')

access

access.tables

This gives you the same AnnDataAccessor object as for a normal AnnData.

table = access.tables["table"]

table

You can subset it and read into memory as an actual AnnData:

table_subset = table[table.obs["clone"] == "diploid"]

table_subset

adata = table_subset.to_memory()

Generic HDF5¶

Let us query a generic HDF5 artifact:

artifact = ln.Artifact.get(key="testfile.hdf5")

And get a backed accessor:

backed = artifact.open()

The returned object contains the .connection and h5py.File or zarr.Group in .storage

backed

BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/test-arrays/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)

backed.storage

<HDF5 file "testfile.hdf5>" (mode r)>

Parquet¶

A dataframe stored as sharded parquet.

Note that it is also possible to register and access Hugging Face paths. For this huggingface_hub package should be installed.

artifact = ln.Artifact.using("laminlabs/lamindata").get(key="sharded_parquet")

artifact.path.view_tree()

backed = artifact.open()

This returns a pyarrow dataset.

backed

<pyarrow._dataset.FileSystemDataset at 0x7ff410c2b460>

backed.head(5).to_pandas()

Show code cell output Hide code cell output

	cell_type	n_genes	percent_mito
index
CGTTATACAGTACC-8	CD4+/CD45RO+ Memory	1034	0.010163
AGATATTGACCACA-1	CD4+/CD45RO+ Memory	1078	0.012831
GCAGGGCTGTATGC-8	CD8+/CD45RA+ Naive Cytotoxic	1055	0.012287
TTATGGCTGGCAAG-2	CD4+/CD25 T Reg	1236	0.023963
CACGACCTGGGAGT-7	CD4+/CD25 T Reg	1010	0.016620

It is also possible to open a collection of cloud artifacts.

collection = ln.Collection.using("laminlabs/lamindata").get(
    key="sharded_parquet_collection"
)

backed = collection.open()

backed

<pyarrow._dataset.FileSystemDataset at 0x7ff4104a1a80>

backed.to_table().to_pandas()

Show code cell output Hide code cell output

	cell_type	n_genes	percent_mito
index
CGTTATACAGTACC-8	CD4+/CD45RO+ Memory	1034	0.010163
AGATATTGACCACA-1	CD4+/CD45RO+ Memory	1078	0.012831
GCAGGGCTGTATGC-8	CD8+/CD45RA+ Naive Cytotoxic	1055	0.012287
TTATGGCTGGCAAG-2	CD4+/CD25 T Reg	1236	0.023963
CACGACCTGGGAGT-7	CD4+/CD25 T Reg	1010	0.016620
AATCTCACTCAGTG-3	CD4+/CD45RO+ Memory	1183	0.016056
CTAGTTTGGCTTAG-4	CD4+/CD45RO+ Memory	1002	0.018922
ACGCCGGAAGCCTA-6	CD8+/CD45RA+ Naive Cytotoxic	1292	0.018315
CTGACCACCATGGT-4	CD8+/CD45RA+ Naive Cytotoxic	1559	0.024427
AGTTAAACAAACAG-1	CD19+ B	1005	0.019806
CTACGCACAGGGTG-3	CD4+/CD45RO+ Memory	1053	0.012073
CAGACAACAAAACG-7	CD4+/CD25 T Reg	1109	0.012702
GAGGGTGACCTATT-1	CD4+/CD25 T Reg	1003	0.012971
TGACTGGAACCATG-7	Dendritic cells	1277	0.012961
ACGACCCTGTCTGA-3	Dendritic cells	1074	0.017466
GTTATGCTACCTCC-3	CD14+ Monocytes	1201	0.016839
GTGTCAGATCTACT-6	CD14+ Monocytes	1014	0.025417
AAGAACGAACTCTT-6	CD14+ Monocytes	1067	0.019530
TACTCTGACGTAGT-1	Dendritic cells	1118	0.012069
TAAGCTCTTCTGGA-4	CD14+ Monocytes	1059	0.021497

By default Artifact.open() and Collection.open() use pyarrow to lazily open dataframes. polars can be also used by passing engine="polars". Note also that .open(engine="polars") returns a context manager with LazyFrame.

with collection.open(engine="polars") as lazy_df:
    display(lazy_df.collect().to_pandas())

Show code cell output Hide code cell output

	cell_type	n_genes	percent_mito	index
0	Dendritic cells	1277	0.012961	TGACTGGAACCATG-7
1	Dendritic cells	1074	0.017466	ACGACCCTGTCTGA-3
2	CD14+ Monocytes	1201	0.016839	GTTATGCTACCTCC-3
3	CD14+ Monocytes	1014	0.025417	GTGTCAGATCTACT-6
4	CD14+ Monocytes	1067	0.019530	AAGAACGAACTCTT-6
5	Dendritic cells	1118	0.012069	TACTCTGACGTAGT-1
6	CD14+ Monocytes	1059	0.021497	TAAGCTCTTCTGGA-4
7	CD4+/CD45RO+ Memory	1034	0.010163	CGTTATACAGTACC-8
8	CD4+/CD45RO+ Memory	1078	0.012831	AGATATTGACCACA-1
9	CD8+/CD45RA+ Naive Cytotoxic	1055	0.012287	GCAGGGCTGTATGC-8
10	CD4+/CD25 T Reg	1236	0.023963	TTATGGCTGGCAAG-2
11	CD4+/CD25 T Reg	1010	0.016620	CACGACCTGGGAGT-7
12	CD4+/CD45RO+ Memory	1183	0.016056	AATCTCACTCAGTG-3
13	CD4+/CD45RO+ Memory	1002	0.018922	CTAGTTTGGCTTAG-4
14	CD8+/CD45RA+ Naive Cytotoxic	1292	0.018315	ACGCCGGAAGCCTA-6
15	CD8+/CD45RA+ Naive Cytotoxic	1559	0.024427	CTGACCACCATGGT-4
16	CD19+ B	1005	0.019806	AGTTAAACAAACAG-1
17	CD4+/CD45RO+ Memory	1053	0.012073	CTACGCACAGGGTG-3
18	CD4+/CD25 T Reg	1109	0.012702	CAGACAACAAAACG-7
19	CD4+/CD25 T Reg	1003	0.012971	GAGGGTGACCTATT-1

Yet another way to open several parquet files as a single dataset is via calling .open() directly for a query set.

backed = ln.Artifact.filter(suffix=".parquet").open()

! this query set is unordered, consider using `.order_by()` first to avoid opening the artifacts in an arbitrary order

backed

<pyarrow._dataset.FileSystemDataset at 0x7ff410c1f880>

Stream arrays into cloud¶

It is also possible to write directly into the default cloud (or local) storage of the current instance and then save as an Artifact. This can be done using from_lazy() that returns LazyArtifact. This object creates a real artifact on .save() with the provided arguments.

lazy = ln.Artifact.from_lazy(suffix=".zarr", overwrite_versions=True, key="mydata.zarr")

lazy

Stream an array into lazy.path in the default instance storage using zarr.

store = zarr.storage.FsspecStore.from_url(lazy.path.as_posix())

group = zarr.open(store, mode="w")
group["ones"] = np.ones(3)

Save and get the artifact.

artifact = lazy.save()

artifact

artifact.delete(permanent=True)

# clean up test instance
ln.setup.delete("test-arrays", force=True)