Slice arrays

We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.

Let us now look at the following case:

# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  

Because the artifact was validated, querying the DataFrame is guaranteed to succeed!

Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.

In this notebook, we show how to subset an AnnData and generic HDF5 and zarr collections accessed in the cloud.

!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-array-notebook --name test-array-notebook
 logged in with email testuser1@lamin.ai (uid: DzTjkKse)
! updating local SQLite & locking cloud SQLite
! updating local SQLite & locking cloud SQLite (sync back & unlock: lamin disconnect)
import lamindb as ln
 connected lamindb: testuser1/test-array-notebook

We’ll need some test data:

Artifact(uid='KvhmzxGl2cTrNJhL0000', is_latest=True, key='lndb-storage/testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', space_id=1, storage_id=2, created_by_id=1, created_at=2025-03-16 21:18:41 UTC)

Note that it is also possible to register Hugging Face paths. For this huggingface_hub package should be installed.

We register a folder of parquet files as a single artifact.

Artifact(uid='h8WUljvDSEglVvbd0000', is_latest=True, key='sharded_parquet', suffix='', size=42767, hash='oj6I3nNKj_eiX2I1q26qaw', n_files=11, space_id=1, storage_id=3, created_by_id=1, created_at=2025-03-16 21:18:43 UTC)

We also register a collection of individual parquet files.

artifact_shard1 = ln.Artifact(
artifact_shard2 = ln.Artifact(

    [artifact_shard1, artifact_shard2], key="sharded_parquet_collection"
Collection(uid='H7Pjw3hCO7Sxe6gL0000', is_latest=True, key='sharded_parquet_collection', hash='XavO_EEZSi-shT6uJGFHHA', space_id=1, created_by_id=1, created_at=2025-03-16 21:18:43 UTC)


An h5ad artifact stored on s3:

artifact = ln.Artifact.get(key="lndb-storage/pbmc68k.h5ad")
adata = artifact.open()
This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

Hide code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">

You can subset it like a normal AnnData object:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
adata_subset = adata[obs_idx]
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Subsets load arrays into memory upon direct access:

Hide code cell output
array([[-0.326, -0.191,  0.499, ..., -0.21 , -0.636, -0.49 ],
       [ 0.811, -0.191, -0.728, ..., -0.21 ,  0.604, -0.49 ],
       [-0.326, -0.191,  0.643, ..., -0.21 ,  2.303, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 ,  0.626, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
      shape=(35, 765), dtype=float32)

To load the entire subset into memory as an actual AnnData object, use to_memory():

Hide code cell output
AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Generic HDF5

Let us query a generic HDF5 artifact:

artifact = ln.Artifact.get(key="lndb-storage/testfile.hdf5")

And get a backed accessor:

backed = artifact.open()
The returned object contains the .connection and h5py.File or zarr.Group in .storage

BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/lndb-storage/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
<HDF5 file "testfile.hdf5>" (mode r)>


A dataframe stored as sharded parquet.

artifact = ln.Artifact.get(key="sharded_parquet")
Hide code cell output
11 sub-directories & 11 files with suffixes '.parquet'
├── louvain=0/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
    └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
backed = artifact.open()
This returns a pyarrow dataset.

<pyarrow._dataset.FileSystemDataset at 0x7fed90c056c0>
Hide code cell output
cell_type n_genes percent_mito
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163
AGATATTGACCACA-1 CD4+/CD45RO+ Memory 1078 0.012831
GCAGGGCTGTATGC-8 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287
TTATGGCTGGCAAG-2 CD4+/CD25 T Reg 1236 0.023963
CACGACCTGGGAGT-7 CD4+/CD25 T Reg 1010 0.016620

It is also possible to open a collection of cloud artifacts.

collection = ln.Collection.get(key="sharded_parquet_collection")
backed = collection.open()
<pyarrow._dataset.FileSystemDataset at 0x7feda7f89de0>
Hide code cell output
cell_type n_genes percent_mito
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163
AGATATTGACCACA-1 CD4+/CD45RO+ Memory 1078 0.012831
GCAGGGCTGTATGC-8 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287
TTATGGCTGGCAAG-2 CD4+/CD25 T Reg 1236 0.023963
CACGACCTGGGAGT-7 CD4+/CD25 T Reg 1010 0.016620
AATCTCACTCAGTG-3 CD4+/CD45RO+ Memory 1183 0.016056
CTAGTTTGGCTTAG-4 CD4+/CD45RO+ Memory 1002 0.018922
ACGCCGGAAGCCTA-6 CD8+/CD45RA+ Naive Cytotoxic 1292 0.018315
CTGACCACCATGGT-4 CD8+/CD45RA+ Naive Cytotoxic 1559 0.024427
AGTTAAACAAACAG-1 CD19+ B 1005 0.019806
CTACGCACAGGGTG-3 CD4+/CD45RO+ Memory 1053 0.012073
CAGACAACAAAACG-7 CD4+/CD25 T Reg 1109 0.012702
GAGGGTGACCTATT-1 CD4+/CD25 T Reg 1003 0.012971
TGACTGGAACCATG-7 Dendritic cells 1277 0.012961
ACGACCCTGTCTGA-3 Dendritic cells 1074 0.017466
GTTATGCTACCTCC-3 CD14+ Monocytes 1201 0.016839
GTGTCAGATCTACT-6 CD14+ Monocytes 1014 0.025417
AAGAACGAACTCTT-6 CD14+ Monocytes 1067 0.019530
TACTCTGACGTAGT-1 Dendritic cells 1118 0.012069
TAAGCTCTTCTGGA-4 CD14+ Monocytes 1059 0.021497
# clean up test instance
