Jupyter Notebook

Slice arrays

We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.

Let us now look at the following case:

# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  

Because the artifact was validated, querying the DataFrame is guaranteed to succeed!

Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.

In this notebook, we show how to subset an AnnData and generic HDF5 and zarr collections accessed in the cloud.

Let us create a remote instance for testing.

!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-arrays
Hide code cell output
 logged in with email [email protected] (uid: DzTjkKse)
! updating cloud SQLite 's3://lamindb-ci/test-arrays/.lamindb/lamin.db' of instance 'testuser1/test-arrays'
! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)
 initialized lamindb: testuser1/test-arrays

Import lamindb and track this notebook.

import lamindb as ln

ln.track("hsRyWJggf2Ca")
Hide code cell output
 connected lamindb: testuser1/test-arrays
 created Transform('hsRyWJggf2Ca0000'), started new Run('431HdMGQ...') at 2025-05-08 07:31:39 UTC
 notebook imports: lamindb==1.5.0

We’ll need some test data:

ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()
Hide code cell output
Artifact(uid='QuFLeNKkYiZmUYX80000', is_latest=True, key='testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-05-08 07:31:40 UTC)

Note that it is also possible to register Hugging Face paths. For this huggingface_hub package should be installed.

We register a folder of parquet files as a single artifact.

ln.Artifact("hf://datasets/Koncopd/lamindb-test/sharded_parquet").save()
Hide code cell output
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
! will manage storage location hf://datasets/Koncopd/lamindb-test with instance testuser1/test-arrays
 due to lack of write access, LaminDB won't manage storage location: hf://datasets/Koncopd/lamindb-test
 deleted storage record on hub e82908a3045a5fecadfe01b36107a2e4
Artifact(uid='PSShrTD9Ms0y4YA90000', is_latest=True, key='sharded_parquet', suffix='', size=42767, hash='oj6I3nNKj_eiX2I1q26qaw', n_files=11, space_id=1, storage_id=2, run_id=1, created_by_id=1, created_at=2025-05-08 07:31:42 UTC)

We also register a collection of individual parquet files.

artifact_shard1 = ln.Artifact(
    "hf://datasets/Koncopd/lamindb-test/sharded_parquet/louvain=0/947eee0b064440c9b9910ca2eb89e608-0.parquet"
).save()
artifact_shard2 = ln.Artifact(
    "hf://datasets/Koncopd/lamindb-test/sharded_parquet/louvain=1/947eee0b064440c9b9910ca2eb89e608-0.parquet"
).save()

ln.Collection(
    [artifact_shard1, artifact_shard2], key="sharded_parquet_collection"
).save()
Hide code cell output
Collection(uid='2JsYCQQ1MTslQVxe0000', is_latest=True, key='sharded_parquet_collection', hash='XavO_EEZSi-shT6uJGFHHA', space_id=1, created_by_id=1, run_id=1, created_at=2025-05-08 07:31:42 UTC)

AnnData

An h5ad artifact stored on s3:

artifact = ln.Artifact.get(key="pbmc68k.h5ad")
artifact.path
S3QueryPath('s3://lamindb-ci/test-arrays/pbmc68k.h5ad')
adata = artifact.open()

This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

adata
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

adata.X
Hide code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">

You can subset it like a normal AnnData object:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Subsets load arrays into memory upon direct access:

adata_subset.X
Hide code cell output
array([[-0.326, -0.191,  0.499, ..., -0.21 , -0.636, -0.49 ],
       [ 0.811, -0.191, -0.728, ..., -0.21 ,  0.604, -0.49 ],
       [-0.326, -0.191,  0.643, ..., -0.21 ,  2.303, -0.49 ],
       ...,
       [-0.326, -0.191, -0.728, ..., -0.21 ,  0.626, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
      shape=(35, 765), dtype=float32)

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset.to_memory()
Hide code cell output
AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Generic HDF5

Let us query a generic HDF5 artifact:

artifact = ln.Artifact.get(key="testfile.hdf5")

And get a backed accessor:

backed = artifact.open()

The returned object contains the .connection and h5py.File or zarr.Group in .storage

backed
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/test-arrays/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
<HDF5 file "testfile.hdf5>" (mode r)>

Parquet

A dataframe stored as sharded parquet.

artifact = ln.Artifact.get(key="sharded_parquet")
artifact.path.view_tree()
Hide code cell output
11 sub-directories & 11 files with suffixes '.parquet'
hf://datasets/Koncopd/lamindb-test/sharded_parquet
├── louvain=0/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
    └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
backed = artifact.open()

This returns a pyarrow dataset.

backed
<pyarrow._dataset.FileSystemDataset at 0x7fb3df96bca0>
backed.head(5).to_pandas()
Hide code cell output
cell_type n_genes percent_mito
index
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163
AGATATTGACCACA-1 CD4+/CD45RO+ Memory 1078 0.012831
GCAGGGCTGTATGC-8 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287
TTATGGCTGGCAAG-2 CD4+/CD25 T Reg 1236 0.023963
CACGACCTGGGAGT-7 CD4+/CD25 T Reg 1010 0.016620

It is also possible to open a collection of cloud artifacts.

collection = ln.Collection.get(key="sharded_parquet_collection")
backed = collection.open()
backed
<pyarrow._dataset.FileSystemDataset at 0x7fb3df8023e0>
backed.to_table().to_pandas()
Hide code cell output
cell_type n_genes percent_mito
index
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163
AGATATTGACCACA-1 CD4+/CD45RO+ Memory 1078 0.012831
GCAGGGCTGTATGC-8 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287
TTATGGCTGGCAAG-2 CD4+/CD25 T Reg 1236 0.023963
CACGACCTGGGAGT-7 CD4+/CD25 T Reg 1010 0.016620
AATCTCACTCAGTG-3 CD4+/CD45RO+ Memory 1183 0.016056
CTAGTTTGGCTTAG-4 CD4+/CD45RO+ Memory 1002 0.018922
ACGCCGGAAGCCTA-6 CD8+/CD45RA+ Naive Cytotoxic 1292 0.018315
CTGACCACCATGGT-4 CD8+/CD45RA+ Naive Cytotoxic 1559 0.024427
AGTTAAACAAACAG-1 CD19+ B 1005 0.019806
CTACGCACAGGGTG-3 CD4+/CD45RO+ Memory 1053 0.012073
CAGACAACAAAACG-7 CD4+/CD25 T Reg 1109 0.012702
GAGGGTGACCTATT-1 CD4+/CD25 T Reg 1003 0.012971
TGACTGGAACCATG-7 Dendritic cells 1277 0.012961
ACGACCCTGTCTGA-3 Dendritic cells 1074 0.017466
GTTATGCTACCTCC-3 CD14+ Monocytes 1201 0.016839
GTGTCAGATCTACT-6 CD14+ Monocytes 1014 0.025417
AAGAACGAACTCTT-6 CD14+ Monocytes 1067 0.019530
TACTCTGACGTAGT-1 Dendritic cells 1118 0.012069
TAAGCTCTTCTGGA-4 CD14+ Monocytes 1059 0.021497

By default Artifact.open() and Collection.open() use pyarrow to lazily open dataframes. polars can be also used by passing engine="polars". Note also that .open(engine="polars") returns a context manager with LazyFrame.

with collection.open(engine="polars") as lazy_df:
    display(lazy_df.collect().to_pandas())
Hide code cell output
/tmp/ipykernel_3453/1430633675.py:2: CategoricalRemappingWarning: Local categoricals have different encodings, expensive re-encoding is done to perform this merge operation. Consider using a StringCache or an Enum type if the categories are known in advance
  display(lazy_df.collect().to_pandas())
cell_type n_genes percent_mito index
0 CD4+/CD45RO+ Memory 1034 0.010163 CGTTATACAGTACC-8
1 CD4+/CD45RO+ Memory 1078 0.012831 AGATATTGACCACA-1
2 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287 GCAGGGCTGTATGC-8
3 CD4+/CD25 T Reg 1236 0.023963 TTATGGCTGGCAAG-2
4 CD4+/CD25 T Reg 1010 0.016620 CACGACCTGGGAGT-7
5 CD4+/CD45RO+ Memory 1183 0.016056 AATCTCACTCAGTG-3
6 CD4+/CD45RO+ Memory 1002 0.018922 CTAGTTTGGCTTAG-4
7 CD8+/CD45RA+ Naive Cytotoxic 1292 0.018315 ACGCCGGAAGCCTA-6
8 CD8+/CD45RA+ Naive Cytotoxic 1559 0.024427 CTGACCACCATGGT-4
9 CD19+ B 1005 0.019806 AGTTAAACAAACAG-1
10 CD4+/CD45RO+ Memory 1053 0.012073 CTACGCACAGGGTG-3
11 CD4+/CD25 T Reg 1109 0.012702 CAGACAACAAAACG-7
12 CD4+/CD25 T Reg 1003 0.012971 GAGGGTGACCTATT-1
13 Dendritic cells 1277 0.012961 TGACTGGAACCATG-7
14 Dendritic cells 1074 0.017466 ACGACCCTGTCTGA-3
15 CD14+ Monocytes 1201 0.016839 GTTATGCTACCTCC-3
16 CD14+ Monocytes 1014 0.025417 GTGTCAGATCTACT-6
17 CD14+ Monocytes 1067 0.019530 AAGAACGAACTCTT-6
18 Dendritic cells 1118 0.012069 TACTCTGACGTAGT-1
19 CD14+ Monocytes 1059 0.021497 TAAGCTCTTCTGGA-4

Yet another way to open several parquet files as a single dataset is via calling .open() directly for a query set.

backed = ln.Artifact.filter(suffix=".parquet").open()
! this query set is unordered, consider using `.order_by()` first to avoid opening the artifacts in an arbitrary order
backed
<pyarrow._dataset.FileSystemDataset at 0x7fb3df96ac20>
Hide code cell content
# clean up test instance
!lamin delete --force test-arrays
 deleting instance testuser1/test-arrays
 deleted storage record on hub 76e5f3b018085f52bcd5ca9b4d7e0ce5
 deleted instance record on hub 587a82023ecb5ea28b3a448cb8240f7f