Slice arrays¶
We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.
Let us now look at the following case:
# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]
Because the artifact was validated, querying the DataFrame
is guaranteed to succeed!
Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.
For a use case with TileDB, see: CELLxGENE: scRNA-seq
For a use case with DuckDB, see: RxRx: cell imaging
In this notebook, we show how to subset an AnnData
and generic HDF5
and zarr
collections accessed in the cloud.
Let us create a remote instance for testing.
!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-arrays
Show code cell output
✓ logged in with email [email protected] (uid: DzTjkKse)
! updating cloud SQLite 's3://lamindb-ci/test-arrays/.lamindb/lamin.db' of instance 'testuser1/test-arrays'
! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)
→ initialized lamindb: testuser1/test-arrays
Import lamindb and track this notebook.
import lamindb as ln
ln.track("hsRyWJggf2Ca")
Show code cell output
→ connected lamindb: testuser1/test-arrays
→ created Transform('hsRyWJggf2Ca0000'), started new Run('4iyN1x4z...') at 2025-08-06 17:30:22 UTC
→ notebook imports: lamindb==1.10.0
We’ll need some test data:
ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()
Show code cell output
Artifact(uid='coAUczLQO2uZinxy0000', is_latest=True, key='testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-08-06 17:30:23 UTC)
AnnData¶
An h5ad
artifact stored on s3:
artifact = ln.Artifact.get(key="pbmc68k.h5ad")
artifact.path
S3QueryPath('s3://lamindb-ci/test-arrays/pbmc68k.h5ad')
adata = artifact.open()
This object is an AnnDataAccessor
object, an AnnData
object backed in the cloud:
adata
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
constructed for the AnnData object pbmc68k.h5ad
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Without subsetting, the AnnDataAccessor
object references underlying lazy h5
or zarr
arrays:
adata.X
Show code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">
You can subset it like a normal AnnData
object:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Subsets load arrays into memory upon direct access:
adata_subset.X
Show code cell output
array([[-0.326, -0.191, 0.499, ..., -0.21 , -0.636, -0.49 ],
[ 0.811, -0.191, -0.728, ..., -0.21 , 0.604, -0.49 ],
[-0.326, -0.191, 0.643, ..., -0.21 , 2.303, -0.49 ],
...,
[-0.326, -0.191, -0.728, ..., -0.21 , 0.626, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
shape=(35, 765), dtype=float32)
To load the entire subset into memory as an actual AnnData
object, use to_memory()
:
adata_subset.to_memory()
Show code cell output
AnnData object with n_obs × n_vars = 35 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
SpatialData¶
It is also possible to access AnnData
objects inside SpatialData
tables
:
artifact = ln.Artifact.using("laminlabs/lamindata").get(
key="visium_aligned_guide_min.zarr"
)
access = artifact.open()
→ transferred: Artifact(uid='bjH534dxVi1drmLZ0001'), Storage(uid='D9BilDV2')
access
Show code cell output
SpatialDataAccessor object
constructed for the SpatialData object bjH534dxVi1drmLZ.zarr
with tables: ['table']
access.tables
Show code cell output
Accessor for the SpatialData attribute tables
with keys: ['table']
This gives you the same AnnDataAccessor
object as for a normal AnnData
.
table = access.tables["table"]
table
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 37 × 18085
constructed for the AnnData object table
obs: ['_index', 'array_col', 'array_row', 'clone', 'dataset', 'in_tissue', 'region', 'spot_id']
obsm: ['spatial']
uns: ['spatial', 'spatialdata_attrs']
var: ['feature_types', 'gene_ids', 'genome', 'symbols']
You can subset it and read into memory as an actual AnnData
:
table_subset = table[table.obs["clone"] == "diploid"]
table_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 31 × 18085
obs: ['_index', 'array_col', 'array_row', 'clone', 'dataset', 'in_tissue', 'region', 'spot_id']
obsm: ['spatial']
uns: ['spatial', 'spatialdata_attrs']
var: ['feature_types', 'gene_ids', 'genome', 'symbols']
adata = table_subset.to_memory()
Generic HDF5¶
Let us query a generic HDF5 artifact:
artifact = ln.Artifact.get(key="testfile.hdf5")
And get a backed accessor:
backed = artifact.open()
The returned object contains the .connection
and h5py.File
or zarr.Group
in .storage
backed
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/test-arrays/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
<HDF5 file "testfile.hdf5>" (mode r)>
Parquet¶
A dataframe stored as sharded parquet
.
Note that it is also possible to register and access Hugging Face paths. For this huggingface_hub
package should be installed.
artifact = ln.Artifact.using("laminlabs/lamindata").get(key="sharded_parquet")
artifact.path.view_tree()
Show code cell output
/opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
11 sub-directories & 11 files with suffixes '.parquet'
hf://datasets/Koncopd/lamindb-test/sharded_parquet
├── louvain=0/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
└── 947eee0b064440c9b9910ca2eb89e608-0.parquet
backed = artifact.open()
Show code cell output
→ transferred: Artifact(uid='78XWb8yD09SCgVfl0000'), Storage(uid='5EYyeftHljIs')
This returns a pyarrow dataset.
backed
<pyarrow._dataset.FileSystemDataset at 0x7f7e799e0340>
backed.head(5).to_pandas()
Show code cell output
cell_type | n_genes | percent_mito | |
---|---|---|---|
index | |||
CGTTATACAGTACC-8 | CD4+/CD45RO+ Memory | 1034 | 0.010163 |
AGATATTGACCACA-1 | CD4+/CD45RO+ Memory | 1078 | 0.012831 |
GCAGGGCTGTATGC-8 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 |
TTATGGCTGGCAAG-2 | CD4+/CD25 T Reg | 1236 | 0.023963 |
CACGACCTGGGAGT-7 | CD4+/CD25 T Reg | 1010 | 0.016620 |
It is also possible to open a collection of cloud artifacts.
collection = ln.Collection.using("laminlabs/lamindata").get(
key="sharded_parquet_collection"
)
backed = collection.open()
Show code cell output
→ transferred: Artifact(uid='yBp5v9RRptoIrIMQ0000')
→ transferred: Artifact(uid='fB33zDQDFb0i3Yxw0000')
→ transferred: Collection(uid='6aWTZ7J2ej1Rj22q0000')
backed
<pyarrow._dataset.FileSystemDataset at 0x7f7e7c94f2e0>
backed.to_table().to_pandas()
Show code cell output
cell_type | n_genes | percent_mito | |
---|---|---|---|
index | |||
CGTTATACAGTACC-8 | CD4+/CD45RO+ Memory | 1034 | 0.010163 |
AGATATTGACCACA-1 | CD4+/CD45RO+ Memory | 1078 | 0.012831 |
GCAGGGCTGTATGC-8 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 |
TTATGGCTGGCAAG-2 | CD4+/CD25 T Reg | 1236 | 0.023963 |
CACGACCTGGGAGT-7 | CD4+/CD25 T Reg | 1010 | 0.016620 |
AATCTCACTCAGTG-3 | CD4+/CD45RO+ Memory | 1183 | 0.016056 |
CTAGTTTGGCTTAG-4 | CD4+/CD45RO+ Memory | 1002 | 0.018922 |
ACGCCGGAAGCCTA-6 | CD8+/CD45RA+ Naive Cytotoxic | 1292 | 0.018315 |
CTGACCACCATGGT-4 | CD8+/CD45RA+ Naive Cytotoxic | 1559 | 0.024427 |
AGTTAAACAAACAG-1 | CD19+ B | 1005 | 0.019806 |
CTACGCACAGGGTG-3 | CD4+/CD45RO+ Memory | 1053 | 0.012073 |
CAGACAACAAAACG-7 | CD4+/CD25 T Reg | 1109 | 0.012702 |
GAGGGTGACCTATT-1 | CD4+/CD25 T Reg | 1003 | 0.012971 |
TGACTGGAACCATG-7 | Dendritic cells | 1277 | 0.012961 |
ACGACCCTGTCTGA-3 | Dendritic cells | 1074 | 0.017466 |
GTTATGCTACCTCC-3 | CD14+ Monocytes | 1201 | 0.016839 |
GTGTCAGATCTACT-6 | CD14+ Monocytes | 1014 | 0.025417 |
AAGAACGAACTCTT-6 | CD14+ Monocytes | 1067 | 0.019530 |
TACTCTGACGTAGT-1 | Dendritic cells | 1118 | 0.012069 |
TAAGCTCTTCTGGA-4 | CD14+ Monocytes | 1059 | 0.021497 |
By default Artifact.open()
and Collection.open()
use pyarrow
to lazily open dataframes. polars
can be also used by passing engine="polars"
. Note also that .open(engine="polars")
returns a context manager with LazyFrame.
with collection.open(engine="polars") as lazy_df:
display(lazy_df.collect().to_pandas())
Show code cell output
cell_type | n_genes | percent_mito | index | |
---|---|---|---|---|
0 | CD4+/CD45RO+ Memory | 1034 | 0.010163 | CGTTATACAGTACC-8 |
1 | CD4+/CD45RO+ Memory | 1078 | 0.012831 | AGATATTGACCACA-1 |
2 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 | GCAGGGCTGTATGC-8 |
3 | CD4+/CD25 T Reg | 1236 | 0.023963 | TTATGGCTGGCAAG-2 |
4 | CD4+/CD25 T Reg | 1010 | 0.016620 | CACGACCTGGGAGT-7 |
5 | CD4+/CD45RO+ Memory | 1183 | 0.016056 | AATCTCACTCAGTG-3 |
6 | CD4+/CD45RO+ Memory | 1002 | 0.018922 | CTAGTTTGGCTTAG-4 |
7 | CD8+/CD45RA+ Naive Cytotoxic | 1292 | 0.018315 | ACGCCGGAAGCCTA-6 |
8 | CD8+/CD45RA+ Naive Cytotoxic | 1559 | 0.024427 | CTGACCACCATGGT-4 |
9 | CD19+ B | 1005 | 0.019806 | AGTTAAACAAACAG-1 |
10 | CD4+/CD45RO+ Memory | 1053 | 0.012073 | CTACGCACAGGGTG-3 |
11 | CD4+/CD25 T Reg | 1109 | 0.012702 | CAGACAACAAAACG-7 |
12 | CD4+/CD25 T Reg | 1003 | 0.012971 | GAGGGTGACCTATT-1 |
13 | Dendritic cells | 1277 | 0.012961 | TGACTGGAACCATG-7 |
14 | Dendritic cells | 1074 | 0.017466 | ACGACCCTGTCTGA-3 |
15 | CD14+ Monocytes | 1201 | 0.016839 | GTTATGCTACCTCC-3 |
16 | CD14+ Monocytes | 1014 | 0.025417 | GTGTCAGATCTACT-6 |
17 | CD14+ Monocytes | 1067 | 0.019530 | AAGAACGAACTCTT-6 |
18 | Dendritic cells | 1118 | 0.012069 | TACTCTGACGTAGT-1 |
19 | CD14+ Monocytes | 1059 | 0.021497 | TAAGCTCTTCTGGA-4 |
Yet another way to open several parquet files as a single dataset is via calling .open()
directly for a query set.
backed = ln.Artifact.filter(suffix=".parquet").open()
! this query set is unordered, consider using `.order_by()` first to avoid opening the artifacts in an arbitrary order
backed
<pyarrow._dataset.FileSystemDataset at 0x7f7e70fc1c00>
# clean up test instance
ln.setup.delete("test-arrays", force=True)
Show code cell output
→ deleted storage record on hub 76e5f3b018085f52bcd5ca9b4d7e0ce5 | s3://lamindb-ci/test-arrays
→ deleted instance record on hub 587a82023ecb5ea28b3a448cb8240f7f