Slice arrays¶
We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.
Let us now look at the following case:
# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]
Because the artifact was validated, querying the DataFrame
is guaranteed to succeed!
Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.
For a use case with TileDB, see: CELLxGENE: scRNA-seq
For a use case with DuckDB, see: RxRx: cell imaging
In this notebook, we show how to subset an AnnData
and generic HDF5
and zarr
collections accessed in the cloud.
Let us create a remote instance for testing.
!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-arrays
Show code cell output
✓ logged in with email [email protected] (uid: DzTjkKse)
! updating cloud SQLite 's3://lamindb-ci/test-arrays/.lamindb/lamin.db' of instance 'testuser1/test-arrays'
! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)
→ initialized lamindb: testuser1/test-arrays
Import lamindb and track this notebook.
import lamindb as ln
ln.track("hsRyWJggf2Ca")
Show code cell output
→ connected lamindb: testuser1/test-arrays
→ created Transform('hsRyWJggf2Ca0000'), started new Run('431HdMGQ...') at 2025-05-08 07:31:39 UTC
→ notebook imports: lamindb==1.5.0
We’ll need some test data:
ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()
Show code cell output
Artifact(uid='QuFLeNKkYiZmUYX80000', is_latest=True, key='testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-05-08 07:31:40 UTC)
Note that it is also possible to register Hugging Face paths. For this huggingface_hub
package should be installed.
We register a folder of parquet
files as a single artifact.
ln.Artifact("hf://datasets/Koncopd/lamindb-test/sharded_parquet").save()
Show code cell output
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
! will manage storage location hf://datasets/Koncopd/lamindb-test with instance testuser1/test-arrays
→ due to lack of write access, LaminDB won't manage storage location: hf://datasets/Koncopd/lamindb-test
→ deleted storage record on hub e82908a3045a5fecadfe01b36107a2e4
Artifact(uid='PSShrTD9Ms0y4YA90000', is_latest=True, key='sharded_parquet', suffix='', size=42767, hash='oj6I3nNKj_eiX2I1q26qaw', n_files=11, space_id=1, storage_id=2, run_id=1, created_by_id=1, created_at=2025-05-08 07:31:42 UTC)
We also register a collection of individual parquet
files.
artifact_shard1 = ln.Artifact(
"hf://datasets/Koncopd/lamindb-test/sharded_parquet/louvain=0/947eee0b064440c9b9910ca2eb89e608-0.parquet"
).save()
artifact_shard2 = ln.Artifact(
"hf://datasets/Koncopd/lamindb-test/sharded_parquet/louvain=1/947eee0b064440c9b9910ca2eb89e608-0.parquet"
).save()
ln.Collection(
[artifact_shard1, artifact_shard2], key="sharded_parquet_collection"
).save()
Show code cell output
Collection(uid='2JsYCQQ1MTslQVxe0000', is_latest=True, key='sharded_parquet_collection', hash='XavO_EEZSi-shT6uJGFHHA', space_id=1, created_by_id=1, run_id=1, created_at=2025-05-08 07:31:42 UTC)
AnnData¶
An h5ad
artifact stored on s3:
artifact = ln.Artifact.get(key="pbmc68k.h5ad")
artifact.path
S3QueryPath('s3://lamindb-ci/test-arrays/pbmc68k.h5ad')
adata = artifact.open()
This object is an AnnDataAccessor
object, an AnnData
object backed in the cloud:
adata
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
constructed for the AnnData object pbmc68k.h5ad
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Without subsetting, the AnnDataAccessor
object references underlying lazy h5
or zarr
arrays:
adata.X
Show code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">
You can subset it like a normal AnnData
object:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Subsets load arrays into memory upon direct access:
adata_subset.X
Show code cell output
array([[-0.326, -0.191, 0.499, ..., -0.21 , -0.636, -0.49 ],
[ 0.811, -0.191, -0.728, ..., -0.21 , 0.604, -0.49 ],
[-0.326, -0.191, 0.643, ..., -0.21 , 2.303, -0.49 ],
...,
[-0.326, -0.191, -0.728, ..., -0.21 , 0.626, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
shape=(35, 765), dtype=float32)
To load the entire subset into memory as an actual AnnData
object, use to_memory()
:
adata_subset.to_memory()
Show code cell output
AnnData object with n_obs × n_vars = 35 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Generic HDF5¶
Let us query a generic HDF5 artifact:
artifact = ln.Artifact.get(key="testfile.hdf5")
And get a backed accessor:
backed = artifact.open()
The returned object contains the .connection
and h5py.File
or zarr.Group
in .storage
backed
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/test-arrays/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
<HDF5 file "testfile.hdf5>" (mode r)>
Parquet¶
A dataframe stored as sharded parquet
.
artifact = ln.Artifact.get(key="sharded_parquet")
artifact.path.view_tree()
Show code cell output
11 sub-directories & 11 files with suffixes '.parquet'
hf://datasets/Koncopd/lamindb-test/sharded_parquet
├── louvain=0/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│ └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
└── 947eee0b064440c9b9910ca2eb89e608-0.parquet
backed = artifact.open()
This returns a pyarrow dataset.
backed
<pyarrow._dataset.FileSystemDataset at 0x7fb3df96bca0>
backed.head(5).to_pandas()
Show code cell output
cell_type | n_genes | percent_mito | |
---|---|---|---|
index | |||
CGTTATACAGTACC-8 | CD4+/CD45RO+ Memory | 1034 | 0.010163 |
AGATATTGACCACA-1 | CD4+/CD45RO+ Memory | 1078 | 0.012831 |
GCAGGGCTGTATGC-8 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 |
TTATGGCTGGCAAG-2 | CD4+/CD25 T Reg | 1236 | 0.023963 |
CACGACCTGGGAGT-7 | CD4+/CD25 T Reg | 1010 | 0.016620 |
It is also possible to open a collection of cloud artifacts.
collection = ln.Collection.get(key="sharded_parquet_collection")
backed = collection.open()
backed
<pyarrow._dataset.FileSystemDataset at 0x7fb3df8023e0>
backed.to_table().to_pandas()
Show code cell output
cell_type | n_genes | percent_mito | |
---|---|---|---|
index | |||
CGTTATACAGTACC-8 | CD4+/CD45RO+ Memory | 1034 | 0.010163 |
AGATATTGACCACA-1 | CD4+/CD45RO+ Memory | 1078 | 0.012831 |
GCAGGGCTGTATGC-8 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 |
TTATGGCTGGCAAG-2 | CD4+/CD25 T Reg | 1236 | 0.023963 |
CACGACCTGGGAGT-7 | CD4+/CD25 T Reg | 1010 | 0.016620 |
AATCTCACTCAGTG-3 | CD4+/CD45RO+ Memory | 1183 | 0.016056 |
CTAGTTTGGCTTAG-4 | CD4+/CD45RO+ Memory | 1002 | 0.018922 |
ACGCCGGAAGCCTA-6 | CD8+/CD45RA+ Naive Cytotoxic | 1292 | 0.018315 |
CTGACCACCATGGT-4 | CD8+/CD45RA+ Naive Cytotoxic | 1559 | 0.024427 |
AGTTAAACAAACAG-1 | CD19+ B | 1005 | 0.019806 |
CTACGCACAGGGTG-3 | CD4+/CD45RO+ Memory | 1053 | 0.012073 |
CAGACAACAAAACG-7 | CD4+/CD25 T Reg | 1109 | 0.012702 |
GAGGGTGACCTATT-1 | CD4+/CD25 T Reg | 1003 | 0.012971 |
TGACTGGAACCATG-7 | Dendritic cells | 1277 | 0.012961 |
ACGACCCTGTCTGA-3 | Dendritic cells | 1074 | 0.017466 |
GTTATGCTACCTCC-3 | CD14+ Monocytes | 1201 | 0.016839 |
GTGTCAGATCTACT-6 | CD14+ Monocytes | 1014 | 0.025417 |
AAGAACGAACTCTT-6 | CD14+ Monocytes | 1067 | 0.019530 |
TACTCTGACGTAGT-1 | Dendritic cells | 1118 | 0.012069 |
TAAGCTCTTCTGGA-4 | CD14+ Monocytes | 1059 | 0.021497 |
By default Artifact.open()
and Collection.open()
use pyarrow
to lazily open dataframes. polars
can be also used by passing engine="polars"
. Note also that .open(engine="polars")
returns a context manager with LazyFrame.
with collection.open(engine="polars") as lazy_df:
display(lazy_df.collect().to_pandas())
Show code cell output
/tmp/ipykernel_3453/1430633675.py:2: CategoricalRemappingWarning: Local categoricals have different encodings, expensive re-encoding is done to perform this merge operation. Consider using a StringCache or an Enum type if the categories are known in advance
display(lazy_df.collect().to_pandas())
cell_type | n_genes | percent_mito | index | |
---|---|---|---|---|
0 | CD4+/CD45RO+ Memory | 1034 | 0.010163 | CGTTATACAGTACC-8 |
1 | CD4+/CD45RO+ Memory | 1078 | 0.012831 | AGATATTGACCACA-1 |
2 | CD8+/CD45RA+ Naive Cytotoxic | 1055 | 0.012287 | GCAGGGCTGTATGC-8 |
3 | CD4+/CD25 T Reg | 1236 | 0.023963 | TTATGGCTGGCAAG-2 |
4 | CD4+/CD25 T Reg | 1010 | 0.016620 | CACGACCTGGGAGT-7 |
5 | CD4+/CD45RO+ Memory | 1183 | 0.016056 | AATCTCACTCAGTG-3 |
6 | CD4+/CD45RO+ Memory | 1002 | 0.018922 | CTAGTTTGGCTTAG-4 |
7 | CD8+/CD45RA+ Naive Cytotoxic | 1292 | 0.018315 | ACGCCGGAAGCCTA-6 |
8 | CD8+/CD45RA+ Naive Cytotoxic | 1559 | 0.024427 | CTGACCACCATGGT-4 |
9 | CD19+ B | 1005 | 0.019806 | AGTTAAACAAACAG-1 |
10 | CD4+/CD45RO+ Memory | 1053 | 0.012073 | CTACGCACAGGGTG-3 |
11 | CD4+/CD25 T Reg | 1109 | 0.012702 | CAGACAACAAAACG-7 |
12 | CD4+/CD25 T Reg | 1003 | 0.012971 | GAGGGTGACCTATT-1 |
13 | Dendritic cells | 1277 | 0.012961 | TGACTGGAACCATG-7 |
14 | Dendritic cells | 1074 | 0.017466 | ACGACCCTGTCTGA-3 |
15 | CD14+ Monocytes | 1201 | 0.016839 | GTTATGCTACCTCC-3 |
16 | CD14+ Monocytes | 1014 | 0.025417 | GTGTCAGATCTACT-6 |
17 | CD14+ Monocytes | 1067 | 0.019530 | AAGAACGAACTCTT-6 |
18 | Dendritic cells | 1118 | 0.012069 | TACTCTGACGTAGT-1 |
19 | CD14+ Monocytes | 1059 | 0.021497 | TAAGCTCTTCTGGA-4 |
Yet another way to open several parquet files as a single dataset is via calling .open()
directly for a query set.
backed = ln.Artifact.filter(suffix=".parquet").open()
! this query set is unordered, consider using `.order_by()` first to avoid opening the artifacts in an arbitrary order
backed
<pyarrow._dataset.FileSystemDataset at 0x7fb3df96ac20>
Show code cell content
# clean up test instance
!lamin delete --force test-arrays
• deleting instance testuser1/test-arrays
→ deleted storage record on hub 76e5f3b018085f52bcd5ca9b4d7e0ce5
→ deleted instance record on hub 587a82023ecb5ea28b3a448cb8240f7f