scRNA-seq¶
Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:
create a dataset (an
Artifact
) and seed aCollection
()
concatenate the collection to a single
tiledbsoma
array store ()
If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.
# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-scrna --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-scrna
import lamindb as ln
import bionty as bt
ln.track()
Show code cell output
→ connected lamindb: testuser1/test-scrna
→ created Transform('wtHAFzM33lw70000'), started new Run('ZfsLXQWp...') at 2025-04-15 16:33:29 UTC
→ notebook imports: bionty==1.3.0 lamindb==1.4.0
Populate metadata registries based on an artifact¶
Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells()
loads a subsampled version:
adata = ln.core.datasets.anndata_human_immune_cells()
adata
Show code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
obs: 'donor', 'tissue', 'cell_type', 'assay'
var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
uns: 'default_embedding'
obsm: 'X_umap'
Let’s curate this artifact:
curator = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.ensembl_gene_id,
categoricals={
adata.obs.donor.name: ln.ULabel.name,
adata.obs.tissue.name: bt.Tissue.name,
adata.obs.cell_type.name: bt.CellType.name,
adata.obs.assay.name: bt.ExperimentalFactor.name,
},
organism="human",
)
Show code cell output
! organism is ignored, define it on the dtype level
! 4 terms are not validated: 'donor', 'tissue', 'cell_type', 'assay'
→ fix typos, remove non-existent values, or save terms via .add_new_from("columns")
# this runs a while, because this instance is still empty
curator.validate()
Show code cell output
! 12 terms are not validated: 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', 'A31', '582C'
→ fix typos, remove non-existent values, or save terms via .add_new_from("donor")
! 1 term is not validated: 'animal cell'
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
! 220 terms are not validated: 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from("var_index")
False
curator.add_new_from_var_index()
Show code cell output
/tmp/ipykernel_3215/4289150631.py:1: FutureWarning: Use add_new_from('var_index') instead of add_new_from_var_index, add_new_from_var_index will be removed in the future.
curator.add_new_from_var_index()
curator.add_new_from("donor")
curator.add_new_from("cell_type")
curator.validate()
Show code cell output
True
When we create a Artifact
object from an AnnData
, we automatically curate it with validated features and labels:
artifact = curator.save_artifact(key="datasets/conde22.h5ad")
It is annotated with rich metadata:
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'yoNTdfO3SvoxJy110000' │ ├── .key = 'datasets/conde22.h5ad' │ ├── .size = 57612943 │ ├── .hash = 't_YJQpYrAyAGhs7Ir68zKj' │ ├── .n_observations = 1648 │ ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/yoNTdfO3SvoxJy110000.h5ad │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2025-04-15 16:34:12 │ └── .transform = 'scRNA-seq' ├── Dataset features │ ├── var • 36503 [bionty.Gene] │ │ MIR1302-2HG float │ │ FAM138A float │ │ OR4F5 float │ │ OR4F29 float │ │ OR4F16 float │ │ LINC01409 float │ │ FAM87B float │ │ LINC01128 float │ │ LINC00115 float │ │ FAM41C float │ └── obs • 4 [Feature] │ assay cat[bionty.ExperimentalF… 10x 3' v3, 10x 5' v1, 10x 5' v2 │ cell_type cat[bionty.CellType] CD16-negative, CD56-bright natural kille… │ donor cat[ULabel] 582C, 621B, 637C, 640C, A29, A31, A35, A… │ tissue cat[bionty.Tissue] blood, bone marrow, caecum, duodenum, il… └── Labels └── .tissues bionty.Tissue blood, thoracic lymph node, spleen, lung… .cell_types bionty.CellType classical monocyte, T follicular helper … .experimental_factors bionty.ExperimentalFactor 10x 3' v3, 10x 5' v2, 10x 5' v1 .ulabels ULabel D496, 621B, A29, A36, A35, 637C, A52, A3…
Seed a collection¶
Let’s create a first version of a collection that will encompass many h5ad
files when more data is ingested.
Note
To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.
collection = ln.Collection(artifact, key="My versioned scRNA-seq collection").save()
For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:
collection.describe()
Show code cell output
Collection └── General ├── .uid = 'TO0IYCSd9yoAA0cl0000' ├── .key = 'My versioned scRNA-seq collection' ├── .hash = 'DuyXxlMxwF92YehyBLbhKg' ├── .created_by = testuser1 (Test User1) ├── .created_at = 2025-04-15 16:34:15 └── .transform = 'scRNA-seq'
Access the underlying artifacts like so:
collection.artifacts.df()
Show code cell output
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1 | yoNTdfO3SvoxJy110000 | datasets/conde22.h5ad | None | .h5ad | dataset | AnnData | 57612943 | t_YJQpYrAyAGhs7Ir68zKj | None | 1648 | sha1-fl | True | False | 1 | 1 | None | None | True | 1 | 2025-04-15 16:34:12.076000+00:00 | 1 | None | 1 |
See data lineage:
collection.view_lineage()