scRNA-seq¶
Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:
create a dataset (an
Artifact
) and seed aCollection
()
concatenate the collection to a single
tiledbsoma
array store ()
If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.
# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-scrna --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-scrna
import lamindb as ln
import bionty as bt
ln.track("Nv48yAceNSh8")
Show code cell output
→ connected lamindb: testuser1/test-scrna
→ created Transform('Nv48yAceNSh80000'), started new Run('A6k3kqG8...') at 2025-05-08 07:32:20 UTC
→ notebook imports: bionty==1.3.2 lamindb==1.5.0
Populate metadata registries based on an artifact¶
Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells()
loads a subsampled version:
adata = ln.core.datasets.anndata_human_immune_cells()
adata
Show code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
obs: 'donor', 'tissue', 'cell_type', 'assay'
var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
uns: 'default_embedding'
obsm: 'X_umap'
Before validating & annotating this artifact, we need to define valid features and a schema.
# define valid features
ln.Feature(name="donor", dtype=str).save()
ln.Feature(name="tissue", dtype=bt.Tissue).save()
ln.Feature(name="cell_type", dtype=bt.CellType).save()
ln.Feature(name="assay", dtype=bt.ExperimentalFactor).save()
# define anndata schema
obs_schema = ln.Schema(itype=ln.Feature).save()
varT_schema = ln.Schema(itype=bt.Gene.ensembl_gene_id).save()
schema = ln.Schema(
name="Flexible AnnData",
otype="AnnData",
slots={"obs": obs_schema, "var.T": varT_schema},
).save()
Let’s curate this artifact:
curator = ln.curators.AnnDataCurator(adata, schema)
try:
curator.validate()
except ln.errors.ValidationError:
pass
Show code cell output
! 1 term not validated in feature 'cell_type' in slot 'obs': 'animal cell'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('cell_type')
One cell type isn’t validated because it’s not part of the CellType
registry. Let’s create it.
bt.CellType(name="animal cell").save()
CellType(uid='2Go5sf8V', name='animal cell', space_id=1, created_by_id=1, run_id=1, created_at=2025-05-08 07:32:25 UTC)
try:
curator.validate()
except ln.errors.ValidationError:
pass
! 220 terms not validated in feature 'columns' in slot 'var.T': 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
→ fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
Some Ensembl gene IDs are not validated, likely because they stem from an older version of Ensembl. We create records in the registry through the following convenience method.
curator.slots["var.T"].cat.add_new_from("columns")
Alternatively, we could import genes from an old Ensembl version into the Gene
registry (see guide).
When we create a Artifact
object from an AnnData
, we automatically curate it with validated features and labels:
artifact = curator.save_artifact(key="datasets/conde22.h5ad")
Show code cell output
→ not annotating with 36503 features for slot var.T as it exceeds 1000 (ln.settings.annotation.n_max_records)
It is annotated with rich metadata:
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'N1mEcYOuezEzvHsx0000' │ ├── .key = 'datasets/conde22.h5ad' │ ├── .size = 57612943 │ ├── .hash = 't_YJQpYrAyAGhs7Ir68zKj' │ ├── .n_observations = 1648 │ ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/N1mEcYOuezEzvHsx0000.h5ad │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2025-05-08 07:32:58 │ └── .transform = 'scRNA-seq' ├── Dataset features │ ├── obs • 4 [Feature] │ │ assay cat[bionty.ExperimentalF… 10x 3' v3, 10x 5' v1, 10x 5' v2 │ │ cell_type cat[bionty.CellType] CD16-negative, CD56-bright natural kille… │ │ tissue cat[bionty.Tissue] blood, bone marrow, caecum, duodenum, il… │ │ donor str │ └── var.T • 36503 [bionty.Gene.ensembl_gen… └── Labels └── .tissues bionty.Tissue blood, thoracic lymph node, spleen, lung… .cell_types bionty.CellType classical monocyte, T follicular helper … .experimental_factors bionty.ExperimentalFactor 10x 3' v3, 10x 5' v2, 10x 5' v1
Seed a collection¶
Let’s create a first version of a collection that will encompass many h5ad
files when more data is ingested.
Note
To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.
collection = ln.Collection(artifact, key="scrna/collection1").save()
For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:
collection.describe()
Show code cell output
Collection └── General ├── .uid = '7WtmA1O38aVivq4k0000' ├── .key = 'scrna/collection1' ├── .hash = 'DuyXxlMxwF92YehyBLbhKg' ├── .created_by = testuser1 (Test User1) ├── .created_at = 2025-05-08 07:32:58 └── .transform = 'scRNA-seq'
Access the underlying artifacts like so:
collection.artifacts.df()
Show code cell output
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1 | N1mEcYOuezEzvHsx0000 | datasets/conde22.h5ad | None | .h5ad | dataset | AnnData | 57612943 | t_YJQpYrAyAGhs7Ir68zKj | None | 1648 | sha1-fl | True | False | 1 | 1 | 3 | None | True | 1 | 2025-05-08 07:32:58.528000+00:00 | 1 | None | 1 |
See data lineage:
collection.view_lineage()
Finish the run and save the notebook.
ln.finish()
→ finished Run('A6k3kqG8') after 39s at 2025-05-08 07:33:00 UTC