scRNA-seq¶
Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:
create a dataset (an
Artifact
) and seed aCollection
()
concatenate the collection to a single
tiledbsoma
array store ()
If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.
# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-scrna --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-scrna
import lamindb as ln
import bionty as bt
ln.track("Nv48yAceNSh8")
Show code cell output
→ connected lamindb: testuser1/test-scrna
→ created Transform('Nv48yAceNSh80000'), started new Run('R9fHTH3S...') at 2025-05-29 10:21:55 UTC
→ notebook imports: bionty==1.4a1 lamindb==1.6a1
Populate metadata registries based on an artifact¶
Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells()
loads a subsampled version:
adata = ln.core.datasets.anndata_human_immune_cells()
adata
Show code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
obs: 'donor', 'tissue', 'cell_type', 'assay'
var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
uns: 'default_embedding'
obsm: 'X_umap'
To validate & annotate a dataset, we need to define valid features and a schema.
# define valid features
ln.Feature(name="donor", dtype=str).save()
ln.Feature(name="tissue", dtype=bt.Tissue).save()
ln.Feature(name="cell_type", dtype=bt.CellType).save()
ln.Feature(name="assay", dtype=bt.ExperimentalFactor).save()
# define a schema (or get via ln.examples.anndata.anndata_ensembl_gene_ids_and_valid_features_in_obs())
obs_schema = ln.Schema(
itype=ln.Feature
).save() # validate obs columns against the Feature registry
varT_schema = ln.Schema(
itype=bt.Gene.ensembl_gene_id
).save() # validate var.T columns against the Gene registry
schema = ln.Schema(
name="anndata_ensembl_gene_ids_and_valid_features_in_obs",
otype="AnnData",
slots={"obs": obs_schema, "var.T": varT_schema},
).save()
Let’s attempt saving this dataset as a validated & annotated artifact.
try:
artifact = ln.Artifact.from_anndata(adata, schema=schema).save()
except ln.errors.ValidationError as error:
print(error)
Show code cell output
! 1 term not validated in feature 'cell_type' in slot 'obs': 'animal cell'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('cell_type')
1 term not validated in feature 'cell_type' in slot 'obs': 'animal cell'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('cell_type')
One cell type isn’t validated because it’s not part of the CellType
registry. Let’s create it.
bt.CellType(name="animal cell").save()
Show code cell output
CellType(uid='2Go5sf8V', name='animal cell', branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-05-29 10:22:00 UTC)
We can now save the dataset.
# runs ~10sec because it imports 40k Ensembl gene IDs from a public ontology
artifact = ln.Artifact.from_anndata(
adata, key="datasets/conde22.h5ad", schema=schema
).save()
Show code cell output
! 220 terms not validated in feature 'columns' in slot 'var.T': 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
→ fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
→ not annotating with 36283 features for slot var.T as it exceeds 1000 (ln.settings.annotation.n_max_records)
Some Ensembl gene IDs don’t validate because they stem from an older version of Ensembl. If we wanted to be 100% sure that all gene identifiers are valid Ensembl IDs you can import the genes from an old Ensembl version into the Gene
registry (see guide). One can also enforce this through the .var.T
schema by setting schema.maximal_set=True
, which will prohibit any non-valid features in the dataframe.
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'GKjaKEfWe5zX7uW00000' │ ├── .key = 'datasets/conde22.h5ad' │ ├── .size = 57612943 │ ├── .hash = 't_YJQpYrAyAGhs7Ir68zKj' │ ├── .n_observations = 1648 │ ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/GKjaKEfWe5zX7uW00000.h5ad │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2025-05-29 10:22:24 │ └── .transform = 'scRNA-seq' ├── Dataset features │ ├── obs • 4 [Feature] │ │ assay cat[bionty.ExperimentalF… 10x 3' v3, 10x 5' v1, 10x 5' v2 │ │ cell_type cat[bionty.CellType] CD16-negative, CD56-bright natural kille… │ │ tissue cat[bionty.Tissue] blood, bone marrow, caecum, duodenum, il… │ │ donor str │ └── var.T • 36283 [bionty.Gene.ensembl_gen… └── Labels └── .tissues bionty.Tissue blood, thoracic lymph node, spleen, lung… .cell_types bionty.CellType classical monocyte, T follicular helper … .experimental_factors bionty.ExperimentalFactor 10x 3' v3, 10x 5' v2, 10x 5' v1
Seed a collection¶
Let’s create a first version of a collection that will encompass many h5ad
files when more data is ingested.
Note
To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.
collection = ln.Collection(artifact, key="scrna/collection1").save()
For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:
collection.describe()
Show code cell output
Collection └── General ├── .uid = '9NtDdfR9obBBu9mT0000' ├── .key = 'scrna/collection1' ├── .hash = 'DuyXxlMxwF92YehyBLbhKg' ├── .created_by = testuser1 (Test User1) ├── .created_at = 2025-05-29 10:22:24 └── .transform = 'scRNA-seq'
Access the underlying artifacts like so:
collection.artifacts.df()
Show code cell output
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | branch_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1 | GKjaKEfWe5zX7uW00000 | datasets/conde22.h5ad | None | .h5ad | dataset | AnnData | 57612943 | t_YJQpYrAyAGhs7Ir68zKj | None | 1648 | sha1-fl | True | False | 1 | 1 | 3 | None | True | 1 | 2025-05-29 10:22:24.262000+00:00 | 1 | None | 1 |
See data lineage:
collection.view_lineage()
Finish the run and save the notebook.
ln.finish()
→ finished Run('R9fHTH3S') after 30s at 2025-05-29 10:22:26 UTC