scrna1/6 Jupyter Notebook lamindata

scRNA-seq

Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:

  1. create a dataset (an Artifact) and seed a Collection (scrna1/6)

  2. append a new dataset to the collection (scrna2/6)

  3. query & analyze individual datasets (scrna3/6)

  4. load the collection into memory (scrna4/6)

  5. iterate over the collection to train an ML model (scrna5/6)

  6. concatenate the collection to a single tiledbsoma array store (scrna6/6)

If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.

# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-scrna --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-scrna
import lamindb as ln
import bionty as bt

ln.track("Nv48yAceNSh8")
Hide code cell output
 connected lamindb: testuser1/test-scrna
 created Transform('Nv48yAceNSh80000'), started new Run('A6k3kqG8...') at 2025-05-08 07:32:20 UTC
 notebook imports: bionty==1.3.2 lamindb==1.5.0

Populate metadata registries based on an artifact

Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells() loads a subsampled version:

adata = ln.core.datasets.anndata_human_immune_cells()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding'
    obsm: 'X_umap'

Before validating & annotating this artifact, we need to define valid features and a schema.

# define valid features
ln.Feature(name="donor", dtype=str).save()
ln.Feature(name="tissue", dtype=bt.Tissue).save()
ln.Feature(name="cell_type", dtype=bt.CellType).save()
ln.Feature(name="assay", dtype=bt.ExperimentalFactor).save()

# define anndata schema
obs_schema = ln.Schema(itype=ln.Feature).save()
varT_schema = ln.Schema(itype=bt.Gene.ensembl_gene_id).save()
schema = ln.Schema(
    name="Flexible AnnData",
    otype="AnnData",
    slots={"obs": obs_schema, "var.T": varT_schema},
).save()

Let’s curate this artifact:

curator = ln.curators.AnnDataCurator(adata, schema)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass
Hide code cell output
! 1 term not validated in feature 'cell_type' in slot 'obs': 'animal cell'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('cell_type')

One cell type isn’t validated because it’s not part of the CellType registry. Let’s create it.

bt.CellType(name="animal cell").save()
CellType(uid='2Go5sf8V', name='animal cell', space_id=1, created_by_id=1, run_id=1, created_at=2025-05-08 07:32:25 UTC)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass
! 220 terms not validated in feature 'columns' in slot 'var.T': 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
    → fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')

Some Ensembl gene IDs are not validated, likely because they stem from an older version of Ensembl. We create records in the registry through the following convenience method.

curator.slots["var.T"].cat.add_new_from("columns")

Alternatively, we could import genes from an old Ensembl version into the Gene registry (see guide).

When we create a Artifact object from an AnnData, we automatically curate it with validated features and labels:

artifact = curator.save_artifact(key="datasets/conde22.h5ad")
Hide code cell output
 not annotating with 36503 features for slot var.T as it exceeds 1000 (ln.settings.annotation.n_max_records)

It is annotated with rich metadata:

artifact.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'N1mEcYOuezEzvHsx0000'
│   ├── .key = 'datasets/conde22.h5ad'
│   ├── .size = 57612943
│   ├── .hash = 't_YJQpYrAyAGhs7Ir68zKj'
│   ├── .n_observations = 1648
│   ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/N1mEcYOuezEzvHsx0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2025-05-08 07:32:58
│   └── .transform = 'scRNA-seq'
├── Dataset features
│   ├── obs4                     [Feature]                                                           
│   │   assay                       cat[bionty.ExperimentalF…  10x 3' v3, 10x 5' v1, 10x 5' v2          
│   │   cell_type                   cat[bionty.CellType]       CD16-negative, CD56-bright natural kille…
│   │   tissue                      cat[bionty.Tissue]         blood, bone marrow, caecum, duodenum, il…
│   │   donor                       str                                                                 
│   └── var.T36503               [bionty.Gene.ensembl_gen…                                           
└── Labels
    └── .tissues                    bionty.Tissue              blood, thoracic lymph node, spleen, lung…
        .cell_types                 bionty.CellType            classical monocyte, T follicular helper …
        .experimental_factors       bionty.ExperimentalFactor  10x 3' v3, 10x 5' v2, 10x 5' v1          

Seed a collection

Let’s create a first version of a collection that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

collection = ln.Collection(artifact, key="scrna/collection1").save()

For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:

collection.describe()
Hide code cell output
Collection 
└── General
    ├── .uid = '7WtmA1O38aVivq4k0000'
    ├── .key = 'scrna/collection1'
    ├── .hash = 'DuyXxlMxwF92YehyBLbhKg'
    ├── .created_by = testuser1 (Test User1)
    ├── .created_at = 2025-05-08 07:32:58
    └── .transform = 'scRNA-seq'

Access the underlying artifacts like so:

collection.artifacts.df()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1 N1mEcYOuezEzvHsx0000 datasets/conde22.h5ad None .h5ad dataset AnnData 57612943 t_YJQpYrAyAGhs7Ir68zKj None 1648 sha1-fl True False 1 1 3 None True 1 2025-05-08 07:32:58.528000+00:00 1 None 1

See data lineage:

collection.view_lineage()
_images/026ffb622beaf8d2b28d76071989316aeb5849a6be2788431a9410922b9faac0.svg

Finish the run and save the notebook.

ln.finish()
 finished Run('A6k3kqG8') after 39s at 2025-05-08 07:33:00 UTC