scrna1/6 Jupyter Notebook lamindata

scRNA-seq

Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:

  1. create a dataset (an Artifact) and seed a Collection (scrna1/6)

  2. append a new dataset to the collection (scrna2/6)

  3. query & analyze individual datasets (scrna3/6)

  4. load the collection into memory (scrna4/6)

  5. iterate over the collection to train an ML model (scrna5/6)

  6. concatenate the collection to a single tiledbsoma array store (scrna6/6)

If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.

# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-scrna --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-scrna
import lamindb as ln
import bionty as bt

ln.track()
Hide code cell output
 connected lamindb: testuser1/test-scrna
 created Transform('wtHAFzM33lw70000'), started new Run('ZfsLXQWp...') at 2025-04-15 16:33:29 UTC
 notebook imports: bionty==1.3.0 lamindb==1.4.0

Populate metadata registries based on an artifact

Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells() loads a subsampled version:

adata = ln.core.datasets.anndata_human_immune_cells()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding'
    obsm: 'X_umap'

Let’s curate this artifact:

curator = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        adata.obs.donor.name: ln.ULabel.name,
        adata.obs.tissue.name: bt.Tissue.name,
        adata.obs.cell_type.name: bt.CellType.name,
        adata.obs.assay.name: bt.ExperimentalFactor.name,
    },
    organism="human",
)
Hide code cell output
! organism is ignored, define it on the dtype level
!   4 terms are not validated: 'donor', 'tissue', 'cell_type', 'assay'
    → fix typos, remove non-existent values, or save terms via .add_new_from("columns")
# this runs a while, because this instance is still empty
curator.validate()
Hide code cell output
!   12 terms are not validated: 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', 'A31', '582C'
    → fix typos, remove non-existent values, or save terms via .add_new_from("donor")
!   1 term is not validated: 'animal cell'
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
!   220 terms are not validated: 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from("var_index")
False
curator.add_new_from_var_index()
Hide code cell output
/tmp/ipykernel_3215/4289150631.py:1: FutureWarning: Use add_new_from('var_index') instead of add_new_from_var_index, add_new_from_var_index will be removed in the future.
  curator.add_new_from_var_index()
curator.add_new_from("donor")
curator.add_new_from("cell_type")
curator.validate()
Hide code cell output
True

When we create a Artifact object from an AnnData, we automatically curate it with validated features and labels:

artifact = curator.save_artifact(key="datasets/conde22.h5ad")

It is annotated with rich metadata:

artifact.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'yoNTdfO3SvoxJy110000'
│   ├── .key = 'datasets/conde22.h5ad'
│   ├── .size = 57612943
│   ├── .hash = 't_YJQpYrAyAGhs7Ir68zKj'
│   ├── .n_observations = 1648
│   ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/yoNTdfO3SvoxJy110000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2025-04-15 16:34:12
│   └── .transform = 'scRNA-seq'
├── Dataset features
│   ├── var36503                 [bionty.Gene]                                                       
│   │   MIR1302-2HG                 float                                                               
│   │   FAM138A                     float                                                               
│   │   OR4F5                       float                                                               
│   │   OR4F29                      float                                                               
│   │   OR4F16                      float                                                               
│   │   LINC01409                   float                                                               
│   │   FAM87B                      float                                                               
│   │   LINC01128                   float                                                               
│   │   LINC00115                   float                                                               
│   │   FAM41C                      float                                                               
│   └── obs4                     [Feature]                                                           
assay                       cat[bionty.ExperimentalF…  10x 3' v3, 10x 5' v1, 10x 5' v2          
cell_type                   cat[bionty.CellType]       CD16-negative, CD56-bright natural kille…
donor                       cat[ULabel]                582C, 621B, 637C, 640C, A29, A31, A35, A…
tissue                      cat[bionty.Tissue]         blood, bone marrow, caecum, duodenum, il…
└── Labels
    └── .tissues                    bionty.Tissue              blood, thoracic lymph node, spleen, lung…
        .cell_types                 bionty.CellType            classical monocyte, T follicular helper …
        .experimental_factors       bionty.ExperimentalFactor  10x 3' v3, 10x 5' v2, 10x 5' v1          
        .ulabels                    ULabel                     D496, 621B, A29, A36, A35, 637C, A52, A3…

Seed a collection

Let’s create a first version of a collection that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

collection = ln.Collection(artifact, key="My versioned scRNA-seq collection").save()

For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:

collection.describe()
Hide code cell output
Collection 
└── General
    ├── .uid = 'TO0IYCSd9yoAA0cl0000'
    ├── .key = 'My versioned scRNA-seq collection'
    ├── .hash = 'DuyXxlMxwF92YehyBLbhKg'
    ├── .created_by = testuser1 (Test User1)
    ├── .created_at = 2025-04-15 16:34:15
    └── .transform = 'scRNA-seq'

Access the underlying artifacts like so:

collection.artifacts.df()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1 yoNTdfO3SvoxJy110000 datasets/conde22.h5ad None .h5ad dataset AnnData 57612943 t_YJQpYrAyAGhs7Ir68zKj None 1648 sha1-fl True False 1 1 None None True 1 2025-04-15 16:34:12.076000+00:00 1 None 1

See data lineage:

collection.view_lineage()
_images/999c53277e9c48fcc0695ef88073566a5ab98f9c54a57492db352aeb69d239e2.svg