scrna1/6 Jupyter Notebook lamindata

scRNA-seq

Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:

  1. create a dataset (an Artifact) and seed a Collection (scrna1/6)

  2. append a new dataset to the collection (scrna2/6)

  3. query & analyze individual datasets (scrna3/6)

  4. load the collection into memory (scrna4/6)

  5. iterate over the collection to train an ML model (scrna5/6)

  6. concatenate the collection to a single tiledbsoma array store (scrna6/6)

If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.

# !pip install 'lamindb[jupyter,aws,bionty]'
!lamin init --storage ./test-scrna --schema bionty
Hide code cell output
 initialized lamindb: testuser1/test-scrna
import lamindb as ln
import bionty as bt

ln.track("Nv48yAceNSh80003")
Hide code cell output
 connected lamindb: testuser1/test-scrna
 created Transform('Nv48yAceNSh80003'), started new Run('OeUJxfZS...') at 2025-01-20 07:34:54 UTC
 notebook imports: bionty==1.0.0 lamindb==1.0.2

Populate metadata registries based on an artifact

Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells() loads a subsampled version:

adata = ln.core.datasets.anndata_human_immune_cells()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding'
    obsm: 'X_umap'

Let’s curate this artifact:

curator = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        adata.obs.donor.name: ln.ULabel.name,
        adata.obs.tissue.name: bt.Tissue.name,
        adata.obs.cell_type.name: bt.CellType.name,
        adata.obs.assay.name: bt.ExperimentalFactor.name,
    },
    organism="human",
)
Hide code cell output
 added 4 records with Feature.name for "columns": 'donor', 'tissue', 'cell_type', 'assay'
# this runs a while, because this instance is still empty
curator.validate()
Hide code cell output
 saving validated records of 'var_index'
 added 36283 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', 'ENSG00000238009', 'ENSG00000239945', 'ENSG00000239906', 'ENSG00000241860', 'ENSG00000241599', 'ENSG00000286448', 'ENSG00000236601', 'ENSG00000284733', 'ENSG00000235146', 'ENSG00000284662', 'ENSG00000229905', 'ENSG00000237491', 'ENSG00000177757', 'ENSG00000228794', 'ENSG00000225880', 'ENSG00000230368', 'ENSG00000272438', ...
 saving validated records of 'tissue'
 added 17 records from public with Tissue.name for "tissue": 'jejunal epithelium', 'duodenum', 'caecum', 'blood', 'liver', 'mesenteric lymph node', 'spleen', 'omentum', 'bone marrow', 'ileum', 'lung', 'thoracic lymph node', 'transverse colon', 'lamina propria', 'thymus', 'sigmoid colon', 'skeletal muscle tissue'
 saving validated records of 'cell_type'
 added 31 records from public with CellType.name for "cell_type": 'megakaryocyte', 'T follicular helper cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'CD16-negative, CD56-bright natural killer cell, human', 'dendritic cell, human', 'gamma-delta T cell', 'lymphocyte', 'plasma cell', 'progenitor cell', 'alpha-beta T cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'naive B cell', 'group 3 innate lymphoid cell', 'plasmacytoid dendritic cell', 'CD8-positive, alpha-beta memory T cell', 'plasmablast', 'CD4-positive helper T cell', 'conventional dendritic cell', 'memory B cell', ...
 saving validated records of 'assay'
 added 3 records from public with ExperimentalFactor.name for "assay": '10x 5' v1', '10x 5' v2', '10x 3' v3'
 mapping "var_index" on Gene.ensembl_gene_id
!   220 terms are not validated: 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
 mapping "donor" on ULabel.name
!   12 terms are not validated: 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', 'A31', '582C'
    → fix typos, remove non-existent values, or save terms via .add_new_from("donor")
 "tissue" is validated against Tissue.name
 mapping "cell_type" on CellType.name
!   1 term is not validated: 'animal cell'
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
 "assay" is validated against ExperimentalFactor.name
False
curator.add_new_from_var_index()
Hide code cell output
 added 220 records with Gene.ensembl_gene_id for "var_index": 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
curator.add_new_from("donor")
curator.add_new_from("cell_type")
Hide code cell output
 added 12 records with ULabel.name for "donor": '637C', 'D503', '640C', 'A37', 'A35', 'D496', '621B', '582C', 'A52', 'A31', 'A36', 'A29'
 added 1 record with CellType.name for "cell_type": 'animal cell'
curator.validate()
Hide code cell output
 "var_index" is validated against Gene.ensembl_gene_id
 "donor" is validated against ULabel.name
 "tissue" is validated against Tissue.name
 "cell_type" is validated against CellType.name
 "assay" is validated against ExperimentalFactor.name
True

When we create a Artifact object from an AnnData, we automatically curate it with validated features and labels:

artifact = curator.save_artifact(description="Human immune cells from Conde22")

It is annotated with rich metadata:

artifact.describe(print_types=True)
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'heySREilI158XLST0000'
│   ├── .size = 57612943
│   ├── .hash = 't_YJQpYrAyAGhs7Ir68zKj'
│   ├── .n_observations = 1648
│   ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/heySREilI158XLST0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2025-01-20 07:35:53
│   └── .transform = 'scRNA-seq'
├── Dataset features/._schemas_m2m
│   ├── var36503                 [bionty.Gene]                                                       
│   │   MIR1302-2HG                 float                                                               
│   │   FAM138A                     float                                                               
│   │   OR4F5                       float                                                               
│   │   OR4F29                      float                                                               
│   │   OR4F16                      float                                                               
│   │   LINC01409                   float                                                               
│   │   FAM87B                      float                                                               
│   │   LINC01128                   float                                                               
│   │   LINC00115                   float                                                               
│   │   FAM41C                      float                                                               
│   └── obs4                     [Feature]                                                           
assay                       cat[bionty.ExperimentalF…  10x 3' v3, 10x 5' v1, 10x 5' v2          
cell_type                   cat[bionty.CellType]       CD16-negative, CD56-bright natural kille…
donor                       cat[ULabel]                582C, 621B, 637C, 640C, A29, A31, A35, A…
tissue                      cat[bionty.Tissue]         blood, bone marrow, caecum, duodenum, il…
└── Labels
    └── .tissues                    bionty.Tissue              jejunal epithelium, duodenum, caecum, bl…
        .cell_types                 bionty.CellType            megakaryocyte, T follicular helper cell,…
        .experimental_factors       bionty.ExperimentalFactor  10x 5' v1, 10x 5' v2, 10x 3' v3          
        .ulabels                    ULabel                     637C, D503, 640C, A37, A35, D496, 621B, …

Seed a collection

Let’s create a first version of a collection that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

collection = ln.Collection(artifact, name="My versioned scRNA-seq collection").save()
Hide code cell output
/tmp/ipykernel_3204/430317079.py:1: FutureWarning: argument `name` will be removed, please pass My versioned scRNA-seq collection to `key` instead
  collection = ln.Collection(artifact, name="My versioned scRNA-seq collection").save()

For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:

collection.describe()
Hide code cell output
Collection 
└── General
    ├── .uid = 'CefCs4oPJxCIkP7o0000'
    ├── .key = 'My versioned scRNA-seq collection'
    ├── .hash = 'DuyXxlMxwF92YehyBLbhKg'
    ├── .created_by = testuser1 (Test User1)
    ├── .created_at = 2025-01-20 07:35:57
    └── .transform = 'scRNA-seq'

Access the underlying artifacts like so:

collection.artifacts.df()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1 heySREilI158XLST0000 None Human immune cells from Conde22 .h5ad dataset AnnData 57612943 t_YJQpYrAyAGhs7Ir68zKj None 1648 sha1-fl True False 1 1 None None True 1 2025-01-20 07:35:53.126000+00:00 1 None 1

See data lineage:

collection.view_lineage()
_images/67701804baa09d3da49eb0d5c953b8c6ce668c7127e362ad890ee4c11907ab2c.svg