scrna1/6 Jupyter Notebook lamindata

scRNA-seq

Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:

  1. create a dataset (an Artifact) and seed a Collection (scrna1/6)

  2. append a new dataset to the collection (scrna2/6)

  3. query & analyze individual datasets (scrna3/6)

  4. load the collection into memory (scrna4/6)

  5. iterate over the collection to train an ML model (scrna5/6)

  6. concatenate the collection to a single tiledbsoma array store (scrna6/6)

If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.

# !pip install 'lamindb[jupyter,aws,bionty]'
!lamin init --storage ./test-scrna --schema bionty
Hide code cell output
 connected lamindb: testuser1/test-scrna
import lamindb as ln
import bionty as bt

ln.track("Nv48yAceNSh80003")
Hide code cell output
 connected lamindb: testuser1/test-scrna
 created Transform('Nv48yAce'), started new Run('esANpd5l') at 2024-12-20 15:03:59 UTC
 notebook imports: bionty==0.53.2 lamindb==0.77.3

Populate metadata registries based on an artifact

Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells() loads a subsampled version:

adata = ln.core.datasets.anndata_human_immune_cells()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding'
    obsm: 'X_umap'

Let’s curate this artifact:

curator = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        adata.obs.donor.name: ln.ULabel.name,
        adata.obs.tissue.name: bt.Tissue.name,
        adata.obs.cell_type.name: bt.CellType.name,
        adata.obs.assay.name: bt.ExperimentalFactor.name,
    },
    organism="human",
)
Hide code cell output
 added 4 records with Feature.name for "columns": 'donor', 'tissue', 'cell_type', 'assay'
# this runs a while, because this instance is still empty
curator.validate()
Hide code cell output
 saving validated records of 'var_index'
 added 36283 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', 'ENSG00000238009', 'ENSG00000239945', 'ENSG00000239906', 'ENSG00000241860', 'ENSG00000241599', 'ENSG00000286448', 'ENSG00000236601', 'ENSG00000284733', 'ENSG00000235146', 'ENSG00000284662', 'ENSG00000229905', 'ENSG00000237491', 'ENSG00000177757', 'ENSG00000228794', 'ENSG00000225880', 'ENSG00000230368', 'ENSG00000272438', ...
 saving validated records of 'tissue'
 added 17 records from public with Tissue.name for "tissue": 'bone marrow', 'sigmoid colon', 'skeletal muscle tissue', 'liver', 'duodenum', 'ileum', 'blood', 'spleen', 'omentum', 'transverse colon', 'mesenteric lymph node', 'thoracic lymph node', 'lung', 'lamina propria', 'jejunal epithelium', 'thymus', 'caecum'
 saving validated records of 'cell_type'
 added 31 records from public with CellType.name for "cell_type": 'group 3 innate lymphoid cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'mucosal invariant T cell', 'non-classical monocyte', 'regulatory T cell', 'lymphocyte', 'alveolar macrophage', 'conventional dendritic cell', 'plasma cell', 'effector memory CD4-positive, alpha-beta T cell', 'CD16-negative, CD56-bright natural killer cell, human', 'mast cell', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'memory B cell', 'CD4-positive helper T cell', 'T follicular helper cell', 'germinal center B cell', 'plasmacytoid dendritic cell', 'progenitor cell', 'macrophage', ...
 saving validated records of 'assay'
 added 3 records from public with ExperimentalFactor.name for "assay": '10x 5' v1', '10x 3' v3', '10x 5' v2'
 mapping "var_index" on Gene.ensembl_gene_id
!   220 terms are not validated: 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
 mapping "donor" on ULabel.name
!   12 terms are not validated: 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', 'A31', '582C'
    → fix typos, remove non-existent values, or save terms via .add_new_from("donor")
 "tissue" is validated against Tissue.name
 mapping "cell_type" on CellType.name
!   1 term is not validated: 'animal cell'
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
 "assay" is validated against ExperimentalFactor.name
False
curator.add_new_from_var_index()
Hide code cell output
 added 220 records with Gene.ensembl_gene_id for "var_index": 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
curator.add_new_from("donor")
curator.add_new_from("cell_type")
Hide code cell output
 added 12 records with ULabel.name for "donor": '640C', 'D503', '637C', 'A36', 'A29', '582C', 'A52', '621B', 'D496', 'A37', 'A31', 'A35'
 added 1 record with CellType.name for "cell_type": 'animal cell'
curator.validate()
Hide code cell output
 "var_index" is validated against Gene.ensembl_gene_id
 "donor" is validated against ULabel.name
 "tissue" is validated against Tissue.name
 "cell_type" is validated against CellType.name
 "assay" is validated against ExperimentalFactor.name
True

When we create a Artifact object from an AnnData, we automatically curate it with validated features and labels:

artifact = curator.save_artifact(description="Human immune cells from Conde22")

It is annotated with rich metadata:

artifact.describe(print_types=True)
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'NAxy8G7q3xarNdqt0000'
│   ├── .size = 57612943
│   ├── .hash = 't_YJQpYrAyAGhs7Ir68zKj'
│   ├── .n_observations = 1648
│   ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/NAxy8G7q3xarNdqt0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2024-12-20 15:04:55
│   └── .transform = 'scRNA-seq'
├── Dataset features/.feature_sets
│   ├── var36503                 [bionty.Gene]                                                       
│   │   MIR1302-2HG                 float                                                               
│   │   FAM138A                     float                                                               
│   │   OR4F5                       float                                                               
│   │   OR4F29                      float                                                               
│   │   OR4F16                      float                                                               
│   │   LINC01409                   float                                                               
│   │   FAM87B                      float                                                               
│   │   LINC01128                   float                                                               
│   │   LINC00115                   float                                                               
│   │   FAM41C                      float                                                               
│   └── obs4                     [Feature]                                                           
assay                       cat[bionty.ExperimentalF…  10x 3' v3, 10x 5' v1, 10x 5' v2          
cell_type                   cat[bionty.CellType]       CD16-negative, CD56-bright natural kille…
donor                       cat[ULabel]                582C, 621B, 637C, 640C, A29, A31, A35, A…
tissue                      cat[bionty.Tissue]         blood, bone marrow, caecum, duodenum, il…
└── Labels
    └── .tissues                    bionty.Tissue              bone marrow, sigmoid colon, skeletal mus…
        .cell_types                 bionty.CellType            group 3 innate lymphoid cell, naive thym…
        .experimental_factors       bionty.ExperimentalFactor  10x 5' v1, 10x 3' v3, 10x 5' v2          
        .ulabels                    ULabel                     640C, D503, 637C, A36, A29, 582C, A52, 6…

Seed a collection

Let’s create a first version of a collection that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

collection = ln.Collection(artifact, name="My versioned scRNA-seq collection").save()

For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:

collection.describe()
Hide code cell output
Collection 
└── General
    ├── .uid = 'vU9vGhm9ozKWE66f0000'
    ├── .hash = 'DuyXxlMxwF92YehyBLbhKg'
    ├── .created_by = testuser1 (Test User1)
    ├── .created_at = 2024-12-20 15:04:59
    └── .transform = 'scRNA-seq'

Access the underlying artifacts like so:

collection.artifacts.df()
Hide code cell output
uid key description suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id version is_latest run_id created_at created_by_id
id
1 NAxy8G7q3xarNdqt0000 None Human immune cells from Conde22 .h5ad dataset 57612943 t_YJQpYrAyAGhs7Ir68zKj None 1648 sha1-fl AnnData 1 True 1 1 None True 1 2024-12-20 15:04:55.120402+00:00 1

See data lineage:

collection.view_lineage()
_images/2e0f383c8f765694329572181ca184513f3b4b09fd2be1088c467a19e2fd32d5.svg