scrna1/6 Jupyter Notebook lamindata

scRNA-seq

Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:

  1. create a dataset (an Artifact) and seed a Collection (scrna1/6)

  2. append a new dataset to the collection (scrna2/6)

  3. query & analyze individual datasets (scrna3/6)

  4. load the collection into memory (scrna4/6)

  5. iterate over the collection to train an ML model (scrna5/6)

  6. concatenate the collection to a single tiledbsoma array store (scrna6/6)

If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.

# !pip install 'lamindb[jupyter,aws,bionty]'
!lamin init --storage ./test-scrna --schema bionty
Hide code cell output
→ connected lamindb: testuser1/test-scrna
import lamindb as ln
import bionty as bt

ln.track("Nv48yAceNSh80003")
Hide code cell output
→ connected lamindb: testuser1/test-scrna
→ created Transform('Nv48yAce'), started new Run('pUG4PLm9') at 2024-11-21 06:52:53 UTC
→ notebook imports: bionty==0.53.1 lamindb==0.76.16

Populate metadata registries based on an artifact

Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells() loads a subsampled version:

adata = ln.core.datasets.anndata_human_immune_cells()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding'
    obsm: 'X_umap'

Let’s curate this artifact:

curator = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        adata.obs.donor.name: ln.ULabel.name,
        adata.obs.tissue.name: bt.Tissue.name,
        adata.obs.cell_type.name: bt.CellType.name,
        adata.obs.assay.name: bt.ExperimentalFactor.name,
    },
    organism="human",
)
Hide code cell output
✓ added 4 records with Feature.name for columns: 'donor', 'tissue', 'cell_type', 'assay'
# this runs a while, because this instance is still empty
curator.validate()
Hide code cell output
• saving validated records of 'var_index'
✓ added 36283 records from public with Gene.ensembl_gene_id for var_index: 'ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', 'ENSG00000238009', 'ENSG00000239945', 'ENSG00000239906', 'ENSG00000241860', 'ENSG00000241599', 'ENSG00000286448', 'ENSG00000236601', 'ENSG00000284733', 'ENSG00000235146', 'ENSG00000284662', 'ENSG00000229905', 'ENSG00000237491', 'ENSG00000177757', 'ENSG00000228794', 'ENSG00000225880', 'ENSG00000230368', 'ENSG00000272438', ...
• saving validated records of 'tissue'
• saving validated records of 'cell_type'
✓ added 31 records from public with CellType.name for cell_type: 'megakaryocyte', 'effector memory CD4-positive, alpha-beta T cell', 'plasmacytoid dendritic cell', 'alveolar macrophage', 'naive B cell', 'alpha-beta T cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'progenitor cell', 'gamma-delta T cell', 'CD4-positive helper T cell', 'regulatory T cell', 'group 3 innate lymphoid cell', 'plasma cell', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'mast cell', 'non-classical monocyte', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'lymphocyte', 'CD16-negative, CD56-bright natural killer cell, human', 'classical monocyte', ...
• saving validated records of 'assay'
• mapping var_index on Gene.ensembl_gene_id
!    220 terms are not validated: 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
• mapping donor on ULabel.name
!    12 terms are not validated: 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from('donor')
✓ 'tissue' is validated against Tissue.name
• mapping cell_type on CellType.name
!    1 term is not validated: 'animal cell'
→ fix typo, remove non-existent value, or save term via .add_new_from('cell_type')
✓ 'assay' is validated against ExperimentalFactor.name
False
curator.add_new_from_var_index()
Hide code cell output
✓ added 220 records with Gene.ensembl_gene_id for var_index: 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
curator.add_new_from("donor")
curator.add_new_from("cell_type")
Hide code cell output
✓ added 12 records with ULabel.name for donor: 'A52', 'A29', '582C', 'D496', 'A35', '640C', 'A36', '637C', '621B', 'A37', 'A31', 'D503'
✓ added 1 record with CellType.name for cell_type: 'animal cell'
curator.validate()
Hide code cell output
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'donor' is validated against ULabel.name
✓ 'tissue' is validated against Tissue.name
✓ 'cell_type' is validated against CellType.name
✓ 'assay' is validated against ExperimentalFactor.name
True

When we create a Artifact object from an AnnData, we automatically curate it with validated features and labels:

artifact = curator.save_artifact(description="Human immune cells from Conde22")

It is annotated with rich metadata:

artifact.describe(print_types=True)
Hide code cell output
Artifact(uid='AE158pvBDEQbQeOj0000', is_latest=True, description='Human immune cells from Conde22', suffix='.h5ad', type='dataset', size=57612943, hash='t_YJQpYrAyAGhs7Ir68zKj', n_observations=1648, _hash_type='sha1-fl', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-21 06:53:36 UTC)
  Provenance
    .storage: Storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna'
    .transform: Transform = 'scRNA-seq'
    .run: Run = 2024-11-21 06:52:53 UTC
    .created_by: User = 'testuser1'
  Labels
    .tissues: bionty.Tissue = 'duodenum', 'lamina propria', 'sigmoid colon', 'jejunal epithelium', 'thymus', 'skeletal muscle tissue', 'caecum', 'mesenteric lymph node', 'spleen', 'omentum', ...
    .cell_types: bionty.CellType = 'megakaryocyte', 'effector memory CD4-positive, alpha-beta T cell', 'plasmacytoid dendritic cell', 'alveolar macrophage', 'naive B cell', 'alpha-beta T cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'progenitor cell', 'gamma-delta T cell', 'CD4-positive helper T cell', ...
    .experimental_factors: bionty.ExperimentalFactor = '10x 3' v3', '10x 5' v1', '10x 5' v2'
    .ulabels: ULabel = 'A52', 'A29', '582C', 'D496', 'A35', '640C', 'A36', '637C', '621B', 'A37', ...
  Features
    'assay': cat[bionty.ExperimentalFactor] = '10x 3' v3', '10x 5' v1', '10x 5' v2'
    'cell_type': cat[bionty.CellType] = 'CD16-negative, CD56-bright natural killer cell, human', 'CD16-positive, CD56-dim natural killer cell, human', 'CD4-positive helper T cell', 'CD8-positive, alpha-beta memory T cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'T follicular helper cell', 'alpha-beta T cell', 'alveolar macrophage', 'animal cell', 'classical monocyte', ...
    'donor': cat[ULabel] = '582C', '621B', '637C', '640C', 'A29', 'A31', 'A35', 'A36', 'A37', 'A52', ...
    'tissue': cat[bionty.Tissue] = 'blood', 'bone marrow', 'caecum', 'duodenum', 'ileum', 'jejunal epithelium', 'lamina propria', 'liver', 'lung', 'mesenteric lymph node', ...
  Feature sets
    'var': bionty.Gene = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'
    'obs': Feature = 'donor', 'tissue', 'cell_type', 'assay'

Seed a collection

Let’s create a first version of a collection that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

collection = ln.Collection(artifact, name="My versioned scRNA-seq collection").save()

For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:

collection.describe()
Hide code cell output
Collection(uid='6HBgmdrV4xz24AVj0000', is_latest=True, name='My versioned scRNA-seq collection', hash='DuyXxlMxwF92YehyBLbhKg', visibility=1, created_at=2024-11-21 06:53:40 UTC)
  Provenance
    .created_by = 'testuser1'
    .transform = 'scRNA-seq'
    .run = 2024-11-21 06:52:53 UTC

Access the underlying artifacts like so:

collection.artifacts.df()
Hide code cell output
uid version is_latest description key suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_at created_by_id
id
1 AE158pvBDEQbQeOj0000 None True Human immune cells from Conde22 None .h5ad dataset 57612943 t_YJQpYrAyAGhs7Ir68zKj None 1648 sha1-fl AnnData 1 True 1 1 1 2024-11-21 06:53:36.460292+00:00 1

See data lineage:

collection.view_lineage()
_images/a3b2ecf2338146073bc1bdd3cb6167bcd396b4bb33be02a2f99825c532cc4045.svg