scrna2/6 Jupyter Notebook lamindata

Standardize and append a dataset

Here, we’ll learn

  • how to standardize a less well curated dataset

  • how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.track("ManDYgmftZ8C0003")
Hide code cell output
 connected lamindb: testuser1/test-scrna
 created Transform('ManDYgmftZ8C0003'), started new Run('TbWhwU1I...') at 2025-01-20 07:36:02 UTC
 notebook imports: bionty==1.0.0 lamindb==1.0.2

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Create a curator:

curator = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={"cell_type_untrusted": bt.CellType.name},
    organism="human",
)
curator.validate()
Hide code cell output
! indexing datasets with gene symbols can be problematic: https://docs.lamin.ai/faq/symbol-mapping
 added 1 record with Feature.name for "columns": 'cell_type_untrusted'
 saving validated records of 'var_index'
 added 5 records from public with Gene.symbol for "var_index": 'GPX1', 'IGLL5', 'RN7SL1', 'SNORD3B-2', 'SOD2'
 mapping "var_index" on Gene.symbol
!   65 terms are not validated: 'ATPIF1', 'C1orf228', 'CCBL2', 'RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'AC079767.4', 'H1FX', 'SELT', 'ATP5I', 'IGJ', 'CCDC109B', 'FYB', 'H2AFY', 'FAM65B', 'HIST1H4C', 'HIST1H1E', 'ZNRD1', 'C6orf48', 'RP3-467N11.1', ...
    54 synonyms found: "ATPIF1" → "ATP5IF1", "C1orf228" → "ARMH1", "CCBL2" → "KYAT3", "AC079767.4" → "LINC01857", "H1FX" → "H1-10", "SELT" → "SELENOT", "ATP5I" → "ATP5ME", "IGJ" → "JCHAIN", "CCDC109B" → "MCUB", "FYB" → "FYB1", "H2AFY" → "MACROH2A1", "FAM65B" → "RIPOR2", "HIST1H4C" → "H4C3", "HIST1H1E" → "H1-4", "ZNRD1" → "POLR1H", "C6orf48" → "SNHG32", "SEPT7" → "SEPTIN7", "WBSCR22" → "BUD23", "RSBN1L-AS1" → "APTR", "CCDC132" → "VPS50", ...
    → curate synonyms via .standardize("var_index")    for remaining terms:
    → fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
 mapping "cell_type_untrusted" on CellType.name
!   9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type_untrusted")
False

Standardize & validate genes

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
    organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
Hide code cell output
 standardized 754/765 terms

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
curator = ln.Curator.from_anndata(
    adata_validated,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={"cell_type_untrusted": bt.CellType.name},
    organism="human",
)
curator.validate()
Hide code cell output
 "var_index" is validated against Gene.ensembl_gene_id
 mapping "cell_type_untrusted" on CellType.name
!   9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type_untrusted")
False

Standardize & validate cell types

None of the cell type names are valid. We’ll now look up the non-validated cell types using the values of the public ontology and create a mapping.

curator.non_validated["cell_type_untrusted"]
Hide code cell output
['Dendritic cells',
 'CD19+ B',
 'CD4+/CD45RO+ Memory',
 'CD8+ Cytotoxic T',
 'CD4+/CD25 T Reg',
 'CD14+ Monocytes',
 'CD56+ NK',
 'CD8+/CD45RA+ Naive Cytotoxic',
 'CD34+']
ct_public_lo = bt.CellType.public().lookup()
name_mapping = {
    "Dendritic cells": ct_public_lo.dendritic_cell.name,
    "CD19+ B": ct_public_lo.b_cell_cd19_positive.name,
    "CD4+/CD45RO+ Memory": ct_public_lo.effector_memory_cd45ra_positive_alpha_beta_t_cell_terminally_differentiated.name,
    "CD8+ Cytotoxic T": ct_public_lo.cd8_positive_alpha_beta_cytotoxic_t_cell.name,
    "CD4+/CD25 T Reg": ct_public_lo.cd4_positive_cd25_positive_alpha_beta_regulatory_t_cell.name,
    "CD14+ Monocytes": ct_public_lo.cd14_positive_monocyte.name,
    "CD56+ NK": ct_public_lo.cd56_positive_cd161_positive_immature_natural_killer_cell_human.name,
    "CD8+/CD45RA+ Naive Cytotoxic": ct_public_lo.cd8_positive_alpha_beta_memory_t_cell_cd45ro_positive.name,
    "CD34+": ct_public_lo.cd34_positive_cd56_positive_cd117_positive_common_innate_lymphoid_precursor_human.name
}

We can now standardize cell type names using the lookup-based mapper:

adata_validated.obs["cell_type_untrusted_original"] = adata_validated.obs["cell_type_untrusted"]  # copy the original annotations
adata_validated.obs["cell_type_untrusted"] = adata_validated.obs["cell_type_untrusted_original"].map(name_mapping)

Now, all cell types are validated:

curator.validate()
Hide code cell output
 saving validated records of 'cell_type_untrusted'
 added 5 records from public with CellType.name for "cell_type_untrusted": 'CD14-positive monocyte', 'CD34-positive, CD56-positive, CD117-positive common innate lymphoid precursor, human', 'CD4-positive, CD25-positive, alpha-beta regulatory T cell', 'CD56-positive, CD161-positive immature natural killer cell, human', 'CD8-positive, alpha-beta cytotoxic T cell'
 "var_index" is validated against Gene.ensembl_gene_id
 "cell_type_untrusted" is validated against CellType.name
True

Register

artifact = curator.save_artifact(description="10x reference adata")
Hide code cell output
!    4 unique terms (80.00%) are not validated for name: 'n_genes', 'percent_mito', 'louvain', 'cell_type_untrusted_original'
!    did not create Feature records for 4 non-validated names: 'cell_type_untrusted_original', 'louvain', 'n_genes', 'percent_mito'
artifact.view_lineage()
Hide code cell output
_images/34fe62b1d6f0078f756b073fcded16edeac7d70ed1a37d33d1b26129a805b258.svg
artifact.describe()
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'Xl9wc1xjhWAhOdVJ0000'
│   ├── .size = 859840
│   ├── .hash = '8cSIZsvUrKeGfL64-H9RLw'
│   ├── .n_observations = 70
│   ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/Xl9wc1xjhWAhOdVJ0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2025-01-20 07:36:12
│   └── .transform = 'Standardize and append a dataset'
├── Dataset features/._schemas_m2m
│   ├── var754                   [bionty.Gene]                                                       
│   │   HES4                        float                                                               
│   │   TNFRSF4                     float                                                               
│   │   SSU72                       float                                                               
│   │   PARK7                       float                                                               
│   │   RBP7                        float                                                               
│   │   SRM                         float                                                               
│   │   MAD2L2                      float                                                               
│   │   AGTRAP                      float                                                               
│   │   TNFRSF1B                    float                                                               
│   │   EFHD2                       float                                                               
│   │   NECAP2                      float                                                               
│   │   HP1BP3                      float                                                               
│   │   C1QA                        float                                                               
│   │   C1QB                        float                                                               
│   │   HNRNPR                      float                                                               
│   │   GALE                        float                                                               
│   │   STMN1                       float                                                               
│   │   CD52                        float                                                               
│   │   FGR                         float                                                               
│   │   ATP5IF1                     float                                                               
│   └── obs1                     [Feature]                                                           
cell_type_untrusted         cat[bionty.CellType]       B cell, CD19-positive, CD14-positive mon…
└── Labels
    └── .cell_types                 bionty.CellType            CD8-positive, alpha-beta memory T cell, …

Re-curate

We review the dataset and find all annotations trustworthy up there being a 'CD38-positive naive B cell'.

Inspecting the name_mapping in detail tells us 'CD8+/CD45RA+ Naive Cytotoxic' was erroneously mapped on a B cell.

Let us correct this and create a 'cell_type' feature that we can now trust.

name_mapping['CD38-positive naive B cell'] = 'cytotoxic T cell'
adata_validated.obs["cell_type"] = adata_validated.obs["cell_type_untrusted_original"].map(name_mapping)
adata_validated.obs["cell_type"].unique()
['dendritic cell', 'B cell, CD19-positive', 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD4-positive, CD25-positive, alpha-beta regul..., 'CD14-positive monocyte', 'CD56-positive, CD161-positive immature natura..., 'CD8-positive, alpha-beta memory T cell, CD45R..., 'CD34-positive, CD56-positive, CD117-positive ...]
Categories (9, object): ['CD4-positive, CD25-positive, alpha-beta regul..., 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD8-positive, alpha-beta memory T cell, CD45R..., ..., 'B cell, CD19-positive', 'CD34-positive, CD56-positive, CD117-positive ..., 'CD56-positive, CD161-positive immature natura..., 'dendritic cell']
artifact_trusted = ln.Curator.from_anndata(
    adata_validated,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={"cell_type": bt.CellType.name, "cell_type_untrusted": bt.CellType.name},
    organism="human",
).save_artifact(
    description="10x reference adata, trusted cell type annotation",
    revises=artifact,
)
 "var_index" is validated against Gene.ensembl_gene_id
 "cell_type" is validated against CellType.name
 "cell_type_untrusted" is validated against CellType.name
!    4 unique terms (66.70%) are not validated for name: 'n_genes', 'percent_mito', 'louvain', 'cell_type_untrusted_original'
!    did not create Feature records for 4 non-validated names: 'cell_type_untrusted_original', 'louvain', 'n_genes', 'percent_mito'
artifact_trusted.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'Xl9wc1xjhWAhOdVJ0001'
│   ├── .size = 857336
│   ├── .hash = 'GK721a-L-fGDI8kXefKMtA'
│   ├── .n_observations = 70
│   ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/Xl9wc1xjhWAhOdVJ0001.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2025-01-20 07:36:13
│   └── .transform = 'Standardize and append a dataset'
├── Dataset features/._schemas_m2m
│   ├── var754                   [bionty.Gene]                                                       
│   │   HES4                        float                                                               
│   │   TNFRSF4                     float                                                               
│   │   SSU72                       float                                                               
│   │   PARK7                       float                                                               
│   │   RBP7                        float                                                               
│   │   SRM                         float                                                               
│   │   MAD2L2                      float                                                               
│   │   AGTRAP                      float                                                               
│   │   TNFRSF1B                    float                                                               
│   │   EFHD2                       float                                                               
│   │   NECAP2                      float                                                               
│   │   HP1BP3                      float                                                               
│   │   C1QA                        float                                                               
│   │   C1QB                        float                                                               
│   │   HNRNPR                      float                                                               
│   │   GALE                        float                                                               
│   │   STMN1                       float                                                               
│   │   CD52                        float                                                               
│   │   FGR                         float                                                               
│   │   ATP5IF1                     float                                                               
│   └── obs2                     [Feature]                                                           
cell_type                   cat[bionty.CellType]       B cell, CD19-positive, CD14-positive mon…
cell_type_untrusted         cat[bionty.CellType]       B cell, CD19-positive, CD14-positive mon…
└── Labels
    └── .cell_types                 bionty.CellType            CD8-positive, alpha-beta memory T cell, …

Append the dataset to the collection

Query the previous collection:

collection_v1 = ln.Collection.get(name="My versioned scRNA-seq collection")

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = collection_v1.append(artifact_trusted).save()
Hide code cell output
 adding collection ids [1] as inputs for run 2, adding parent transform 1
 adding collection ids [1] as inputs for run 2, adding parent transform 1
 adding artifact ids [1] as inputs for run 2, adding parent transform 1

If you want, you can label the collection’s version by setting .version.

collection_v2.version = "2"
collection_v2.save()
Hide code cell output
Collection(uid='CefCs4oPJxCIkP7o0001', version='2', is_latest=True, key='My versioned scRNA-seq collection', hash='luH-jPb6eJLsXvc1TWGpUg', created_by_id=1, space_id=1, run_id=2, created_at=2025-01-20 07:36:13 UTC)

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()
Hide code cell output
Collection 
└── General
    ├── .uid = 'CefCs4oPJxCIkP7o0001'
    ├── .key = 'My versioned scRNA-seq collection'
    ├── .hash = 'luH-jPb6eJLsXvc1TWGpUg'
    ├── .version = '2'
    ├── .created_by = testuser1 (Test User1)
    ├── .created_at = 2025-01-20 07:36:13
    └── .transform = 'Standardize and append a dataset'

View data lineage:

collection_v2.view_lineage()
Hide code cell output
_images/7916858a9fd0dd076127bdedd8b2035d1de40b098a3e63b185633b7e35cb9204.svg