scrna2/6 Jupyter Notebook lamindata

Standardize and append a dataset

Here, we’ll learn

  • how to standardize a less well curated dataset

  • how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.track("ManDYgmftZ8C0003")
Hide code cell output
→ connected lamindb: testuser1/test-scrna
→ notebook imports: bionty==0.52.0 lamindb==0.76.14
→ created Transform('ManDYgmf'), started new Run('WY2NktYk') at 2024-10-19 07:28:11 UTC

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Create a curator:

curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={"cell_type_untrusted": bt.CellType.name},
    organism="human",
)
Hide code cell output
✓ added 1 record with Feature.name for columns: 'cell_type_untrusted'
3 non-validated values are not saved in Feature.name: ['percent_mito', 'n_genes', 'louvain']!
      → to lookup values, use lookup().columns
      → to save, run add_new_from_columns

Standardize & validate genes

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
    organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
Hide code cell output
• standardized 749/765 terms
! found 5 symbols in Bionty: ['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2']
   please add corresponding Gene records via `.from_values(['ENSG00000254709', 'ENSG00000262074', 'ENSG00000233276', 'ENSG00000276168', 'ENSG00000291237'])`

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
curator = ln.Curator.from_anndata(
    adata_validated,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={"cell_type_untrusted": bt.CellType.name},
    organism="human",
)
Hide code cell output
3 non-validated values are not saved in Feature.name: ['percent_mito', 'n_genes', 'louvain']!
      → to lookup values, use lookup().columns
      → to save, run add_new_from_columns
curator.validate()
Hide code cell output
• saving validated records of 'var_index'
• saving validated terms of 'cell_type_untrusted'
! 9 non-validated values are not saved in CellType.name: ['Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+']!
      → to lookup values, use lookup().cell_type_untrusted
      → to save, run .add_new_from('cell_type_untrusted')
✓ var_index is validated against Gene.ensembl_gene_id
• mapping cell_type_untrusted on CellType.name
!    9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
      → fix typos, remove non-existent values, or save terms via .add_new_from('cell_type_untrusted')
False

Standardize & validate cell types

None of the cell type names are valid.

We’ll now search the public ontology and add the name found in the dataset as a synonym to the top match found in the public ontology.

bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapping = {}
for invalid_name in adata_validated.obs["cell_type_untrusted"].unique():
    ontology_id = bionty.search(invalid_name).iloc[0].ontology_id  # top search hit through iloc[0]
    record = bt.CellType.from_source(ontology_id=ontology_id)
    name_mapping[invalid_name] = record.name  # map the original name to standardized name
    record.save()
    # record.add_synonym(name)  # optionally save the invalid name as synonym so that it becomes searchable
# print the mapping
print(name_mapping)
Hide code cell output
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000911'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000795'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0002057'
✓ loaded 1 CellType record matching ontology_id: 'CL:0000860'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0001054'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0002051'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000952'
{'Dendritic cells': 'dendritic cell', 'CD19+ B': 'B cell, CD19-positive', 'CD4+/CD45RO+ Memory': 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD8+ Cytotoxic T': 'cytotoxic T cell', 'CD4+/CD25 T Reg': 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14+ Monocytes': 'CD14-positive, CD16-negative classical monocyte', 'CD56+ NK': 'CD16-positive, CD56-dim natural killer cell, human', 'CD8+/CD45RA+ Naive Cytotoxic': 'CD38-positive naive B cell', 'CD34+': 'CD38-high pre-BCR positive cell'}

We can now standardize cell type names using the search-based mapper:

adata_validated.obs["cell_type_untrusted_original"] = adata_validated.obs["cell_type_untrusted"]  # copy the original annotations
adata_validated.obs["cell_type_untrusted"] = adata_validated.obs["cell_type_untrusted_original"].map(name_mapping)

Now, all cell types are validated:

curator.validate()
Hide code cell output
• saving validated records of 'var_index'
• saving validated terms of 'cell_type_untrusted'
✓ var_index is validated against Gene.ensembl_gene_id
✓ cell_type_untrusted is validated against CellType.name
True

Register

artifact = curator.save_artifact(description="10x reference adata")
Hide code cell output
!    4 unique terms (80.00%) are not validated for name: n_genes, percent_mito, louvain, cell_type_untrusted_original
artifact.view_lineage()
Hide code cell output
_images/6647d87eaf2b1bc320833303f0b87f4364d20d712a93d2fc424868d5212a9ebd.svg
artifact.describe()
Artifact(uid='weUxELC0oEXmgLUW0000', is_latest=True, description='10x reference adata', suffix='.h5ad', type='dataset', size=855476, hash='3kPqtHuXea07jsEaQ0yYmQ', n_observations=70, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-10-19 07:28:21 UTC)
  Provenance
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna'
    .transform = 'Standardize and append a dataset'
    .run = 2024-10-19 07:28:11 UTC
    .created_by = 'testuser1'
  Labels
    .cell_types = 'CD16-positive, CD56-dim natural killer cell, human', 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD38-high pre-BCR positive cell'
  Features
    'cell_type_untrusted' = 'B cell, CD19-positive', 'CD14-positive, CD16-negative classical monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'CD38-high pre-BCR positive cell', 'CD38-positive naive B cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'cytotoxic T cell', 'dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated'
  Feature sets
    'var' = 'HES4', 'TNFRSF4', 'SSU72', 'PARK7', 'RBP7', 'SRM', 'MAD2L2', 'AGTRAP', 'TNFRSF1B', 'EFHD2', 'NECAP2', 'HP1BP3', 'C1QA', 'C1QB', 'HNRNPR', 'GALE', 'STMN1', 'CD52', 'FGR', 'ATP5IF1'
    'obs' = 'cell_type_untrusted'

Re-curate

We review the dataset and find all annotations trustworthy up there being a 'CD38-positive naive B cell'.

Inspecting the name_mapping in detail tells us 'CD8+/CD45RA+ Naive Cytotoxic' was erroneously mapped on a B cell.

Let us correct this and create a 'cell_type' feature that we can now trust.

name_mapping['CD38-positive naive B cell'] = 'cytotoxic T cell'
adata_validated.obs["cell_type"] = adata_validated.obs["cell_type_untrusted_original"].map(name_mapping)
adata_validated.obs["cell_type"].unique()
['dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T ce..., 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regul..., 'CD14-positive, CD16-negative classical monocyte', 'CD16-positive, CD56-dim natural killer cell, ..., 'CD38-positive naive B cell', 'CD38-high pre-BCR positive cell']
Categories (9, object): ['CD8-positive, CD25-positive, alpha-beta regul..., 'effector memory CD4-positive, alpha-beta T ce..., 'cytotoxic T cell', 'CD38-positive naive B cell', ..., 'B cell, CD19-positive', 'CD38-high pre-BCR positive cell', 'CD16-positive, CD56-dim natural killer cell, ..., 'dendritic cell']
artifact_trusted = ln.Curator.from_anndata(
    adata_validated,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={"cell_type": bt.CellType.name, "cell_type_untrusted": bt.CellType.name},
    organism="human",
).save_artifact(
    description="10x reference adata, trusted cell type annotation",
    revises=artifact,
)
4 non-validated values are not saved in Feature.name: ['cell_type_untrusted_original', 'percent_mito', 'louvain', 'n_genes']!
      → to lookup values, use lookup().columns
      → to save, run add_new_from_columns
• saving validated records of 'var_index'
• saving validated terms of 'cell_type'
• saving validated terms of 'cell_type_untrusted'
✓ var_index is validated against Gene.ensembl_gene_id
✓ cell_type is validated against CellType.name
✓ cell_type_untrusted is validated against CellType.name
!    4 unique terms (66.70%) are not validated for name: n_genes, percent_mito, louvain, cell_type_untrusted_original
artifact_trusted.describe()
Hide code cell output
Artifact(uid='weUxELC0oEXmgLUW0001', is_latest=True, description='10x reference adata, trusted cell type annotation', suffix='.h5ad', type='dataset', size=847300, hash='H2rLY3hcV18srupj0xhyCQ', n_observations=70, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-10-19 07:28:23 UTC)
  Provenance
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna'
    .transform = 'Standardize and append a dataset'
    .run = 2024-10-19 07:28:11 UTC
    .created_by = 'testuser1'
  Labels
    .cell_types = 'CD16-positive, CD56-dim natural killer cell, human', 'dendritic cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte', 'CD38-positive naive B cell', 'CD38-high pre-BCR positive cell'
  Features
    'cell_type' = 'B cell, CD19-positive', 'CD14-positive, CD16-negative classical monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'CD38-high pre-BCR positive cell', 'CD38-positive naive B cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'cytotoxic T cell', 'dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated'
    'cell_type_untrusted' = 'B cell, CD19-positive', 'CD14-positive, CD16-negative classical monocyte', 'CD16-positive, CD56-dim natural killer cell, human', 'CD38-high pre-BCR positive cell', 'CD38-positive naive B cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'cytotoxic T cell', 'dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated'
  Feature sets
    'var' = 'HES4', 'TNFRSF4', 'SSU72', 'PARK7', 'RBP7', 'SRM', 'MAD2L2', 'AGTRAP', 'TNFRSF1B', 'EFHD2', 'NECAP2', 'HP1BP3', 'C1QA', 'C1QB', 'HNRNPR', 'GALE', 'STMN1', 'CD52', 'FGR', 'ATP5IF1'
    'obs' = 'cell_type', 'cell_type_untrusted'

Append the dataset to the collection

Query the previous collection:

collection_v1 = ln.Collection.get(name="My versioned scRNA-seq collection")

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = collection_v1.append(artifact_trusted).save()
Hide code cell output
• adding collection ids [1] as inputs for run 2, adding parent transform 1
• adding collection ids [1] as inputs for run 2, adding parent transform 1
• adding artifact ids [1] as inputs for run 2, adding parent transform 1

If you want, you can label the collection’s version by setting .version.

collection_v2.version = "2"
collection_v2.save()
Hide code cell output
Collection(uid='JDaO8HEGAKlar0yW0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='6FG6rc2asMgsENs5CIx59Q', visibility=1, created_by_id=1, transform_id=2, run_id=2, created_at=2024-10-19 07:28:23 UTC)

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()
Hide code cell output
Collection(uid='JDaO8HEGAKlar0yW0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='6FG6rc2asMgsENs5CIx59Q', visibility=1, created_at=2024-10-19 07:28:23 UTC)
  Provenance
    .created_by = 'testuser1'
    .transform = 'Standardize and append a dataset'
    .run = 2024-10-19 07:28:11 UTC

View data lineage:

collection_v2.view_lineage()
Hide code cell output
_images/a09916e32134b1b48fe2b728d9ccde681925b0017291fb85178a2b098638b8ae.svg