Standardize and append a dataset¶
Here, we’ll learn
how to standardize a less well curated dataset
how to append it to the growing versioned collection
import lamindb as ln
import bionty as bt
ln.track("ManDYgmftZ8C0003")
Show code cell output
→ connected lamindb: testuser1/test-scrna
→ created Transform('ManDYgmftZ8C0003'), started new Run('TbWhwU1I...') at 2025-01-20 07:36:02 UTC
→ notebook imports: bionty==1.0.0 lamindb==1.0.2
Let’s now consider a less-well curated dataset:
adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
adata
Show code cell output
AnnData object with n_obs × n_vars = 70 × 765
obs: 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Create a curator:
curator = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.symbol,
categoricals={"cell_type_untrusted": bt.CellType.name},
organism="human",
)
curator.validate()
Show code cell output
! indexing datasets with gene symbols can be problematic: https://docs.lamin.ai/faq/symbol-mapping
✓ added 1 record with Feature.name for "columns": 'cell_type_untrusted'
• saving validated records of 'var_index'
✓ added 5 records from public with Gene.symbol for "var_index": 'GPX1', 'IGLL5', 'RN7SL1', 'SNORD3B-2', 'SOD2'
• mapping "var_index" on Gene.symbol
! 65 terms are not validated: 'ATPIF1', 'C1orf228', 'CCBL2', 'RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'AC079767.4', 'H1FX', 'SELT', 'ATP5I', 'IGJ', 'CCDC109B', 'FYB', 'H2AFY', 'FAM65B', 'HIST1H4C', 'HIST1H1E', 'ZNRD1', 'C6orf48', 'RP3-467N11.1', ...
54 synonyms found: "ATPIF1" → "ATP5IF1", "C1orf228" → "ARMH1", "CCBL2" → "KYAT3", "AC079767.4" → "LINC01857", "H1FX" → "H1-10", "SELT" → "SELENOT", "ATP5I" → "ATP5ME", "IGJ" → "JCHAIN", "CCDC109B" → "MCUB", "FYB" → "FYB1", "H2AFY" → "MACROH2A1", "FAM65B" → "RIPOR2", "HIST1H4C" → "H4C3", "HIST1H1E" → "H1-4", "ZNRD1" → "POLR1H", "C6orf48" → "SNHG32", "SEPT7" → "SEPTIN7", "WBSCR22" → "BUD23", "RSBN1L-AS1" → "APTR", "CCDC132" → "VPS50", ...
→ curate synonyms via .standardize("var_index") for remaining terms:
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
• mapping "cell_type_untrusted" on CellType.name
! 9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_type_untrusted")
False
Standardize & validate genes ¶
Let’s convert Gene symbols to Ensembl ids via standardize()
. Note that this is a non-unique mapping and the first match is kept because the keep
parameter in .standardize()
defaults to "first"
:
adata.var["ensembl_gene_id"] = bt.Gene.standardize(
adata.var.index,
field=bt.Gene.symbol,
return_field=bt.Gene.ensembl_gene_id,
organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")
# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
Show code cell output
• standardized 754/765 terms
Here, we’ll use .raw
:
adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
curator = ln.Curator.from_anndata(
adata_validated,
var_index=bt.Gene.ensembl_gene_id,
categoricals={"cell_type_untrusted": bt.CellType.name},
organism="human",
)
curator.validate()
Show code cell output
✓ "var_index" is validated against Gene.ensembl_gene_id
• mapping "cell_type_untrusted" on CellType.name
! 9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_type_untrusted")
False
Standardize & validate cell types ¶
None of the cell type names are valid. We’ll now look up the non-validated cell types using the values of the public ontology and create a mapping.
curator.non_validated["cell_type_untrusted"]
Show code cell output
['Dendritic cells',
'CD19+ B',
'CD4+/CD45RO+ Memory',
'CD8+ Cytotoxic T',
'CD4+/CD25 T Reg',
'CD14+ Monocytes',
'CD56+ NK',
'CD8+/CD45RA+ Naive Cytotoxic',
'CD34+']
ct_public_lo = bt.CellType.public().lookup()
name_mapping = {
"Dendritic cells": ct_public_lo.dendritic_cell.name,
"CD19+ B": ct_public_lo.b_cell_cd19_positive.name,
"CD4+/CD45RO+ Memory": ct_public_lo.effector_memory_cd45ra_positive_alpha_beta_t_cell_terminally_differentiated.name,
"CD8+ Cytotoxic T": ct_public_lo.cd8_positive_alpha_beta_cytotoxic_t_cell.name,
"CD4+/CD25 T Reg": ct_public_lo.cd4_positive_cd25_positive_alpha_beta_regulatory_t_cell.name,
"CD14+ Monocytes": ct_public_lo.cd14_positive_monocyte.name,
"CD56+ NK": ct_public_lo.cd56_positive_cd161_positive_immature_natural_killer_cell_human.name,
"CD8+/CD45RA+ Naive Cytotoxic": ct_public_lo.cd8_positive_alpha_beta_memory_t_cell_cd45ro_positive.name,
"CD34+": ct_public_lo.cd34_positive_cd56_positive_cd117_positive_common_innate_lymphoid_precursor_human.name
}
We can now standardize cell type names using the lookup-based mapper:
adata_validated.obs["cell_type_untrusted_original"] = adata_validated.obs["cell_type_untrusted"] # copy the original annotations
adata_validated.obs["cell_type_untrusted"] = adata_validated.obs["cell_type_untrusted_original"].map(name_mapping)
Now, all cell types are validated:
curator.validate()
Show code cell output
• saving validated records of 'cell_type_untrusted'
✓ added 5 records from public with CellType.name for "cell_type_untrusted": 'CD14-positive monocyte', 'CD34-positive, CD56-positive, CD117-positive common innate lymphoid precursor, human', 'CD4-positive, CD25-positive, alpha-beta regulatory T cell', 'CD56-positive, CD161-positive immature natural killer cell, human', 'CD8-positive, alpha-beta cytotoxic T cell'
✓ "var_index" is validated against Gene.ensembl_gene_id
✓ "cell_type_untrusted" is validated against CellType.name
True
Register ¶
artifact = curator.save_artifact(description="10x reference adata")
Show code cell output
! 4 unique terms (80.00%) are not validated for name: 'n_genes', 'percent_mito', 'louvain', 'cell_type_untrusted_original'
! did not create Feature records for 4 non-validated names: 'cell_type_untrusted_original', 'louvain', 'n_genes', 'percent_mito'
artifact.view_lineage()
Show code cell output
artifact.describe()
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'Xl9wc1xjhWAhOdVJ0000' │ ├── .size = 859840 │ ├── .hash = '8cSIZsvUrKeGfL64-H9RLw' │ ├── .n_observations = 70 │ ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/Xl9wc1xjhWAhOdVJ0000.h5ad │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2025-01-20 07:36:12 │ └── .transform = 'Standardize and append a dataset' ├── Dataset features/._schemas_m2m │ ├── var • 754 [bionty.Gene] │ │ HES4 float │ │ TNFRSF4 float │ │ SSU72 float │ │ PARK7 float │ │ RBP7 float │ │ SRM float │ │ MAD2L2 float │ │ AGTRAP float │ │ TNFRSF1B float │ │ EFHD2 float │ │ NECAP2 float │ │ HP1BP3 float │ │ C1QA float │ │ C1QB float │ │ HNRNPR float │ │ GALE float │ │ STMN1 float │ │ CD52 float │ │ FGR float │ │ ATP5IF1 float │ └── obs • 1 [Feature] │ cell_type_untrusted cat[bionty.CellType] B cell, CD19-positive, CD14-positive mon… └── Labels └── .cell_types bionty.CellType CD8-positive, alpha-beta memory T cell, …
Re-curate¶
We review the dataset and find all annotations trustworthy up there being a 'CD38-positive naive B cell'
.
Inspecting the name_mapping
in detail tells us 'CD8+/CD45RA+ Naive Cytotoxic'
was erroneously mapped on a B cell.
Let us correct this and create a 'cell_type'
feature that we can now trust.
name_mapping['CD38-positive naive B cell'] = 'cytotoxic T cell'
adata_validated.obs["cell_type"] = adata_validated.obs["cell_type_untrusted_original"].map(name_mapping)
adata_validated.obs["cell_type"].unique()
['dendritic cell', 'B cell, CD19-positive', 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD4-positive, CD25-positive, alpha-beta regul..., 'CD14-positive monocyte', 'CD56-positive, CD161-positive immature natura..., 'CD8-positive, alpha-beta memory T cell, CD45R..., 'CD34-positive, CD56-positive, CD117-positive ...]
Categories (9, object): ['CD4-positive, CD25-positive, alpha-beta regul..., 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD8-positive, alpha-beta memory T cell, CD45R..., ..., 'B cell, CD19-positive', 'CD34-positive, CD56-positive, CD117-positive ..., 'CD56-positive, CD161-positive immature natura..., 'dendritic cell']
artifact_trusted = ln.Curator.from_anndata(
adata_validated,
var_index=bt.Gene.ensembl_gene_id,
categoricals={"cell_type": bt.CellType.name, "cell_type_untrusted": bt.CellType.name},
organism="human",
).save_artifact(
description="10x reference adata, trusted cell type annotation",
revises=artifact,
)
✓ "var_index" is validated against Gene.ensembl_gene_id
✓ "cell_type" is validated against CellType.name
✓ "cell_type_untrusted" is validated against CellType.name
! 4 unique terms (66.70%) are not validated for name: 'n_genes', 'percent_mito', 'louvain', 'cell_type_untrusted_original'
! did not create Feature records for 4 non-validated names: 'cell_type_untrusted_original', 'louvain', 'n_genes', 'percent_mito'
artifact_trusted.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'Xl9wc1xjhWAhOdVJ0001' │ ├── .size = 857336 │ ├── .hash = 'GK721a-L-fGDI8kXefKMtA' │ ├── .n_observations = 70 │ ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/Xl9wc1xjhWAhOdVJ0001.h5ad │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2025-01-20 07:36:13 │ └── .transform = 'Standardize and append a dataset' ├── Dataset features/._schemas_m2m │ ├── var • 754 [bionty.Gene] │ │ HES4 float │ │ TNFRSF4 float │ │ SSU72 float │ │ PARK7 float │ │ RBP7 float │ │ SRM float │ │ MAD2L2 float │ │ AGTRAP float │ │ TNFRSF1B float │ │ EFHD2 float │ │ NECAP2 float │ │ HP1BP3 float │ │ C1QA float │ │ C1QB float │ │ HNRNPR float │ │ GALE float │ │ STMN1 float │ │ CD52 float │ │ FGR float │ │ ATP5IF1 float │ └── obs • 2 [Feature] │ cell_type cat[bionty.CellType] B cell, CD19-positive, CD14-positive mon… │ cell_type_untrusted cat[bionty.CellType] B cell, CD19-positive, CD14-positive mon… └── Labels └── .cell_types bionty.CellType CD8-positive, alpha-beta memory T cell, …
Append the dataset to the collection¶
Query the previous collection:
collection_v1 = ln.Collection.get(name="My versioned scRNA-seq collection")
Create a new version of the collection by sharding it across the new artifact
and the artifact underlying version 1 of the collection:
collection_v2 = collection_v1.append(artifact_trusted).save()
Show code cell output
• adding collection ids [1] as inputs for run 2, adding parent transform 1
• adding collection ids [1] as inputs for run 2, adding parent transform 1
• adding artifact ids [1] as inputs for run 2, adding parent transform 1
If you want, you can label the collection’s version by setting .version
.
collection_v2.version = "2"
collection_v2.save()
Show code cell output
Collection(uid='CefCs4oPJxCIkP7o0001', version='2', is_latest=True, key='My versioned scRNA-seq collection', hash='luH-jPb6eJLsXvc1TWGpUg', created_by_id=1, space_id=1, run_id=2, created_at=2025-01-20 07:36:13 UTC)
Version 2 of the collection covers significantly more conditions.
collection_v2.describe()
Show code cell output
Collection └── General ├── .uid = 'CefCs4oPJxCIkP7o0001' ├── .key = 'My versioned scRNA-seq collection' ├── .hash = 'luH-jPb6eJLsXvc1TWGpUg' ├── .version = '2' ├── .created_by = testuser1 (Test User1) ├── .created_at = 2025-01-20 07:36:13 └── .transform = 'Standardize and append a dataset'
View data lineage:
collection_v2.view_lineage()