Standardize and append a dataset¶
Here, we’ll learn
how to standardize a less well curated dataset
how to append it to the growing versioned collection
import lamindb as ln
import bionty as bt
ln.track()
Show code cell output
→ connected lamindb: testuser1/test-scrna
→ created Transform('6eKRTJmIJL9h0000', key='scrna2.ipynb'), started new Run('a9L0KH0vr7jAnKNP') at 2025-12-17 19:51:52 UTC
→ notebook imports: bionty==1.10.0 lamindb==1.17.0
• recommendation: to identify the notebook across renames, pass the uid: ln.track("6eKRTJmIJL9h")
Let’s now consider a less-well curated dataset:
adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
# this is our dataset
adata
Show code cell output
AnnData object with n_obs × n_vars = 70 × 765
obs: 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
We can’t save it in validated form.
try:
ln.Artifact.from_anndata(
adata,
key="scrna/dataset2.h5ad",
schema="ensembl_gene_ids_and_valid_features_in_obs",
).save()
except SystemExit as e:
print("Error captured:", e)
Show code cell output
→ writing the in-memory object into cache
→ loading artifact into memory for validation
! 4 terms not validated in feature 'columns' in slot 'obs': 'n_genes', 'louvain', 'cell_type_untrusted', 'percent_mito'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! no values were validated for columns!
Error captured: `organism` is required to get Source record for Gene!
Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":
adata.var["ensembl_gene_id"] = bt.Gene.standardize(
adata.var.index,
field=bt.Gene.symbol,
return_field=bt.Gene.ensembl_gene_id,
organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")
Show code cell output
! found 5 symbols in public source: ['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2']
please add corresponding Gene records via: `.from_values(['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2'])`
None of the cell type names are valid.
adata.obs["cell_type_untrusted"].unique()
Show code cell output
['Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+']
Categories (9, object): ['CD4+/CD25 T Reg', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic', ..., 'CD19+ B', 'CD34+', 'CD56+ NK', 'Dendritic cells']
Let’s look up the non-validated cell types using the values of the public ontology and create a mapping.
cell_types = bt.CellType.public().lookup()
name_mapping = {
"Dendritic cells": cell_types.dendritic_cell.name,
"CD19+ B": cell_types.b_cell_cd19_positive.name,
"CD4+/CD45RO+ Memory": cell_types.effector_memory_cd45ra_positive_alpha_beta_t_cell_terminally_differentiated.name,
"CD8+ Cytotoxic T": cell_types.cd8_positive_alpha_beta_cytotoxic_t_cell.name,
"CD4+/CD25 T Reg": cell_types.cd4_positive_cd25_positive_alpha_beta_regulatory_t_cell.name,
"CD14+ Monocytes": cell_types.cd14_positive_monocyte.name,
"CD56+ NK": cell_types.cd56_positive_cd161_positive_immature_natural_killer_cell_human.name,
"CD8+/CD45RA+ Naive Cytotoxic": cell_types.cd8_positive_alpha_beta_memory_t_cell_cd45ro_positive.name,
"CD34+": cell_types.cd34_positive_cd56_positive_cd117_positive_common_innate_lymphoid_precursor_human.name,
"CD38-positive naive B cell": cell_types.cytotoxic_t_cell.name,
}
And standardize cell type names using this name mapping:
adata.obs["cell_type"] = adata.obs["cell_type_untrusted"].map(name_mapping)
adata.obs["cell_type"].unique()
Show code cell output
['dendritic cell', 'B cell, CD19-positive', 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD4-positive, CD25-positive, alpha-beta regul..., 'CD14-positive monocyte', 'CD56-positive, CD161-positive immature natura..., 'CD8-positive, alpha-beta memory T cell, CD45R..., 'CD34-positive, CD56-positive, CD117-positive ...]
Categories (9, object): ['CD4-positive, CD25-positive, alpha-beta regul..., 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD8-positive, alpha-beta memory T cell, CD45R..., ..., 'B cell, CD19-positive', 'CD34-positive, CD56-positive, CD117-positive ..., 'CD56-positive, CD161-positive immature natura..., 'dendritic cell']
Define the corresponding feature:
ln.Feature(name="cell_type", dtype=bt.CellType).save()
Show code cell output
→ returning feature with same name: 'cell_type'
Feature(uid='3HKy4qYQ3to1', name='cell_type', dtype='cat[bionty.CellType]', is_type=None, unit=None, description=None, array_rank=0, array_size=0, array_shape=None, proxy_dtype=None, synonyms=None, branch_id=1, space_id=1, created_by_id=2, run_id=1, type_id=None, created_at=2025-12-17 19:51:21 UTC, is_locked=False)
Save the artifact with cell type and gene annotations:
artifact_trusted = ln.Artifact.from_anndata(
adata,
key="scrna/dataset2.h5ad",
description="10x reference adata, trusted cell type annotation",
schema="ensembl_gene_ids_and_valid_features_in_obs",
).save()
artifact_trusted.describe()
Show code cell output
→ writing the in-memory object into cache
→ loading artifact into memory for validation
! 4 terms not validated in feature 'columns' in slot 'obs': 'n_genes', 'louvain', 'cell_type_untrusted', 'percent_mito'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! 11 terms not validated in feature 'columns' in slot 'var.T': 'RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5'
→ fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
Artifact: scrna/dataset2.h5ad (0000) | description: 10x reference adata, trusted cell type annotation ├── uid: 63hLvWIbI605RCk30000 run: a9L0KH0 (scrna2.ipynb) │ kind: dataset otype: AnnData │ hash: 57xpUqmnZeBP6et6xOCcPA size: 835.8 KB │ branch: main space: all │ created_at: 2025-12-17 19:51:58 UTC created_by: testuser1 │ n_observations: 70 ├── storage/path: │ /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/63hLvWIbI605RCk30000.h5ad ├── Dataset features │ ├── obs (1) │ │ cell_type bionty.CellType B cell, CD19-positive, CD14-positive mo… │ └── var.T (754 bionty.Gene.ensemb… │ HES4 num │ TNFRSF4 num │ SSU72 num │ PARK7 num │ RBP7 num │ SRM num │ MAD2L2 num │ AGTRAP num │ TNFRSF1B num │ EFHD2 num │ NECAP2 num │ HP1BP3 num │ C1QA num │ C1QB num │ HNRNPR num │ GALE num │ STMN1 num │ CD52 num │ FGR num │ ATP5IF1 num └── Labels └── .cell_types bionty.CellType CD8-positive, alpha-beta memory T cell,…
Query the previous collection:
collection_v1 = ln.Collection.get(key="scrna/collection1")
Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:
collection_v2 = collection_v1.append(artifact_trusted).save()
See data lineage.
collection_v2.view_lineage()