Standardize and append a dataset¶
Here, we’ll learn
how to standardize a less well curated dataset
how to append it to the growing versioned collection
import lamindb as ln
import bionty as bt
ln.track()
Show code cell output
→ connected lamindb: testuser1/test-scrna
→ created Transform('W011HZyyKzV00000', key='scrna2.ipynb'), started new Run('hzx93VMcknsDuH2A') at 2025-11-26 11:06:53 UTC
→ notebook imports: bionty==1.9.1 lamindb==1.16.1
• recommendation: to identify the notebook across renames, pass the uid: ln.track("W011HZyyKzV0")
Let’s now consider a less-well curated dataset:
adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
# this is our dataset
adata
Show code cell output
AnnData object with n_obs × n_vars = 70 × 765
obs: 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
We can’t save it in validated form.
try:
ln.Artifact.from_anndata(
adata,
key="scrna/dataset2.h5ad",
schema="ensembl_gene_ids_and_valid_features_in_obs",
).save()
except ln.errors.ValidationError:
pass
Show code cell output
→ writing the in-memory object into cache
→ loading artifact into memory for validation
! 4 terms not validated in feature 'columns' in slot 'obs': 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! no values were validated for columns!
! 765 terms not validated in feature 'columns' in slot 'var.T': 'HES4', 'TNFRSF4', 'SSU72', 'PARK7', 'RBP7', 'SRM', 'MAD2L2', 'AGTRAP', 'TNFRSF1B', 'EFHD2', 'NECAP2', 'HP1BP3', 'C1QA', 'C1QB', 'HNRNPR', 'GALE', 'STMN1', 'CD52', 'FGR', 'ATPIF1', ...
→ fix organism 'human', fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
! no values were validated for columns!
→ returning schema with same hash: Schema(uid='0000000000000000', name='valid_features', description=None, is_type=False, itype='Feature', otype=None, dtype=None, hash='kMi7B_N88uu-YnbTLDU-DA', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-11-26 11:06:22 UTC, is_locked=False)
→ returning schema with same hash: Schema(uid='0000000000000001', name='valid_ensembl_gene_ids', description=None, is_type=False, itype='bionty.Gene.ensembl_gene_id', otype=None, dtype='num', hash='1gocc_TJ1RU2bMwDRK-WUA', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-11-26 11:06:22 UTC, is_locked=False)
Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":
adata.var["ensembl_gene_id"] = bt.Gene.standardize(
adata.var.index,
field=bt.Gene.symbol,
return_field=bt.Gene.ensembl_gene_id,
organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")
Show code cell output
! found 5 symbols in public source: ['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2']
please add corresponding Gene records via: `.from_values(['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2'])`
None of the cell type names are valid.
adata.obs["cell_type_untrusted"].unique()
Show code cell output
['Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+']
Categories (9, object): ['CD4+/CD25 T Reg', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic', ..., 'CD19+ B', 'CD34+', 'CD56+ NK', 'Dendritic cells']
Let’s look up the non-validated cell types using the values of the public ontology and create a mapping.
cell_types = bt.CellType.public().lookup()
name_mapping = {
"Dendritic cells": cell_types.dendritic_cell.name,
"CD19+ B": cell_types.b_cell_cd19_positive.name,
"CD4+/CD45RO+ Memory": cell_types.effector_memory_cd45ra_positive_alpha_beta_t_cell_terminally_differentiated.name,
"CD8+ Cytotoxic T": cell_types.cd8_positive_alpha_beta_cytotoxic_t_cell.name,
"CD4+/CD25 T Reg": cell_types.cd4_positive_cd25_positive_alpha_beta_regulatory_t_cell.name,
"CD14+ Monocytes": cell_types.cd14_positive_monocyte.name,
"CD56+ NK": cell_types.cd56_positive_cd161_positive_immature_natural_killer_cell_human.name,
"CD8+/CD45RA+ Naive Cytotoxic": cell_types.cd8_positive_alpha_beta_memory_t_cell_cd45ro_positive.name,
"CD34+": cell_types.cd34_positive_cd56_positive_cd117_positive_common_innate_lymphoid_precursor_human.name,
"CD38-positive naive B cell": cell_types.cytotoxic_t_cell.name,
}
And standardize cell type names using this name mapping:
adata.obs["cell_type"] = adata.obs["cell_type_untrusted"].map(name_mapping)
adata.obs["cell_type"].unique()
Show code cell output
['dendritic cell', 'B cell, CD19-positive', 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD4-positive, CD25-positive, alpha-beta regul..., 'CD14-positive monocyte', 'CD56-positive, CD161-positive immature natura..., 'CD8-positive, alpha-beta memory T cell, CD45R..., 'CD34-positive, CD56-positive, CD117-positive ...]
Categories (9, object): ['CD4-positive, CD25-positive, alpha-beta regul..., 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD8-positive, alpha-beta memory T cell, CD45R..., ..., 'B cell, CD19-positive', 'CD34-positive, CD56-positive, CD117-positive ..., 'CD56-positive, CD161-positive immature natura..., 'dendritic cell']
Define the corresponding feature:
ln.Feature(name="cell_type", dtype=bt.CellType).save()
Show code cell output
→ returning feature with same name: 'cell_type'
Feature(uid='z7usF5ZebJ1W', name='cell_type', dtype='cat[bionty.CellType]', is_type=None, unit=None, description=None, array_rank=0, array_size=0, array_shape=None, proxy_dtype=None, synonyms=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, created_at=2025-11-26 11:06:22 UTC, is_locked=False)
Save the artifact with cell type and gene annotations:
artifact_trusted = ln.Artifact.from_anndata(
adata,
key="scrna/dataset2.h5ad",
description="10x reference adata, trusted cell type annotation",
schema="ensembl_gene_ids_and_valid_features_in_obs",
).save()
artifact_trusted.describe()
Show code cell output
→ writing the in-memory object into cache
→ creating new artifact version for key 'scrna/dataset2.h5ad' in storage '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna'
→ loading artifact into memory for validation
! 4 terms not validated in feature 'columns' in slot 'obs': 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! 11 terms not validated in feature 'columns' in slot 'var.T': 'RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5'
→ fix organism 'human', fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
Artifact: scrna/dataset2.h5ad (0001) | description: 10x reference adata, trusted cell type annotation ├── uid: YPzTiZZiULn5ufBM0001 run: hzx93VM (scrna2.ipynb) │ kind: dataset otype: AnnData │ hash: 57xpUqmnZeBP6et6xOCcPA size: 835.8 KB │ branch: main space: all │ created_at: 2025-11-26 11:07:03 UTC created_by: testuser1 │ n_observations: 70 ├── storage/path: │ /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/YPzTiZZiULn5ufBM0001.h5ad ├── Dataset features │ ├── obs (1) │ │ cell_type bionty.CellType B cell, CD19-positive, CD14-positive mo… │ └── var.T (754 bionty.Gene.ensemb… │ HES4 num │ TNFRSF4 num │ SSU72 num │ PARK7 num │ RBP7 num │ SRM num │ MAD2L2 num │ AGTRAP num │ TNFRSF1B num │ EFHD2 num │ NECAP2 num │ HP1BP3 num │ C1QA num │ C1QB num │ HNRNPR num │ GALE num │ STMN1 num │ CD52 num │ FGR num │ ATP5IF1 num └── Labels └── .cell_types bionty.CellType CD8-positive, alpha-beta memory T cell,…
Query the previous collection:
collection_v1 = ln.Collection.get(key="scrna/collection1")
Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:
collection_v2 = collection_v1.append(artifact_trusted).save()
See data lineage.
collection_v2.view_lineage()