scrna2/6 Jupyter Notebook lamindata

Standardize and append a dataset

Here, we’ll learn

  • how to standardize a less well curated dataset

  • how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.track()
Hide code cell output
 connected lamindb: testuser1/test-scrna
 created Transform('W011HZyyKzV00000', key='scrna2.ipynb'), started new Run('hzx93VMcknsDuH2A') at 2025-11-26 11:06:53 UTC
 notebook imports: bionty==1.9.1 lamindb==1.16.1
 recommendation: to identify the notebook across renames, pass the uid: ln.track("W011HZyyKzV0")

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
# this is our dataset
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We can’t save it in validated form.

try:
    ln.Artifact.from_anndata(
        adata,
        key="scrna/dataset2.h5ad",
        schema="ensembl_gene_ids_and_valid_features_in_obs",
    ).save()
except ln.errors.ValidationError:
    pass
Hide code cell output
 writing the in-memory object into cache
 loading artifact into memory for validation
! 4 terms not validated in feature 'columns' in slot 'obs': 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! no values were validated for columns!
! 765 terms not validated in feature 'columns' in slot 'var.T': 'HES4', 'TNFRSF4', 'SSU72', 'PARK7', 'RBP7', 'SRM', 'MAD2L2', 'AGTRAP', 'TNFRSF1B', 'EFHD2', 'NECAP2', 'HP1BP3', 'C1QA', 'C1QB', 'HNRNPR', 'GALE', 'STMN1', 'CD52', 'FGR', 'ATPIF1', ...
    → fix organism 'human', fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
! no values were validated for columns!
 returning schema with same hash: Schema(uid='0000000000000000', name='valid_features', description=None, is_type=False, itype='Feature', otype=None, dtype=None, hash='kMi7B_N88uu-YnbTLDU-DA', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-11-26 11:06:22 UTC, is_locked=False)
 returning schema with same hash: Schema(uid='0000000000000001', name='valid_ensembl_gene_ids', description=None, is_type=False, itype='bionty.Gene.ensembl_gene_id', otype=None, dtype='num', hash='1gocc_TJ1RU2bMwDRK-WUA', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-11-26 11:06:22 UTC, is_locked=False)

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
    organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")
Hide code cell output
! found 5 symbols in public source: ['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2']
  please add corresponding Gene records via: `.from_values(['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2'])`

None of the cell type names are valid.

adata.obs["cell_type_untrusted"].unique()
Hide code cell output
['Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+']
Categories (9, object): ['CD4+/CD25 T Reg', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic', ..., 'CD19+ B', 'CD34+', 'CD56+ NK', 'Dendritic cells']

Let’s look up the non-validated cell types using the values of the public ontology and create a mapping.

cell_types = bt.CellType.public().lookup()
name_mapping = {
    "Dendritic cells": cell_types.dendritic_cell.name,
    "CD19+ B": cell_types.b_cell_cd19_positive.name,
    "CD4+/CD45RO+ Memory": cell_types.effector_memory_cd45ra_positive_alpha_beta_t_cell_terminally_differentiated.name,
    "CD8+ Cytotoxic T": cell_types.cd8_positive_alpha_beta_cytotoxic_t_cell.name,
    "CD4+/CD25 T Reg": cell_types.cd4_positive_cd25_positive_alpha_beta_regulatory_t_cell.name,
    "CD14+ Monocytes": cell_types.cd14_positive_monocyte.name,
    "CD56+ NK": cell_types.cd56_positive_cd161_positive_immature_natural_killer_cell_human.name,
    "CD8+/CD45RA+ Naive Cytotoxic": cell_types.cd8_positive_alpha_beta_memory_t_cell_cd45ro_positive.name,
    "CD34+": cell_types.cd34_positive_cd56_positive_cd117_positive_common_innate_lymphoid_precursor_human.name,
    "CD38-positive naive B cell": cell_types.cytotoxic_t_cell.name,
}

And standardize cell type names using this name mapping:

adata.obs["cell_type"] = adata.obs["cell_type_untrusted"].map(name_mapping)
adata.obs["cell_type"].unique()
Hide code cell output
['dendritic cell', 'B cell, CD19-positive', 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD4-positive, CD25-positive, alpha-beta regul..., 'CD14-positive monocyte', 'CD56-positive, CD161-positive immature natura..., 'CD8-positive, alpha-beta memory T cell, CD45R..., 'CD34-positive, CD56-positive, CD117-positive ...]
Categories (9, object): ['CD4-positive, CD25-positive, alpha-beta regul..., 'effector memory CD45RA-positive, alpha-beta T..., 'CD8-positive, alpha-beta cytotoxic T cell', 'CD8-positive, alpha-beta memory T cell, CD45R..., ..., 'B cell, CD19-positive', 'CD34-positive, CD56-positive, CD117-positive ..., 'CD56-positive, CD161-positive immature natura..., 'dendritic cell']

Define the corresponding feature:

ln.Feature(name="cell_type", dtype=bt.CellType).save()
Hide code cell output
 returning feature with same name: 'cell_type'
Feature(uid='z7usF5ZebJ1W', name='cell_type', dtype='cat[bionty.CellType]', is_type=None, unit=None, description=None, array_rank=0, array_size=0, array_shape=None, proxy_dtype=None, synonyms=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, created_at=2025-11-26 11:06:22 UTC, is_locked=False)

Save the artifact with cell type and gene annotations:

artifact_trusted = ln.Artifact.from_anndata(
    adata,
    key="scrna/dataset2.h5ad",
    description="10x reference adata, trusted cell type annotation",
    schema="ensembl_gene_ids_and_valid_features_in_obs",
).save()
artifact_trusted.describe()
Hide code cell output
 writing the in-memory object into cache
 creating new artifact version for key 'scrna/dataset2.h5ad' in storage '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna'
 loading artifact into memory for validation
! 4 terms not validated in feature 'columns' in slot 'obs': 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! 11 terms not validated in feature 'columns' in slot 'var.T': 'RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5'
    → fix organism 'human', fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
Artifact: scrna/dataset2.h5ad (0001)
|   description: 10x reference adata, trusted cell type annotation
├── uid: YPzTiZZiULn5ufBM0001            run: hzx93VM (scrna2.ipynb)
kind: dataset                        otype: AnnData             
hash: 57xpUqmnZeBP6et6xOCcPA         size: 835.8 KB             
branch: main                         space: all                 
created_at: 2025-11-26 11:07:03 UTC  created_by: testuser1      
n_observations: 70                                              
├── storage/path: 
/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/YPzTiZZiULn5ufBM0001.h5ad
├── Dataset features
├── obs (1)                                                                                                    
│   cell_type                       bionty.CellType                    B cell, CD19-positive, CD14-positive mo…
└── var.T (754 bionty.Gene.ensemb…                                                                             
    HES4                            num                                                                        
    TNFRSF4                         num                                                                        
    SSU72                           num                                                                        
    PARK7                           num                                                                        
    RBP7                            num                                                                        
    SRM                             num                                                                        
    MAD2L2                          num                                                                        
    AGTRAP                          num                                                                        
    TNFRSF1B                        num                                                                        
    EFHD2                           num                                                                        
    NECAP2                          num                                                                        
    HP1BP3                          num                                                                        
    C1QA                            num                                                                        
    C1QB                            num                                                                        
    HNRNPR                          num                                                                        
    GALE                            num                                                                        
    STMN1                           num                                                                        
    CD52                            num                                                                        
    FGR                             num                                                                        
    ATP5IF1                         num                                                                        
└── Labels
    └── .cell_types                     bionty.CellType                    CD8-positive, alpha-beta memory T cell,…

Query the previous collection:

collection_v1 = ln.Collection.get(key="scrna/collection1")

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = collection_v1.append(artifact_trusted).save()

See data lineage.

collection_v2.view_lineage()
Hide code cell output
_images/4c4ee76b91288637e31a0431369f295b2fab3535c6d32f43c0d147617f729d49.svg