Standardize metadata on-the-fly

This use cases runs on a LaminDB instance with populated CellType and Pathway registries. Make sure you run the GO Ontology notebook before executing this use case.

Here, we demonstrate how to standardize the metadata on-the-fly during cell type annotation and pathway enrichment analysis using these two registries.

For more information, see:

!lamin connect use-cases-registries
 connected lamindb: testuser1/use-cases-registries
 to map a local dev directory, call: lamin settings set dev-dir .
import lamindb as ln
import bionty as bt
from lamin_usecases import datasets as ds
import scanpy as sc
import matplotlib.pyplot as plt
import celltypist
import gseapy as gp
 connected lamindb: testuser1/use-cases-registries
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/celltypist/classifier.py:11: FutureWarning: `__version__` is deprecated, use `importlib.metadata.version('scanpy')` instead
  from scanpy import __version__ as scv
ln.track("hsPU1OENv0LS0000")
 created Transform('hsPU1OENv0LS0000', key='analysis-registries.ipynb'), started new Run('VvoZmBdG5xXzIVOs') at 2025-12-14 22:43:53 UTC
 notebook imports: bionty==1.10.0 celltypist==1.7.1 gseapy==1.1.11 lamin_usecases==0.0.1 lamindb==1.17.0 matplotlib==3.10.8 scanpy==1.11.5

An interferon-beta treated dataset

A small peripheral blood mononuclear cell dataset that is split into control and stimulated groups. The stimulated group was treated with interferon beta.

Let’s load the dataset and perform some preprocessing:

adata = ds.anndata_seurat_ifnb(preprocess=False, populate_registries=True)
adata
AnnData object with n_obs × n_vars = 13999 × 9943
    obs: 'stim'
    var: 'symbol'
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata, n_comps=20)
sc.pp.neighbors(adata, n_pcs=10)
sc.tl.umap(adata)

Analysis: cell type annotation using CellTypist

model = celltypist.models.Model.load(model="Immune_All_Low.pkl")
Hide code cell output
🔎 No available models. Downloading...
📜 Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json
📚 Total models in list: 60
📂 Storing models in /home/runner/.celltypist/data/models
💾 Downloading model [1/60]: Immune_All_Low.pkl
💾 Downloading model [2/60]: Immune_All_High.pkl
💾 Downloading model [3/60]: Adult_COVID19_PBMC.pkl
💾 Downloading model [4/60]: Adult_CynomolgusMacaque_Hippocampus.pkl
💾 Downloading model [5/60]: Adult_Human_MTG.pkl
💾 Downloading model [6/60]: Adult_Human_PancreaticIslet.pkl
💾 Downloading model [7/60]: Adult_Human_PrefrontalCortex.pkl
💾 Downloading model [8/60]: Adult_Human_Skin.pkl
💾 Downloading model [9/60]: Adult_Human_Vascular.pkl
💾 Downloading model [10/60]: Adult_Mouse_Gut.pkl
💾 Downloading model [11/60]: Adult_Mouse_OlfactoryBulb.pkl
💾 Downloading model [12/60]: Adult_Pig_Hippocampus.pkl
💾 Downloading model [13/60]: Adult_RhesusMacaque_Hippocampus.pkl
💾 Downloading model [14/60]: Adult_cHSPCs_Illumina.pkl
💾 Downloading model [15/60]: Adult_cHSPCs_Ultima.pkl
💾 Downloading model [16/60]: Autopsy_COVID19_Lung.pkl
💾 Downloading model [17/60]: COVID19_HumanChallenge_Blood.pkl
💾 Downloading model [18/60]: COVID19_Immune_Landscape.pkl
💾 Downloading model [19/60]: Cells_Adult_Breast.pkl
💾 Downloading model [20/60]: Cells_Fetal_Lung.pkl
💾 Downloading model [21/60]: Cells_Human_Tonsil.pkl
💾 Downloading model [22/60]: Cells_Intestinal_Tract.pkl
💾 Downloading model [23/60]: Cells_Lung_Airway.pkl
💾 Downloading model [24/60]: Developing_Human_Brain.pkl
💾 Downloading model [25/60]: Developing_Human_Gonads.pkl
💾 Downloading model [26/60]: Developing_Human_Hippocampus.pkl
💾 Downloading model [27/60]: Developing_Human_Organs.pkl
💾 Downloading model [28/60]: Developing_Human_Thymus.pkl
💾 Downloading model [29/60]: Developing_Mouse_Brain.pkl
💾 Downloading model [30/60]: Developing_Mouse_Hippocampus.pkl
💾 Downloading model [31/60]: Fetal_Human_AdrenalGlands.pkl
💾 Downloading model [32/60]: Fetal_Human_Pancreas.pkl
💾 Downloading model [33/60]: Fetal_Human_Pituitary.pkl
💾 Downloading model [34/60]: Fetal_Human_Retina.pkl
💾 Downloading model [35/60]: Fetal_Human_Skin.pkl
💾 Downloading model [36/60]: Healthy_Adult_Heart.pkl
💾 Downloading model [37/60]: Healthy_COVID19_PBMC.pkl
💾 Downloading model [38/60]: Healthy_Human_Liver.pkl
💾 Downloading model [39/60]: Healthy_Mouse_Liver.pkl
💾 Downloading model [40/60]: Human_AdultAged_Hippocampus.pkl
💾 Downloading model [41/60]: Human_Colorectal_Cancer.pkl
💾 Downloading model [42/60]: Human_Developmental_Retina.pkl
💾 Downloading model [43/60]: Human_Embryonic_YolkSac.pkl
💾 Downloading model [44/60]: Human_Endometrium_Atlas.pkl
💾 Downloading model [45/60]: Human_IPF_Lung.pkl
💾 Downloading model [46/60]: Human_Longitudinal_Hippocampus.pkl
💾 Downloading model [47/60]: Human_Lung_Atlas.pkl
💾 Downloading model [48/60]: Human_PF_Lung.pkl
💾 Downloading model [49/60]: Human_Placenta_Decidua.pkl
💾 Downloading model [50/60]: Lethal_COVID19_Lung.pkl
💾 Downloading model [51/60]: Mouse_Dendritic_Subtypes.pkl
💾 Downloading model [52/60]: Mouse_Dentate_Gyrus.pkl
💾 Downloading model [53/60]: Mouse_Isocortex_Hippocampus.pkl
💾 Downloading model [54/60]: Mouse_Postnatal_DentateGyrus.pkl
💾 Downloading model [55/60]: Mouse_Whole_Brain.pkl
💾 Downloading model [56/60]: Nuclei_Human_InnerEar.pkl
💾 Downloading model [57/60]: Nuclei_Lung_Airway.pkl
💾 Downloading model [58/60]: PaediatricAdult_COVID19_Airway.pkl
💾 Downloading model [59/60]: PaediatricAdult_COVID19_PBMC.pkl
💾 Downloading model [60/60]: Pan_Fetal_Human.pkl
predictions = celltypist.annotate(
    adata, model="Immune_All_Low.pkl", majority_voting=True
)
adata.obs["cell_type_celltypist"] = predictions.predicted_labels.majority_voting
🔬 Input data has 13999 cells and 9943 genes
🔗 Matching reference genes in the model
🧬 3701 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it
⛓️ Over-clustering input data with resolution set to 10
🗳️ Majority voting the predictions
✅ Majority voting done!
adata.obs["cell_type_celltypist"] = bt.CellType.standardize(
    adata.obs["cell_type_celltypist"]
)
sc.pl.umap(
    adata,
    color=["cell_type_celltypist", "stim"],
    frameon=False,
    legend_fontsize=10,
    wspace=0.4,
)
... storing 'cell_type_celltypist' as categorical
_images/688a0384398b109004e8660f956260e3e19090d7487027c0bb2aca687e9583df.png

Analysis: Pathway enrichment analysis using Enrichr

This analysis is based on the GSEApy scRNA-seq Example.

First, we compute differentially expressed genes using a Wilcoxon test between stimulated and control cells.

# compute differentially expressed genes
sc.tl.rank_genes_groups(
    adata,
    groupby="stim",
    use_raw=False,
    method="wilcoxon",
    groups=["STIM"],
    reference="CTRL",
)

rank_genes_groups_df = sc.get.rank_genes_groups_df(adata, "STIM")
rank_genes_groups_df.head()
names scores logfoldchanges pvals pvals_adj
0 ISG15 99.456596 7.132836 0.0 0.0
1 ISG20 96.736671 5.074293 0.0 0.0
2 IFI6 94.972832 5.828898 0.0 0.0
3 IFIT3 92.482513 7.432556 0.0 0.0
4 IFIT1 90.699203 8.053666 0.0 0.0

Next, we filter out up/down-regulated differentially expressed gene sets:

degs_up = rank_genes_groups_df[
    (rank_genes_groups_df["logfoldchanges"] > 0)
    & (rank_genes_groups_df["pvals_adj"] < 0.05)
]
degs_dw = rank_genes_groups_df[
    (rank_genes_groups_df["logfoldchanges"] < 0)
    & (rank_genes_groups_df["pvals_adj"] < 0.05)
]

degs_up.shape, degs_dw.shape
((541, 5), (936, 5))

Run pathway enrichment analysis on DEGs and plot top 10 pathways:

enr_up = gp.enrichr(degs_up.names, gene_sets="GO_Biological_Process_2023").res2d
gp.dotplot(enr_up, figsize=(2, 3), title="Up", cmap=plt.cm.autumn_r);
_images/1d9d8bef0f3d83539f8f12b908ab7db9bc9c5f760f6ef4505b45839d1218cda1.png
enr_dw = gp.enrichr(degs_dw.names, gene_sets="GO_Biological_Process_2023").res2d
gp.dotplot(enr_dw, figsize=(2, 3), title="Down", cmap=plt.cm.winter_r);
_images/bdde4063413984048b1c7612dd8b6ff1e1783a34db3ce904d1359242fa9b2aca.png

Annotate & save dataset

gRegister new features and labels (check out more details here):

new_features = ln.Feature.from_dataframe(adata.obs)
ln.save(new_features)
new_labels = [ln.ULabel(name=i) for i in adata.obs["stim"].unique()]
ln.save(new_labels)
! You have few permissible values for feature stim, consider dtype 'cat' instead of 'str'
! You have few permissible values for feature cell_type_celltypist, consider dtype 'cat' instead of 'str'
! rather than passing a string 'cat' to dtype, pass a Python object
! rather than passing a string 'cat' to dtype, pass a Python object
features = ln.Feature.lookup()

Register dataset using a Artifact object:

artifact = ln.Artifact.from_anndata(
    adata,
    description="seurat_ifnb_activated_Bcells",
).save()
 writing the in-memory object into cache
# TODO: rewrite based on ln.Curator.from_anndata()
# artifact.features._add_set_from_anndata(
#     var_field=bt.Gene.symbol,
#     organism="human",  # optionally, globally set organism via bt.settings.organism = "human"
# )
# cell_type_records = bt.CellType.from_values(adata.obs["cell_type_celltypist"])
# artifact.labels.add(cell_type_records, features.cell_type_celltypist)
# stim_records = ln.ULabel.from_values(adata.obs["stim"])
# artifact.labels.add(stim_records, features.stim)

Querying pathways

Querying for pathways contains “interferon-beta” in the name:

bt.Pathway.filter(name__contains="interferon-beta").to_dataframe()
uid name ontology_id abbr synonyms description is_locked created_at branch_id space_id created_by_id run_id source_id
id
4953 3VZq4dMe response to interferon-beta GO:0035456 None response to fiblaferon|response to fibroblast ... Any Process That Results In A Change In State ... False 2025-12-14 22:43:08.909000+00:00 1 1 2 None 58
4334 54R2a0el regulation of interferon-beta production GO:0032648 None regulation of IFN-beta production Any Process That Modulates The Frequency, Rate... False 2025-12-14 22:43:08.854000+00:00 1 1 2 None 58
3127 3x0xmK1y positive regulation of interferon-beta production GO:0032728 None up-regulation of interferon-beta production|up... Any Process That Activates Or Increases The Fr... False 2025-12-14 22:43:08.754000+00:00 1 1 2 None 58
2130 1NzHDJDi negative regulation of interferon-beta production GO:0032688 None down regulation of interferon-beta production|... Any Process That Stops, Prevents, Or Reduces T... False 2025-12-14 22:43:08.669000+00:00 1 1 2 None 58
684 1l4z0v8W cellular response to interferon-beta GO:0035458 None cellular response to fiblaferon|cellular respo... Any Process That Results In A Change In State ... False 2025-12-14 22:43:08.539000+00:00 1 1 2 None 58

Query pathways from a gene:

bt.Pathway.filter(genes__symbol="KIR2DL1").to_dataframe()
uid name ontology_id abbr synonyms description is_locked created_at branch_id space_id created_by_id run_id source_id
id
1346 7S7qlEkG immune response-inhibiting cell surface recept... GO:0002767 None immune response-inhibiting cell surface recept... The Series Of Molecular Signals Initiated By A... False 2025-12-14 22:43:08.592000+00:00 1 1 2 None 58

Query artifacts from a pathway:

ln.Artifact.filter(feature_sets__pathways__name__icontains="interferon-beta").first()
Artifact(uid='PuhSA2U4mw8WfvJy0000', version=None, is_latest=True, key=None, description='seurat_ifnb_activated_Bcells', suffix='.h5ad', kind='dataset', otype='AnnData', size=214912657, hash='sf5XLKBNQ_8vQkKcy_Q04k', n_files=None, n_observations=13999, branch_id=1, space_id=1, storage_id=2, run_id=1, schema_id=None, created_by_id=2, created_at=2025-12-14 22:47:30 UTC, is_locked=False)

Query featuresets from a pathway to learn from which geneset this pathway was computed:

pathway = bt.Pathway.get(ontology_id="GO:0035456")
pathway
Pathway(uid='3VZq4dMe', name='response to interferon-beta', ontology_id='GO:0035456', abbr=None, synonyms='response to fiblaferon|response to fibroblast interferon|response to interferon beta', description='Any Process That Results In A Change In State Or Activity Of A Cell Or An Organism (In Terms Of Movement, Secretion, Enzyme Production, Gene Expression, Etc.) As A Result Of An Interferon-Beta Stimulus. Interferon-Beta Is A Type I Interferon.', branch_id=1, space_id=1, created_by_id=2, run_id=None, source_id=58, created_at=2025-12-14 22:43:08 UTC, is_locked=False)
degs = ln.FeatureSet.get(pathways__ontology_id=pathway.ontology_id)

Now we can get the list of genes that are differentially expressed and belong to this pathway:

contributing_genes = pathway.genes.all() & degs.genes.all()
contributing_genes.list("symbol")
/tmp/ipykernel_2994/628516556.py:2: DeprecationWarning: Use to_list instead of list, list will be removed in the future.
  contributing_genes.list("symbol")
['PLSCR1',
 'MNDA',
 'IFITM1',
 'IFITM2',
 'PNPT1',
 'AIM2',
 'IFITM3',
 'IRF1',
 'BST2',
 'OAS1',
 'STAT1',
 'CALM1',
 'XAF1',
 'IFI16',
 'SHFL']
# clean up test instance
!lamin delete --force use-cases-registries
!rm -r ./use-cases-registries
Hide code cell output
╭─ Error ──────────────────────────────────────────────────────────────────────╮
 '/home/runner/work/lamin-usecases/lamin-usecases/docs/use-cases-registries/. 
 lamindb' contains 1 objects:                                                 
 /home/runner/work/lamin-usecases/lamin-usecases/docs/use-cases-registries/.l 
 amindb/PuhSA2U4mw8WfvJy0000.h5ad                                             
 delete them prior to deleting the storage location                           
╰──────────────────────────────────────────────────────────────────────────────╯