Jupyter Notebook Binder

CellTypist

Cell types classify cells based on public and private knowledge from studying transcription, morphology, function & other properties. Established cell types have well-characterized markers and properties; however, cell subtypes and states are continuously being discovered, refined and better understood.

In this notebook, we register the immune cell type vocabulary from CellTypist, a computational tool used for cell type classification in scRNA-seq data.

In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to curate datasets analyzed with CellTypist enrichment analysis and track the dataset with LaminDB.

# pip install 'lamindb[jupyter,bionty]'
!lamin load use-cases-registries
Hide code cell output
Entity has to be a laminhub URL or 'artifact' or 'transform'
Hide code cell content
# filter warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")
import lamindb as ln
import bionty as bt
 connected lamindb: testuser1/use-cases-registries

Access CellTypist records

As a first step we will read in CellTypist’s immune cell encyclopedia

import pandas as pd

description = "CellTypist Pan Immune Atlas v2: basic cell type information"
celltypist_source_v2_url = "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"

celltypist_df = pd.read_excel(celltypist_source_v2_url)

It provides an ontology_id of the public Cell Ontology for the majority of records.

celltypist_df.head()
High-hierarchy cell types Low-hierarchy cell types Description Cell Ontology ID Curated markers
0 B cells B cells B lymphocytes with diverse cell surface immuno... CL:0000236 CD79A, MS4A1, CD19
1 B cells Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 CXCR5, TNFRSF13B, CD22
2 B cells Proliferative germinal center B cells proliferating germinal center B cells CL:0000844 MKI67, SUGCT, AICDA
3 B cells Germinal center B cells proliferating mature B cells that undergo soma... CL:0000844 POU2AF1, CD40, SUGCT
4 B cells Memory B cells long-lived mature B lymphocytes which are form... CL:0000787 CR2, CD27, MS4A1

The “Cell Ontology ID” is associated with multiple “Low-hierarchy cell types”:

celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)
High-hierarchy cell types Description Curated markers
Cell Ontology ID Low-hierarchy cell types
CL:0000236 B cells B cells B lymphocytes with diverse cell surface immuno... CD79A, MS4A1, CD19
CL:0000843 Follicular B cells B cells resting mature B lymphocytes found in the prim... CXCR5, TNFRSF13B, CD22
CL:0000844 Proliferative germinal center B cells B cells proliferating germinal center B cells MKI67, SUGCT, AICDA
Germinal center B cells B cells proliferating mature B cells that undergo soma... POU2AF1, CD40, SUGCT
CL:0000787 Memory B cells B cells long-lived mature B lymphocytes which are form... CR2, CD27, MS4A1
Age-associated B cells B cells CD11c+ T-bet+ memory B cells associated with a... FCRL2, ITGAX, TBX21
CL:0000788 Naive B cells B cells mature B lymphocytes which express cell-surfac... IGHM, IGHD, TCL1A
CL:0000818 Transitional B cells B cells immature B cell precursors in the bone marrow ... CD24, MYO1C, MS4A1
CL:0000817 Large pre-B cells B-cell lineage proliferative B lymphocyte precursors derived ... MME, CD24, MKI67
Small pre-B cells B-cell lineage non-proliferative B lymphocyte precursors deri... MME, CD24, IGLL5

Validate CellTypist records

For any cell type record that can be validated against the public Cell Ontology, we’d like to ensure that it’s actually validated.

This will avoid that we’ll refer to the same cell type with different identifiers.

We need a Bionty object for this:

bionty = bt.CellType.public()
bionty
PublicOntology
Entity: CellType
Organism: all
Source: cl, 2024-08-16
#terms: 2959

We can now validate the "Cell Ontology ID" column:

bionty.inspect(celltypist_df["Cell Ontology ID"], bionty.ontology_id);

This looks good!

But when inspecting the names, most of them don’t validate:

bionty.inspect(celltypist_df["Low-hierarchy cell types"], bionty.name);
! 97 unique terms (99.00%) are not validated for name: 'B cells', 'Follicular B cells', 'Proliferative germinal center B cells', 'Germinal center B cells', 'Memory B cells', 'Age-associated B cells', 'Naive B cells', 'Transitional B cells', 'Large pre-B cells', 'Small pre-B cells', ...
   detected 6 unique terms with synonyms: DC1, DC2, ETP, ILC2, ILC3, pDC
→  standardize terms via .standardize()

A search tells us that terms that are named in plural in Cell Typist occur with a name in singular in the Cell Ontology:

celltypist_df["Low-hierarchy cell types"][0]
'B cells'
bionty.search(celltypist_df["Low-hierarchy cell types"][0]).head(2)
name definition synonyms parents __agg__
ontology_id
CL:0000156 obsolete antibody secreting cell Obsolete: A Cell Of The Lymphoid Series That C... None [] obsolete antibody secreting cell
CL:0000432 reticular cell A Fibroblast That Synthesizes Collagen And Use... reticulum cell [CL:0000057] reticular cell

Let’s try to strip "s" and inspect if more names are now validated. Yes, there are!

bionty.inspect(
    [i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
    bionty.name,
);
! 93 unique terms (94.90%) are not validated for name: 'Follicular B cell', 'Proliferative germinal center B cell', 'Germinal center B cell', 'Memory B cell', 'Age-associated B cell', 'Naive B cell', 'Transitional B cell', 'Large pre-B cell', 'Small pre-B cell', 'Pre-pro-B cell', ...
   detected 35 unique terms with inconsistent casing/synonyms: Follicular B cell, Germinal center B cell, Memory B cell, Naive B cell, Transitional B cell, Small pre-B cell, Pro-B cell, Cycling B cell, Cycling gamma-delta T cell, Cycling monocyte, ...
→  standardize terms via .standardize()

Every “low-hierarchy cell type” has an ontology id and most “high-hierarchy cell types” also appear as “low-hierarchy cell types” in the Cell Typist table. Four, however, don’t, and therefore don’t have an ontology ID.

high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()

high_terms_nonval = set(high_terms).difference(low_terms)
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}

Register CellTypist records

Let’s first add the “High-hierarchy cell types” as a column "parent".

This enables LaminDB to populate the parents and children fields, which will enable you to query for hierarchical relationships.

celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")

# if high and low terms are the same, no parents
celltypist_df.loc[
    (celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None

# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
    columns={"Low-hierarchy cell types": "ct_name", "Cell Ontology ID": "ontology_id"},
    inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()

# add standardize names for each ontology_id
celltypist_df["name"] = bionty.df().loc[celltypist_df["ontology_id"]].name.values
celltypist_df.head(2)
ct_name description ontology_id parent name
0 B cells B lymphocytes with diverse cell surface immuno... CL:0000236 None B cell
1 Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 B cells follicular B cell

Now, let’s create records from the public ontology:

public_records = bt.CellType.from_values(
    celltypist_df.ontology_id, bt.CellType.ontology_id
)
ln.save(public_records)

Let’s now amend public ontology records so that they maintain additional annotations that Cell Typist might have.

from lamindb.core.exceptions import ValidationError

public_records_dict = {r.ontology_id: r for r in public_records}

for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    try:
        record.add_synonym(row["ct_name"])
    except ValidationError:  # do nothing if the synonym already exists as a record
        pass
Hide code cell output
✗ input synonyms ['DC2'] already associated with the following records:
created_at created_by_id run_id updated_at _branch_code space_id _aux source_id id uid name ontology_id abbr synonyms description
0 2025-01-20 07:36:39.702000+00:00 1 None 2025-01-20 07:36:39.702000+00:00 1 1 None 32 92 3JO0EdVd plasmacytoid dendritic cell CL:0000784 None plasmacytoid monocyte|interferon-producing cel... A Dendritic Cell Type Of Distinct Morphology, ...
✗ input synonyms ['ILC2'] already associated with the following records:
created_at created_by_id run_id updated_at _branch_code space_id _aux source_id id uid name ontology_id abbr synonyms description
0 2025-01-20 07:36:39.702000+00:00 1 None 2025-01-20 07:36:39.702000+00:00 1 1 None 32 114 4ny4oBnr group 2 innate lymphoid cell CL:0001069 None ILC2|natural helper cell|nuocyte An Innate Lymphoid Cell That Is Capable Of Pro...
✗ input synonyms ['ILC3'] already associated with the following records:
created_at created_by_id run_id updated_at _branch_code space_id _aux source_id id uid name ontology_id abbr synonyms description
0 2025-01-20 07:36:39.702000+00:00 1 None 2025-01-20 07:36:39.702000+00:00 1 1 None 32 115 3tILnbqv group 3 innate lymphoid cell CL:0001071 None ILC3 An Innate Lymphoid Cell That Constituitively E...
✗ input synonyms ['pDC'] already associated with the following records:
created_at created_by_id run_id updated_at _branch_code space_id _aux source_id id uid name ontology_id abbr synonyms description
0 2025-01-20 07:36:39.702000+00:00 1 None 2025-01-20 07:36:39.702000+00:00 1 1 None 32 92 3JO0EdVd plasmacytoid dendritic cell CL:0000784 None plasmacytoid monocyte|interferon-producing cel... A Dendritic Cell Type Of Distinct Morphology, ...

Add parent-child relationship of the records from Celltypist

We still need to add the renaming 4 High hierarchy terms:

list(high_terms_nonval)
['Erythroid', 'Cycling cells', 'B-cell lineage', 'T cells']

Let’s get the top hits from a search:

for term in list(high_terms_nonval):
    print(f"Term: {term}")
    display(bionty.search(term).head(2))
Term: Erythroid
name definition synonyms parents __agg__
ontology_id
CL:0002000 Kit-positive erythroid progenitor cell An Erythroid Progenitor Cell Is Kit-Positive, ... c- Kit-positive erythroid progenitor cell [CL:0001066] kit-positive erythroid progenitor cell
CL:0000038 erythroid progenitor cell A Progenitor Cell Committed To The Erythroid L... None [CL:0000839, CL:0000764] erythroid progenitor cell
Term: Cycling cells
name definition synonyms parents __agg__
ontology_id
Term: B-cell lineage
name definition synonyms parents __agg__
ontology_id
Term: T cells
name definition synonyms parents __agg__
ontology_id
CL:0000145 professional antigen presenting cell A Cell Capable Of Processing And Presenting Li... None [CL:0000738] professional antigen presenting cell
CL:0000432 reticular cell A Fibroblast That Synthesizes Collagen And Use... reticulum cell [CL:0000057] reticular cell

So we decide to:

  • Add the “T cells” to the synonyms of the public “T cell” record

  • Create the remaining 3 terms only using their names (we think “B cell flow” shouldn’t be identified with “B cell”)

for name in high_terms_nonval:
    if name == "T cells":
        record = bt.CellType.from_source(name="T cell")
        record.add_synonym(name)
        record.save()
    elif name == "Erythroid":
        record = bt.CellType.from_source(name="erythroid lineage cell")
        record.add_synonym(name)
        record.save()
    else:
        record = bt.CellType(name=name)
        record.save()
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}
bt.CellType(name="B-cell lineage").save()
 returning existing CellType record with same name: 'B-cell lineage'
CellType(uid='5gxL2SWr', name='B-cell lineage', created_by_id=1, space_id=1, created_at=2025-01-20 07:36:41 UTC)

Now let’s add the parent records:

celltypist_df["parent"] = bt.CellType.standardize(celltypist_df["parent"])
for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    if row["parent"] is not None:
        parent_record = bt.CellType.get(name=row["parent"])
        record.parents.add(parent_record)

Access the registry

The previously added CellTypist ontology registry is now available in LaminDB. To retrieve the full ontology table as a Pandas DataFrame we can use .filter:

bt.CellType.df()
uid name ontology_id abbr synonyms description space_id source_id run_id created_at created_by_id _aux _branch_code
id
139 5gxL2SWr B-cell lineage None None None None 1 NaN None 2025-01-20 07:36:41.435000+00:00 1 None 1
138 5jshKSVL Cycling cells None None None None 1 NaN None 2025-01-20 07:36:41.431000+00:00 1 None 1
69 4bKGljt0 cell CL:0000000 None None A Material Entity Of Anatomical Origin (Part O... 1 32.0 None 2025-01-20 07:36:39.702000+00:00 1 None 1
70 4y4o4m6R blood cell CL:0000081 None None A Cell Found Predominately In The Blood. 1 32.0 None 2025-01-20 07:36:39.702000+00:00 1 None 1
71 6Sq9ZVSG professional antigen presenting cell CL:0000145 None None A Cell Capable Of Processing And Presenting Li... 1 32.0 None 2025-01-20 07:36:39.702000+00:00 1 None 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
25 6YazXirC thymocyte CL:0000893 None ETP|thymic lymphocyte An Immature T Cell Located In The Thymus. 1 32.0 None 2025-01-20 07:36:39.247000+00:00 1 None 1
26 zQ4dyjEs fibroblast CL:0000057 None Fibroblasts A Connective Tissue Cell Which Secretes An Ext... 1 32.0 None 2025-01-20 07:36:39.247000+00:00 1 None 1
27 bgoqqGYM granulocyte CL:0000094 None Granulocytes|polymorphonuclear leukocyte|granu... A Leukocyte With Abundant Granules In The Cyto... 1 32.0 None 2025-01-20 07:36:39.247000+00:00 1 None 1
28 6rfrjhvo neutrophil CL:0000775 None neutrophilic leucocyte|neutrocyte|neutrophil l... Any Of The Immature Or Mature Forms Of A Granu... 1 32.0 None 2025-01-20 07:36:39.247000+00:00 1 None 1
29 1HNi1cpn common myeloid progenitor CL:0000049 None common myeloid precursor|CMP A Progenitor Cell Committed To Myeloid Lineage... 1 32.0 None 2025-01-20 07:36:39.247000+00:00 1 None 1

100 rows × 13 columns

This enables us to look for cell types by creating a lookup object from our new CellType registry.

db_lookup = bt.CellType.lookup()
db_lookup.memory_b_cell
CellType(uid='2cUPBtY8', name='memory B cell', ontology_id='CL:0000787', synonyms='memory B lymphocyte|Age-associated B cells|Memory B cells|memory B-lymphocyte|memory B-cell', description='A Memory B Cell Is A Mature B Cell That Is Long-Lived, Readily Activated Upon Re-Encounter Of Its Antigenic Determinant, And Has Been Selected For Expression Of Higher Affinity Immunoglobulin. This Cell Type Has The Phenotype Cd19-Positive, Cd20-Positive, Mhc Class Ii-Positive, And Cd138-Negative.', created_by_id=1, space_id=1, source_id=32, created_at=2025-01-20 07:36:39 UTC)

See cell type hierarchy:

db_lookup.memory_b_cell.view_parents()
_images/d18ec034ec3faf2beb4a6871418ceedd32d2ad774a41a15e4046278fc17d4be9.svg

Access parents of a record:

db_lookup.memory_b_cell.parents.list()
[CellType(uid='ryEtgi1y', name='B cell', ontology_id='CL:0000236', synonyms='B cells|B-lymphocyte|Cycling B cells|B-cell|B lymphocyte', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', created_by_id=1, space_id=1, source_id=32, created_at=2025-01-20 07:36:39 UTC),
 CellType(uid='71xItrKo', name='mature B cell', ontology_id='CL:0000785', synonyms='mature B-cell|mature B lymphocyte|mature B-lymphocyte', description='A B Cell That Is Mature, Having Left The Bone Marrow. Initially, These Cells Are Igm-Positive And Igd-Positive, And They Can Be Activated By Antigen.', created_by_id=1, space_id=1, source_id=32, created_at=2025-01-20 07:36:39 UTC)]

Move on to the next registry: GO pathways