Jupyter Notebook Binder

CellTypist

Cell types classify cells based on public and private knowledge from studying transcription, morphology, function & other properties. Established cell types have well-characterized markers and properties; however, cell subtypes and states are continuously being discovered, refined and better understood.

In this notebook, we register the immune cell type vocabulary from CellTypist, a computational tool used for cell type classification in scRNA-seq data.

In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to curate datasets analyzed with CellTypist enrichment analysis and track the dataset with LaminDB.

# pip install 'lamindb[jupyter,bionty]'
!lamin load use-cases-registries
Hide code cell output
Entity has to be a laminhub URL or 'artifact' or 'transform'
Hide code cell content
# filter warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")
import lamindb as ln
import bionty as bt
→ connected lamindb: testuser1/use-cases-registries

Access CellTypist records

As a first step we will read in CellTypist’s immune cell encyclopedia

import pandas as pd

description = "CellTypist Pan Immune Atlas v2: basic cell type information"
celltypist_source_v2_url = "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"

celltypist_df = pd.read_excel(celltypist_source_v2_url)

It provides an ontology_id of the public Cell Ontology for the majority of records.

celltypist_df.head()
High-hierarchy cell types Low-hierarchy cell types Description Cell Ontology ID Curated markers
0 B cells B cells B lymphocytes with diverse cell surface immuno... CL:0000236 CD79A, MS4A1, CD19
1 B cells Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 CXCR5, TNFRSF13B, CD22
2 B cells Proliferative germinal center B cells proliferating germinal center B cells CL:0000844 MKI67, SUGCT, AICDA
3 B cells Germinal center B cells proliferating mature B cells that undergo soma... CL:0000844 POU2AF1, CD40, SUGCT
4 B cells Memory B cells long-lived mature B lymphocytes which are form... CL:0000787 CR2, CD27, MS4A1

The “Cell Ontology ID” is associated with multiple “Low-hierarchy cell types”:

celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)
High-hierarchy cell types Description Curated markers
Cell Ontology ID Low-hierarchy cell types
CL:0000236 B cells B cells B lymphocytes with diverse cell surface immuno... CD79A, MS4A1, CD19
CL:0000843 Follicular B cells B cells resting mature B lymphocytes found in the prim... CXCR5, TNFRSF13B, CD22
CL:0000844 Proliferative germinal center B cells B cells proliferating germinal center B cells MKI67, SUGCT, AICDA
Germinal center B cells B cells proliferating mature B cells that undergo soma... POU2AF1, CD40, SUGCT
CL:0000787 Memory B cells B cells long-lived mature B lymphocytes which are form... CR2, CD27, MS4A1
Age-associated B cells B cells CD11c+ T-bet+ memory B cells associated with a... FCRL2, ITGAX, TBX21
CL:0000788 Naive B cells B cells mature B lymphocytes which express cell-surfac... IGHM, IGHD, TCL1A
CL:0000818 Transitional B cells B cells immature B cell precursors in the bone marrow ... CD24, MYO1C, MS4A1
CL:0000817 Large pre-B cells B-cell lineage proliferative B lymphocyte precursors derived ... MME, CD24, MKI67
Small pre-B cells B-cell lineage non-proliferative B lymphocyte precursors deri... MME, CD24, IGLL5

Validate CellTypist records

For any cell type record that can be validated against the public Cell Ontology, we’d like to ensure that it’s actually validated.

This will avoid that we’ll refer to the same cell type with different identifiers.

We need a Bionty object for this:

bionty = bt.CellType.public()
bionty
PublicOntology
Entity: CellType
Organism: all
Source: cl, 2024-05-15
#terms: 2931

We can now validate the "Cell Ontology ID" column:

bionty.inspect(celltypist_df["Cell Ontology ID"], bionty.ontology_id);

This looks good!

But when inspecting the names, most of them don’t validate:

bionty.inspect(celltypist_df["Low-hierarchy cell types"], bionty.name);
! 97 unique terms (99.00%) are not validated for name: 'B cells', 'Follicular B cells', 'Proliferative germinal center B cells', 'Germinal center B cells', 'Memory B cells', 'Age-associated B cells', 'Naive B cells', 'Transitional B cells', 'Large pre-B cells', 'Small pre-B cells', ...
   detected 6 unique terms with synonyms: DC1, DC2, ETP, ILC2, ILC3, pDC
→  standardize terms via .standardize()

A search tells us that terms that are named in plural in Cell Typist occur with a name in singular in the Cell Ontology:

celltypist_df["Low-hierarchy cell types"][0]
'B cells'
bionty.search(celltypist_df["Low-hierarchy cell types"][0]).head(2)
ontology_id definition synonyms parents __agg__ __ratio__
name
B cell CL:0000236 A Lymphocyte Of B Lineage That Is Capable Of B... B lymphocyte|B-cell|B-lymphocyte [CL:0000945] b cell 92.307692
B-1 B cell CL:0000819 A B Cell Of Distinct Lineage And Surface Marke... B1 B-cell|B-1 B-lymphocyte|B-1 B-cell|B1 cell|... [CL:0000785] b-1 b cell 85.714286

Let’s try to strip "s" and inspect if more names are now validated. Yes, there are!

bionty.inspect(
    [i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
    bionty.name,
);
! 93 unique terms (94.90%) are not validated for name: 'Follicular B cell', 'Proliferative germinal center B cell', 'Germinal center B cell', 'Memory B cell', 'Age-associated B cell', 'Naive B cell', 'Transitional B cell', 'Large pre-B cell', 'Small pre-B cell', 'Pre-pro-B cell', ...
   detected 31 unique terms with inconsistent casing/synonyms: Follicular B cell, Germinal center B cell, Memory B cell, Naive B cell, Transitional B cell, Small pre-B cell, Pro-B cell, DC1, DC2, Endothelial cell, ...
→  standardize terms via .standardize()

Every “low-hierarchy cell type” has an ontology id and most “high-hierarchy cell types” also appear as “low-hierarchy cell types” in the Cell Typist table. Four, however, don’t, and therefore don’t have an ontology ID.

high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()

high_terms_nonval = set(high_terms).difference(low_terms)
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}

Register CellTypist records

Let’s first add the “High-hierarchy cell types” as a column "parent".

This enables LaminDB to populate the parents and children fields, which will enable you to query for hierarchical relationships.

celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")

# if high and low terms are the same, no parents
celltypist_df.loc[
    (celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None

# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
    columns={"Low-hierarchy cell types": "ct_name", "Cell Ontology ID": "ontology_id"},
    inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()

# add standardize names for each ontology_id
celltypist_df["name"] = bionty.df().loc[celltypist_df["ontology_id"]].name.values
celltypist_df.head(2)
ct_name description ontology_id parent name
0 B cells B lymphocytes with diverse cell surface immuno... CL:0000236 None B cell
1 Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 B cells follicular B cell

Now, let’s create records from the public ontology:

public_records = bt.CellType.from_values(
    celltypist_df.ontology_id, bt.CellType.ontology_id
)
ln.save(public_records)

Let’s now amend public ontology records so that they maintain additional annotations that Cell Typist might have.

from lamindb.core.exceptions import ValidationError

public_records_dict = {r.ontology_id: r for r in public_records}

for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    try:
        record.add_synonym(row["ct_name"])
    except ValidationError:  # do nothing if the synonym already exists as a record
        pass
Hide code cell output
✗ input synonyms ['DC2'] already associated with the following records:
created_at created_by_id run_id updated_at source_id id uid name ontology_id abbr synonyms description
0 2024-11-21 06:53:38.503683+00:00 1 None 2024-11-21 06:53:38.503691+00:00 32 92 3JO0EdVd plasmacytoid dendritic cell CL:0000784 None plasmacytoid monocyte|T-associated plasma cell... A Dendritic Cell Type Of Distinct Morphology, ...
✗ input synonyms ['ILC2'] already associated with the following records:
created_at created_by_id run_id updated_at source_id id uid name ontology_id abbr synonyms description
0 2024-11-21 06:53:38.504134+00:00 1 None 2024-11-21 06:53:38.504142+00:00 32 114 4ny4oBnr group 2 innate lymphoid cell CL:0001069 None nuocyte|natural helper cell|ILC2 An Innate Lymphoid Cell That Is Capable Of Pro...
✗ input synonyms ['ILC3'] already associated with the following records:
created_at created_by_id run_id updated_at source_id id uid name ontology_id abbr synonyms description
0 2024-11-21 06:53:38.504155+00:00 1 None 2024-11-21 06:53:38.504163+00:00 32 115 3tILnbqv group 3 innate lymphoid cell CL:0001071 None ILC3 An Innate Lymphoid Cell That Constituitively E...
✗ input synonyms ['pDC'] already associated with the following records:
created_at created_by_id run_id updated_at source_id id uid name ontology_id abbr synonyms description
0 2024-11-21 06:53:38.503683+00:00 1 None 2024-11-21 06:53:38.503691+00:00 32 92 3JO0EdVd plasmacytoid dendritic cell CL:0000784 None plasmacytoid monocyte|T-associated plasma cell... A Dendritic Cell Type Of Distinct Morphology, ...

Add parent-child relationship of the records from Celltypist

We still need to add the renaming 4 High hierarchy terms:

list(high_terms_nonval)
['B-cell lineage', 'T cells', 'Cycling cells', 'Erythroid']

Let’s get the top hits from a search:

for term in list(high_terms_nonval):
    print(f"Term: {term}")
    display(bionty.search(term).head(2))
Term: B-cell lineage
ontology_id definition synonyms parents __agg__ __ratio__
name
obsolete cell by lineage CL:0000220 None None [] obsolete cell by lineage 73.684211
obsolete cell line cell CL:0007014 Obsolete: A Cultured Cell That Has Been Passag... passaged cultured cell [] obsolete cell line cell 64.864865
Term: T cells
ontology_id definition synonyms parents __agg__ __ratio__
name
T cell CL:0000084 A Type Of Lymphocyte Whose Defining Characteri... T lymphocyte|T-lymphocyte|T-cell [CL:0000542] t cell 92.307692
Tc1 cell CL:0000917 A Cd8-Positive, Alpha-Beta Positive T Cell Tha... Tc1 T cell|Th1 non-TFH CD8-positive T cell|CD8... [CL:0000908] tc1 cell 80.000000
Term: Cycling cells
ontology_id definition synonyms parents __agg__ __ratio__
name
circulating cell CL:0000080 A Cell Which Moves Among Different Tissues Of ... None [CL:0000000] circulating cell 75.862069
lining cell CL:0000213 A Cell Within An Epithelial Cell Sheet Whose M... boundary cell [CL:0000215] lining cell 75.000000
Term: Erythroid
ontology_id definition synonyms parents __agg__ __ratio__
name
Kit-positive, CD34-negative megakaryocyte erythroid progenitor cell CL:0002006 A Megakaryocyte Erythroid Progenitor Cell That... None [CL:0000050] kit-positive, cd34-negative megakaryocyte eryt... 90.0
Kit-positive erythroid progenitor cell CL:0002000 An Erythroid Progenitor Cell Is Kit-Positive, ... c- Kit-positive erythroid progenitor cell [CL:0001066] kit-positive erythroid progenitor cell 90.0

So we decide to:

  • Add the “T cells” to the synonyms of the public “T cell” record

  • Create the remaining 3 terms only using their names (we think “B cell flow” shouldn’t be identified with “B cell”)

for name in high_terms_nonval:
    if name == "T cells":
        record = bt.CellType.from_source(name="T cell")
        record.add_synonym(name)
        record.save()
    elif name == "Erythroid":
        record = bt.CellType.from_source(name="erythroid lineage cell")
        record.add_synonym(name)
        record.save()
    else:
        record = bt.CellType(name=name)
        record.save()
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}
bt.CellType(name="B-cell lineage").save()
→ returning existing CellType record with same name: 'B-cell lineage'
CellType(uid='5gxL2SWr', name='B-cell lineage', created_by_id=1, created_at=2024-11-21 06:53:40 UTC)

Now let’s add the parent records:

celltypist_df["parent"] = bt.CellType.standardize(celltypist_df["parent"])
for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    if row["parent"] is not None:
        parent_record = bt.CellType.get(name=row["parent"])
        record.parents.add(parent_record)

Access the registry

The previously added CellTypist ontology registry is now available in LaminDB. To retrieve the full ontology table as a Pandas DataFrame we can use .filter:

bt.CellType.df()
uid name ontology_id abbr synonyms description source_id run_id created_at created_by_id
id
139 5gxL2SWr B-cell lineage None None None None NaN None 2024-11-21 06:53:40.013145+00:00 1
24 2KfvYuU7 erythroid lineage cell CL:0000764 None Erythroid|Late erythroid|Mid erythroid|Early e... A Immature Or Mature Cell In The Lineage Leadi... 32.0 None 2024-11-21 06:53:38.182099+00:00 1
140 5jshKSVL Cycling cells None None None None NaN None 2024-11-21 06:53:40.034799+00:00 1
14 22LvKd01 T cell CL:0000084 None T cells|T-cell|T-lymphocyte|CD8a/a|Cycling T c... A Type Of Lymphocyte Whose Defining Characteri... 32.0 None 2024-11-21 06:53:38.181885+00:00 1
68 7j3YpGzu T-helper 17 cell CL:0000899 None T(H)-17 cell|Th17 T cell|helper T cell type 17... Cd4-Positive, Alpha-Beta T Cell With The Pheno... 32.0 None 2024-11-21 06:53:38.183042+00:00 1
... ... ... ... ... ... ... ... ... ... ...
111 3yMnmkVh hematopoietic oligopotent progenitor cell, lin... CL:0001060 None None A Hematopoietic Oligopotent Progenitor Cell Th... 32.0 None 2024-11-21 06:53:38.504073+00:00 1
110 mya6z15C bone cell CL:0001035 None None A Connective Tissue Cell Found In Bone. 32.0 None 2024-11-21 06:53:38.504053+00:00 1
109 1746leyS CD115-positive monocyte OR common dendritic pr... CL:0001019 None None None 32.0 None 2024-11-21 06:53:38.504031+00:00 1
108 2OdKiFr8 CD7-negative lymphoid progenitor OR granulocyt... CL:0001012 None None None 32.0 None 2024-11-21 06:53:38.504010+00:00 1
107 4Ilrnj9U hematopoietic cell CL:0000988 None haemopoietic cell|haematopoietic cell|hemopoie... A Cell Of A Hematopoietic Lineage. 32.0 None 2024-11-21 06:53:38.503989+00:00 1

100 rows × 10 columns

This enables us to look for cell types by creating a lookup object from our new CellType registry.

db_lookup = bt.CellType.lookup()
db_lookup.memory_b_cell
CellType(uid='2cUPBtY8', name='memory B cell', ontology_id='CL:0000787', synonyms='memory B lymphocyte|memory B-lymphocyte|memory B-cell|Memory B cells|Age-associated B cells', description='A Memory B Cell Is A Mature B Cell That Is Long-Lived, Readily Activated Upon Re-Encounter Of Its Antigenic Determinant, And Has Been Selected For Expression Of Higher Affinity Immunoglobulin. This Cell Type Has The Phenotype Cd19-Positive, Cd20-Positive, Mhc Class Ii-Positive, And Cd138-Negative.', created_by_id=1, source_id=32, created_at=2024-11-21 06:53:38 UTC)

See cell type hierarchy:

db_lookup.memory_b_cell.view_parents()
_images/cdafd8de80465e8ce2c4f4d6a5b0e483278547cb55ac6923a95438340d38f3ed.svg

Access parents of a record:

db_lookup.memory_b_cell.parents.list()
[CellType(uid='ryEtgi1y', name='B cell', ontology_id='CL:0000236', synonyms='B-cell|B cells|Cycling B cells|B-lymphocyte|B lymphocyte', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', created_by_id=1, source_id=32, created_at=2024-11-21 06:53:38 UTC),
 CellType(uid='71xItrKo', name='mature B cell', ontology_id='CL:0000785', synonyms='mature B-lymphocyte|mature B-cell|mature B lymphocyte', description='A B Cell That Is Mature, Having Left The Bone Marrow. Initially, These Cells Are Igm-Positive And Igd-Positive, And They Can Be Activated By Antigen.', created_by_id=1, source_id=32, created_at=2024-11-21 06:53:38 UTC)]

Move on to the next registry: GO pathways