Gene Ontology (GO)¶
Pathways represent interconnected molecular networks of signaling cascades that govern critical cellular processes. They provide understandings cellular behavior mechanisms, insights of disease progression and treatment responses. In an R&D organization, managing pathways across different datasets are crucial for gaining insights of potential therapeutic targets and intervention strategies.
In this notebook we manage a pathway registry based on “2023 GO Biological Process” ontology. We’ll walk you through the steps of registering pathways and link them to genes.
In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to perform a pathway enrichment analysis and track the dataset with LaminDB.
# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./use-cases-registries --schema bionty
Show code cell output
→ connected lamindb: testuser1/use-cases-registries
import lamindb as ln
import bionty as bt
import gseapy as gp
bt.settings.organism = "human" # globally set organism
→ connected lamindb: testuser1/use-cases-registries
Fetch GO pathways annotated with human genes using Enrichr¶
First we fetch the “GO_Biological_Process_2023” pathways for humans using GSEApy which wraps GSEA and Enrichr.
go_bp = gp.get_library(name="GO_Biological_Process_2023", organism="Human")
print(f"Number of pathways {len(go_bp)}")
Number of pathways 5406
go_bp["ATF6-mediated Unfolded Protein Response (GO:0036500)"]
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF']
Parse out the ontology_id from keys, convert into the format of {ontology_id: (name, genes)}
def parse_ontology_id_from_keys(key):
"""Parse out the ontology id.
"ATF6-mediated Unfolded Protein Response (GO:0036500)" -> ("GO:0036500", "ATF6-mediated Unfolded Protein Response")
"""
name, id = key.rsplit(" (", 1)
id = id.rstrip(")")
return id, name
go_bp_parsed = {}
for key, genes in go_bp.items():
id, name = parse_ontology_id_from_keys(key)
go_bp_parsed[id] = (name, genes)
go_bp_parsed["GO:0036500"]
('ATF6-mediated Unfolded Protein Response',
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF'])
Register pathway ontology in LaminDB¶
bionty = bt.Pathway.public()
bionty
Show code cell output
PublicOntology
Entity: Pathway
Organism: all
Source: go, 2024-06-17
#terms: 47856
Next, we register all the pathways and genes in LaminDB to finally link pathways to genes.
Register pathway terms¶
To register the pathways we make use of .from_values
to directly parse the annotated GO pathway ontology IDs into LaminDB.
pathway_records = bt.Pathway.from_values(go_bp_parsed.keys(), bt.Pathway.ontology_id)
ln.save(pathway_records)
Register gene symbols¶
Similarly, we use .from_values
for all Pathway associated genes to register them with LaminDB.
all_genes = bt.Gene.standardize(list({g for genes in go_bp.values() for g in genes}))
gene_records = bt.Gene.from_values(all_genes, bt.Gene.symbol)
ln.save(gene_records);
! found 56 synonyms in Bionty (output truncated): [np.str_('C1ORF68'), np.str_('FAM172BP'), np.str_('SLC9A3R2'), np.str_('C1ORF43'), np.str_('C20ORF173'), np.str_('C1ORF109'), np.str_('C9ORF78'), np.str_('C12ORF29'), np.str_('C3ORF38'), np.str_('C17ORF97'), '...']
please add corresponding Gene records via (output truncated): `.from_values([np.str_('C1ORF68'), np.str_('FAM172BP'), np.str_('SLC9A3R2'), np.str_('C1ORF43'), np.str_('C20ORF173'), np.str_('C1ORF109'), np.str_('C9ORF78'), np.str_('C12ORF29'), np.str_('C3ORF38'), np.str_('C17ORF97'), '...'])`
! ambiguous validation in Bionty for 1104 records: 'TMEM102', 'HAT1', 'RINL', 'NAPEPLD', 'USP17L1', 'RAD17', 'CCDC120', 'TOLLIP', 'HLA-DPA1', 'GPX5', 'OR2T2', 'MEGF11', 'TFE3', 'TBC1D3F', 'RDH13', 'IRF9', 'ZAP70', 'ZBTB9', 'HLA-DOB', 'PRSS1', ...
! did not create Gene records for 38 non-validated symbols: 'AFD1', 'AZF1', 'CCL3L1', 'CCL4L1', 'DGS2', 'DUX3', 'DUX5', 'FOXL3-OT1', 'IGL', 'LOC100653049', 'LOC102723475', 'LOC102723996', 'LOC102724159', 'LOC107984156', 'LOC112268384', 'LOC122319436', 'LOC122513141', 'LOC122539214', 'LOC344967', 'MDRV', ...
Manually register the 37 non-validated symbols:
inspect_result = bt.Gene.inspect(all_genes, bt.Gene.symbol)
nonval_genes = []
for g in inspect_result.non_validated:
nonval_genes.append(bt.Gene(symbol=g))
ln.save(nonval_genes)
! received 14696 unique terms, 1 empty/duplicated term is ignored
! 38 unique terms (0.30%) are not validated for symbol: 'DUX5', 'MTRNR2L13', 'MTRNR2L12', 'TRL-AAG2-3', 'MTRNR2L8', 'DUX3', 'MDRV', 'LOC102724159', 'TAS2R36', 'MTRNR2L3', ...
couldn't validate 38 terms: 'MDRV', 'LOC107984156', 'LOC122319436', 'DUX3', 'LOC100653049', 'FOXL3-OT1', 'MTRNR2L1', 'TRA', 'SEPTIN14P20', 'MTRNR2L10', 'MTRNR2L8', 'MTRNR2L4', 'MTRNR2L11', 'AFD1', 'IGL', 'MTRNR2L2', 'LOC122513141', 'CCL4L1', 'TAS2R33', 'MTRNR2L13', ...
→ if you are sure, create new records via Gene() and save to your registry
! records with similar symbols exist! did you mean to load one of them?
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | synonyms | description | source_id | organism_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
625 | 34sokaC4zpvA | TAS2R41 | None | ENSG00000221855 | 259287 | protein_coding | T2R59 | taste 2 receptor member 41 | 11 | 1 | None | 2024-12-03 08:31:35.746935+00:00 | 1 |
626 | 7atcLdtPNO4W | TAS2R41 | None | ENSG00000284982 | 259287 | protein_coding | T2R59 | taste 2 receptor member 41 | 11 | 1 | None | 2024-12-03 08:31:35.746958+00:00 | 1 |
1033 | 5OIokZFPxKL5 | TAS2R60 | None | ENSG00000185899 | 338398 | protein_coding | T2R60 | taste 2 receptor member 60 | 11 | 1 | None | 2024-12-03 08:31:35.767510+00:00 | 1 |
! record with similar symbol exists! did you mean to load it?
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | synonyms | description | source_id | organism_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
16244 | 19ZTVOfazMYF | SEPTIN14 | None | ENSG00000154997 | 346288 | protein_coding | SEPT14|FLJ44060 | septin 14 | 11 | 1 | None | 2024-12-03 08:31:36.849329+00:00 | 1 |
! record with similar symbol exists! did you mean to load it?
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | synonyms | description | source_id | organism_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
13853 | 2Z2LBAj16utl | TRAFD1 | None | ENSG00000135148 | 10906 | protein_coding | FLN29 | TRAF-type zinc finger domain containing 1 | 11 | 1 | None | 2024-12-03 08:31:36.722345+00:00 | 1 |
! records with similar symbols exist! did you mean to load one of them?
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | synonyms | description | source_id | organism_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
625 | 34sokaC4zpvA | TAS2R41 | None | ENSG00000221855 | 259287 | protein_coding | T2R59 | taste 2 receptor member 41 | 11 | 1 | None | 2024-12-03 08:31:35.746935+00:00 | 1 |
626 | 7atcLdtPNO4W | TAS2R41 | None | ENSG00000284982 | 259287 | protein_coding | T2R59 | taste 2 receptor member 41 | 11 | 1 | None | 2024-12-03 08:31:35.746958+00:00 | 1 |
1033 | 5OIokZFPxKL5 | TAS2R60 | None | ENSG00000185899 | 338398 | protein_coding | T2R60 | taste 2 receptor member 60 | 11 | 1 | None | 2024-12-03 08:31:35.767510+00:00 | 1 |
! record with similar symbol exists! did you mean to load it?
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | synonyms | description | source_id | organism_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
1415 | 217N4Hgazwqu | JAZF1 | None | ENSG00000153814 | 221895 | protein_coding | TIP27|ZNF802|DKFZP761K2222 | JAZF zinc finger 1 | 11 | 1 | None | 2024-12-03 08:31:35.784770+00:00 | 1 |
! records with similar symbols exist! did you mean to load one of them?
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | synonyms | description | source_id | organism_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
282 | 1dfhC6H3hRcR | TRAF3IP3 | None | ENSG00000009790 | 80342 | protein_coding | T3JAM | TRAF3 interacting protein 3 | 11 | 1 | None | 2024-12-03 08:31:35.731315+00:00 | 1 |
410 | 27ggmkQCDtQ9 | TRABD2B | None | ENSG00000269113 | 388630 | protein_coding | TIKI2 | TraB domain containing 2B | 11 | 1 | None | 2024-12-03 08:31:35.737328+00:00 | 1 |
1637 | 2hSzcmjJmNim | TRAM1 | None | ENSG00000067167 | 23471 | protein_coding | TRAM|TRAMP | translocation associated membrane protein 1 | 11 | 1 | None | 2024-12-03 08:31:35.796753+00:00 | 1 |
! records with similar symbols exist! did you mean to load one of them?
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | synonyms | description | source_id | organism_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
10104 | 10LbfkUokHgx | IGLC1 | None | ENSG00000211675 | IG_C_gene | IGLC | immunoglobulin lambda constant 1 | 11 | 1 | None | 2024-12-03 08:31:36.396594+00:00 | 1 | |
10988 | 2qFUCnrrxPlc | IGLC7 | None | ENSG00000211685 | IG_C_gene | immunoglobulin lambda constant 7 | 11 | 1 | None | 2024-12-03 08:31:36.442661+00:00 | 1 | ||
16685 | 7TBpxi4zzmMB | IGLC3 | None | ENSG00000211679 | IG_C_gene | IGLC | immunoglobulin lambda constant 3 (Kern-Oz+ mar... | 11 | 1 | None | 2024-12-03 08:31:36.872946+00:00 | 1 |
! records with similar symbols exist! did you mean to load one of them?
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | synonyms | description | source_id | organism_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
2707 | TfqHfw8FjZxq | CCL4 | None | ENSG00000275302 | 6351 | protein_coding | MIP-1-BETA|SCYA4|AT744.1|LAG1|ACT-2 | C-C motif chemokine ligand 4 | 11 | 1 | None | 2024-12-03 08:31:35.864598+00:00 | 1 |
2708 | 2MLaX8EEZ9sQ | CCL4 | None | ENSG00000275824 | 6351 | protein_coding | MIP-1-BETA|SCYA4|AT744.1|LAG1|ACT-2 | C-C motif chemokine ligand 4 | 11 | 1 | None | 2024-12-03 08:31:35.864700+00:00 | 1 |
2709 | 7P74Aze5sMxb | CCL4 | None | ENSG00000277943 | 6351 | protein_coding | MIP-1-BETA|SCYA4|AT744.1|LAG1|ACT-2 | C-C motif chemokine ligand 4 | 11 | 1 | None | 2024-12-03 08:31:35.864734+00:00 | 1 |
! records with similar symbols exist! did you mean to load one of them?
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | synonyms | description | source_id | organism_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
16101 | F36vWNagki41 | CCL3 | None | ENSG00000274221 | 6348 | protein_coding | LD78|G0S19-1|SCYA3|LD78ALPHA|SCI|MIP-1-ALPHA | C-C motif chemokine ligand 3 | 11 | 1 | None | 2024-12-03 08:31:36.841656+00:00 | 1 |
16102 | EBoWJxh70guC | CCL3 | None | ENSG00000277632 | 6348 | protein_coding | LD78|G0S19-1|SCYA3|LD78ALPHA|SCI|MIP-1-ALPHA | C-C motif chemokine ligand 3 | 11 | 1 | None | 2024-12-03 08:31:36.841679+00:00 | 1 |
16103 | 6XmKAksx3Nn4 | CCL3 | None | ENSG00000278567 | 6348 | protein_coding | LD78|G0S19-1|SCYA3|LD78ALPHA|SCI|MIP-1-ALPHA | C-C motif chemokine ligand 3 | 11 | 1 | None | 2024-12-03 08:31:36.841702+00:00 | 1 |
Link pathway to genes¶
Now that we are tracking all pathways and genes records, we can link both of them to make the pathways even more queryable.
symbols_gene_records = {record.symbol: record for record in gene_records}
for pathway_record in pathway_records:
pathway_genes = go_bp_parsed.get(pathway_record.ontology_id)[1]
pathway_genes_records = [symbols_gene_records.get(gene) for gene in pathway_genes]
pathway_record.genes.set(pathway_genes_records)
Now genes are linked to pathways:
pathway_record.genes.list("symbol")
['CAST', 'CARD18', 'XIAP', 'CARD8', 'CST7']
pathway_record.genes.list("ensembl_gene_id")
['ENSG00000153113',
'ENSG00000255501',
'ENSG00000101966',
'ENSG00000105483',
'ENSG00000077984']
Move on to the next analysis: Standardize metadata on-the-fly