Why should I not index datasets with gene symbols?¶
Gene symbols are widely used for readability, particularly for visualization. However, indexing datasets with gene symbols presents challenges:
A single gene may have multiple symbols or aliases.
Gene symbols change over time (e.g., BRCA2 was once FACD) without version tracking.
The same symbol can represent different genes across species.
Symbols may be misinterpreted by software (e.g., SEPT9 as “September 9” in Excel).
Formatting inconsistencies exist (e.g., case sensitivity, special characters).
Using unique identifiers like ENSEMBL gene IDs addresses these issues by providing:
A direct, stable mapping to genomic coordinates.
Consistency across databases.
Species-specific prefixes to prevent cross-species confusion.
Unique, permanent identifiers with standardized formatting.
Storing ENSEMBL gene IDs alongside gene symbols offers readability for visualization while maintaining robust data integrity. During curation, validating against ENSEMBL gene IDs ensures accurate mapping.
If only symbols are available for a dataset, you can map them to ENSEMBL IDs using standardize()
.
# !pip install 'lamindb[bionty]'
!lamin init --storage test-symbols --schema bionty
Show code cell output
→ connected lamindb: testuser1/test-symbols
import lamindb as ln
import bionty as bt
import numpy as np
import pandas as pd
import anndata as ad
# create example AnnData object with gene symbols
rng = np.random.default_rng(42)
X = rng.integers(0, 100, size=(5, 10))
var = pd.DataFrame(
index=pd.Index(
[
"BRCA1",
"TP53",
"EGFR",
"KRAS",
"PTEN",
"MYC",
"VEGFA",
"IL6",
"TNF",
"GAPDH",
],
name="symbol",
)
)
adata = ad.AnnData(X=X, var=var)
adata.var
Show code cell output
→ connected lamindb: testuser1/test-symbols
symbol |
---|
BRCA1 |
TP53 |
EGFR |
KRAS |
PTEN |
MYC |
VEGFA |
IL6 |
TNF |
GAPDH |
# map Gene symbols to ENSEMBL IDs
gene_mapper = bt.Gene.standardize(
adata.var.index,
field=bt.Gene.symbol,
return_field=bt.Gene.ensembl_gene_id,
return_mapper=True,
organism="human",
)
adata.var["ensembl_id"] = adata.var.index.map(
lambda gene_id: gene_mapper.get(gene_id, gene_id)
)
adata.var
Show code cell output
! found 10 symbols in Bionty: ['BRCA1', 'GAPDH', 'VEGFA', 'KRAS', 'IL6', 'MYC', 'TP53', 'EGFR', 'PTEN', 'TNF']
please add corresponding Gene records via: `.from_values(['BRCA1', 'GAPDH', 'VEGFA', 'KRAS', 'IL6', 'MYC', 'TP53', 'EGFR', 'PTEN', 'TNF'])`
ensembl_id | |
---|---|
symbol | |
BRCA1 | ENSG00000012048 |
TP53 | ENSG00000141510 |
EGFR | ENSG00000146648 |
KRAS | ENSG00000133703 |
PTEN | ENSG00000171862 |
MYC | ENSG00000136997 |
VEGFA | ENSG00000112715 |
IL6 | ENSG00000136244 |
TNF | ENSG00000204490 |
GAPDH | ENSG00000111640 |
standardized_genes = bt.Gene.from_values(
[
"ENSG00000141510",
"ENSG00000133703",
"ENSG00000111640",
"ENSG00000171862",
"ENSG00000204490",
"ENSG00000112715",
"ENSG00000146648",
"ENSG00000136997",
"ENSG00000012048",
"ENSG00000136244",
],
field=bt.Gene.ensembl_gene_id,
organism="human",
)
ln.save(standardized_genes)
This allows for validating the the ensembl_id
against the Gene
registry using the bt.Gene.ensembl_gene_id
field.
bt.Gene.validate(adata.var["ensembl_id"], field=bt.Gene.ensembl_gene_id)
Show code cell output
array([ True, True, True, True, True, True, True, True, True,
True])
Note
Gene symbols do not map one-to-one with ENSEMBL IDs. A single gene symbol may correspond to multiple ENSEMBL IDs due to:
Gene Paralogs: Similar symbols can be shared among paralogous genes within the same species, resulting in one symbol linking to multiple ENSEMBL IDs.
Pseudogenes: Some symbols represent both functional genes and their non-functional pseudogenes, each with distinct ENSEMBL IDs.
Transcript Variants: One symbol may map to multiple ENSEMBL transcript IDs, each representing different isoforms or splice variants.
standardize()
retrieves the first match in cases of multiple hits, which is generally sufficient but not perfectly accurate.
!lamin delete --force test-symbols
Show code cell output
• deleting instance testuser1/test-symbols