Perturbation¶
This guide demonstrates how to curate a complex, real world perturbation dataset McFarland et al. 2020 using the wetlab
schema.
# !pip install 'lamindb[jupyter,aws,bionty]' wetlab
!lamin init --storage ./test-perturbation --schema bionty,wetlab
→ connected lamindb: testuser1/test-perturbation
import lamindb as ln
import bionty as bt
import wetlab as wl
import pandas as pd
pd.set_option("display.max_columns", None)
ln.context.uid = "K6sInKIQW5nt0003"
ln.context.track()
Show code cell output
→ connected lamindb: testuser1/test-perturbation
→ created Transform('K6sInKIQ'), started new Run('D7kGZUCg') at 2024-11-21 06:57:13 UTC
→ notebook imports: bionty==0.53.1 lamindb==0.76.16 pandas==2.2.3 wetlab==0.34.0
# See https://lamin.ai/laminlabs/lamindata/transform/13VINnFk89PE0006 to learn how this dataset was prepared
adata = ln.Artifact.using("laminlabs/lamindata").get(uid="Xk7Qaik9vBLV4PKf0001").load()
adata.obs.head(3)
Show code cell output
→ completing transfer to track Artifact('Xk7Qaik9') as input
→ mapped records:
→ transferred records: Artifact(uid='Xk7Qaik9vBLV4PKf0001'), Storage(uid='D9BilDV2')
depmap_id | cancer | cell_det_rate | cell_line | cell_quality | channel | disease | dose_unit | dose_value | doublet_CL1 | doublet_CL2 | doublet_GMM_prob | doublet_dev_imp | doublet_z_margin | hash_assignment | hash_tag | num_SNPs | organism | perturbation | perturbation_type | sex | singlet_ID | singlet_dev | singlet_dev_z | singlet_margin | singlet_z_margin | time | tissue_type | tot_reads | nperts | ngenes | ncounts | percent_mito | percent_ribo | chembl-ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AACTGGTGTCTCTCTG | ACH-000390 | True | 0.093159 | LUDLU-1 | normal | nan | lung cancer | µM | 0.1 | LUDLU1_LUNG | TE14_OESOPHAGUS | 2.269468e-10 | 0.009426 | 0.403316 | nan | nan | 481 | human | trametinib | drug | Male | LUDLU1_LUNG | 0.655877 | 14.860933 | 0.462273 | 12.351139 | 24 | cell_line | 787 | 1 | 3045 | 12895.0 | 3.202792 | 24.955409 | CHEMBL2103875 |
ATAGGCTCAGATTTCG | ACH-000444 | True | 0.145728 | LU99 | normal | 2 | lung cancer | µM | 0.5 | LU99_LUNG | MCAS_OVARY | 8.562908e-04 | 0.010173 | 0.188284 | nan | nan | 1003 | human | afatinib | drug | Male | LU99_LUNG | 0.762847 | 10.648094 | 0.474590 | 8.164565 | 24 | cell_line | 1597 | 1 | 4763 | 23161.0 | 7.473771 | 18.051898 | CHEMBL1173655 |
GCCAAATCAAGCCGTC | ACH-000396 | True | 0.117330 | J82 | normal | nan | urinary bladder carcinoma | µM | 0.1 | J82_URINARY_TRACT | IGR1_SKIN | 6.490367e-08 | 0.009686 | 1.185862 | nan | nan | 647 | human | dabrafenib | drug | Male | J82_URINARY_TRACT | 0.651059 | 14.740111 | 0.404508 | 11.188513 | 24 | cell_line | 1159 | 1 | 3834 | 18062.0 | 2.762706 | 22.085040 | CHEMBL2028663 |
curate = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.ensembl_gene_id,
organism="human",
using_key="laminlabs/lamindata"
)
curate.validate()
✓ added 8 records from laminlabs/lamindata with Feature.name for columns: 'tissue_type', 'perturbation_type', 'time', 'disease', 'sex', 'perturbation', 'organism', 'cell_line'
→ validating metadata using registries of instance laminlabs/lamindata
• saving validated records of 'var_index'
✓ added 1864 records from public with Gene.ensembl_gene_id for var_index: 'ENSG00000102316', 'ENSG00000109472', 'ENSG00000080007', 'ENSG00000203926', 'ENSG00000232301', 'ENSG00000127419', 'ENSG00000266869', 'ENSG00000108960', 'ENSG00000261316', 'ENSG00000126870', 'ENSG00000121797', 'ENSG00000243927', 'ENSG00000143473', 'ENSG00000261488', 'ENSG00000115665', 'ENSG00000180613', 'ENSG00000167283', 'ENSG00000160472', 'ENSG00000273387', 'ENSG00000110768', ...
✓ added 7 records from laminlabs/lamindata with Gene.ensembl_gene_id for var_index: 'ENSG00000214970', 'ENSG00000215067', 'ENSG00000255823', 'ENSG00000258301', 'ENSG00000258631', 'ENSG00000263388', 'ENSG00000272370'
✓ 'var_index' is validated against Gene.ensembl_gene_id
True
# The cells were subject to several types of perturbations that we will curate separately
adata.obs.perturbation_type.value_counts()
Show code cell output
perturbation_type
drug 855
CRISPR 145
Name: count, dtype: int64
Curate non-perturbation metadata¶
categoricals = {
"depmap_id": bt.CellLine.ontology_id,
"cell_line": bt.CellLine.name,
"disease": bt.Disease.name,
"organism": bt.Organism.name,
"perturbation_type": ln.ULabel.name,
"sex": bt.Phenotype.name,
"time": ln.ULabel.name,
"tissue_type": ln.ULabel.name,
}
sources = {
"depmap_id": bt.Source.using("laminlabs/lamindata").filter(name="depmap").one(),
"cell_line": bt.Source.using("laminlabs/lamindata").filter(name="depmap").one(),
}
curate = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.ensembl_gene_id,
categoricals=categoricals,
organism="human",
sources=sources,
using_key="laminlabs/lamindata"
)
curate.validate()
Show code cell output
✓ added 1 record with Feature.name for columns: 'depmap_id'
→ validating metadata using registries of instance laminlabs/lamindata
• saving validated records of 'depmap_id'
• saving validated records of 'disease'
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'depmap_id' is validated against CellLine.ontology_id
✓ 'cell_line' is validated against CellLine.name
✓ 'disease' is validated against Disease.name
✓ 'organism' is validated against Organism.name
• mapping perturbation_type on ULabel.name
! 2 terms are not validated: 'drug', 'CRISPR'
→ fix typos, remove non-existent values, or save terms via .add_new_from('perturbation_type')
• mapping sex on Phenotype.name
! 3 terms are not validated: 'Male', 'Female', 'Unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from('sex')
• mapping time on ULabel.name
! 4 terms are not validated: '24', '72, 96', '3, 6, 12, 24, 48', '6'
→ fix typos, remove non-existent values, or save terms via .add_new_from('time')
• mapping tissue_type on ULabel.name
! 1 term is not validated: 'cell_line'
→ fix typo, remove non-existent value, or save term via .add_new_from('tissue_type')
False
curate.add_new_from("perturbation_type")
curate.add_new_from("sex")
curate.add_new_from("time")
curate.add_new_from("tissue_type")
curate.add_new_from("cell_line")
Show code cell output
✓ added 2 records with ULabel.name for perturbation_type: 'CRISPR', 'drug'
✓ added 3 records with Phenotype.name for sex: 'Unknown', 'Female', 'Male'
✓ added 4 records with ULabel.name for time: '72, 96', '6', '24', '3, 6, 12, 24, 48'
✓ added 1 record with ULabel.name for tissue_type: 'cell_line'
Modeling and curating perturbation metadata¶
The dataset has two types of perturbations: CRISPR and Compounds. We will create their records and associated targets separately.
crispr_metadata = adata.obs[adata.obs["perturbation_type"] == "CRISPR"]
drug_metadata = adata.obs[adata.obs["perturbation_type"] == "drug"]
The wetlab
schema has two major components:
wetlab.EnvironmentalTreatment
to model perturbations such as heat,wetlab.GeneticTreatment
to model perturbations such as CRISPR, andwetlab.CompoundTreatment
to model, for example, drugs. Several treatments together can be modeled usingwetlab.CombinationTreatment
.Known targets of treatments can be modeled through
wetlab.TreatmentTarget
which can be one or several ofbionty.Gene
,bionty.Protein
, orbionty.Pathway
records.
Genetic perturbations¶
Genetic perturbations can be modeled in two ways depending on the available information by populating a:
wetlab.GeneticTreatment
record if the system such as the guide RNA name or sequence, the on- and off-target scores are known.wetlab.TreatmentTarget
record that links tobionty.Gene
records.
crispr_metadata.head(3)
Show code cell output
depmap_id | cancer | cell_det_rate | cell_line | cell_quality | channel | disease | dose_unit | dose_value | doublet_CL1 | doublet_CL2 | doublet_GMM_prob | doublet_dev_imp | doublet_z_margin | hash_assignment | hash_tag | num_SNPs | organism | perturbation | perturbation_type | sex | singlet_ID | singlet_dev | singlet_dev_z | singlet_margin | singlet_z_margin | time | tissue_type | tot_reads | nperts | ngenes | ncounts | percent_mito | percent_ribo | chembl-ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TAGTTGGAGATCGATA | ACH-000723 | True | 0.132708 | YD-10B | low_quality | nan | head and neck cancer | nan | NaN | YD10B_UPPER_AERODIGESTIVE_TRACT | 647V_URINARY_TRACT | NaN | 0.156492 | 1.556214 | nan | nan | 874 | human | sggpx4-2 | CRISPR | Male | YD10B_UPPER_AERODIGESTIVE_TRACT | 0.292802 | 3.272682 | 0.016459 | 0.330120 | 72, 96 | cell_line | 2105 | 1 | 4341 | 20693.0 | 0.695887 | 16.242208 | NaN |
CATCGGGGTTCATGGT | ACH-000219 | True | 0.087860 | A-375 | normal | nan | skin cancer | nan | NaN | A375_SKIN | DAOY_CENTRAL_NERVOUS_SYSTEM | 2.496623e-07 | 0.007701 | 0.088255 | nan | nan | 524 | human | sglacz | CRISPR | Female | A375_SKIN | 0.671925 | 13.649916 | 0.464200 | 11.962996 | 72, 96 | cell_line | 1035 | 1 | 2919 | 13771.0 | 2.730375 | 40.592550 | NaN |
AAATGCCTCGTGGACC-1 | ACH-000762 | True | 0.075085 | YD-38 | normal | nan | head and neck cancer | nan | NaN | YD38_UPPER_AERODIGESTIVE_TRACT | IGR1_SKIN | 2.366136e-02 | 0.032628 | 0.158193 | nan | nan | 407 | human | sggpx4-2 | CRISPR | Male | YD38_UPPER_AERODIGESTIVE_TRACT | 0.571537 | 7.417734 | 0.331906 | 5.552359 | 72, 96 | cell_line | 829 | 1 | 2456 | 10996.0 | 2.528192 | 39.841761 | NaN |
list(crispr_metadata["perturbation"].unique())
Show code cell output
['sggpx4-2', 'sglacz', 'sggpx4-1', 'sgor2j2']
What are the associated targets?
The following targets are the direct targets of the perturbations, and while they may affect a pathway, we only curate the direct targets for simplicity.
sgGPX4-1: Gene/Protein - GPX4 (Glutathione Peroxidase 4)
sgGPX4-2: Gene/Protein - GPX4 (Glutathione Peroxidase 4)
sgLACZ: Gene/Protein - LACZ (β-galactosidase)
sgOR2J2: Gene/Protein - OR2J2 (Olfactory receptor family 2 subfamily J member 2)
Since the perturbation metadata contains the guide RNA names, we model the genetic perturbations using both wetlab.GeneticTreatment
and wetlab.TreatmentTarget
.
treatments = [
("sgGPX4-1", "GPX4", "Glutathione Peroxidase 4"),
("sgGPX4-2", "GPX4", "Glutathione Peroxidase 4"),
("sgor2j2", "or2j2", "Olfactory receptor family 2 subfamily J member 2"),
("sgLACZ", "lacz", "beta-galactosidase control"), # Control from E. coli
]
organism = bt.Organism.lookup().human
genetic_treatments = []
for name, symbol, target_name in treatments:
treatment = wl.GeneticTreatment(system="CRISPR Cas9", name=name).save()
if symbol != "lacz":
gene_result = bt.Gene.from_source(symbol=symbol, organism=organism)
gene = gene_result[0] if isinstance(gene_result, list) else gene_result
gene = gene.save()
else:
gene = bt.Gene(symbol=symbol, organism=organism).save()
target = wl.TreatmentTarget(name=target_name).save()
target.genes.add(gene)
treatment.targets.add(target)
genetic_treatments.append(treatment)
Show code cell output
✓ created 1 Gene record from Bionty matching symbol: 'GPX4'
! record with similar name exists! did you mean to load it?
uid | name | system | sequence | on_target_score | off_target_score | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
1 | Us3vqdde3Yz5 | sgGPX4-1 | CRISPR Cas9 | None | None | None | 1 | 2024-11-21 06:57:33.377621+00:00 | 1 |
→ returning existing TreatmentTarget record with same name: 'Glutathione Peroxidase 4'
✓ created 1 Gene record from Bionty matching synonyms: 'or2j2'
! ambiguous validation in Bionty for 1 record: 'OR2J2'
Compound perturbations¶
Although the targets are known for many compounds, we skip annotating them here to keep the guide brief.
What are the compound targets?
AZD5591: Unknown
Afatinib: Proteins - EGFR (Epidermal Growth Factor Receptor), HER2 (Human Epidermal growth factor Receptor 2)
BRD3379: Unknown
Bortezomib: Protein complex - Proteasome (specifically the 26S proteasome subunit)
Dabrafenib: Gene/Protein - BRAF (V600E mutation in the BRAF gene, which codes for a protein kinase)
Everolimus: Protein - mTOR (Mammalian Target of Rapamycin)
Gemcitabine: Pathway/Process - DNA synthesis (inhibition of ribonucleotide reductase and incorporation into DNA)
Idasanutlin: Protein - MDM2 (Mouse Double Minute 2 homolog)
JQ1: Protein - BRD4 (Bromodomain-containing protein 4)
Navitoclax: Proteins - BCL-2, BCL-XL (B-cell lymphoma 2 and B-cell lymphoma-extra large)
Prexasertib: Protein - CHK1 (Checkpoint kinase 1)
Taselisib: Protein/Pathway - PI3K (Phosphoinositide 3-kinase)
Trametinib: Proteins - MEK1/2 (Mitogen-Activated Protein Kinase Kinase 1 and 2)
control: Not applicable
# We are using the chebi/chembl chemistry/drug ontology for the drug perturbations
chebi_source = bt.Source.filter(entity="Drug", name="chebi").one()
wl.Compound.add_source(chebi_source)
compounds = wl.Compound.public()
Show code cell output
→ due to lack of write access, LaminDB won't manage storage location: s3://bionty-assets/
• path in storage 's3://bionty-assets' with key 'df_all__chebi__2024-07-27__Drug.parquet'
→ source added!
drug_metadata.head(3)
Show code cell output
depmap_id | cancer | cell_det_rate | cell_line | cell_quality | channel | disease | dose_unit | dose_value | doublet_CL1 | doublet_CL2 | doublet_GMM_prob | doublet_dev_imp | doublet_z_margin | hash_assignment | hash_tag | num_SNPs | organism | perturbation | perturbation_type | sex | singlet_ID | singlet_dev | singlet_dev_z | singlet_margin | singlet_z_margin | time | tissue_type | tot_reads | nperts | ngenes | ncounts | percent_mito | percent_ribo | chembl-ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AACTGGTGTCTCTCTG | ACH-000390 | True | 0.093159 | LUDLU-1 | normal | nan | lung cancer | µM | 0.1 | LUDLU1_LUNG | TE14_OESOPHAGUS | 2.269468e-10 | 0.009426 | 0.403316 | nan | nan | 481 | human | trametinib | drug | Male | LUDLU1_LUNG | 0.655877 | 14.860933 | 0.462273 | 12.351139 | 24 | cell_line | 787 | 1 | 3045 | 12895.0 | 3.202792 | 24.955409 | CHEMBL2103875 |
ATAGGCTCAGATTTCG | ACH-000444 | True | 0.145728 | LU99 | normal | 2 | lung cancer | µM | 0.5 | LU99_LUNG | MCAS_OVARY | 8.562908e-04 | 0.010173 | 0.188284 | nan | nan | 1003 | human | afatinib | drug | Male | LU99_LUNG | 0.762847 | 10.648094 | 0.474590 | 8.164565 | 24 | cell_line | 1597 | 1 | 4763 | 23161.0 | 7.473771 | 18.051898 | CHEMBL1173655 |
GCCAAATCAAGCCGTC | ACH-000396 | True | 0.117330 | J82 | normal | nan | urinary bladder carcinoma | µM | 0.1 | J82_URINARY_TRACT | IGR1_SKIN | 6.490367e-08 | 0.009686 | 1.185862 | nan | nan | 647 | human | dabrafenib | drug | Male | J82_URINARY_TRACT | 0.651059 | 14.740111 | 0.404508 | 11.188513 | 24 | cell_line | 1159 | 1 | 3834 | 18062.0 | 2.762706 | 22.085040 | CHEMBL2028663 |
compounds = wl.Compound.from_values(drug_metadata["perturbation"], field="name")
Show code cell output
✓ created 8 Compound records from Bionty matching name: 'trametinib', 'afatinib', 'dabrafenib', 'gemcitabine', 'navitoclax', 'bortezomib', 'JQ1', 'everolimus'
! did not create Compound records for 6 non-validated names: 'azd5591', 'brd3379', 'control', 'idasanutlin', 'prexasertib', 'taselisib'
# The remaining compounds are not in chebi and we create records for them
for missing in [
"azd5591",
"brd3379",
"control",
"idasanutlin",
"prexasertib",
"taselisib",
]:
compounds.append(wl.Compound(name=missing))
ln.save(compounds)
unique_treatments = drug_metadata[
["perturbation", "dose_unit", "dose_value"]
].drop_duplicates()
compound_treatments = []
for _, row in unique_treatments.iterrows():
compound = wl.Compound.get(name=row["perturbation"])
treatment = wl.CompoundTreatment(
name=compound.name,
concentration=row["dose_value"],
concentration_unit=row["dose_unit"],
)
compound_treatments.append(treatment)
ln.save(compound_treatments)
Register curated artifact¶
artifact = curate.save_artifact(description="McFarland AnnData")
Show code cell output
→ validating metadata using registries of instance laminlabs/lamindata
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'depmap_id' is validated against CellLine.ontology_id
✓ 'cell_line' is validated against CellLine.name
✓ 'disease' is validated against Disease.name
✓ 'organism' is validated against Organism.name
✓ 'perturbation_type' is validated against ULabel.name
✓ 'sex' is validated against Phenotype.name
✓ 'time' is validated against ULabel.name
✓ 'tissue_type' is validated against ULabel.name
! 26 unique terms (74.30%) are not validated for name: 'cancer', 'cell_det_rate', 'cell_quality', 'channel', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', ...
artifact.genetic_treatments.set(genetic_treatments)
artifact.compound_treatments.set(compound_treatments)
artifact.describe()
Show code cell output
Artifact(uid='GfepgcpSQKrWaqoN0000', is_latest=True, description='McFarland AnnData', suffix='.h5ad', type='dataset', size=2511528, hash='-NmtOcllwkwqByHg6iGdBA', n_observations=1000, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-21 06:57:41 UTC)
Provenance
.storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-perturbation'
.transform = 'Perturbation'
.run = 2024-11-21 06:57:13 UTC
.created_by = 'testuser1'
Labels
.organisms = 'human'
.diseases = 'liver cancer', 'uterine corpus cancer', 'colorectal cancer', 'malignant pancreatic neoplasm', 'brain cancer', 'urinary bladder carcinoma', 'sarcoma', 'ovarian cancer', 'breast cancer', 'skin cancer', ...
.cell_lines = 'CCF-STTG1', 'ONS-76', 'UM-UC-1', 'YD-38', 'WM-266-4', 'FTC-238', 'MES-SA', 'Daoy', 'PANC-1', 'COLO 668', ...
.phenotypes = 'Unknown', 'Female', 'Male'
.genetic_treatments = 'sgGPX4-1', 'sgGPX4-2', 'sgor2j2', 'sgLACZ'
.compound_treatments = 'trametinib', 'afatinib', 'dabrafenib', 'gemcitabine', 'navitoclax', 'bortezomib', 'brd3379', 'JQ1', 'azd5591', 'control', ...
.ulabels = 'CRISPR', 'drug', '72, 96', '6', '24', '3, 6, 12, 24, 48', 'cell_line'
Features
'cell_line' = '22Rv1', '253J-BV', '42-MG-BA', '639-V', '647-V', '769-P', '786-O', '8505C', 'A-375', 'A2780', ...
'depmap_id' = 'NIH:OVCAR-3', 'Hs 294T', 'NCI-H1581', 'T24', 'NCI-H1693', 'PA-TU-8988S', 'PA-TU-8988T', '253J-BV', 'NCI-H1650', 'S-117', ...
'disease' = 'bile duct cancer', 'bone cancer', 'brain cancer', 'breast cancer', 'colorectal cancer', 'esophageal cancer', 'gastric cancer', 'head and neck cancer', 'kidney cancer', 'liver cancer', ...
'organism' = 'human'
'perturbation_type' = 'CRISPR', 'drug'
'sex' = 'Female', 'Male', 'Unknown'
'time' = '24', '3, 6, 12, 24, 48', '6', '72, 96'
'tissue_type' = 'cell_line'
Feature sets
'var' = 'MAGED2', 'CPE', 'DDX43', 'SPANXA2', 'LNCPRESS1', 'TMEM175', 'None', 'MMD', 'LINC01834', 'DYNC2I1', 'CCRL2', 'MRPS6', 'KCNH1', 'TBILA', 'SLC5A7', 'GSX2', 'ATP5MG', 'TMEM190', 'GTF2H1'
'obs' = 'tissue_type', 'perturbation_type', 'time', 'disease', 'sex', 'perturbation', 'organism', 'cell_line', 'depmap_id'
Show code cell content
# clean up test instance
!rm -r test-perturbation
!lamin delete --force test-perturbation
• deleting instance testuser1/test-perturbation