Perturbation

This guide demonstrates how to curate a complex, real world perturbation dataset McFarland et al. 2020 using the wetlab schema.

# !pip install 'lamindb[jupyter,aws,bionty]' wetlab
!lamin init --storage ./test-perturbation --schema bionty,wetlab
→ connected lamindb: testuser1/test-perturbation
import lamindb as ln
import bionty as bt
import wetlab as wl
import pandas as pd

pd.set_option("display.max_columns", None)

ln.context.uid = "K6sInKIQW5nt0003"
ln.context.track()
Hide code cell output
→ connected lamindb: testuser1/test-perturbation
→ created Transform('K6sInKIQ'), started new Run('D7kGZUCg') at 2024-11-21 06:57:13 UTC
→ notebook imports: bionty==0.53.1 lamindb==0.76.16 pandas==2.2.3 wetlab==0.34.0
# See https://lamin.ai/laminlabs/lamindata/transform/13VINnFk89PE0006 to learn how this dataset was prepared
adata = ln.Artifact.using("laminlabs/lamindata").get(uid="Xk7Qaik9vBLV4PKf0001").load()
adata.obs.head(3)
Hide code cell output
→ completing transfer to track Artifact('Xk7Qaik9') as input
→ mapped records: 
→ transferred records: Artifact(uid='Xk7Qaik9vBLV4PKf0001'), Storage(uid='D9BilDV2')
depmap_id cancer cell_det_rate cell_line cell_quality channel disease dose_unit dose_value doublet_CL1 doublet_CL2 doublet_GMM_prob doublet_dev_imp doublet_z_margin hash_assignment hash_tag num_SNPs organism perturbation perturbation_type sex singlet_ID singlet_dev singlet_dev_z singlet_margin singlet_z_margin time tissue_type tot_reads nperts ngenes ncounts percent_mito percent_ribo chembl-ID
AACTGGTGTCTCTCTG ACH-000390 True 0.093159 LUDLU-1 normal nan lung cancer µM 0.1 LUDLU1_LUNG TE14_OESOPHAGUS 2.269468e-10 0.009426 0.403316 nan nan 481 human trametinib drug Male LUDLU1_LUNG 0.655877 14.860933 0.462273 12.351139 24 cell_line 787 1 3045 12895.0 3.202792 24.955409 CHEMBL2103875
ATAGGCTCAGATTTCG ACH-000444 True 0.145728 LU99 normal 2 lung cancer µM 0.5 LU99_LUNG MCAS_OVARY 8.562908e-04 0.010173 0.188284 nan nan 1003 human afatinib drug Male LU99_LUNG 0.762847 10.648094 0.474590 8.164565 24 cell_line 1597 1 4763 23161.0 7.473771 18.051898 CHEMBL1173655
GCCAAATCAAGCCGTC ACH-000396 True 0.117330 J82 normal nan urinary bladder carcinoma µM 0.1 J82_URINARY_TRACT IGR1_SKIN 6.490367e-08 0.009686 1.185862 nan nan 647 human dabrafenib drug Male J82_URINARY_TRACT 0.651059 14.740111 0.404508 11.188513 24 cell_line 1159 1 3834 18062.0 2.762706 22.085040 CHEMBL2028663
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    organism="human",
    using_key="laminlabs/lamindata"
)
curate.validate()
✓ added 8 records from laminlabs/lamindata with Feature.name for columns: 'tissue_type', 'perturbation_type', 'time', 'disease', 'sex', 'perturbation', 'organism', 'cell_line'
→ validating metadata using registries of instance laminlabs/lamindata
• saving validated records of 'var_index'
✓ added 1864 records from public with Gene.ensembl_gene_id for var_index: 'ENSG00000102316', 'ENSG00000109472', 'ENSG00000080007', 'ENSG00000203926', 'ENSG00000232301', 'ENSG00000127419', 'ENSG00000266869', 'ENSG00000108960', 'ENSG00000261316', 'ENSG00000126870', 'ENSG00000121797', 'ENSG00000243927', 'ENSG00000143473', 'ENSG00000261488', 'ENSG00000115665', 'ENSG00000180613', 'ENSG00000167283', 'ENSG00000160472', 'ENSG00000273387', 'ENSG00000110768', ...
✓ added 7 records from laminlabs/lamindata with Gene.ensembl_gene_id for var_index: 'ENSG00000214970', 'ENSG00000215067', 'ENSG00000255823', 'ENSG00000258301', 'ENSG00000258631', 'ENSG00000263388', 'ENSG00000272370'
✓ 'var_index' is validated against Gene.ensembl_gene_id
True
# The cells were subject to several types of perturbations that we will curate separately
adata.obs.perturbation_type.value_counts()
Hide code cell output
perturbation_type
drug      855
CRISPR    145
Name: count, dtype: int64

Curate non-perturbation metadata

categoricals = {
    "depmap_id": bt.CellLine.ontology_id,
    "cell_line": bt.CellLine.name,
    "disease": bt.Disease.name,
    "organism": bt.Organism.name,
    "perturbation_type": ln.ULabel.name,
    "sex": bt.Phenotype.name,
    "time": ln.ULabel.name,
    "tissue_type": ln.ULabel.name,
}
sources = {
    "depmap_id": bt.Source.using("laminlabs/lamindata").filter(name="depmap").one(),
    "cell_line": bt.Source.using("laminlabs/lamindata").filter(name="depmap").one(),
}

curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals=categoricals,
    organism="human",
    sources=sources,
    using_key="laminlabs/lamindata"
)

curate.validate()
Hide code cell output
✓ added 1 record with Feature.name for columns: 'depmap_id'
→ validating metadata using registries of instance laminlabs/lamindata
• saving validated records of 'depmap_id'
• saving validated records of 'disease'
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'depmap_id' is validated against CellLine.ontology_id
✓ 'cell_line' is validated against CellLine.name
✓ 'disease' is validated against Disease.name
✓ 'organism' is validated against Organism.name
• mapping perturbation_type on ULabel.name
!    2 terms are not validated: 'drug', 'CRISPR'
→ fix typos, remove non-existent values, or save terms via .add_new_from('perturbation_type')
• mapping sex on Phenotype.name
!    3 terms are not validated: 'Male', 'Female', 'Unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from('sex')
• mapping time on ULabel.name
!    4 terms are not validated: '24', '72, 96', '3, 6, 12, 24, 48', '6'
→ fix typos, remove non-existent values, or save terms via .add_new_from('time')
• mapping tissue_type on ULabel.name
!    1 term is not validated: 'cell_line'
→ fix typo, remove non-existent value, or save term via .add_new_from('tissue_type')
False
curate.add_new_from("perturbation_type")
curate.add_new_from("sex")
curate.add_new_from("time")
curate.add_new_from("tissue_type")
curate.add_new_from("cell_line")
Hide code cell output
✓ added 2 records with ULabel.name for perturbation_type: 'CRISPR', 'drug'
✓ added 3 records with Phenotype.name for sex: 'Unknown', 'Female', 'Male'
✓ added 4 records with ULabel.name for time: '72, 96', '6', '24', '3, 6, 12, 24, 48'
✓ added 1 record with ULabel.name for tissue_type: 'cell_line'

Modeling and curating perturbation metadata

The dataset has two types of perturbations: CRISPR and Compounds. We will create their records and associated targets separately.

crispr_metadata = adata.obs[adata.obs["perturbation_type"] == "CRISPR"]
drug_metadata = adata.obs[adata.obs["perturbation_type"] == "drug"]

The wetlab schema has two major components:

  1. wetlab.EnvironmentalTreatment to model perturbations such as heat, wetlab.GeneticTreatment to model perturbations such as CRISPR, and wetlab.CompoundTreatment to model, for example, drugs. Several treatments together can be modeled using wetlab.CombinationTreatment.

  2. Known targets of treatments can be modeled through wetlab.TreatmentTarget which can be one or several of bionty.Gene, bionty.Protein, or bionty.Pathway records.

Genetic perturbations

Genetic perturbations can be modeled in two ways depending on the available information by populating a:

  1. wetlab.GeneticTreatment record if the system such as the guide RNA name or sequence, the on- and off-target scores are known.

  2. wetlab.TreatmentTarget record that links to bionty.Gene records.

crispr_metadata.head(3)
Hide code cell output
depmap_id cancer cell_det_rate cell_line cell_quality channel disease dose_unit dose_value doublet_CL1 doublet_CL2 doublet_GMM_prob doublet_dev_imp doublet_z_margin hash_assignment hash_tag num_SNPs organism perturbation perturbation_type sex singlet_ID singlet_dev singlet_dev_z singlet_margin singlet_z_margin time tissue_type tot_reads nperts ngenes ncounts percent_mito percent_ribo chembl-ID
TAGTTGGAGATCGATA ACH-000723 True 0.132708 YD-10B low_quality nan head and neck cancer nan NaN YD10B_UPPER_AERODIGESTIVE_TRACT 647V_URINARY_TRACT NaN 0.156492 1.556214 nan nan 874 human sggpx4-2 CRISPR Male YD10B_UPPER_AERODIGESTIVE_TRACT 0.292802 3.272682 0.016459 0.330120 72, 96 cell_line 2105 1 4341 20693.0 0.695887 16.242208 NaN
CATCGGGGTTCATGGT ACH-000219 True 0.087860 A-375 normal nan skin cancer nan NaN A375_SKIN DAOY_CENTRAL_NERVOUS_SYSTEM 2.496623e-07 0.007701 0.088255 nan nan 524 human sglacz CRISPR Female A375_SKIN 0.671925 13.649916 0.464200 11.962996 72, 96 cell_line 1035 1 2919 13771.0 2.730375 40.592550 NaN
AAATGCCTCGTGGACC-1 ACH-000762 True 0.075085 YD-38 normal nan head and neck cancer nan NaN YD38_UPPER_AERODIGESTIVE_TRACT IGR1_SKIN 2.366136e-02 0.032628 0.158193 nan nan 407 human sggpx4-2 CRISPR Male YD38_UPPER_AERODIGESTIVE_TRACT 0.571537 7.417734 0.331906 5.552359 72, 96 cell_line 829 1 2456 10996.0 2.528192 39.841761 NaN
list(crispr_metadata["perturbation"].unique())
Hide code cell output
['sggpx4-2', 'sglacz', 'sggpx4-1', 'sgor2j2']
What are the associated targets?

The following targets are the direct targets of the perturbations, and while they may affect a pathway, we only curate the direct targets for simplicity.

  1. sgGPX4-1: Gene/Protein - GPX4 (Glutathione Peroxidase 4)

  2. sgGPX4-2: Gene/Protein - GPX4 (Glutathione Peroxidase 4)

  3. sgLACZ: Gene/Protein - LACZ (β-galactosidase)

  4. sgOR2J2: Gene/Protein - OR2J2 (Olfactory receptor family 2 subfamily J member 2)

Since the perturbation metadata contains the guide RNA names, we model the genetic perturbations using both wetlab.GeneticTreatment and wetlab.TreatmentTarget.

treatments = [
    ("sgGPX4-1", "GPX4", "Glutathione Peroxidase 4"),
    ("sgGPX4-2", "GPX4", "Glutathione Peroxidase 4"),
    ("sgor2j2", "or2j2", "Olfactory receptor family 2 subfamily J member 2"),
    ("sgLACZ", "lacz", "beta-galactosidase control"),  # Control from E. coli
]
organism = bt.Organism.lookup().human

genetic_treatments = []
for name, symbol, target_name in treatments:
    treatment = wl.GeneticTreatment(system="CRISPR Cas9", name=name).save()
    if symbol != "lacz":
        gene_result = bt.Gene.from_source(symbol=symbol, organism=organism)
        gene = gene_result[0] if isinstance(gene_result, list) else gene_result
        gene = gene.save()
    else:
        gene = bt.Gene(symbol=symbol, organism=organism).save()
    target = wl.TreatmentTarget(name=target_name).save()
    target.genes.add(gene)
    treatment.targets.add(target)
    genetic_treatments.append(treatment)
Hide code cell output
✓ created 1 Gene record from Bionty matching symbol: 'GPX4'
! record with similar name exists! did you mean to load it?
uid name system sequence on_target_score off_target_score run_id created_at created_by_id
id
1 Us3vqdde3Yz5 sgGPX4-1 CRISPR Cas9 None None None 1 2024-11-21 06:57:33.377621+00:00 1
→ returning existing TreatmentTarget record with same name: 'Glutathione Peroxidase 4'
✓ created 1 Gene record from Bionty matching synonyms: 'or2j2'
! ambiguous validation in Bionty for 1 record: 'OR2J2'

Compound perturbations

Although the targets are known for many compounds, we skip annotating them here to keep the guide brief.

What are the compound targets?
  1. AZD5591: Unknown

  2. Afatinib: Proteins - EGFR (Epidermal Growth Factor Receptor), HER2 (Human Epidermal growth factor Receptor 2)

  3. BRD3379: Unknown

  4. Bortezomib: Protein complex - Proteasome (specifically the 26S proteasome subunit)

  5. Dabrafenib: Gene/Protein - BRAF (V600E mutation in the BRAF gene, which codes for a protein kinase)

  6. Everolimus: Protein - mTOR (Mammalian Target of Rapamycin)

  7. Gemcitabine: Pathway/Process - DNA synthesis (inhibition of ribonucleotide reductase and incorporation into DNA)

  8. Idasanutlin: Protein - MDM2 (Mouse Double Minute 2 homolog)

  9. JQ1: Protein - BRD4 (Bromodomain-containing protein 4)

  10. Navitoclax: Proteins - BCL-2, BCL-XL (B-cell lymphoma 2 and B-cell lymphoma-extra large)

  11. Prexasertib: Protein - CHK1 (Checkpoint kinase 1)

  12. Taselisib: Protein/Pathway - PI3K (Phosphoinositide 3-kinase)

  13. Trametinib: Proteins - MEK1/2 (Mitogen-Activated Protein Kinase Kinase 1 and 2)

  14. control: Not applicable

# We are using the chebi/chembl chemistry/drug ontology for the drug perturbations
chebi_source = bt.Source.filter(entity="Drug", name="chebi").one()
wl.Compound.add_source(chebi_source)
compounds = wl.Compound.public()
Hide code cell output
→ due to lack of write access, LaminDB won't manage storage location: s3://bionty-assets/
• path in storage 's3://bionty-assets' with key 'df_all__chebi__2024-07-27__Drug.parquet'
→ source added!
drug_metadata.head(3)
Hide code cell output
depmap_id cancer cell_det_rate cell_line cell_quality channel disease dose_unit dose_value doublet_CL1 doublet_CL2 doublet_GMM_prob doublet_dev_imp doublet_z_margin hash_assignment hash_tag num_SNPs organism perturbation perturbation_type sex singlet_ID singlet_dev singlet_dev_z singlet_margin singlet_z_margin time tissue_type tot_reads nperts ngenes ncounts percent_mito percent_ribo chembl-ID
AACTGGTGTCTCTCTG ACH-000390 True 0.093159 LUDLU-1 normal nan lung cancer µM 0.1 LUDLU1_LUNG TE14_OESOPHAGUS 2.269468e-10 0.009426 0.403316 nan nan 481 human trametinib drug Male LUDLU1_LUNG 0.655877 14.860933 0.462273 12.351139 24 cell_line 787 1 3045 12895.0 3.202792 24.955409 CHEMBL2103875
ATAGGCTCAGATTTCG ACH-000444 True 0.145728 LU99 normal 2 lung cancer µM 0.5 LU99_LUNG MCAS_OVARY 8.562908e-04 0.010173 0.188284 nan nan 1003 human afatinib drug Male LU99_LUNG 0.762847 10.648094 0.474590 8.164565 24 cell_line 1597 1 4763 23161.0 7.473771 18.051898 CHEMBL1173655
GCCAAATCAAGCCGTC ACH-000396 True 0.117330 J82 normal nan urinary bladder carcinoma µM 0.1 J82_URINARY_TRACT IGR1_SKIN 6.490367e-08 0.009686 1.185862 nan nan 647 human dabrafenib drug Male J82_URINARY_TRACT 0.651059 14.740111 0.404508 11.188513 24 cell_line 1159 1 3834 18062.0 2.762706 22.085040 CHEMBL2028663
compounds = wl.Compound.from_values(drug_metadata["perturbation"], field="name")
Hide code cell output
✓ created 8 Compound records from Bionty matching name: 'trametinib', 'afatinib', 'dabrafenib', 'gemcitabine', 'navitoclax', 'bortezomib', 'JQ1', 'everolimus'
! did not create Compound records for 6 non-validated names: 'azd5591', 'brd3379', 'control', 'idasanutlin', 'prexasertib', 'taselisib'
# The remaining compounds are not in chebi and we create records for them
for missing in [
    "azd5591",
    "brd3379",
    "control",
    "idasanutlin",
    "prexasertib",
    "taselisib",
]:
    compounds.append(wl.Compound(name=missing))
ln.save(compounds)
unique_treatments = drug_metadata[
    ["perturbation", "dose_unit", "dose_value"]
].drop_duplicates()

compound_treatments = []
for _, row in unique_treatments.iterrows():
    compound = wl.Compound.get(name=row["perturbation"])
    treatment = wl.CompoundTreatment(
        name=compound.name,
        concentration=row["dose_value"],
        concentration_unit=row["dose_unit"],
    )
    compound_treatments.append(treatment)

ln.save(compound_treatments)

Register curated artifact

artifact = curate.save_artifact(description="McFarland AnnData")
Hide code cell output
→ validating metadata using registries of instance laminlabs/lamindata
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'depmap_id' is validated against CellLine.ontology_id
✓ 'cell_line' is validated against CellLine.name
✓ 'disease' is validated against Disease.name
✓ 'organism' is validated against Organism.name
✓ 'perturbation_type' is validated against ULabel.name
✓ 'sex' is validated against Phenotype.name
✓ 'time' is validated against ULabel.name
✓ 'tissue_type' is validated against ULabel.name
!    26 unique terms (74.30%) are not validated for name: 'cancer', 'cell_det_rate', 'cell_quality', 'channel', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', ...
artifact.genetic_treatments.set(genetic_treatments)
artifact.compound_treatments.set(compound_treatments)
artifact.describe()
Hide code cell output
Artifact(uid='GfepgcpSQKrWaqoN0000', is_latest=True, description='McFarland AnnData', suffix='.h5ad', type='dataset', size=2511528, hash='-NmtOcllwkwqByHg6iGdBA', n_observations=1000, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-21 06:57:41 UTC)
  Provenance
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-perturbation'
    .transform = 'Perturbation'
    .run = 2024-11-21 06:57:13 UTC
    .created_by = 'testuser1'
  Labels
    .organisms = 'human'
    .diseases = 'liver cancer', 'uterine corpus cancer', 'colorectal cancer', 'malignant pancreatic neoplasm', 'brain cancer', 'urinary bladder carcinoma', 'sarcoma', 'ovarian cancer', 'breast cancer', 'skin cancer', ...
    .cell_lines = 'CCF-STTG1', 'ONS-76', 'UM-UC-1', 'YD-38', 'WM-266-4', 'FTC-238', 'MES-SA', 'Daoy', 'PANC-1', 'COLO 668', ...
    .phenotypes = 'Unknown', 'Female', 'Male'
    .genetic_treatments = 'sgGPX4-1', 'sgGPX4-2', 'sgor2j2', 'sgLACZ'
    .compound_treatments = 'trametinib', 'afatinib', 'dabrafenib', 'gemcitabine', 'navitoclax', 'bortezomib', 'brd3379', 'JQ1', 'azd5591', 'control', ...
    .ulabels = 'CRISPR', 'drug', '72, 96', '6', '24', '3, 6, 12, 24, 48', 'cell_line'
  Features
    'cell_line' = '22Rv1', '253J-BV', '42-MG-BA', '639-V', '647-V', '769-P', '786-O', '8505C', 'A-375', 'A2780', ...
    'depmap_id' = 'NIH:OVCAR-3', 'Hs 294T', 'NCI-H1581', 'T24', 'NCI-H1693', 'PA-TU-8988S', 'PA-TU-8988T', '253J-BV', 'NCI-H1650', 'S-117', ...
    'disease' = 'bile duct cancer', 'bone cancer', 'brain cancer', 'breast cancer', 'colorectal cancer', 'esophageal cancer', 'gastric cancer', 'head and neck cancer', 'kidney cancer', 'liver cancer', ...
    'organism' = 'human'
    'perturbation_type' = 'CRISPR', 'drug'
    'sex' = 'Female', 'Male', 'Unknown'
    'time' = '24', '3, 6, 12, 24, 48', '6', '72, 96'
    'tissue_type' = 'cell_line'
  Feature sets
    'var' = 'MAGED2', 'CPE', 'DDX43', 'SPANXA2', 'LNCPRESS1', 'TMEM175', 'None', 'MMD', 'LINC01834', 'DYNC2I1', 'CCRL2', 'MRPS6', 'KCNH1', 'TBILA', 'SLC5A7', 'GSX2', 'ATP5MG', 'TMEM190', 'GTF2H1'
    'obs' = 'tissue_type', 'perturbation_type', 'time', 'disease', 'sex', 'perturbation', 'organism', 'cell_line', 'depmap_id'
Hide code cell content
# clean up test instance
!rm -r test-perturbation
!lamin delete --force test-perturbation
• deleting instance testuser1/test-perturbation