PertCurator

Here we use PertCurator to curate perturbation related columns in an AnnData object of McFarland et al. 2020.

# pip install 'lamindb[jupyter,wetlab]' cellxgene-lamin
!lamin init --storage ./test-pert-curator --schema bionty,wetlab,ourprojects
 connected lamindb: anonymous/test-pert-curator
import lamindb as ln
import wetlab as wl
import bionty as bt
import ourprojects as ops
import pandas as pd
import scanpy as sc

ln.track("HIRTYxL3aZc70000")
Hide code cell output
 connected lamindb: anonymous/test-pert-curator
 created Transform('HIRTYxL3'), started new Run('hIs7cb7a') at 2024-12-20 14:57:33 UTC
 notebook imports: bionty==0.53.2 lamindb==0.77.3 ourprojects==0.1.0 pandas==2.2.3 scanpy==1.10.4 wetlab==0.39.1
adata = ln.Artifact.using("laminlabs/lamindata").get(uid="Xk7Qaik9vBLV4PKf0001").load()
adata.obs.head()
 completing transfer to track Artifact('Xk7Qaik9') as input
 mapped records: 
 transferred records: Artifact(uid='Xk7Qaik9vBLV4PKf0001'), Storage(uid='D9BilDV2')
depmap_id cancer cell_det_rate cell_line cell_quality channel disease dose_unit dose_value doublet_CL1 doublet_CL2 doublet_GMM_prob doublet_dev_imp doublet_z_margin hash_assignment hash_tag num_SNPs organism perturbation perturbation_type sex singlet_ID singlet_dev singlet_dev_z singlet_margin singlet_z_margin time tissue_type tot_reads nperts ngenes ncounts percent_mito percent_ribo chembl-ID
AACTGGTGTCTCTCTG ACH-000390 True 0.093159 LUDLU-1 normal nan lung cancer µM 0.1 LUDLU1_LUNG TE14_OESOPHAGUS 2.269468e-10 0.009426 0.403316 nan nan 481 human trametinib drug Male LUDLU1_LUNG 0.655877 14.860933 0.462273 12.351139 24 cell_line 787 1 3045 12895.0 3.202792 24.955409 CHEMBL2103875
ATAGGCTCAGATTTCG ACH-000444 True 0.145728 LU99 normal 2 lung cancer µM 0.5 LU99_LUNG MCAS_OVARY 8.562908e-04 0.010173 0.188284 nan nan 1003 human afatinib drug Male LU99_LUNG 0.762847 10.648094 0.474590 8.164565 24 cell_line 1597 1 4763 23161.0 7.473771 18.051898 CHEMBL1173655
GCCAAATCAAGCCGTC ACH-000396 True 0.117330 J82 normal nan urinary bladder carcinoma µM 0.1 J82_URINARY_TRACT IGR1_SKIN 6.490367e-08 0.009686 1.185862 nan nan 647 human dabrafenib drug Male J82_URINARY_TRACT 0.651059 14.740111 0.404508 11.188513 24 cell_line 1159 1 3834 18062.0 2.762706 22.085040 CHEMBL2028663
CGGAGAAGTCGCGTCA ACH-000997 True 0.005422 HCT-15 low_quality 7 colorectal cancer µM 0.1 HCT15_LARGE_INTESTINE NCIH322_LUNG NaN 0.029753 0.000794 nan nan 30 human gemcitabine drug Male HCT15_LARGE_INTESTINE 0.970247 2.852338 0.168971 0.833455 24 cell_line 76 1 178 726.0 70.247934 5.785124 CHEMBL888
TAGTTGGAGATCGATA ACH-000723 True 0.132708 YD-10B low_quality nan head and neck cancer nan NaN YD10B_UPPER_AERODIGESTIVE_TRACT 647V_URINARY_TRACT NaN 0.156492 1.556214 nan nan 874 human sggpx4-2 CRISPR Male YD10B_UPPER_AERODIGESTIVE_TRACT 0.292802 3.272682 0.016459 0.330120 72, 96 cell_line 2105 1 4341 20693.0 0.695887 16.242208 NaN
# Calculate an embedding because CELLxGENE requires one
sc.tl.pca(adata)

Curate and register perturbations

Required columns:

  • Either “pert_target” or “pert_name” and “pert_type” (“pert_type” allows: “genetic”, “drug”, “biologic”, “physical”)

  • If pert_dose = True (default), requires “pert_dose” in form of number+unit. E.g. 10.0nM

  • If pert_time = True (default), requires “pert_time” in form of number+unit. E.g. 10.0h

# rename the columns to match the expected format
adata.obs["pert_time"] = adata.obs["time"].apply(
    lambda x: str(x).split(", ")[-1] + "h" if pd.notna(x) else x
)  # we only take the last timepoint
adata.obs["pert_dose"] = adata.obs["dose_value"].map(
    lambda x: f"{x}{adata.obs['dose_unit'].iloc[0]}" if pd.notna(x) else None
)
adata.obs.rename(
    columns={"perturbation": "pert_name", "perturbation_type": "pert_type"},
    inplace=True,
)
# fix the perturbation type as suggested by the curator
adata.obs["pert_type"] = adata.obs["pert_type"].cat.rename_categories(
    {"CRISPR": "genetic", "drug": "compound"}
)
curator = wl.PertCurator(adata)
Hide code cell output
 mapped 'pert_name' to 'pert_compound'
 mapped 'pert_name' to 'pert_genetic'
 added default value 'unknown' to the adata.obs['assay']
 added default value 'unknown' to the adata.obs['cell_type']
 added default value 'unknown' to the adata.obs['development_stage']
 added default value 'unknown' to the adata.obs['donor_id']
 added default value 'unknown' to the adata.obs['self_reported_ethnicity']
 added default value 'cell' to the adata.obs['suspension_type']
 added default value 'unknown' to the adata.obs['pert_target']
 added 13 records with Feature.name for "columns": 'assay', 'cell_type', 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'sex', 'suspension_type', 'tissue_type', 'organism', 'cell_line', 'pert_genetic', 'pert_compound'
curator.validate()
Hide code cell output
 saving validated records of 'var_index'
 added 1869 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000102316', 'ENSG00000109472', 'ENSG00000080007', 'ENSG00000203926', 'ENSG00000232301', 'ENSG00000127419', 'ENSG00000266869', 'ENSG00000108960', 'ENSG00000261316', 'ENSG00000126870', 'ENSG00000121797', 'ENSG00000243927', 'ENSG00000143473', 'ENSG00000261488', 'ENSG00000115665', 'ENSG00000180613', 'ENSG00000167283', 'ENSG00000160472', 'ENSG00000273387', 'ENSG00000110768', ...
 saving validated records of 'disease'
 added 21 records from public with Disease.name for "disease": 'thyroid cancer', 'rhabdoid tumor', 'colorectal cancer', 'ovarian cancer', 'prostate cancer', 'neuroblastoma', 'gastric cancer', 'head and neck cancer', 'uterine corpus cancer', 'liver cancer', 'breast cancer', 'esophageal cancer', 'lung cancer', 'bone cancer', 'sarcoma', 'kidney cancer', 'urinary bladder carcinoma', 'brain cancer', 'malignant pancreatic neoplasm', 'bile duct cancer', ...
 saving validated records of 'sex'
 added 2 records from public with Phenotype.name for "sex": 'male', 'female'
 saving validated records of 'cell_line'
 added 183 records from public with CellLine.name for "cell_line": 'LS1034', 'SW48', 'HCC1143', 'SNU-C2A', 'HT-29', 'NCI-H1048', 'KP-4', 'SK-UT-1', 'TEN', 'TE-8', 'DMS 273', 'MSTO-211H', 'TOV-112D', 'NIH:OVCAR-3', 'NUGC-3', 'RH-30', 'RCC10RGB', 'NCI-H1793', 'EFM-192A', 'HCC-1195', ...
 saving validated records of 'pert_compound'
 added 8 records from public with Compound.name for "pert_compound": 'navitoclax', 'bortezomib', 'JQ1', 'trametinib', 'everolimus', 'dabrafenib', 'gemcitabine', 'afatinib'
 mapping "var_index" on Gene.ensembl_gene_id
!   2 terms are not validated: 'ENSG00000255823', 'ENSG00000272370'
    → fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
 mapping "assay" on ExperimentalFactor.name
!   1 term is not validated: 'unknown'
    → fix typos, remove non-existent values, or save terms via .add_new_from("assay")
 mapping "cell_type" on CellType.name
!   1 term is not validated: 'unknown'
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
 mapping "development_stage" on DevelopmentalStage.name
!   1 term is not validated: 'unknown'
    → fix typos, remove non-existent values, or save terms via .add_new_from("development_stage")
 mapping "disease" on Disease.name
!   1 term is not validated: 'pancreatic cancer'
    1 synonym found: "pancreatic cancer" → "malignant pancreatic neoplasm"
    → curate synonyms via .standardize("disease")
 mapping "donor_id" on ULabel.name
!   1 term is not validated: 'unknown'
    → fix typos, remove non-existent values, or save terms via .add_new_from("donor_id")
 mapping "self_reported_ethnicity" on Ethnicity.name
!   1 term is not validated: 'unknown'
    → fix typos, remove non-existent values, or save terms via .add_new_from("self_reported_ethnicity")
 mapping "sex" on Phenotype.name
!   3 terms are not validated: 'Male', 'Female', 'Unknown'
    2 synonyms found: "Male" → "male", "Female" → "female"
    → curate synonyms via .standardize("sex")    for remaining terms:
    → fix typos, remove non-existent values, or save terms via .add_new_from("sex")
 mapping "suspension_type" on ULabel.name
!   1 term is not validated: 'cell'
    → fix typos, remove non-existent values, or save terms via .add_new_from("suspension_type")
 mapping "tissue_type" on ULabel.name
!   1 term is not validated: 'cell_line'
    → fix typos, remove non-existent values, or save terms via .add_new_from("tissue_type")
 "organism" is validated against Organism.name
 "cell_line" is validated against CellLine.name
 mapping "pert_genetic" on GeneticPerturbation.name
!   4 terms are not validated: 'sggpx4-2', 'sglacz', 'sggpx4-1', 'sgor2j2'
    → fix typos, remove non-existent values, or save terms via .add_new_from("pert_genetic")
 mapping "pert_compound" on Compound.name
!   6 terms are not validated: 'brd3379', 'azd5591', 'control', 'prexasertib', 'taselisib', 'idasanutlin'
    → fix typos, remove non-existent values, or save terms via .add_new_from("pert_compound")
False

Genetic perturbations

# register genetic perturbations with their target genes
pert_target_map = {
    "sggpx4-1": "GPX4",
    "sggpx4-2": "GPX4",
    "sgor2j2": "OR2J2",  # cutting control
}

for sg_name, gene_symbol in pert_target_map.items():
    pert = wl.GeneticPerturbation(
        system="CRISPR-Cas9",
        name=sg_name,
        description="cutting control" if sg_name == "sgor2j2" else None,
    ).save()
    target = wl.PerturbationTarget(name=gene_symbol).save()
    pert.targets.add(target)
    gene = bt.Gene.from_source(symbol=gene_symbol, organism="human").save()
    target.genes.set([gene] if isinstance(gene, bt.Gene) else gene)

adata.obs["pert_target"] = adata.obs["pert_genetic"].map(pert_target_map)

# register the negative control without targets: Non-cutting control
wl.GeneticPerturbation(
    name="sglacz", system="CRISPR-Cas9", description="non-cutting control"
).save();
Hide code cell output
 created 1 Gene record from Bionty matching symbol: 'GPX4'
! record with similar name exists! did you mean to load it?
uid name system description sequence on_target_score off_target_score run_id created_at created_by_id
id
1 MDGYBKBXcdcu sggpx4-1 CRISPR-Cas9 None None None None 1 2024-12-20 14:58:45.707716+00:00 1
 returning existing PerturbationTarget record with same name: 'GPX4'
 created 1 Gene record from Bionty matching symbol: 'OR2J2'
! ambiguous validation in Bionty for 1 record: 'OR2J2'

Compounds

# the remaining compounds are not in CHEBI and we create records for them
curator.add_new_from("pert_compound")
Hide code cell output
 added 6 records with Compound.name for "pert_compound": 'taselisib', 'brd3379', 'control', 'azd5591', 'prexasertib', 'idasanutlin'

Curate non-pert metadata

# manually fix sex and set assay
adata.obs["sex"] = adata.obs["sex"].cat.rename_categories({"Unknown": "unknown"})
adata.obs["assay"] = "10x 3' v3"

# subset the adata to only include the validated genes
adata = adata[:, ~adata.var_names.isin(curator.non_validated["var_index"])].copy()

# standardize disease and sex as suggested
curator.standardize("disease")
curator.standardize("sex")
Hide code cell output
 standardized 1 synonym in "disease": "pancreatic cancer" → "malignant pancreatic neoplasm"
 standardized 2 synonyms in "sex": "Male" → "male", "Female" → "female"
# Recreate Curator object because we are using a new adata
curator = wl.PertCurator(adata)
curator.validate()
Hide code cell output
 mapped 'pert_name' to 'pert_compound'
 mapped 'pert_name' to 'pert_genetic'
 added 1 record with Feature.name for "columns": 'pert_target'
 saving validated records of 'assay'
 added 1 record from public with ExperimentalFactor.name for "assay": '10x 3' v3'
 "var_index" is validated against Gene.ensembl_gene_id
 "assay" is validated against ExperimentalFactor.name
 mapping "cell_type" on CellType.name
!   1 term is not validated: 'unknown'
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
 mapping "development_stage" on DevelopmentalStage.name
!   1 term is not validated: 'unknown'
    → fix typos, remove non-existent values, or save terms via .add_new_from("development_stage")
 mapping "disease" on Disease.name
!   1 term is not validated: 'pancreatic cancer'
    1 synonym found: "pancreatic cancer" → "malignant pancreatic neoplasm"
    → curate synonyms via .standardize("disease")
 mapping "donor_id" on ULabel.name
!   1 term is not validated: 'unknown'
    → fix typos, remove non-existent values, or save terms via .add_new_from("donor_id")
 mapping "self_reported_ethnicity" on Ethnicity.name
!   1 term is not validated: 'unknown'
    → fix typos, remove non-existent values, or save terms via .add_new_from("self_reported_ethnicity")
 mapping "sex" on Phenotype.name
!   3 terms are not validated: 'Male', 'Female', 'unknown'
    2 synonyms found: "Male" → "male", "Female" → "female"
    → curate synonyms via .standardize("sex")    for remaining terms:
    → fix typos, remove non-existent values, or save terms via .add_new_from("sex")
 mapping "suspension_type" on ULabel.name
!   1 term is not validated: 'cell'
    → fix typos, remove non-existent values, or save terms via .add_new_from("suspension_type")
 mapping "tissue_type" on ULabel.name
!   1 term is not validated: 'cell_line'
    → fix typos, remove non-existent values, or save terms via .add_new_from("tissue_type")
 "organism" is validated against Organism.name
 "cell_line" is validated against CellLine.name
 "pert_target" is validated against PerturbationTarget.name
 "pert_genetic" is validated against GeneticPerturbation.name
 "pert_compound" is validated against Compound.name
False
curator.add_new_from("all")
Hide code cell output
 added 1 record with CellType.name for "cell_type": 'unknown'
 added 1 record with DevelopmentalStage.name for "development_stage": 'unknown'
 added 1 record with Disease.name for "disease": 'pancreatic cancer'
 added 1 record with ULabel.name for "donor_id": 'unknown'
 added 1 record with Ethnicity.name for "self_reported_ethnicity": 'unknown'
 added 3 records with Phenotype.name for "sex": 'Male', 'Female', 'unknown'
 added 1 record with ULabel.name for "suspension_type": 'cell'
 added 1 record with ULabel.name for "tissue_type": 'cell_line'
curator.validate()
Hide code cell output
 "var_index" is validated against Gene.ensembl_gene_id
 "assay" is validated against ExperimentalFactor.name
 "cell_type" is validated against CellType.name
 "development_stage" is validated against DevelopmentalStage.name
 "disease" is validated against Disease.name
 "donor_id" is validated against ULabel.name
 "self_reported_ethnicity" is validated against Ethnicity.name
 "sex" is validated against Phenotype.name
 "suspension_type" is validated against ULabel.name
 "tissue_type" is validated against ULabel.name
 "organism" is validated against Organism.name
 "cell_line" is validated against CellLine.name
 "pert_target" is validated against PerturbationTarget.name
 "pert_genetic" is validated against GeneticPerturbation.name
 "pert_compound" is validated against Compound.name
True

References

reference = ops.Reference(
    name="Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action",
    abbr="McFarland 2020",
    url="https://www.nature.com/articles/s41467-020-17440-w",
    doi="10.1038/s41467-020-17440-w",
    text=(
        "Assays to study cancer cell responses to pharmacologic or genetic perturbations are typically "
        "restricted to using simple phenotypic readouts such as proliferation rate. Information-rich assays, "
        "such as gene-expression profiling, have generally not permitted efficient profiling of a given "
        "perturbation across multiple cellular contexts. Here, we develop MIX-Seq, a method for multiplexed "
        "transcriptional profiling of post-perturbation responses across a mixture of samples with single-cell "
        "resolution, using SNP-based computational demultiplexing of single-cell RNA-sequencing data. We show "
        "that MIX-Seq can be used to profile responses to chemical or genetic perturbations across pools of 100 "
        "or more cancer cell lines. We combine it with Cell Hashing to further multiplex additional experimental "
        "conditions, such as post-treatment time points or drug doses. Analyzing the high-content readout of "
        "scRNA-seq reveals both shared and context-specific transcriptional response components that can identify "
        "drug mechanism of action and enable prediction of long-term cell viability from short-term transcriptional "
        "responses to treatment."
    ),
).save()

Register curated artifact

artifact = curator.save_artifact(description="McFarland AnnData")
Hide code cell output
!    30 unique terms (65.20%) are not validated for name: 'depmap_id', 'cancer', 'cell_det_rate', 'cell_quality', 'channel', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', ...
!    did not create Feature records for 30 non-validated names: 'cancer', 'cell_det_rate', 'cell_quality', 'channel', 'chembl-ID', 'depmap_id', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', 'doublet_z_margin', 'hash_assignment', 'hash_tag', 'ncounts', 'ngenes', 'nperts', 'num_SNPs', 'percent_mito', ...
# link the reference to the artifact
artifact.references.add(reference)
artifact.describe()
Artifact .h5ad/AnnData
├── General
│   ├── .uid = '1Er757qaE84Kv1Dx0000'
│   ├── .size = 3500320
│   ├── .hash = 'AQrOlYzdag6efYv2ZgXsuw'
│   ├── .n_observations = 1000
│   ├── .path = /home/runner/work/wetlab/wetlab/docs/guide/test-pert-curator/.lamindb/1Er757qaE84Kv1Dx0000.h5ad
│   ├── .created_by = anonymous
│   ├── .created_at = 2024-12-20 14:59:31
│   └── .transform = 'PertCurator'
├── Dataset features/.feature_sets
│   ├── var1869                  [bionty.Gene]                                                       
│   │   MAGED2                      float                                                               
│   │   CPE                         float                                                               
│   │   DDX43                       float                                                               
│   │   SPANXA2                     float                                                               
│   │   LNCPRESS1                   float                                                               
│   │   TMEM175                     float                                                               
│   │   MMD                         float                                                               
│   │   LINC01834                   float                                                               
│   │   DYNC2I1                     float                                                               
│   │   CCRL2                       float                                                               
│   │   MRPS6                       float                                                               
│   │   KCNH1                       float                                                               
│   │   TBILA                       float                                                               
│   │   SLC5A7                      float                                                               
│   │   GSX2                        float                                                               
│   │   ATP5MG                      float                                                               
│   │   TMEM190                     float                                                               
│   │   GTF2H1                      float                                                               
│   └── obs16                    [Feature]                                                           
assay                       cat[bionty.ExperimentalF…  10x 3' v3                                
cell_line                   cat[bionty.CellLine]       22Rv1, 253J-BV, 42-MG-BA, 639-V, 647-V, …
cell_type                   cat[bionty.CellType]       unknown                                  
development_stage           cat[bionty.Developmental…  unknown                                  
disease                     cat[bionty.Disease]        bile duct cancer, bone cancer, brain can…
donor_id                    cat[ULabel]                unknown                                  
organism                    cat[bionty.Organism]       human                                    
pert_compound               cat[wetlab.Compound]       JQ1, afatinib, azd5591, bortezomib, brd3…
pert_genetic                cat[wetlab.GeneticPertur…  sggpx4-1, sggpx4-2, sglacz, sgor2j2      
pert_target                 cat[wetlab.PerturbationT…  GPX4, OR2J2                              
self_reported_ethnicity     cat[bionty.Ethnicity]      unknown                                  
sex                         cat[bionty.Phenotype]      Female, Male, unknown                    
suspension_type             cat[ULabel]                cell                                     
tissue_type                 cat[ULabel]                cell_line                                
pert_dose                   str                                                                 
pert_time                   str                                                                 
└── Labels
    └── .organisms                  bionty.Organism            human                                    
        .cell_types                 bionty.CellType            unknown                                  
        .diseases                   bionty.Disease             thyroid cancer, rhabdoid tumor, colorect…
        .cell_lines                 bionty.CellLine            LS1034, SW48, HCC1143, SNU-C2A, HT-29, N…
        .phenotypes                 bionty.Phenotype           Male, Female, unknown                    
        .experimental_factors       bionty.ExperimentalFactor  10x 3' v3                                
        .developmental_stages       bionty.DevelopmentalStage  unknown                                  
        .ethnicities                bionty.Ethnicity           unknown                                  
        .references                 ourprojects.Reference      Multiplexed single-cell transcriptional …
        .compounds                  wetlab.Compound            JQ1, afatinib, azd5591, bortezomib, brd3…
        .perturbation_targets       wetlab.PerturbationTarget  GPX4, OR2J2                              
        .genetic_perturbations      wetlab.GeneticPerturbati…  sggpx4-1, sggpx4-2, sglacz, sgor2j2      
        .ulabels                    ULabel                     unknown, cell, cell_line