PertCurator¶
Here we use PertCurator
to curate perturbation related columns in an AnnData
object of McFarland et al. 2020.
# pip install 'lamindb[jupyter,wetlab]' cellxgene-lamin
!lamin init --storage ./test-pert-curator --schema bionty,wetlab,ourprojects
→ connected lamindb: anonymous/test-pert-curator
import lamindb as ln
import wetlab as wl
import bionty as bt
import ourprojects as ops
import pandas as pd
import scanpy as sc
ln.track("HIRTYxL3aZc70000")
Show code cell output
→ connected lamindb: anonymous/test-pert-curator
→ created Transform('HIRTYxL3'), started new Run('hIs7cb7a') at 2024-12-20 14:57:33 UTC
→ notebook imports: bionty==0.53.2 lamindb==0.77.3 ourprojects==0.1.0 pandas==2.2.3 scanpy==1.10.4 wetlab==0.39.1
adata = ln.Artifact.using("laminlabs/lamindata").get(uid="Xk7Qaik9vBLV4PKf0001").load()
adata.obs.head()
→ completing transfer to track Artifact('Xk7Qaik9') as input
→ mapped records:
→ transferred records: Artifact(uid='Xk7Qaik9vBLV4PKf0001'), Storage(uid='D9BilDV2')
depmap_id | cancer | cell_det_rate | cell_line | cell_quality | channel | disease | dose_unit | dose_value | doublet_CL1 | doublet_CL2 | doublet_GMM_prob | doublet_dev_imp | doublet_z_margin | hash_assignment | hash_tag | num_SNPs | organism | perturbation | perturbation_type | sex | singlet_ID | singlet_dev | singlet_dev_z | singlet_margin | singlet_z_margin | time | tissue_type | tot_reads | nperts | ngenes | ncounts | percent_mito | percent_ribo | chembl-ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AACTGGTGTCTCTCTG | ACH-000390 | True | 0.093159 | LUDLU-1 | normal | nan | lung cancer | µM | 0.1 | LUDLU1_LUNG | TE14_OESOPHAGUS | 2.269468e-10 | 0.009426 | 0.403316 | nan | nan | 481 | human | trametinib | drug | Male | LUDLU1_LUNG | 0.655877 | 14.860933 | 0.462273 | 12.351139 | 24 | cell_line | 787 | 1 | 3045 | 12895.0 | 3.202792 | 24.955409 | CHEMBL2103875 |
ATAGGCTCAGATTTCG | ACH-000444 | True | 0.145728 | LU99 | normal | 2 | lung cancer | µM | 0.5 | LU99_LUNG | MCAS_OVARY | 8.562908e-04 | 0.010173 | 0.188284 | nan | nan | 1003 | human | afatinib | drug | Male | LU99_LUNG | 0.762847 | 10.648094 | 0.474590 | 8.164565 | 24 | cell_line | 1597 | 1 | 4763 | 23161.0 | 7.473771 | 18.051898 | CHEMBL1173655 |
GCCAAATCAAGCCGTC | ACH-000396 | True | 0.117330 | J82 | normal | nan | urinary bladder carcinoma | µM | 0.1 | J82_URINARY_TRACT | IGR1_SKIN | 6.490367e-08 | 0.009686 | 1.185862 | nan | nan | 647 | human | dabrafenib | drug | Male | J82_URINARY_TRACT | 0.651059 | 14.740111 | 0.404508 | 11.188513 | 24 | cell_line | 1159 | 1 | 3834 | 18062.0 | 2.762706 | 22.085040 | CHEMBL2028663 |
CGGAGAAGTCGCGTCA | ACH-000997 | True | 0.005422 | HCT-15 | low_quality | 7 | colorectal cancer | µM | 0.1 | HCT15_LARGE_INTESTINE | NCIH322_LUNG | NaN | 0.029753 | 0.000794 | nan | nan | 30 | human | gemcitabine | drug | Male | HCT15_LARGE_INTESTINE | 0.970247 | 2.852338 | 0.168971 | 0.833455 | 24 | cell_line | 76 | 1 | 178 | 726.0 | 70.247934 | 5.785124 | CHEMBL888 |
TAGTTGGAGATCGATA | ACH-000723 | True | 0.132708 | YD-10B | low_quality | nan | head and neck cancer | nan | NaN | YD10B_UPPER_AERODIGESTIVE_TRACT | 647V_URINARY_TRACT | NaN | 0.156492 | 1.556214 | nan | nan | 874 | human | sggpx4-2 | CRISPR | Male | YD10B_UPPER_AERODIGESTIVE_TRACT | 0.292802 | 3.272682 | 0.016459 | 0.330120 | 72, 96 | cell_line | 2105 | 1 | 4341 | 20693.0 | 0.695887 | 16.242208 | NaN |
# Calculate an embedding because CELLxGENE requires one
sc.tl.pca(adata)
Curate and register perturbations¶
Required columns:
Either “pert_target” or “pert_name” and “pert_type” (“pert_type” allows: “genetic”, “drug”, “biologic”, “physical”)
If pert_dose = True (default), requires “pert_dose” in form of number+unit. E.g. 10.0nM
If pert_time = True (default), requires “pert_time” in form of number+unit. E.g. 10.0h
# rename the columns to match the expected format
adata.obs["pert_time"] = adata.obs["time"].apply(
lambda x: str(x).split(", ")[-1] + "h" if pd.notna(x) else x
) # we only take the last timepoint
adata.obs["pert_dose"] = adata.obs["dose_value"].map(
lambda x: f"{x}{adata.obs['dose_unit'].iloc[0]}" if pd.notna(x) else None
)
adata.obs.rename(
columns={"perturbation": "pert_name", "perturbation_type": "pert_type"},
inplace=True,
)
# fix the perturbation type as suggested by the curator
adata.obs["pert_type"] = adata.obs["pert_type"].cat.rename_categories(
{"CRISPR": "genetic", "drug": "compound"}
)
curator = wl.PertCurator(adata)
Show code cell output
→ mapped 'pert_name' to 'pert_compound'
→ mapped 'pert_name' to 'pert_genetic'
→ added default value 'unknown' to the adata.obs['assay']
→ added default value 'unknown' to the adata.obs['cell_type']
→ added default value 'unknown' to the adata.obs['development_stage']
→ added default value 'unknown' to the adata.obs['donor_id']
→ added default value 'unknown' to the adata.obs['self_reported_ethnicity']
→ added default value 'cell' to the adata.obs['suspension_type']
→ added default value 'unknown' to the adata.obs['pert_target']
✓ added 13 records with Feature.name for "columns": 'assay', 'cell_type', 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'sex', 'suspension_type', 'tissue_type', 'organism', 'cell_line', 'pert_genetic', 'pert_compound'
curator.validate()
Show code cell output
• saving validated records of 'var_index'
✓ added 1869 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000102316', 'ENSG00000109472', 'ENSG00000080007', 'ENSG00000203926', 'ENSG00000232301', 'ENSG00000127419', 'ENSG00000266869', 'ENSG00000108960', 'ENSG00000261316', 'ENSG00000126870', 'ENSG00000121797', 'ENSG00000243927', 'ENSG00000143473', 'ENSG00000261488', 'ENSG00000115665', 'ENSG00000180613', 'ENSG00000167283', 'ENSG00000160472', 'ENSG00000273387', 'ENSG00000110768', ...
• saving validated records of 'disease'
✓ added 21 records from public with Disease.name for "disease": 'thyroid cancer', 'rhabdoid tumor', 'colorectal cancer', 'ovarian cancer', 'prostate cancer', 'neuroblastoma', 'gastric cancer', 'head and neck cancer', 'uterine corpus cancer', 'liver cancer', 'breast cancer', 'esophageal cancer', 'lung cancer', 'bone cancer', 'sarcoma', 'kidney cancer', 'urinary bladder carcinoma', 'brain cancer', 'malignant pancreatic neoplasm', 'bile duct cancer', ...
• saving validated records of 'sex'
✓ added 2 records from public with Phenotype.name for "sex": 'male', 'female'
• saving validated records of 'cell_line'
✓ added 183 records from public with CellLine.name for "cell_line": 'LS1034', 'SW48', 'HCC1143', 'SNU-C2A', 'HT-29', 'NCI-H1048', 'KP-4', 'SK-UT-1', 'TEN', 'TE-8', 'DMS 273', 'MSTO-211H', 'TOV-112D', 'NIH:OVCAR-3', 'NUGC-3', 'RH-30', 'RCC10RGB', 'NCI-H1793', 'EFM-192A', 'HCC-1195', ...
• saving validated records of 'pert_compound'
✓ added 8 records from public with Compound.name for "pert_compound": 'navitoclax', 'bortezomib', 'JQ1', 'trametinib', 'everolimus', 'dabrafenib', 'gemcitabine', 'afatinib'
• mapping "var_index" on Gene.ensembl_gene_id
! 2 terms are not validated: 'ENSG00000255823', 'ENSG00000272370'
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
• mapping "assay" on ExperimentalFactor.name
! 1 term is not validated: 'unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from("assay")
• mapping "cell_type" on CellType.name
! 1 term is not validated: 'unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
• mapping "development_stage" on DevelopmentalStage.name
! 1 term is not validated: 'unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from("development_stage")
• mapping "disease" on Disease.name
! 1 term is not validated: 'pancreatic cancer'
1 synonym found: "pancreatic cancer" → "malignant pancreatic neoplasm"
→ curate synonyms via .standardize("disease")
• mapping "donor_id" on ULabel.name
! 1 term is not validated: 'unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from("donor_id")
• mapping "self_reported_ethnicity" on Ethnicity.name
! 1 term is not validated: 'unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from("self_reported_ethnicity")
• mapping "sex" on Phenotype.name
! 3 terms are not validated: 'Male', 'Female', 'Unknown'
2 synonyms found: "Male" → "male", "Female" → "female"
→ curate synonyms via .standardize("sex") for remaining terms:
→ fix typos, remove non-existent values, or save terms via .add_new_from("sex")
• mapping "suspension_type" on ULabel.name
! 1 term is not validated: 'cell'
→ fix typos, remove non-existent values, or save terms via .add_new_from("suspension_type")
• mapping "tissue_type" on ULabel.name
! 1 term is not validated: 'cell_line'
→ fix typos, remove non-existent values, or save terms via .add_new_from("tissue_type")
✓ "organism" is validated against Organism.name
✓ "cell_line" is validated against CellLine.name
• mapping "pert_genetic" on GeneticPerturbation.name
! 4 terms are not validated: 'sggpx4-2', 'sglacz', 'sggpx4-1', 'sgor2j2'
→ fix typos, remove non-existent values, or save terms via .add_new_from("pert_genetic")
• mapping "pert_compound" on Compound.name
! 6 terms are not validated: 'brd3379', 'azd5591', 'control', 'prexasertib', 'taselisib', 'idasanutlin'
→ fix typos, remove non-existent values, or save terms via .add_new_from("pert_compound")
False
Genetic perturbations¶
# register genetic perturbations with their target genes
pert_target_map = {
"sggpx4-1": "GPX4",
"sggpx4-2": "GPX4",
"sgor2j2": "OR2J2", # cutting control
}
for sg_name, gene_symbol in pert_target_map.items():
pert = wl.GeneticPerturbation(
system="CRISPR-Cas9",
name=sg_name,
description="cutting control" if sg_name == "sgor2j2" else None,
).save()
target = wl.PerturbationTarget(name=gene_symbol).save()
pert.targets.add(target)
gene = bt.Gene.from_source(symbol=gene_symbol, organism="human").save()
target.genes.set([gene] if isinstance(gene, bt.Gene) else gene)
adata.obs["pert_target"] = adata.obs["pert_genetic"].map(pert_target_map)
# register the negative control without targets: Non-cutting control
wl.GeneticPerturbation(
name="sglacz", system="CRISPR-Cas9", description="non-cutting control"
).save();
Show code cell output
✓ created 1 Gene record from Bionty matching symbol: 'GPX4'
! record with similar name exists! did you mean to load it?
uid | name | system | description | sequence | on_target_score | off_target_score | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
1 | MDGYBKBXcdcu | sggpx4-1 | CRISPR-Cas9 | None | None | None | None | 1 | 2024-12-20 14:58:45.707716+00:00 | 1 |
→ returning existing PerturbationTarget record with same name: 'GPX4'
✓ created 1 Gene record from Bionty matching symbol: 'OR2J2'
! ambiguous validation in Bionty for 1 record: 'OR2J2'
Compounds¶
# the remaining compounds are not in CHEBI and we create records for them
curator.add_new_from("pert_compound")
Show code cell output
✓ added 6 records with Compound.name for "pert_compound": 'taselisib', 'brd3379', 'control', 'azd5591', 'prexasertib', 'idasanutlin'
Curate non-pert metadata¶
# manually fix sex and set assay
adata.obs["sex"] = adata.obs["sex"].cat.rename_categories({"Unknown": "unknown"})
adata.obs["assay"] = "10x 3' v3"
# subset the adata to only include the validated genes
adata = adata[:, ~adata.var_names.isin(curator.non_validated["var_index"])].copy()
# standardize disease and sex as suggested
curator.standardize("disease")
curator.standardize("sex")
Show code cell output
✓ standardized 1 synonym in "disease": "pancreatic cancer" → "malignant pancreatic neoplasm"
✓ standardized 2 synonyms in "sex": "Male" → "male", "Female" → "female"
# Recreate Curator object because we are using a new adata
curator = wl.PertCurator(adata)
curator.validate()
Show code cell output
→ mapped 'pert_name' to 'pert_compound'
→ mapped 'pert_name' to 'pert_genetic'
✓ added 1 record with Feature.name for "columns": 'pert_target'
• saving validated records of 'assay'
✓ added 1 record from public with ExperimentalFactor.name for "assay": '10x 3' v3'
✓ "var_index" is validated against Gene.ensembl_gene_id
✓ "assay" is validated against ExperimentalFactor.name
• mapping "cell_type" on CellType.name
! 1 term is not validated: 'unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
• mapping "development_stage" on DevelopmentalStage.name
! 1 term is not validated: 'unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from("development_stage")
• mapping "disease" on Disease.name
! 1 term is not validated: 'pancreatic cancer'
1 synonym found: "pancreatic cancer" → "malignant pancreatic neoplasm"
→ curate synonyms via .standardize("disease")
• mapping "donor_id" on ULabel.name
! 1 term is not validated: 'unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from("donor_id")
• mapping "self_reported_ethnicity" on Ethnicity.name
! 1 term is not validated: 'unknown'
→ fix typos, remove non-existent values, or save terms via .add_new_from("self_reported_ethnicity")
• mapping "sex" on Phenotype.name
! 3 terms are not validated: 'Male', 'Female', 'unknown'
2 synonyms found: "Male" → "male", "Female" → "female"
→ curate synonyms via .standardize("sex") for remaining terms:
→ fix typos, remove non-existent values, or save terms via .add_new_from("sex")
• mapping "suspension_type" on ULabel.name
! 1 term is not validated: 'cell'
→ fix typos, remove non-existent values, or save terms via .add_new_from("suspension_type")
• mapping "tissue_type" on ULabel.name
! 1 term is not validated: 'cell_line'
→ fix typos, remove non-existent values, or save terms via .add_new_from("tissue_type")
✓ "organism" is validated against Organism.name
✓ "cell_line" is validated against CellLine.name
✓ "pert_target" is validated against PerturbationTarget.name
✓ "pert_genetic" is validated against GeneticPerturbation.name
✓ "pert_compound" is validated against Compound.name
False
curator.add_new_from("all")
Show code cell output
✓ added 1 record with CellType.name for "cell_type": 'unknown'
✓ added 1 record with DevelopmentalStage.name for "development_stage": 'unknown'
✓ added 1 record with Disease.name for "disease": 'pancreatic cancer'
✓ added 1 record with ULabel.name for "donor_id": 'unknown'
✓ added 1 record with Ethnicity.name for "self_reported_ethnicity": 'unknown'
✓ added 3 records with Phenotype.name for "sex": 'Male', 'Female', 'unknown'
✓ added 1 record with ULabel.name for "suspension_type": 'cell'
✓ added 1 record with ULabel.name for "tissue_type": 'cell_line'
curator.validate()
Show code cell output
✓ "var_index" is validated against Gene.ensembl_gene_id
✓ "assay" is validated against ExperimentalFactor.name
✓ "cell_type" is validated against CellType.name
✓ "development_stage" is validated against DevelopmentalStage.name
✓ "disease" is validated against Disease.name
✓ "donor_id" is validated against ULabel.name
✓ "self_reported_ethnicity" is validated against Ethnicity.name
✓ "sex" is validated against Phenotype.name
✓ "suspension_type" is validated against ULabel.name
✓ "tissue_type" is validated against ULabel.name
✓ "organism" is validated against Organism.name
✓ "cell_line" is validated against CellLine.name
✓ "pert_target" is validated against PerturbationTarget.name
✓ "pert_genetic" is validated against GeneticPerturbation.name
✓ "pert_compound" is validated against Compound.name
True
References¶
reference = ops.Reference(
name="Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action",
abbr="McFarland 2020",
url="https://www.nature.com/articles/s41467-020-17440-w",
doi="10.1038/s41467-020-17440-w",
text=(
"Assays to study cancer cell responses to pharmacologic or genetic perturbations are typically "
"restricted to using simple phenotypic readouts such as proliferation rate. Information-rich assays, "
"such as gene-expression profiling, have generally not permitted efficient profiling of a given "
"perturbation across multiple cellular contexts. Here, we develop MIX-Seq, a method for multiplexed "
"transcriptional profiling of post-perturbation responses across a mixture of samples with single-cell "
"resolution, using SNP-based computational demultiplexing of single-cell RNA-sequencing data. We show "
"that MIX-Seq can be used to profile responses to chemical or genetic perturbations across pools of 100 "
"or more cancer cell lines. We combine it with Cell Hashing to further multiplex additional experimental "
"conditions, such as post-treatment time points or drug doses. Analyzing the high-content readout of "
"scRNA-seq reveals both shared and context-specific transcriptional response components that can identify "
"drug mechanism of action and enable prediction of long-term cell viability from short-term transcriptional "
"responses to treatment."
),
).save()
Register curated artifact¶
artifact = curator.save_artifact(description="McFarland AnnData")
Show code cell output
! 30 unique terms (65.20%) are not validated for name: 'depmap_id', 'cancer', 'cell_det_rate', 'cell_quality', 'channel', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', ...
! did not create Feature records for 30 non-validated names: 'cancer', 'cell_det_rate', 'cell_quality', 'channel', 'chembl-ID', 'depmap_id', 'dose_unit', 'dose_value', 'doublet_CL1', 'doublet_CL2', 'doublet_GMM_prob', 'doublet_dev_imp', 'doublet_z_margin', 'hash_assignment', 'hash_tag', 'ncounts', 'ngenes', 'nperts', 'num_SNPs', 'percent_mito', ...
# link the reference to the artifact
artifact.references.add(reference)
artifact.describe()
Artifact .h5ad/AnnData ├── General │ ├── .uid = '1Er757qaE84Kv1Dx0000' │ ├── .size = 3500320 │ ├── .hash = 'AQrOlYzdag6efYv2ZgXsuw' │ ├── .n_observations = 1000 │ ├── .path = /home/runner/work/wetlab/wetlab/docs/guide/test-pert-curator/.lamindb/1Er757qaE84Kv1Dx0000.h5ad │ ├── .created_by = anonymous │ ├── .created_at = 2024-12-20 14:59:31 │ └── .transform = 'PertCurator' ├── Dataset features/.feature_sets │ ├── var • 1869 [bionty.Gene] │ │ MAGED2 float │ │ CPE float │ │ DDX43 float │ │ SPANXA2 float │ │ LNCPRESS1 float │ │ TMEM175 float │ │ MMD float │ │ LINC01834 float │ │ DYNC2I1 float │ │ CCRL2 float │ │ MRPS6 float │ │ KCNH1 float │ │ TBILA float │ │ SLC5A7 float │ │ GSX2 float │ │ ATP5MG float │ │ TMEM190 float │ │ GTF2H1 float │ └── obs • 16 [Feature] │ assay cat[bionty.ExperimentalF… 10x 3' v3 │ cell_line cat[bionty.CellLine] 22Rv1, 253J-BV, 42-MG-BA, 639-V, 647-V, … │ cell_type cat[bionty.CellType] unknown │ development_stage cat[bionty.Developmental… unknown │ disease cat[bionty.Disease] bile duct cancer, bone cancer, brain can… │ donor_id cat[ULabel] unknown │ organism cat[bionty.Organism] human │ pert_compound cat[wetlab.Compound] JQ1, afatinib, azd5591, bortezomib, brd3… │ pert_genetic cat[wetlab.GeneticPertur… sggpx4-1, sggpx4-2, sglacz, sgor2j2 │ pert_target cat[wetlab.PerturbationT… GPX4, OR2J2 │ self_reported_ethnicity cat[bionty.Ethnicity] unknown │ sex cat[bionty.Phenotype] Female, Male, unknown │ suspension_type cat[ULabel] cell │ tissue_type cat[ULabel] cell_line │ pert_dose str │ pert_time str └── Labels └── .organisms bionty.Organism human .cell_types bionty.CellType unknown .diseases bionty.Disease thyroid cancer, rhabdoid tumor, colorect… .cell_lines bionty.CellLine LS1034, SW48, HCC1143, SNU-C2A, HT-29, N… .phenotypes bionty.Phenotype Male, Female, unknown .experimental_factors bionty.ExperimentalFactor 10x 3' v3 .developmental_stages bionty.DevelopmentalStage unknown .ethnicities bionty.Ethnicity unknown .references ourprojects.Reference Multiplexed single-cell transcriptional … .compounds wetlab.Compound JQ1, afatinib, azd5591, bortezomib, brd3… .perturbation_targets wetlab.PerturbationTarget GPX4, OR2J2 .genetic_perturbations wetlab.GeneticPerturbati… sggpx4-1, sggpx4-2, sglacz, sgor2j2 .ulabels ULabel unknown, cell, cell_line