Jupyter Notebook

Multi-modal

Here, we’ll showcase how to curate and register ECCITE-seq data from Papalexi21 in the form of MuData objects.

ECCITE-seq is designed to enable interrogation of single-cell transcriptomes together with surface protein markers in the context of CRISPR screens.

MuData objects build on top of AnnData objects to store multimodal data.

!lamin init --storage ./test-multimodal --schema bionty
Hide code cell output
💡 connected lamindb: testuser1/test-multimodal
import lamindb as ln
import bionty as bt
💡 connected lamindb: testuser1/test-multimodal
mdata = ln.core.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs × n_vars = 200 × 300
  obs:	'perturbation', 'replicate'
  var:	'name'
  4 modalities
    rna:	200 x 173
      obs:	'nCount_RNA', 'nFeature_RNA', 'percent.mito'
      var:	'name'
    adt:	200 x 4
      obs:	'nCount_ADT', 'nFeature_ADT'
      var:	'name'
    hto:	200 x 12
      obs:	'nCount_HTO', 'nFeature_HTO', 'technique'
      var:	'name'
    gdo:	200 x 111
      obs:	'nCount_GDO'
      var:	'name'

Validate annotations

annotate = ln.Annotate.from_mudata(
    mdata,
    var_index={
        "rna": bt.Gene.symbol, # gene expression
        "adt": bt.CellMarker.name, # antibody derived tags reflecting surface proteins
        "hto": ln.Feature.name, # cell hashing
        "gdo": ln.Feature.name, # guide RNAs
    },
    categoricals={
        "perturbation": ln.ULabel.name,  # shared categorical
        "replicate": ln.ULabel.name, # shared categorical
        "hto:technique": bt.ExperimentalFactor.name # note this is a modality specific categorical
    },
    organism="human",
)
Hide code cell output
3 non-validated categories are not saved in Feature.name: ['nCount_RNA', 'nFeature_RNA', 'percent.mito']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
✅ added 1 record with Feature.name for columns: 'technique'
2 non-validated categories are not saved in Feature.name: ['nCount_HTO', 'nFeature_HTO']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
✅ added 2 records with Feature.name for columns: 'perturbation', 'replicate'
45 non-validated categories are not saved in Feature.name: ['gdo:percent.mito', 'gdo:perturbation', 'hto:nFeature_HTO', 'hto:perturbation', 'gdo:nCount_GDO', 'rna:nFeature_RNA', 'gdo:HTO_classification', 'adt:guide_ID', 'gdo:guide_ID', 'hto:replicate', 'adt:NT', 'adt:replicate', 'gdo:S.Score', 'hto:gene_target', 'gdo:MULTI_ID', 'adt:percent.mito', 'hto:HTO_classification', 'hto:S.Score', 'gdo:orig.ident', 'hto:percent.mito', 'adt:nFeature_ADT', 'adt:perturbation', 'hto:orig.ident', 'adt:gene_target', 'gdo:NT', 'adt:orig.ident', 'adt:Phase', 'hto:MULTI_ID', 'hto:nCount_HTO', 'hto:guide_ID', 'gdo:replicate', 'rna:nCount_RNA', 'adt:G2M.Score', 'adt:MULTI_ID', 'gdo:gene_target', 'adt:nCount_ADT', 'hto:Phase', 'gdo:G2M.Score', 'adt:HTO_classification', 'gdo:Phase', 'adt:S.Score', 'hto:technique', 'rna:percent.mito', 'hto:G2M.Score', 'hto:NT']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
2 non-validated categories are not saved in Feature.name: ['nCount_ADT', 'nFeature_ADT']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
1 non-validated categories are not saved in Feature.name: ['nCount_GDO']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
✅ added 100 records from public with Gene.symbol for var_index: 'SH2D6', 'MEF2C-AS2', 'ARHGAP26-AS1', 'GABRA1', 'H4C12', 'HLA-DQB1-AS1', 'SPACA1', 'VNN1', 'CTAGE15', 'PFKFB1', 'TRPC5', 'RBPMS-AS1', 'CA8', 'CSMD3', 'ZNF483', 'AK8', 'TMEM72-AS1', 'ARAP1-AS2', 'CRYAB', 'DNAI7', ...
84 non-validated categories are not saved in Gene.symbol: ['RP5-827C21.6', 'XX-CR54.1', 'RP11-379B18.5', 'RP11-778D9.12', 'RP11-703G6.1', 'AC005150.1', 'RP11-717H13.1', 'CTC-498J12.1', 'RP11-524H19.2', 'AC006042.7', 'AC002066.1', 'AC073934.6', 'RP11-268G12.1', 'U52111.14', 'RP11-235C23.5', 'RP11-12J10.3', 'RP11-324E6.9', 'RP11-187A9.3', 'RP11-365N19.2', 'RP11-346D14.1', 'RP11-265N6.2', 'CTD-3065B20.2', 'RP11-304L19.11', 'AC026471.6', 'AC091132.1', 'RP11-138C9.1', 'RP11-75C10.9', 'RP11-835E18.5', 'RP11-760N9.1', 'RP11-17J14.2', 'CTD-3193O13.8', 'AC004019.13', 'RP11-465N4.4', 'RP11-434D9.1', 'RP11-325L7.1', 'RP11-134K13.4', 'RP5-855F16.1', 'RP3-327A19.5', 'RP11-546K22.3', 'RP11-473O4.4', 'RP13-582O9.7', 'RP11-12D24.10', 'RP11-120C12.3', 'RP11-80H5.7', 'RP11-496I9.1', 'AP000442.4', 'RP11-867G23.3', 'RP11-113K21.4', 'RP11-745O10.2', 'RP11-335O4.3', 'RP11-408E5.4', 'AE000662.93', 'AL132989.1', 'RP11-973N13.4', 'RP11-982M15.2', 'RP11-32B5.7', 'RP1-1J6.2', 'RP3-337O18.9', 'AC011558.5', 'CTA-373H7.7', 'RP11-415J8.5', 'AC092687.5', 'RP11-532F6.4', 'RP11-146I2.1', 'RP11-624M8.1', 'RP11-219B4.7', 'RP11-9M16.2', 'RP11-247A12.8', 'RP11-536K7.5', 'RP11-186N15.3', 'RP11-152H18.3', 'CTD-3012A18.1', 'CTD-2562J17.2', 'RP11-136I14.5', 'RP11-110I1.14', 'RP11-2H8.2', 'RP11-307N16.6', 'RP11-3D4.2', 'RP11-231C14.4', 'CTB-134F13.1', 'RP11-403P17.5', 'RP11-214C8.2', 'CTB-31O20.9', 'AC092295.4']!
      → to lookup categories, use lookup().var_index
      → to save, run add_new_from_var_index
✅ added 4 records from public with CellMarker.name for var_index: 'CD86', 'PDL1', 'PDL2', 'CD366'
12 non-validated categories are not saved in Feature.name: ['rep1-tx', 'rep1-ctrl', 'rep2-tx', 'rep2-ctrl', 'PDL1g1-tx', 'PDL1g1-ctrl', 'PDL1g2-tx', 'PDL1g2-ctrl', 'rep3-tx', 'rep3-ctrl', 'rep4-tx', 'rep4-ctrl']!
      → to lookup categories, use lookup().var_index
      → to save, run add_new_from_var_index
111 non-validated categories are not saved in Feature.name: ['eGFPg1', 'CUL3g1', 'CUL3g2', 'CUL3g3', 'CMTM6g1', 'CMTM6g2', 'CMTM6g3', 'NTg1', 'NTg2', 'NTg3', 'NTg4', 'NTg5', 'NTg7', 'PDL1g1', 'PDL1g2', 'PDL1g3', 'ATF2g1', 'ATF2g2', 'ATF2g3', 'ATF2g4', 'BRD4g1', 'BRD4g2', 'BRD4g3', 'BRD4g4', 'CAV1g1', 'CAV1g2', 'CAV1g3', 'CAV1g4', 'CD86g1', 'CD86g2', 'CD86g3', 'CD86g4', 'ETV7g1', 'ETV7g2', 'ETV7g3', 'ETV7g4', 'IFNGR1g1', 'IFNGR1g2', 'IFNGR1g3', 'IFNGR1g4', 'IFNGR2g1', 'IFNGR2g2', 'IFNGR2g3', 'IFNGR2g4', 'IRF1g1', 'IRF1g2', 'IRF1g3', 'IRF1g4', 'IRF7g1', 'IRF7g2', 'IRF7g3', 'IRF7g4', 'JAK2g1', 'JAK2g2', 'JAK2g3', 'JAK2g4', 'MARCH8g1', 'MARCH8g2', 'MARCH8g3', 'MARCH8g4', 'MYCg1', 'MYCg2', 'MYCg3', 'MYCg4', 'NFKBIAg1', 'NFKBIAg2', 'NFKBIAg3', 'NFKBIAg4', 'PDCD1LG2g1', 'PDCD1LG2g2', 'PDCD1LG2g3', 'PDCD1LG2g4', 'POU2F2g1', 'POU2F2g2', 'POU2F2g3', 'POU2F2g4', 'SMAD4g1', 'SMAD4g2', 'SMAD4g3', 'SMAD4g4', 'SPI1g1', 'SPI1g2', 'SPI1g3', 'SPI1g4', 'STAT1g1', 'STAT1g2', 'STAT1g3', 'STAT1g4', 'STAT2g1', 'STAT2g2', 'STAT2g3', 'STAT2g4', 'STAT3g1', 'STAT3g2', 'STAT3g3', 'STAT3g4', 'STAT5Ag1', 'STAT5Ag2', 'STAT5Ag3', 'STAT5Ag4', 'TNFRSF14g1', 'TNFRSF14g2', 'TNFRSF14g3', 'TNFRSF14g4', 'UBE2L6g1', 'UBE2L6g2', 'UBE2L6g3', 'UBE2L6g4', 'NTg8', 'NTg9', 'NTg10']!
      → to lookup categories, use lookup().var_index
      → to save, run add_new_from_var_index
# add new gene symbols from the ['rna'].var.index
annotate.add_new_from_var_index("rna")

# add new categories from the hto and gdo var.index
annotate.add_new_from_var_index("hto")
annotate.add_new_from_var_index("gdo")

# optional: register additional columns we'd like to annotate
annotate.add_new_from_columns(modality="rna")
annotate.add_new_from_columns(modality="adt")
annotate.add_new_from_columns(modality="hto")
annotate.add_new_from_columns(modality="gdo")
Hide code cell output
✅ added 84 records with Gene.symbol for var_index: 'RP5-827C21.6', 'XX-CR54.1', 'RP11-379B18.5', 'RP11-778D9.12', 'RP11-703G6.1', 'AC005150.1', 'RP11-717H13.1', 'CTC-498J12.1', 'RP11-524H19.2', 'AC006042.7', 'AC002066.1', 'AC073934.6', 'RP11-268G12.1', 'U52111.14', 'RP11-235C23.5', 'RP11-12J10.3', 'RP11-324E6.9', 'RP11-187A9.3', 'RP11-365N19.2', 'RP11-346D14.1', ...
✅ added 12 records with Feature.name for var_index: 'rep1-tx', 'rep1-ctrl', 'rep2-tx', 'rep2-ctrl', 'PDL1g1-tx', 'PDL1g1-ctrl', 'PDL1g2-tx', 'PDL1g2-ctrl', 'rep3-tx', 'rep3-ctrl', 'rep4-tx', 'rep4-ctrl'
✅ added 111 records with Feature.name for var_index: 'eGFPg1', 'CUL3g1', 'CUL3g2', 'CUL3g3', 'CMTM6g1', 'CMTM6g2', 'CMTM6g3', 'NTg1', 'NTg2', 'NTg3', 'NTg4', 'NTg5', 'NTg7', 'PDL1g1', 'PDL1g2', 'PDL1g3', 'ATF2g1', 'ATF2g2', 'ATF2g3', 'ATF2g4', ...
✅ added 3 records with Feature.name for rna obs columns: 'nCount_RNA', 'nFeature_RNA', 'percent.mito'
✅ added 2 records with Feature.name for adt obs columns: 'nCount_ADT', 'nFeature_ADT'
✅ added 2 records with Feature.name for hto obs columns: 'nCount_HTO', 'nFeature_HTO'
✅ added 1 record with Feature.name for gdo obs columns: 'nCount_GDO'
annotate.validate()
Hide code cell output
✅ rna_var_index is validated against Gene.symbol
✅ adt_var_index is validated against CellMarker.name
✅ hto_var_index is validated against Feature.name
✅ gdo_var_index is validated against Feature.name
💡 mapping perturbation on ULabel.name
2 terms are not validated: 'Perturbed', 'NT'
      → save terms via .add_new_from('perturbation')
💡 mapping replicate on ULabel.name
3 terms are not validated: 'rep3', 'rep1', 'rep2'
      → save terms via .add_new_from('replicate')
💡 mapping technique on ExperimentalFactor.name
❗    found 1 terms validated terms: ['cell hashing']
      → save terms via .add_validated_from('technique')
✅ technique is validated against ExperimentalFactor.name
False
# add validated and new categories
annotate.add_new_from("perturbation")
annotate.add_new_from("replicate")
annotate.add_validated_from("technique", modality="hto")
Hide code cell output
✅ added 2 records with ULabel.name for perturbation: 'Perturbed', 'NT'
✅ added 3 records with ULabel.name for replicate: 'rep3', 'rep1', 'rep2'
✅ added 1 record from public with ExperimentalFactor.name for technique: 'cell hashing'
annotate.validate()
Hide code cell output
✅ rna_var_index is validated against Gene.symbol
✅ adt_var_index is validated against CellMarker.name
✅ hto_var_index is validated against Feature.name
✅ gdo_var_index is validated against Feature.name
✅ perturbation is validated against ULabel.name
✅ replicate is validated against ULabel.name
✅ technique is validated against ExperimentalFactor.name
True

Register annotated artifact

artifact = annotate.save_artifact(description="Sub-sampled MuData from Papalexi21")
Hide code cell output
❗ no run & transform get linked, consider calling ln.track()
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/OQ8lnq7gdrJMsQ8019t5.h5mu')
✅ storing artifact 'OQ8lnq7gdrJMsQ8019t5' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal/.lamindb/OQ8lnq7gdrJMsQ8019t5.h5mu'
💡 you can auto-track these data as a run input by calling `ln.track()`
✅ loaded 2 Feature records matching name: 'perturbation', 'replicate'
did not create Feature records for 45 non-validated names: 'adt:G2M.Score', 'adt:HTO_classification', 'adt:MULTI_ID', 'adt:NT', 'adt:Phase', 'adt:S.Score', 'adt:gene_target', 'adt:guide_ID', 'adt:nCount_ADT', 'adt:nFeature_ADT', 'adt:orig.ident', 'adt:percent.mito', 'adt:perturbation', 'adt:replicate', 'gdo:G2M.Score', 'gdo:HTO_classification', 'gdo:MULTI_ID', 'gdo:NT', 'gdo:Phase', 'gdo:S.Score', ...
💡 parsing feature names of X stored in slot 'var'
161 terms (93.10%) are validated for symbol
12 terms (6.90%) are not validated for symbol: CTC-467M3.1, HIST1H4K, CASC1, LARGE, NBPF16, C1orf65, IBA57-AS1, KIAA1239, TMEM75, AP003419.16, FAM65C, C14orf177
✅    linked: FeatureSet(uid='MdKKFlvwGLlzIN6HiG7E', n=172, dtype='float', registry='bionty.Gene', hash='y1Qo897t3gp9S3it4dz6', created_by_id=1)
💡 parsing feature names of slot 'obs'
3 terms (100.00%) are validated for name
✅    linked: FeatureSet(uid='predvpWnECyc1sY7Arm6', n=3, registry='Feature', hash='ujSdHDX-fNIAoHBWnTOA', created_by_id=1)
💡 parsing feature names of X stored in slot 'var'
4 terms (100.00%) are validated for name
✅    linked: FeatureSet(uid='uUX8P2aLj3KDHB5vAU4G', n=4, dtype='float', registry='bionty.CellMarker', hash='o8EDT805HnP0Fmk4uZ9e', created_by_id=1)
💡 parsing feature names of slot 'obs'
2 terms (100.00%) are validated for name
✅    linked: FeatureSet(uid='IDX2TyfxhevGelPkMcRg', n=2, registry='Feature', hash='wKLR1uC3UiY65RnLcoVm', created_by_id=1)
💡 parsing feature names of X stored in slot 'var'
12 terms (100.00%) are validated for name
✅    linked: FeatureSet(uid='ONTeKYyqhFIs76EGQuhZ', n=12, dtype='float', registry='Feature', hash='jX0L9DnQgqEO-acVDKcz', created_by_id=1)
💡 parsing feature names of slot 'obs'
3 terms (100.00%) are validated for name
✅    linked: FeatureSet(uid='rRJInxI2YLmgRtlID08f', n=3, registry='Feature', hash='9c1hgEpX5n0PLYeQh68s', created_by_id=1)
💡 parsing feature names of X stored in slot 'var'
111 terms (100.00%) are validated for name
✅    linked: FeatureSet(uid='2gNJsnr0W4lzUpdT6OkO', n=111, dtype='float', registry='Feature', hash='QiE6_BuWAw1V0Hj6H0Mp', created_by_id=1)
💡 parsing feature names of slot 'obs'
1 term (100.00%) is validated for name
✅    linked: FeatureSet(uid='LSq5E9MO4Aw7PTP0xCop', n=1, registry='Feature', hash='4gYVJoTqYgmTmopSNU2_', created_by_id=1)
✅ saved 9 feature sets for slots: 'obs','['rna'].var','['rna'].obs','['adt'].var','['adt'].obs','['hto'].var','['hto'].obs','['gdo'].var','['gdo'].obs'
artifact.describe()
Artifact(uid='OQ8lnq7gdrJMsQ8019t5', description='Sub-sampled MuData from Papalexi21', suffix='.h5mu', type='dataset', accessor='MuData', size=545560, hash='252nP4Nu-pLH37ZgQQ_tOw', hash_type='md5', n_observations=200, visibility=1, key_is_virtual=True, updated_at='2024-06-19 23:21:10 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal'
  Labels
    .experimental_factors = 'cell hashing'
    .ulabels = 'Perturbed', 'NT', 'rep3', 'rep1', 'rep2'
  Features
    'perturbation' = 'Perturbed', 'NT'
    'replicate' = 'rep3', 'rep1', 'rep2'
    'technique' = 'cell hashing'
  Feature sets
    'obs' = 'perturbation', 'replicate'
    '['rna'].var' = 'SH2D6', 'ARHGAP26-AS1', 'GABRA1', 'HLA-DQB1-AS1', 'SPACA1', 'VNN1', 'CTAGE15', 'PFKFB1', 'TRPC5', 'RBPMS-AS1', 'CA8', 'CSMD3', 'ZNF483'
    '['rna'].obs' = 'nFeature_RNA', 'percent.mito', 'nCount_RNA'
    '['adt'].var' = 'CD86', 'PDL1', 'PDL2', 'CD366'
    '['adt'].obs' = 'nCount_ADT', 'nFeature_ADT'
    '['hto'].var' = 'rep1-tx', 'rep1-ctrl', 'rep2-tx', 'rep2-ctrl', 'PDL1g1-tx', 'PDL1g1-ctrl', 'PDL1g2-tx', 'PDL1g2-ctrl', 'rep3-tx', 'rep3-ctrl', 'rep4-tx', 'rep4-ctrl'
    '['hto'].obs' = 'technique', 'nCount_HTO', 'nFeature_HTO'
    '['gdo'].var' = 'eGFPg1', 'CUL3g1', 'CUL3g2', 'CUL3g3', 'CMTM6g1', 'CMTM6g2', 'CMTM6g3', 'NTg1', 'NTg2', 'NTg3', 'NTg4', 'NTg5', 'NTg7', 'PDL1g1', 'PDL1g2', 'PDL1g3', 'ATF2g1', 'ATF2g2', 'ATF2g3', 'ATF2g4'
    '['gdo'].obs' = 'nCount_GDO'
# clean up test instance
!rm -r test-multimodal
!lamin delete --force test-multimodal
Hide code cell output
💡 deleting instance testuser1/test-multimodal