scrna2/6 Jupyter Notebook lamindata

Standardize and append a batch of data

Here, we’ll learn

  • how to standardize a less well curated collection

  • how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.track("ManDYgmftZ8C0000")
Hide code cell output
→ connected lamindb: testuser1/test-scrna
→ notebook imports: bionty==0.51.2 lamindb==0.76.12
→ created Transform('ManDYgmf'), started new Run('sgBAklas') at 2024-10-11 09:32:59 UTC

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We are still working with human data, and can globally set an organism:

bt.settings.organism = "human"
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={adata.obs.cell_type.name: bt.CellType.name},
)
Hide code cell output
3 non-validated values are not saved in Feature.name: ['percent_mito', 'n_genes', 'louvain']!
      → to lookup values, use lookup().columns
      → to save, run add_new_from_columns

Standardize & validate genes

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
Hide code cell output
• standardized 749/765 terms
! found 5 symbols in Bionty: ['GPX1', 'IGLL5', 'SNORD3B-2', 'RN7SL1', 'SOD2']
   please add corresponding Gene records via `.from_values(['ENSG00000254709', 'ENSG00000233276', 'ENSG00000291237', 'ENSG00000276168', 'ENSG00000262074'])`

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
curate = ln.Curator.from_anndata(
    adata_validated,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={"cell_type": bt.CellType.name},
)
Hide code cell output
3 non-validated values are not saved in Feature.name: ['percent_mito', 'n_genes', 'louvain']!
      → to lookup values, use lookup().columns
      → to save, run add_new_from_columns
curate.validate()
Hide code cell output
✓ var_index is validated against Gene.ensembl_gene_id
• mapping cell_type on CellType.name
!    9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
      → fix typos, remove non-existent values, or save terms via .add_new_from('cell_type')
False
curate.add_validated_from_var_index()

Standardize & validate cell types

Since none of the cell types are validate, let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = bt.CellType.from_source(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()
    record.add_synonym(name)
Hide code cell output
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000911'
! CellType records from source (cl, 2024-05-15) are already in the database!
   → pass `update=True` to update the records
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000795'
! CellType records from source (cl, 2024-05-15) are already in the database!
   → pass `update=True` to update the records
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0002057'
✓ loaded 1 CellType record matching ontology_id: 'CL:0000860'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0001054'
! CellType records from source (cl, 2024-05-15) are already in the database!
   → pass `update=True` to update the records
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0002051'
✓ created 1 CellType record from Bionty matching ontology_id: 'CL:0000952'
! CellType records from source (cl, 2024-05-15) are already in the database!
   → pass `update=True` to update the records

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

curate.validate()
Hide code cell output
✓ var_index is validated against Gene.ensembl_gene_id
✓ cell_type is validated against CellType.name
True

Register

artifact = curate.save_artifact(description="10x reference adata")
Hide code cell output
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/H37yo7578CVfxO0N0000.h5ad')
✓ storing artifact 'H37yo7578CVfxO0N0000' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/H37yo7578CVfxO0N0000.h5ad'
• parsing feature names of X stored in slot 'var'
749 unique terms (100.00%) are validated for ensembl_gene_id
✓    linked: FeatureSet(uid='Dd3x3S0eZZs5nz4hpO1v', n=749, dtype='float', registry='bionty.Gene', hash='o70Gw1y_TnH190ggJ4FwgA', created_by_id=1, run_id=2)
• parsing feature names of slot 'obs'
1 unique term (25.00%) is validated for name
!    3 unique terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
✓    linked: FeatureSet(uid='9ELvSzcin2rB6kOwumgr', n=1, registry='Feature', hash='uO4AVlfU0I_16uduyDlI3Q', created_by_id=1, run_id=2)
✓ saved 2 feature sets for slots: 'var','obs'
artifact.view_lineage()
_images/0e67ebe2aeacd5c1d4b470b3acbde06b0ad6222285ecf27217708a8c49b27cb3.svg

Append the dataset to the collection

Query the previous collection:

collection_v1 = ln.Collection.get(name="My versioned scRNA-seq collection")

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = ln.Collection(
    [artifact, collection_v1.artifacts.all()[0]],
    revises=collection_v1,
).save()
Hide code cell output
• adding collection ids [1] as inputs for run 2, adding parent transform 1
• adding collection ids [1] as inputs for run 2, adding parent transform 1
• adding artifact ids [1] as inputs for run 2, adding parent transform 1

If you want, you can label the collection’s version by setting .version.

collection_v2.version = "2"
collection_v2.save()
Hide code cell output
Collection(uid='HaDja6kov8DMbCsm0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='nrEhls7ejeCFKPtKSr09Wg', visibility=1, created_by_id=1, transform_id=2, run_id=2, created_at=2024-10-11 09:33:17 UTC)

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()
Hide code cell output
Collection(uid='HaDja6kov8DMbCsm0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='nrEhls7ejeCFKPtKSr09Wg', visibility=1, created_at=2024-10-11 09:33:17 UTC)
  Provenance
    .created_by = 'testuser1'
    .transform = 'Standardize and append a batch of data'
    .run = 2024-10-11 09:32:59 UTC

View data lineage:

collection_v2.view_lineage()
_images/349bf59c6313a17f98f977a35b60594c6f8c7c12febf148b12471efd7c379827.svg