Curate AnnData
based on the CELLxGENE schema¶
This guide shows how to curate an AnnData object with the help of laminlabs/cellxgene
against the CELLxGENE schema v5.1.0.
Load your instance where you want to register the curated AnnData object:
# !pip install 'lamindb[bionty,jupyter]' cellxgene-lamin
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.1.1
!lamin init --storage ./test-cellxgene-curate --name test-cellxgene-curate --schema bionty
Show code cell output
→ connected lamindb: testuser1/test-cellxgene-curate
import lamindb as ln
import cellxgene_lamin as cxg
→ connected lamindb: testuser1/test-cellxgene-curate
Let’s start with an AnnData object that we’d like to inspect and curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.
adata = cxg.datasets.anndata_human_immune_cells()
adata.write_h5ad("anndata_human_immune_cells.h5ad")
adata
Show code cell output
AnnData object with n_obs × n_vars = 1626 × 36503
obs: 'donor', 'tissue', 'cell_type', 'assay', 'sex_ontology_term_id', 'organism', 'sex'
var: 'feature_is_filtered'
uns: 'default_embedding'
obsm: 'X_umap'
Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells.h5ad
Show code cell output
Loading dependencies
Loading validator modules
Traceback (most recent call last):
File "/home/runner/.local/share/uv/tools/cellxgene-schema/bin/cellxgene-schema", line 8, in <module>
sys.exit(schema_cli())
^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/cellxgene_schema/cli.py", line 45, in schema_validate
from .validate import validate
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/cellxgene_schema/validate.py", line 14, in <module>
from anndata._core.sparse_dataset import SparseDataset
ImportError: cannot import name 'SparseDataset' from 'anndata._core.sparse_dataset' (/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/anndata/_core/sparse_dataset.py)
Validate and curate metadata¶
We create a Curate
object that references the AnnData
object.
During instantiation, any :class:~lamindb.Feature
records are saved.
curator = cxg.Curator(adata, organism="human", schema_version="5.1.0")
Show code cell output
✓ added 1 record with Feature.name for columns: 'sex_ontology_term_id'
✓ added 4 records from laminlabs/cellxgene with Feature.name for columns: 'assay', 'cell_type', 'tissue', 'organism'
✓ added 1 record from laminlabs/cellxgene with Feature.name for columns: 'sex'
Let’s fix the “donor_id” column name:
adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)
validated = curator.validate()
✗ missing required obs columns development_stage, disease, self_reported_ethnicity, suspension_type, tissue_type
• consider initializing a Curate object like 'Curate(adata, defaults=cxg.CellxGeneFields.OBS_FIELD_DEFAULTS)'to automatically add these columns with default values.
For the missing columns, we can pass default values suggested from CELLxGENE which will automatically add them to the AnnData object:
cxg.CellxGeneFields.OBS_FIELD_DEFAULTS
Show code cell output
{'organism': 'unknown',
'assay': 'unknown',
'cell_type': 'unknown',
'development_stage': 'unknown',
'disease': 'normal',
'donor_id': 'unknown',
'self_reported_ethnicity': 'unknown',
'sex': 'unknown',
'suspension_type': 'cell',
'tissue_type': 'tissue'}
curator = cxg.Curator(adata, defaults=cxg.CellxGeneFields.OBS_FIELD_DEFAULTS, organism="human", schema_version="5.1.0")
Show code cell output
→ added defaults to the AnnData object: {'development_stage': 'unknown', 'disease': 'normal', 'self_reported_ethnicity': 'unknown', 'suspension_type': 'cell', 'tissue_type': 'tissue'}
✓ added 6 records from laminlabs/cellxgene with Feature.name for columns: 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'tissue_type', 'suspension_type'
validated = curator.validate()
validated
Show code cell output
→ validating metadata using registries of instance laminlabs/cellxgene
• saving validated records of 'var_index'
✓ added 36390 records from public with Gene.ensembl_gene_id for var_index: 'ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', 'ENSG00000238009', 'ENSG00000239945', 'ENSG00000239906', 'ENSG00000241860', 'ENSG00000241599', 'ENSG00000286448', 'ENSG00000236601', 'ENSG00000284733', 'ENSG00000235146', 'ENSG00000284662', 'ENSG00000229905', 'ENSG00000237491', 'ENSG00000177757', 'ENSG00000228794', 'ENSG00000225880', 'ENSG00000230368', 'ENSG00000272438', ...
• saving validated records of 'assay'
• saving validated records of 'cell_type'
✓ added 1 record from laminlabs/cellxgene with DevelopmentalStage.name for development_stage: 'unknown'
✓ added 1 record from laminlabs/cellxgene with Disease.name for disease: 'normal'
✓ added 1 record from laminlabs/cellxgene with Ethnicity.name for self_reported_ethnicity: 'unknown'
✓ added 1 record from laminlabs/cellxgene with Phenotype.ontology_id for sex_ontology_term_id: 'PATO:0000384'
✓ added 1 record from laminlabs/cellxgene with ULabel.name for suspension_type: 'cell'
• saving validated records of 'tissue'
✓ added 16 records from public with Tissue.name for tissue: 'sigmoid colon', 'duodenum', 'transverse colon', 'liver', 'jejunal epithelium', 'mesenteric lymph node', 'bone marrow', 'caecum', 'skeletal muscle tissue', 'thymus', 'spleen', 'lamina propria', 'thoracic lymph node', 'omentum', 'blood', 'ileum'
✓ added 1 record from laminlabs/cellxgene with ULabel.name for tissue_type: 'tissue'
• mapping var_index on Gene.ensembl_gene_id
! 113 terms are not validated: 'ENSG00000269933', 'ENSG00000261737', 'ENSG00000259834', 'ENSG00000256374', 'ENSG00000263464', 'ENSG00000203812', 'ENSG00000272196', 'ENSG00000272880', 'ENSG00000270188', 'ENSG00000287116', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
✓ 'assay' is validated against ExperimentalFactor.name
✓ 'cell_type' is validated against CellType.name
✓ 'development_stage' is validated against DevelopmentalStage.name
✓ 'disease' is validated against Disease.name
• mapping donor_id on ULabel.name
! 12 terms are not validated: 'D496-1', '621B-1', 'A29-1', 'A36-1', 'A35-1', '637C-1', 'A52-1', 'A37-1', 'D503-1', '640C-1', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from('donor_id')
✓ 'self_reported_ethnicity' is validated against Ethnicity.name
✓ 'sex_ontology_term_id' is validated against Phenotype.ontology_id
✓ 'suspension_type' is validated against ULabel.name
• mapping tissue on Tissue.name
! 1 term is not validated: 'lungg'
→ fix typo, remove non-existent value, or save term via .add_new_from('tissue')
✓ 'tissue_type' is validated against ULabel.name
✓ 'organism' is validated against Organism.name
False
Remove unvalidated values¶
We remove all unvalidated genes. These genes may exist in a different release of ensembl but are not valid for the ensembl version of cellxgene schema 5.0.0 (ensembl release 110).
adata = adata[:, ~adata.var.index.isin(curator.non_validated["var_index"])].copy()
if adata.raw is not None:
raw_data = adata.raw.to_adata()
raw_data = raw_data[
:, ~raw_data.var_names.isin(curator.non_validated["var_index"])
].copy()
adata.raw = raw_data
# We must create the Curate object again to ensure that it references the correct AnnData object
curator = cxg.Curator(adata, organism="human", schema_version="5.1.0")
Register new metadata labels¶
Following the suggestions above to register genes and labels that aren’t present in the current instance:
(Note that our instance is rather empty. Once you filled up the registries, registering new labels won’t be frequently needed)
For donors, we register the new labels:
curator.add_new_from("donor_id")
Show code cell output
✓ added 12 records with ULabel.name for donor_id: 'D496-1', 'A52-1', 'D503-1', '637C-1', 'A36-1', '640C-1', 'A29-1', '621B-1', '582C-1', 'A31-1', 'A35-1', 'A37-1'
An error is shown for the tissue label “lungg”, which is a typo, should be “lung”. Let’s fix it:
tissues = curator.lookup().tissue
tissues.lung
Show code cell output
Tissue(uid='7Tt4iEKc', name='lung', ontology_id='UBERON:0002048', synonyms='pulmo', description='Respiration Organ That Develops As An Outpocketing Of The Esophagus.', created_by_id=1, source_id=47, created_at=2023-11-28 22:50:53 UTC)
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
{"lungg": tissues.lung.name}
)
Let’s validate the object again:
validated = curator.validate()
validated
Show code cell output
→ validating metadata using registries of instance laminlabs/cellxgene
• saving validated records of 'tissue'
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'assay' is validated against ExperimentalFactor.name
✓ 'cell_type' is validated against CellType.name
✓ 'development_stage' is validated against DevelopmentalStage.name
✓ 'disease' is validated against Disease.name
✓ 'donor_id' is validated against ULabel.name
✓ 'self_reported_ethnicity' is validated against Ethnicity.name
✓ 'sex_ontology_term_id' is validated against Phenotype.ontology_id
✓ 'suspension_type' is validated against ULabel.name
✓ 'tissue' is validated against Tissue.name
✓ 'tissue_type' is validated against ULabel.name
✓ 'organism' is validated against Organism.name
True
adata.obs.head()
Show code cell output
donor_id | tissue | cell_type | assay | sex_ontology_term_id | organism | sex | development_stage | disease | self_reported_ethnicity | suspension_type | tissue_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CZINY-0109_CTGGTCTAGTCTGTAC | D496-1 | blood | classical monocyte | 10x 3' v3 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
CZI-IA10244332+CZI-IA10244434_CCTTCGACATACTCTT | 621B-1 | thoracic lymph node | T follicular helper cell | 10x 5' v2 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Pan_T7935491_CTGGTCTGTACATGTC | A29-1 | spleen | memory B cell | 10x 5' v1 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Pan_T7980367_GGGCATCCAGGTGGAT | A36-1 | lung | alveolar macrophage | 10x 5' v1 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Pan_T7935494_ATCATGGTCTACCTGC | A29-1 | mesenteric lymph node | naive thymus-derived CD4-positive, alpha-beta ... | 10x 5' v1 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Save artifact¶
artifact = curator.save_artifact(description=f"dataset curated against cellxgene schema {curator.schema_version}")
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
artifact.describe()
Show code cell output
Artifact(uid='WHdY7U8HcKDIREte0000', is_latest=True, description='dataset curated against cellxgene schema 5.1.0', suffix='.h5ad', type='dataset', size=54670616, hash='VYhEnkViOhtD-7kN2odUGw', n_observations=1626, _hash_type='sha1-fl', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-21 05:39:49 UTC)
Provenance
.storage = '/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate'
.created_by = 'testuser1'
Labels
.organisms = 'human'
.tissues = 'sigmoid colon', 'duodenum', 'transverse colon', 'liver', 'jejunal epithelium', 'mesenteric lymph node', 'bone marrow', 'caecum', 'skeletal muscle tissue', 'thymus', ...
.cell_types = 'megakaryocyte', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'non-classical monocyte', 'plasmacytoid dendritic cell', 'germinal center B cell', 'conventional dendritic cell', 'group 3 innate lymphoid cell', 'memory B cell', ...
.diseases = 'normal'
.phenotypes = 'male'
.experimental_factors = '10x 5' v2', '10x 5' v1', '10x 3' v3'
.developmental_stages = 'unknown'
.ethnicities = 'unknown'
.ulabels = 'cell', 'tissue', 'D496-1', 'A52-1', 'D503-1', '637C-1', 'A36-1', '640C-1', 'A29-1', '621B-1', ...
Features
'assay' = '10x 3' v3', '10x 5' v1', '10x 5' v2'
'cell_type' = 'CD16-negative, CD56-bright natural killer cell, human', 'CD16-positive, CD56-dim natural killer cell, human', 'CD4-positive helper T cell', 'CD8-positive, alpha-beta memory T cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'T follicular helper cell', 'alpha-beta T cell', 'alveolar macrophage', 'classical monocyte', 'conventional dendritic cell', ...
'development_stage' = 'unknown'
'disease' = 'normal'
'donor_id' = '582C-1', '621B-1', '637C-1', '640C-1', 'A29-1', 'A31-1', 'A35-1', 'A36-1', 'A37-1', 'A52-1', ...
'organism' = 'human'
'self_reported_ethnicity' = 'unknown'
'sex_ontology_term_id' = 'male'
'suspension_type' = 'cell'
'tissue' = 'blood', 'bone marrow', 'caecum', 'duodenum', 'ileum', 'jejunal epithelium', 'lamina propria', 'liver', 'lung', 'mesenteric lymph node', ...
'tissue_type' = 'tissue'
Feature sets
'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'
'obs' = 'assay', 'cell_type', 'tissue', 'organism', 'sex_ontology_term_id', 'sex', 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'tissue_type', 'suspension_type'
The below is optional – it mimics the way cellxgene creates collections of AnnData
objects to link them to studies.
# register a new collection
title = "Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)"
collection = ln.Collection(
[artifact], # registered artifact above, can also pass a list of artifacts
name=title, # title of the publication
description="10.1126/science.abl5197", # DOI of the publication
reference="E-MTAB-11536", # accession number (e.g. GSE#, E-MTAB#, etc.)
reference_type="ArrayExpress", # source type (e.g. GEO, ArrayExpress, SRA, etc.)
).save()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
Return an input h5ad file for cellxgene-schema¶
adata_cxg = curator.to_cellxgene_anndata(is_primary_data=True, title=title)
adata_cxg
Show code cell output
AnnData object with n_obs × n_vars = 1626 × 36390
obs: 'donor_id', 'sex_ontology_term_id', 'suspension_type', 'tissue_type', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'assay_ontology_term_id', 'organism_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data'
var: 'feature_is_filtered'
uns: 'default_embedding', 'title', 'cxg_lamin_schema_reference', 'cxg_lamin_schema_version'
obsm: 'X_umap'
adata_cxg.write_h5ad("anndata_human_immune_cells_cxg.h5ad")
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad
Show code cell output
Loading dependencies
Loading validator modules
Traceback (most recent call last):
File "/home/runner/.local/share/uv/tools/cellxgene-schema/bin/cellxgene-schema", line 8, in <module>
sys.exit(schema_cli())
^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/cellxgene_schema/cli.py", line 45, in schema_validate
from .validate import validate
File "/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/cellxgene_schema/validate.py", line 14, in <module>
from anndata._core.sparse_dataset import SparseDataset
ImportError: cannot import name 'SparseDataset' from 'anndata._core.sparse_dataset' (/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.11/site-packages/anndata/_core/sparse_dataset.py)
Note
The Curate class is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.