Curate AnnData
based on the CELLxGENE schema¶
This guide shows how to curate an AnnData object with the help of laminlabs/cellxgene
against the CELLxGENE schema v5.1.0.
Load your instance where you want to register the curated AnnData object:
# pip install 'lamindb[bionty,jupyter]' cellxgene-lamin
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.1.0
!lamin init --storage ./test-cellxgene-curate --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-cellxgene-curate
import lamindb as ln
import bionty as bt
def get_semi_curated_dataset():
adata = ln.core.datasets.anndata_human_immune_cells()
adata.obs["sex_ontology_term_id"] = "PATO:0000384"
adata.obs["organism"] = "human"
adata.obs["sex"] = "unknown"
# create some typos in the metadata
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories({"lung": "lungg"})
# new donor ids
adata.obs["donor"] = adata.obs["donor"].astype(str) + "-1"
# drop animal cell
adata = adata[adata.obs["cell_type"] != "animal cell", :]
# remove columns that are reserved in the cellxgene schema
adata.var.drop(columns=["feature_reference", "feature_biotype"], inplace=True)
adata.raw.var.drop(
columns=["feature_name", "feature_reference", "feature_biotype"], inplace=True
)
return adata
→ connected lamindb: testuser1/test-cellxgene-curate
Let’s start with an AnnData object that we’d like to inspect and curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.
adata = get_semi_curated_dataset()
adata.write_h5ad("anndata_human_immune_cells.h5ad")
adata
Show code cell output
AnnData object with n_obs × n_vars = 1626 × 36503
obs: 'donor', 'tissue', 'cell_type', 'assay', 'sex_ontology_term_id', 'organism', 'sex'
var: 'feature_is_filtered'
uns: 'default_embedding'
obsm: 'X_umap'
Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells.h5ad || exit 1
Show code cell output
Loading dependencies
Loading validator modules
Starting validation...
Unable to open 'anndata_human_immune_cells.h5ad' with AnnData
Validate and curate metadata¶
We create a Curate
object that references the AnnData
object.
During instantiation, any :class:~lamindb.Feature
records are saved.
curator = ln.curators.CellxGeneAnnDataCatManager(
adata, organism="human", schema_version="5.1.0"
)
Show code cell output
→ Creating control labels in the CellxGene schema.
! record with similar name exists! did you mean to load it?
uid | name | is_type | description | reference | reference_type | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
1 | lOxH289J | TissueType | True | From CellxGene schema. Is "tissue", "organoid"... | None | None | 1 | None | None | 2025-03-20 07:49:51.457000+00:00 | 1 | None | 1 |
! record with similar name exists! did you mean to load it?
uid | name | is_type | description | reference | reference_type | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
4 | r8IcV6Hh | cell culture | False | From CellxGene schema. | None | None | 1 | 1 | None | 2025-03-20 07:49:51.483000+00:00 | 1 | None | 1 |
✓ added 5 records with Feature.name for "columns": 'assay', 'cell_type', 'sex_ontology_term_id', 'tissue', 'organism'
validated = curator.validate()
✗ missing required obs columns 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'suspension_type', 'tissue_type'
→ consider initializing a Curate object with `defaults=cxg.CellxGeneAnnDataCatManager.cxg_categoricals_defaults` to automatically add these columns with default values
Let’s fix the “donor_id” column name:
adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)
For the missing columns, we can pass default values suggested from CELLxGENE which will automatically add them to the AnnData object:
ln.curators.CellxGeneAnnDataCatManager.cxg_categoricals_defaults
Show code cell output
{'cell_type': 'unknown',
'development_stage': 'unknown',
'disease': 'normal',
'donor_id': 'unknown',
'self_reported_ethnicity': 'unknown',
'sex': 'unknown',
'suspension_type': 'cell',
'tissue_type': 'tissue'}
Note
CELLxGENE requires columns tissue
, organism
, and assay
to have existing values from the ontologies.
Therefore, these columns need to be added and populated manually.
curator = ln.curators.CellxGeneAnnDataCatManager(
adata,
defaults=ln.curators.CellxGeneAnnDataCatManager.cxg_categoricals_defaults,
organism="human",
schema_version="5.1.0",
)
Show code cell output
→ added default value 'unknown' to the adata.obs['development_stage']
→ added default value 'normal' to the adata.obs['disease']
→ added default value 'unknown' to the adata.obs['self_reported_ethnicity']
→ added default value 'cell' to the adata.obs['suspension_type']
→ added default value 'tissue' to the adata.obs['tissue_type']
✓ added 5 records with Feature.name for "columns": 'development_stage', 'disease', 'self_reported_ethnicity', 'suspension_type', 'tissue_type'
validated = curator.validate()
validated
Show code cell output
✓ created 1 Organism record from Bionty matching name: 'human'
• saving validated records of 'var_index'
✓ added 36390 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', 'ENSG00000238009', 'ENSG00000239945', 'ENSG00000239906', 'ENSG00000241860', 'ENSG00000241599', 'ENSG00000286448', 'ENSG00000236601', 'ENSG00000284733', 'ENSG00000235146', 'ENSG00000284662', 'ENSG00000229905', 'ENSG00000237491', 'ENSG00000177757', 'ENSG00000228794', 'ENSG00000225880', 'ENSG00000230368', 'ENSG00000272438', ...
• mapping "var_index" on Gene.ensembl_gene_id
! 113 terms are not validated: 'ENSG00000269933', 'ENSG00000261737', 'ENSG00000259834', 'ENSG00000256374', 'ENSG00000263464', 'ENSG00000203812', 'ENSG00000272196', 'ENSG00000272880', 'ENSG00000270188', 'ENSG00000287116', 'ENSG00000237133', 'ENSG00000224739', 'ENSG00000227902', 'ENSG00000239467', 'ENSG00000272551', 'ENSG00000280374', 'ENSG00000236886', 'ENSG00000229352', 'ENSG00000286601', 'ENSG00000227021', ...
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
• saving validated records of 'assay'
✓ added 3 records from public with ExperimentalFactor.name for "assay": '10x 5' v2', '10x 3' v3', '10x 5' v1'
• saving validated records of 'cell_type'
✓ added 31 records from public with CellType.name for "cell_type": 'germinal center B cell', 'CD16-positive, CD56-dim natural killer cell, human', 'conventional dendritic cell', 'plasmacytoid dendritic cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'T follicular helper cell', 'naive B cell', 'group 3 innate lymphoid cell', 'CD4-positive helper T cell', 'progenitor cell', 'classical monocyte', 'memory B cell', 'alpha-beta T cell', 'alveolar macrophage', 'gamma-delta T cell', 'megakaryocyte', 'effector memory CD4-positive, alpha-beta T cell', 'macrophage', 'non-classical monocyte', 'regulatory T cell', ...
• saving validated records of 'sex_ontology_term_id'
✓ added 1 record from public with Phenotype.ontology_id for "sex_ontology_term_id": 'PATO:0000384'
• saving validated records of 'tissue'
✓ added 16 records from public with Tissue.name for "tissue": 'sigmoid colon', 'caecum', 'jejunal epithelium', 'omentum', 'thymus', 'bone marrow', 'spleen', 'skeletal muscle tissue', 'lamina propria', 'ileum', 'liver', 'transverse colon', 'duodenum', 'thoracic lymph node', 'mesenteric lymph node', 'blood'
✓ "assay" is validated against ExperimentalFactor.name
✓ "cell_type" is validated against CellType.name
✓ "development_stage" is validated against DevelopmentalStage.name
✓ "disease" is validated against Disease.name
✓ "self_reported_ethnicity" is validated against Ethnicity.name
✓ "sex_ontology_term_id" is validated against Phenotype.ontology_id
✓ "suspension_type" is validated against ULabel.name
• mapping "tissue" on Tissue.name
! 1 term is not validated: 'lungg'
→ fix typos, remove non-existent values, or save terms via .add_new_from("tissue")
✓ "tissue_type" is validated against ULabel.name
✓ "organism" is validated against Organism.name
False
Remove unvalidated values¶
We remove all unvalidated genes. These genes may exist in a different release of ensembl but are not valid for the ensembl version of cellxgene schema 5.0.0 (ensembl release 110).
curator.non_validated
Show code cell output
{'tissue': ['lungg'],
'var_index': ['ENSG00000269933',
'ENSG00000261737',
'ENSG00000259834',
'ENSG00000256374',
'ENSG00000263464',
'ENSG00000203812',
'ENSG00000272196',
'ENSG00000272880',
'ENSG00000270188',
'ENSG00000287116',
'ENSG00000237133',
'ENSG00000224739',
'ENSG00000227902',
'ENSG00000239467',
'ENSG00000272551',
'ENSG00000280374',
'ENSG00000236886',
'ENSG00000229352',
'ENSG00000286601',
'ENSG00000227021',
'ENSG00000259855',
'ENSG00000273301',
'ENSG00000271870',
'ENSG00000237838',
'ENSG00000286996',
'ENSG00000269028',
'ENSG00000286699',
'ENSG00000273370',
'ENSG00000261490',
'ENSG00000272567',
'ENSG00000270394',
'ENSG00000272370',
'ENSG00000272354',
'ENSG00000251044',
'ENSG00000272040',
'ENSG00000182230',
'ENSG00000204092',
'ENSG00000261068',
'ENSG00000236740',
'ENSG00000236996',
'ENSG00000232295',
'ENSG00000271734',
'ENSG00000236673',
'ENSG00000227220',
'ENSG00000236166',
'ENSG00000112096',
'ENSG00000285162',
'ENSG00000286228',
'ENSG00000237513',
'ENSG00000285106',
'ENSG00000226380',
'ENSG00000270672',
'ENSG00000225932',
'ENSG00000244693',
'ENSG00000268955',
'ENSG00000272267',
'ENSG00000253878',
'ENSG00000259820',
'ENSG00000226403',
'ENSG00000233776',
'ENSG00000269900',
'ENSG00000261534',
'ENSG00000237548',
'ENSG00000239665',
'ENSG00000256892',
'ENSG00000249860',
'ENSG00000271409',
'ENSG00000224745',
'ENSG00000261438',
'ENSG00000231575',
'ENSG00000260461',
'ENSG00000255823',
'ENSG00000254740',
'ENSG00000254561',
'ENSG00000282080',
'ENSG00000256427',
'ENSG00000287388',
'ENSG00000276814',
'ENSG00000280710',
'ENSG00000215271',
'ENSG00000258414',
'ENSG00000258808',
'ENSG00000277050',
'ENSG00000273888',
'ENSG00000258861',
'ENSG00000259444',
'ENSG00000244952',
'ENSG00000273923',
'ENSG00000262668',
'ENSG00000232196',
'ENSG00000256618',
'ENSG00000221995',
'ENSG00000226377',
'ENSG00000273576',
'ENSG00000267637',
'ENSG00000282965',
'ENSG00000273837',
'ENSG00000286949',
'ENSG00000256222',
'ENSG00000280095',
'ENSG00000278927',
'ENSG00000278955',
'ENSG00000277352',
'ENSG00000239446',
'ENSG00000256045',
'ENSG00000228906',
'ENSG00000228139',
'ENSG00000261773',
'ENSG00000278198',
'ENSG00000273496',
'ENSG00000277666',
'ENSG00000278782',
'ENSG00000277761']}
adata = adata[:, ~adata.var.index.isin(curator.non_validated["var_index"])].copy()
if adata.raw is not None:
raw_data = adata.raw.to_adata()
raw_data = raw_data[
:, ~raw_data.var_names.isin(curator.non_validated["var_index"])
].copy()
adata.raw = raw_data
curator = ln.curators.CellxGeneAnnDataCatManager(
adata, organism="human", schema_version="5.1.0"
)
Register new metadata labels¶
Following the suggestions above to register genes and labels that aren’t present in the current instance:
(Note that our instance is rather empty. Once you filled up the registries, registering new labels won’t be frequently needed)
An error is shown for the tissue label “lungg”, which is a typo, should be “lung”. Let’s fix it:
tissues = curator.lookup(public=True).tissue
tissues.lung
Show code cell output
Tissue(ontology_id='UBERON:0002048', name='lung', definition='Respiration Organ That Develops As An Outpocketing Of The Esophagus.', synonyms='pulmo', parents=array(['UBERON:0015212', 'UBERON:0004119', 'UBERON:0005178',
'UBERON:0000171'], dtype=object))
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
{"lungg": tissues.lung.name}
)
Let’s validate the object again:
validated = curator.validate()
validated
Show code cell output
✓ "var_index" is validated against Gene.ensembl_gene_id
• saving validated records of 'tissue'
✓ added 1 record from public with Tissue.name for "tissue": 'lung'
✓ "assay" is validated against ExperimentalFactor.name
✓ "cell_type" is validated against CellType.name
✓ "development_stage" is validated against DevelopmentalStage.name
✓ "disease" is validated against Disease.name
✓ "self_reported_ethnicity" is validated against Ethnicity.name
✓ "sex_ontology_term_id" is validated against Phenotype.ontology_id
✓ "suspension_type" is validated against ULabel.name
✓ "tissue" is validated against Tissue.name
✓ "tissue_type" is validated against ULabel.name
✓ "organism" is validated against Organism.name
True
adata.obs.head()
Show code cell output
donor_id | tissue | cell_type | assay | sex_ontology_term_id | organism | sex | development_stage | disease | self_reported_ethnicity | suspension_type | tissue_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CZINY-0109_CTGGTCTAGTCTGTAC | D496-1 | blood | classical monocyte | 10x 3' v3 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
CZI-IA10244332+CZI-IA10244434_CCTTCGACATACTCTT | 621B-1 | thoracic lymph node | T follicular helper cell | 10x 5' v2 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Pan_T7935491_CTGGTCTGTACATGTC | A29-1 | spleen | memory B cell | 10x 5' v1 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Pan_T7980367_GGGCATCCAGGTGGAT | A36-1 | lung | alveolar macrophage | 10x 5' v1 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Pan_T7935494_ATCATGGTCTACCTGC | A29-1 | mesenteric lymph node | naive thymus-derived CD4-positive, alpha-beta ... | 10x 5' v1 | PATO:0000384 | human | unknown | unknown | normal | unknown | cell | tissue |
Save artifact¶
artifact = curator.save_artifact(
key=f"my_datasets/dataset-curated-against-cxg-{curator.schema_version}.h5ad"
)
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
• key has more than one suffix (path.suffixes), using only last suffix: '.h5ad' - if you want your composite suffix to be recognized add it to lamindb.core.storage.VALID_SIMPLE_SUFFIXES.add()
• path content will be copied to default storage upon `save()` with key 'my_datasets/dataset-curated-against-cxg-5.1.0.h5ad'
✓ storing artifact 'vAkneHT8B0r5dPFI0000' at '/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/.lamindb/vAkneHT8B0r5dPFI0000.h5ad'
! run input wasn't tracked, call `ln.track()` and re-run
✓ 36390 unique terms (100.00%) are validated for ensembl_gene_id
✓ 10 unique terms (83.30%) are validated for name
! 2 unique terms (16.70%) are not validated for name: 'donor_id', 'sex'
✓ loaded 10 Feature records matching name: 'tissue', 'cell_type', 'assay', 'sex_ontology_term_id', 'organism', 'development_stage', 'disease', 'self_reported_ethnicity', 'suspension_type', 'tissue_type'
! did not create Feature records for 2 non-validated names: 'donor_id', 'sex'
✓ saved 2 feature sets for slots: 'var','obs'
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'vAkneHT8B0r5dPFI0000' │ ├── .key = 'my_datasets/dataset-curated-against-cxg-5.1.0.h5ad' │ ├── .size = 54670616 │ ├── .hash = 'VYhEnkViOhtD-7kN2odUGw' │ ├── .n_observations = 1626 │ ├── .path = │ │ /home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/.lamindb/vAkneHT8B0r5dPFI0000. │ │ h5ad │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-03-20 07:50:33 ├── Dataset features/.feature_sets │ ├── var • 36390 [bionty.Gene] │ │ MIR1302-2HG float │ │ FAM138A float │ │ OR4F5 float │ │ OR4F29 float │ │ OR4F16 float │ │ LINC01409 float │ │ FAM87B float │ │ LINC01128 float │ │ LINC00115 float │ │ FAM41C float │ └── obs • 10 [Feature] │ assay cat[bionty.ExperimentalF… 10x 3' v3, 10x 5' v1, 10x 5' v2 │ cell_type cat[bionty.CellType] CD16-negative, CD56-bright natural kille… │ development_stage cat[bionty.Developmental… unknown │ disease cat[bionty.Disease] normal │ organism cat[bionty.Organism] human │ self_reported_ethnicity cat[bionty.Ethnicity] unknown │ sex_ontology_term_id cat[bionty.Phenotype] male │ suspension_type cat[ULabel] cell │ tissue cat[bionty.Tissue] blood, bone marrow, caecum, duodenum, il… │ tissue_type cat[ULabel] tissue └── Labels └── .organisms bionty.Organism human .tissues bionty.Tissue sigmoid colon, caecum, jejunal epitheliu… .cell_types bionty.CellType germinal center B cell, CD16-positive, C… .diseases bionty.Disease normal .phenotypes bionty.Phenotype male .experimental_factors bionty.ExperimentalFactor 10x 5' v2, 10x 3' v3, 10x 5' v1 .developmental_stages bionty.DevelopmentalStage unknown .ethnicities bionty.Ethnicity unknown .ulabels ULabel tissue, cell
Return an input h5ad file for cellxgene-schema¶
title = "Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)"
adata_cxg = curator.to_cellxgene_anndata(is_primary_data=True, title=title)
adata_cxg
Show code cell output
AnnData object with n_obs × n_vars = 1626 × 36390
obs: 'donor_id', 'sex_ontology_term_id', 'suspension_type', 'tissue_type', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'assay_ontology_term_id', 'organism_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data'
var: 'feature_is_filtered'
uns: 'default_embedding', 'title', 'cxg_lamin_schema_reference', 'cxg_lamin_schema_version'
obsm: 'X_umap'
adata_cxg.write_h5ad("anndata_human_immune_cells_cxg.h5ad")
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad || exit 1
Show code cell output
Loading dependencies
Loading validator modules
Starting validation...
Unable to open 'anndata_human_immune_cells_cxg.h5ad' with AnnData
Note
The Curate class is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.