Curate AnnData based on the CELLxGENE schema
¶
This guide shows how to curate an AnnData object against the latest CELLxGENE schema.
Summary
To ingest validate & annotated datasets adhering to a CELLxGENE Schema, call
!cellxgene-schema --version # should print latest version
!cellxgene-schema validate small_cxg_curated.h5ad # validation
using a shell, and then
schema = ln.examples.cellxgene.create_cellxgene_schema()
ln.Artifact("…", schema=schema).save() # annotation (re-validates ontologies, but not some other details)
# pip install lamindb pronto
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.2.3
!lamin init --storage ./test-cellxgene-curate --modules bionty
Show code cell output
! using anonymous user (to identify, call: lamin login)
→ initialized lamindb: anonymous/test-cellxgene-curate
import lamindb as ln
import bionty as bt
import re
ln.track()
Show code cell output
→ connected lamindb: anonymous/test-cellxgene-curate
→ created Transform('6e7TWd66Ql0i0000', key='cellxgene-curate.ipynb'), started new Run('ZtABrNfazRVHkka7') at <django.db.models.expressions.DatabaseDefault object at 0x7fc4d67cf260>
→ notebook imports: bionty==2.3.1 lamindb-core==2.5.1
• recommendation: to identify the notebook across renames, pass the uid: ln.track("6e7TWd66Ql0i")
The CELLxGENE schema¶
As a first step, we generate the specific CELLxGENE schema which adds missing sources to the instance:
cxg_schema = ln.examples.cellxgene.create_cellxgene_schema()
cxg_schema.describe()
Show code cell output
Schema: CELLxGENE AnnData of ontology_id ├── uid: 1zh3bbVZanBIrBEJ run: ZtABrNf (cellxgene-curate.ipynb) │ itype: None otype: AnnData │ hash: JQx88Qmaty5OtoqMzYVnrQ ordered_set: False │ maximal_set: False minimal_set: True │ branch: main space: all │ created_at: 2026-06-04 11:26:56 UTC created_by: anonymous ├── var: var of CELLxGENE │ ├── uid: i2kVNumbCmavZcyO run: ZtABrNf (cellxgene-curate.ipynb) │ │ itype: Feature otype: None │ │ hash: smp1Myp5o2dinPxpwzV_yQ ordered_set: False │ │ maximal_set: False minimal_set: True │ │ branch: main space: all │ │ created_at: 2026-06-04 11:26:56 UTC created_by: anonymous │ └── Features (2) │ └── name dtype optio… nulla… coe… default_va… │ var_index bionty.Gene.ensembl_gene_id[source__uid='2w43l1Y… ✗ ✓ ✓ unset │ feature_is_filte… bool ✗ ✓ ✓ unset ├── obs: obs of CELLxGENE of ontology_id │ ├── uid: b07Ss1o4ARBgKCab run: ZtABrNf (cellxgene-curate.ipynb) │ │ itype: Feature otype: DataFrame │ │ hash: yH7S_ThvrvVAIQaaW22rTA ordered_set: False │ │ maximal_set: False minimal_set: True │ │ branch: main space: all │ │ created_at: 2026-06-04 11:26:56 UTC created_by: anonymous │ └── Features (11) │ └── name dtype … … defau… │ assay_ontology_term_id bionty.ExperimentalFactor.ontology_id ✗ ✓ unset │ cell_type_ontology_term_id bionty.CellType.ontology_id ✗ ✓ unset │ development_stage_ontology_term_… bionty.DevelopmentalStage.ontology_id[source__uid='7J… ✗ ✓ unset │ disease_ontology_term_id bionty.Disease.ontology_id ✗ ✓ unset │ self_reported_ethnicity_ontology… bionty.Ethnicity.ontology_id ✗ ✓ unset │ sex_ontology_term_id bionty.Phenotype.ontology_id ✗ ✓ unset │ tissue_ontology_term_id bionty.Tissue.ontology_id|bionty.CellType.ontology_id ✗ ✓ unset │ donor_id str ✗ ✓ unkno… │ is_primary_data ULabel ✗ ✓ unset │ suspension_type ULabel ✗ ✓ unset │ tissue_type ULabel ✗ ✓ unset └── uns: uns of CELLxGENE version ├── uid: XScK0C2lydmfkCV7 run: ZtABrNf (cellxgene-curate.ipynb) │ itype: Feature otype: DataFrame │ hash: FuoqSv5mtnyQ2eZMXNtRHQ ordered_set: False │ maximal_set: False minimal_set: True │ branch: main space: all │ created_at: 2026-06-04 11:26:56 UTC created_by: anonymous └── Features (1) └── name dtype optional nullable coerce default_value organism_ontology_term_id bionty.Organism.ontology_id ✗ ✓ ✓ unset
The schema has three components:
cxg_schema.slots["var"].describe()
Show code cell output
Schema: var of CELLxGENE ├── uid: i2kVNumbCmavZcyO run: ZtABrNf (cellxgene-curate.ipynb) │ itype: Feature otype: None │ hash: smp1Myp5o2dinPxpwzV_yQ ordered_set: False │ maximal_set: False minimal_set: True │ branch: main space: all │ created_at: 2026-06-04 11:26:56 UTC created_by: anonymous └── Features (2) └── name dtype optio… nullab… coe… default_val… var_index bionty.Gene.ensembl_gene_id[source__uid='2w43l1YS… ✗ ✓ ✓ unset feature_is_filter… bool ✗ ✓ ✓ unset
cxg_schema.slots["obs"].describe()
Show code cell output
Schema: obs of CELLxGENE of ontology_id ├── uid: b07Ss1o4ARBgKCab run: ZtABrNf (cellxgene-curate.ipynb) │ itype: Feature otype: DataFrame │ hash: yH7S_ThvrvVAIQaaW22rTA ordered_set: False │ maximal_set: False minimal_set: True │ branch: main space: all │ created_at: 2026-06-04 11:26:56 UTC created_by: anonymous └── Features (11) └── name dtype o… … defau… assay_ontology_term_id bionty.ExperimentalFactor.ontology_id ✗ ✓ unset cell_type_ontology_term_id bionty.CellType.ontology_id ✗ ✓ unset development_stage_ontology_term_id bionty.DevelopmentalStage.ontology_id[source__uid='7J… ✗ ✓ unset disease_ontology_term_id bionty.Disease.ontology_id ✗ ✓ unset self_reported_ethnicity_ontology_… bionty.Ethnicity.ontology_id ✗ ✓ unset sex_ontology_term_id bionty.Phenotype.ontology_id ✗ ✓ unset tissue_ontology_term_id bionty.Tissue.ontology_id|bionty.CellType.ontology_id ✗ ✓ unset donor_id str ✗ ✓ unkno… is_primary_data ULabel ✗ ✓ unset suspension_type ULabel ✗ ✓ unset tissue_type ULabel ✗ ✓ unset
cxg_schema.slots["uns"].describe()
Show code cell output
Schema: uns of CELLxGENE version ├── uid: XScK0C2lydmfkCV7 run: ZtABrNf (cellxgene-curate.ipynb) │ itype: Feature otype: DataFrame │ hash: FuoqSv5mtnyQ2eZMXNtRHQ ordered_set: False │ maximal_set: False minimal_set: True │ branch: main space: all │ created_at: 2026-06-04 11:26:56 UTC created_by: anonymous └── Features (1) └── name dtype optional nullable coerce default_value organism_ontology_term_id bionty.Organism.ontology_id ✗ ✓ ✓ unset
In the following, we will validate a dataset the CELLxGENE schema and curate it.
Validate and curate metadata¶
Let’s start with an AnnData object that we would like to curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres all requirements of CELLxGENE including the CELLxGENE schema.
adata = ln.examples.datasets.small_dataset3_cellxgene(
with_obs_typo=True, with_var_typo=True
)
adata.uns["organism_ontology_term_id"] = adata.obs["organism_ontology_term_id"].iloc[0]
adata.obs = adata.obs.drop(columns=["organism_ontology_term_id"])
adata.write_h5ad("small_cxg.h5ad")
adata
Show code cell output
AnnData object with n_obs × n_vars = 3 × 3
obs: 'disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type'
var: 'feature_is_filtered'
uns: 'title', 'organism_ontology_term_id'
obsm: 'X_pca'
Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg.h5ad
Show code cell output
INFO:cellxgene_schema:Loading dependencies
INFO:cellxgene_schema:Loading validator modules
INFO:cellxgene_schema.validate:Starting validation...
/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.12/site-packages/cellxgene_schema/validate.py:693: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
is_filtered = column[i]
WARNING:cellxgene_schema.validate:WARNING: Dataframe 'var' only has 3 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING:cellxgene_schema.validate:WARNING: Data contains assay(s) that are not represented in the 'suspension_type' schema definition table. Ensure you have selected the most appropriate value for the assay(s) between 'cell', 'nucleus', and 'na'. Please contact [email protected] during submission so that the assay(s) can be added to the schema definition document.
WARNING:cellxgene_schema.validate:WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.
ERROR:cellxgene_schema.validate:ERROR: Add labels error: Column 'cell_type' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR:cellxgene_schema.validate:ERROR: Add labels error: Column 'self_reported_ethnicity' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR:cellxgene_schema.validate:ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'var', make sure it is a valid ID.
ERROR:cellxgene_schema.validate:ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'raw.var', make sure it is a valid ID.
ERROR:cellxgene_schema.validate:ERROR: Dataframe 'obs' is missing column 'assay_ontology_term_id'.
ERROR:cellxgene_schema.validate:ERROR: 'UBERON:0002048XXX' in 'tissue_ontology_term_id' is not a valid ontology term id of 'UBERON, ZFA, FBbt, WBbt'. When 'tissue_type' is 'tissue', 'tissue_ontology_term_id' must be a valid UBERON, ZFA, FBbt, or WBbt term.
ERROR:cellxgene_schema.validate:ERROR: Dataframe 'obs' is missing column 'self_reported_ethnicity_ontology_term_id'.
INFO:cellxgene_schema.validate:Validation complete in 0:00:11.583047 with status is_valid=False
CELLxGENE requires all observations to be annotated.
If information for a specific column like disease_ontology_term_id is not available, CELLxGENE requires to fall back to default values like “normal” or “unknown”.
Let’s save these defaults to the instance using lamindb.examples.cellxgene.save_cellxgene_defaults():
ln.examples.cellxgene.save_cellxgene_defaults()
Now we can start curating the dataset:
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
curator.validate()
except ln.errors.ValidationError:
pass
The error shows invalid genes are present in the dataset.
Let’s remove them from both the adata and adata.raw objects:
adata = adata[
:, ~adata.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
if adata.raw is not None:
raw_data = adata.raw.to_adata()
raw_data = raw_data[
:, ~raw_data.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
adata.raw = raw_data
As we’ve subsetted the AnnData object, we have to recreate the AnnDataCurator to validate again:
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
curator.validate()
except ln.errors.ValidationError as e:
print(e)
Show code cell output
{
"SCHEMA": {
"COLUMN_NOT_IN_DATAFRAME": [
{
"schema": null,
"column": "assay_ontology_term_id",
"check": "column_in_dataframe",
"error": "column 'assay_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
},
{
"schema": null,
"column": "cell_type_ontology_term_id",
"check": "column_in_dataframe",
"error": "column 'cell_type_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
},
{
"schema": null,
"column": "self_reported_ethnicity_ontology_term_id",
"check": "column_in_dataframe",
"error": "column 'self_reported_ethnicity_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
}
]
}
}
The validation error tells us that we’re missing several columns.
The reason is simple:
CELLxGENE requires all obs metadata to be stored as ontology IDs in entity_ontology_term_id columns.
Therefore, we first translate the name based obs columns into the required format.
adata.obs
Show code cell output
| disease_ontology_term_id | development_stage_ontology_term_id | sex_ontology_term_id | tissue_ontology_term_id | cell_type | self_reported_ethnicity | donor_id | is_primary_data | suspension_type | tissue_type | |
|---|---|---|---|---|---|---|---|---|---|---|
| barcode1 | MONDO:0004975 | unknown | PATO:0000383 | UBERON:0002048XXX | T cell | South Asian | -1 | False | cell | tissue |
| barcode2 | MONDO:0004980 | unknown | PATO:0000384 | UBERON:0002048XXX | B cell | South Asian | 1 | False | cell | tissue |
| barcode3 | MONDO:0004980 | unknown | unknown | UBERON:0000948 | B cell | South Asian | 2 | False | cell | tissue |
# Add missing assay column
adata.obs["assay_ontology_term_id"] = "EFO:0005684"
def get_source_from_feature(feature: ln.Feature) -> bt.Source | None:
if match := re.search(r"source__uid='([^']+)'", feature.dtype_as_str):
return bt.Source.get(uid=match.group(1))
return None
# Add `entity_ontology_term_id` columns by translating names to ontology IDs
standardization_map = {
"self_reported_ethnicity": (
bt.Ethnicity,
"self_reported_ethnicity_ontology_term_id",
),
"cell_type": (bt.CellType, "cell_type_ontology_term_id"),
}
for col, (bt_class, new_col) in standardization_map.items():
feature = cxg_schema.slots["obs"].features.filter(name=new_col).one()
source = get_source_from_feature(feature)
adata.obs[new_col] = bt_class.standardize(
adata.obs[col], field="name", return_field="ontology_id", source=source
)
# Drop the name columns because CELLxGENE disallows them
adata.obs = adata.obs.drop(columns=list(standardization_map.keys()))
Show code cell output
! found 1 name in public source: ['South Asian']
please add corresponding Ethnicity records via: `.from_values(['South Asian'])`
! found 2 names in public source: ['T cell', 'B cell']
please add corresponding CellType records via: `.from_values(['T cell', 'B cell'])`
try:
curator.validate()
except ln.errors.ValidationError:
pass
Show code cell output
! ontology ID BFO:0000020 not found in DataFrame
An error is shown for the tissue label “UBERON:0002048XXX” because it contains a few extra X - a typo.
Let’s fix it:
adata.obs["tissue_ontology_term_id"] = adata.obs[
"tissue_ontology_term_id"
].cat.rename_categories({"UBERON:0002048XXX": "UBERON:0002048"})
Now validate should pass.
# recreate the AnnDataCurator to refresh cached categoricals
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
curator.validate()
Save artifact¶
We can now save the curated artifact:
artifact = curator.save_artifact(key="examples/dataset-curated-against-cxg.h5ad")
Show code cell output
→ returning schema with same hash: Schema(uid='b07Ss1o4ARBgKCab', is_type=False, name='obs of CELLxGENE of ontology_id', description=None, n_members=11, coerce=True, flexible=False, itype='Feature', otype='DataFrame', hash='yH7S_ThvrvVAIQaaW22rTA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, created_on_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, created_at=2026-06-04 11:26:56 UTC, is_locked=False)
→ returning schema with same hash: Schema(uid='XScK0C2lydmfkCV7', is_type=False, name='uns of CELLxGENE version', description=None, n_members=1, coerce=True, flexible=False, itype='Feature', otype='DataFrame', hash='FuoqSv5mtnyQ2eZMXNtRHQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, created_on_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, created_at=2026-06-04 11:26:56 UTC, is_locked=False)
artifact.describe()
Show code cell output
Artifact: examples/dataset-curated-against-cxg.h5ad (0000) ├── uid: U97fYHacIwDQGkxA0000 run: ZtABrNf (cellxgene-curate.ipynb) │ kind: dataset otype: AnnData │ hash: 0OJQXQlQ3k-2FEhfHNwgIQ size: 41.8 KB │ branch: main space: all │ created_at: 2026-06-04 11:30:17 UTC created_by: anonymous │ n_observations: 3 schema: CELLxGENE AnnData of ontology_id ├── storage/path: │ /home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/.lamindb/U97fYHacIwDQGkxA0000.h5ad ├── Dataset features │ ├── var (2) │ │ feature_is_filtered bool │ │ var_index bionty.Gene.ensembl_gene_id[source… ENSG00000000419, ENSG00000139618 │ ├── obs (11) │ │ assay_ontology_term_id bionty.ExperimentalFactor.ontology… EFO:0005684 │ │ cell_type_ontology_term_id bionty.CellType.ontology_id CL:0000084, CL:0000236 │ │ development_stage_ontology_t… bionty.DevelopmentalStage.ontology… unknown │ │ disease_ontology_term_id bionty.Disease.ontology_id MONDO:0004975, MONDO:0004980 │ │ donor_id str │ │ is_primary_data ULabel │ │ self_reported_ethnicity_onto… bionty.Ethnicity.ontology_id HANCESTRO:0848 │ │ sex_ontology_term_id bionty.Phenotype.ontology_id PATO:0000383, PATO:0000384, unknown │ │ suspension_type ULabel cell │ │ tissue_ontology_term_id bionty.Tissue.ontology_id|bionty.C… UBERON:0000948, UBERON:0002048 │ │ tissue_type ULabel tissue │ └── uns (1) │ organism_ontology_term_id bionty.Organism.ontology_id NCBITaxon:9606 └── Labels └── .ulabels ULabel tissue, cell .organisms bionty.Organism human .genes bionty.Gene DPM1, BRCA2 .tissues bionty.Tissue heart, lung .cell_types bionty.CellType T cell, B cell .diseases bionty.Disease Alzheimer disease, atopic eczema .phenotypes bionty.Phenotype unknown, female, male .experimental_factors bionty.ExperimentalFactor RNA-seq of coding RNA from single cells .developmental_stages bionty.DevelopmentalStage unknown .ethnicities bionty.Ethnicity South Asian
Validating using cellxgene-schema¶
To validate the now curated AnnData object using CZI’s cellxgene-schema CLI tool, we need to write the AnnData object to disk.
adata.write("small_cxg_curated.h5ad")
# %%bash -e
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg_curated.h5ad
Show code cell output
INFO:cellxgene_schema:Loading dependencies
INFO:cellxgene_schema:Loading validator modules
INFO:cellxgene_schema.validate:Starting validation...
/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.12/site-packages/cellxgene_schema/validate.py:693: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
is_filtered = column[i]
WARNING:cellxgene_schema.validate:WARNING: Dataframe 'var' only has 2 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING:cellxgene_schema.validate:WARNING: Data contains assay(s) that are not represented in the 'suspension_type' schema definition table. Ensure you have selected the most appropriate value for the assay(s) between 'cell', 'nucleus', and 'na'. Please contact [email protected] during submission so that the assay(s) can be added to the schema definition document.
INFO:cellxgene_schema.validate:Validation complete in 0:00:10.762258 with status is_valid=True
Note
The CELLxGENE Schema is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.