Curate AnnData based on the CELLxGENE schema¶
This guide shows how to curate an AnnData object against the CELLxGENE schema v5.2.0.
Summary
To ingest validate & annotated datasets adhering to a CELLxGENE Schema, call
!cellxgene-schema --version # should print 5.2.0
!cellxgene-schema validate small_cxg_curated.h5ad # validation
using a shell, and then
schema = ln.examples.cellxgene.get_cxg_schema(version="5.2.0")
ln.Artifact("…", schema=schema).save() # annotation (re-validates ontologies, but not some other details)
# pip install 'lamindb[bionty,jupyter]' pronto
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.2.3
!lamin init --storage ./test-cellxgene-curate --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-cellxgene-curate
import lamindb as ln
import bionty as bt
ln.track()
Show code cell output
→ connected lamindb: testuser1/test-cellxgene-curate
→ created Transform('pO0gxiTft69k0000', key='cellxgene-curate.ipynb'), started new Run('JYFlJsgSyDk30Nbg') at 2025-10-27 08:28:08 UTC
→ notebook imports: bionty==1.8.1 lamindb==1.14a1
• recommendation: to identify the notebook across renames, pass the uid: ln.track("pO0gxiTft69k")
The CELLxGENE schema¶
As a first step, we generate the specific CELLxGENE schema which adds missing sources to the instance:
cxg_schema = ln.examples.cellxgene.get_cxg_schema("5.2.0")
Show code cell output
/tmp/ipykernel_3802/2806544410.py:1: DeprecationWarning: Use create_cellxgene_schema instead of get_cxg_schema, get_cxg_schema will be removed in the future.
cxg_schema = ln.examples.cellxgene.get_cxg_schema("5.2.0")
→ writing the in-memory object into cache
→ writing the in-memory object into cache
→ writing the in-memory object into cache
→ writing the in-memory object into cache
→ writing the in-memory object into cache
→ referenced read-only storage location at s3://bionty-assets, is managed by instance with uid 2WgKqzPc1eW3
cxg_schema.describe()
Show code cell output
Schema: AnnData of CELLxGENE version 5.2.0 for human of ontology_id ├── uid: 1t7iWOosjdFwsYcX run: JYFlJsg (cellxgene-curate.ipynb) │ itype: Composite otype: AnnData │ hash: m9RPCyeF9wiHUq6TkKzFuw ordered_set: False │ maximal_set: False minimal_set: True │ branch: main space: all │ created_at: 2025-10-27 08:28:16 UTC created_by: testuser1 ├── var: var of CELLxGENE version 5.2.0 │ ├── uid: uACnhaxsZtm1LXFW run: JYFlJsg (cellxgene-curate.ipynb) │ │ itype: Feature otype: None │ │ hash: 3QBevwcTQ6LpiBZ95PjbPQ ordered_set: False │ │ maximal_set: False minimal_set: True │ │ branch: main space: all │ │ created_at: 2025-10-27 08:28:16 UTC created_by: testuser1 │ └── Features (2) │ └── name dtype opti… null… coerce_d… default_v… │ var_index bionty.Gene.ensembl_gene_id[source__uid='5dmX95… ✗ ✓ ✓ unset │ feature_is_filt… bool ✗ ✓ ✓ unset └── obs: obs of CELLxGENE version 5.2.0 for human of ontology_id ├── uid: 9IGAsduNztzeHcT2 run: JYFlJsg (cellxgene-curate.ipynb) │ itype: Feature otype: DataFrame │ hash: 9UencaGxx1vVapZbW7rfLA ordered_set: False │ maximal_set: False minimal_set: True │ branch: main space: all │ created_at: 2025-10-27 08:28:16 UTC created_by: testuser1 └── Features (12) └── name dtype coe… def… assay_ontology_term_id bionty.ExperimentalFactor.ontology_id[source__uid='2… ✓ uns… cell_type_ontology_term_id bionty.CellType.ontology_id[source__uid='3Uw2Va7a'] ✓ uns… development_stage_ontology_term… bionty.DevelopmentalStage.ontology_id[source__uid='1… ✓ uns… disease_ontology_term_id bionty.Disease.ontology_id[source__uid='4a3ejKuf'] ✓ uns… self_reported_ethnicity_ontolog… bionty.Ethnicity.ontology_id[source__uid='MJRqduf9'] ✓ uns… sex_ontology_term_id bionty.Phenotype.ontology_id[source__uid='3ox8Ekgl'] ✓ uns… tissue_ontology_term_id bionty.Tissue.ontology_id[source__uid='MUtAGdL4'] ✓ uns… organism_ontology_term_id bionty.Organism.ontology_id[source__uid='4tsksCMX'] ✓ uns… donor_id str ✓ unk… is_primary_data ULabel ✓ uns… suspension_type ULabel ✓ uns… tissue_type ULabel ✓ uns…
The schema has two components:
cxg_schema.slots["var"].describe()
Show code cell output
Schema: var of CELLxGENE version 5.2.0 ├── uid: uACnhaxsZtm1LXFW run: JYFlJsg (cellxgene-curate.ipynb) │ itype: Feature otype: None │ hash: 3QBevwcTQ6LpiBZ95PjbPQ ordered_set: False │ maximal_set: False minimal_set: True │ branch: main space: all │ created_at: 2025-10-27 08:28:16 UTC created_by: testuser1 └── Features (2) └── name dtype optio… null… coerce_dt… default_v… var_index bionty.Gene.ensembl_gene_id[source__uid='5dmX950… ✗ ✓ ✓ unset feature_is_filte… bool ✗ ✓ ✓ unset
cxg_schema.slots["obs"].describe()
Show code cell output
Schema: obs of CELLxGENE version 5.2.0 for human of ontology_id ├── uid: 9IGAsduNztzeHcT2 run: JYFlJsg (cellxgene-curate.ipynb) │ itype: Feature otype: DataFrame │ hash: 9UencaGxx1vVapZbW7rfLA ordered_set: False │ maximal_set: False minimal_set: True │ branch: main space: all │ created_at: 2025-10-27 08:28:16 UTC created_by: testuser1 └── Features (12) └── name dtype … coe… defau… assay_ontology_term_id bionty.ExperimentalFactor.ontology_id[source__uid='2v… ✓ ✓ unset cell_type_ontology_term_id bionty.CellType.ontology_id[source__uid='3Uw2Va7a'] ✓ ✓ unset development_stage_ontology_term… bionty.DevelopmentalStage.ontology_id[source__uid='1G… ✓ ✓ unset disease_ontology_term_id bionty.Disease.ontology_id[source__uid='4a3ejKuf'] ✓ ✓ unset self_reported_ethnicity_ontolog… bionty.Ethnicity.ontology_id[source__uid='MJRqduf9'] ✓ ✓ unset sex_ontology_term_id bionty.Phenotype.ontology_id[source__uid='3ox8Ekgl'] ✓ ✓ unset tissue_ontology_term_id bionty.Tissue.ontology_id[source__uid='MUtAGdL4'] ✓ ✓ unset organism_ontology_term_id bionty.Organism.ontology_id[source__uid='4tsksCMX'] ✓ ✓ unset donor_id str ✓ ✓ unkno… is_primary_data ULabel ✓ ✓ unset suspension_type ULabel ✓ ✓ unset tissue_type ULabel ✓ ✓ unset
In the following, we will validate a dataset the CELLxGENE schema and curate it.
Validate and curate metadata¶
Let’s start with an AnnData object that we would like to curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres all requirements of CELLxGENE including the CELLxGENE schema.
adata = ln.examples.datasets.small_dataset3_cellxgene(
with_obs_typo=True, with_var_typo=True
)
adata.write_h5ad("small_cxg.h5ad")
adata
Show code cell output
AnnData object with n_obs × n_vars = 3 × 3
obs: 'disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'organism_ontology_term_id'
var: 'feature_is_filtered'
uns: 'title'
obsm: 'X_pca'
Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg.h5ad
Show code cell output
Loading dependencies
Loading validator modules
Starting validation...
WARNING: Dataframe 'var' only has 3 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.
ERROR: Add labels error: Column 'cell_type' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Add labels error: Column 'self_reported_ethnicity' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'var', make sure it is a valid ID.
ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'raw.var', make sure it is a valid ID.
ERROR: Dataframe 'obs' is missing column 'cell_type_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'assay_ontology_term_id'.
ERROR: 'UBERON:0002048XXX' in 'tissue_ontology_term_id' is not a valid ontology term id of 'UBERON'. When 'tissue_type' is 'tissue' or 'organoid', 'tissue_ontology_term_id' MUST be a descendant term id of 'UBERON:0001062' (anatomical entity).
ERROR: Dataframe 'obs' is missing column 'self_reported_ethnicity_ontology_term_id'.
ERROR: Checking values with dependencies failed for adata.obs['suspension_type'], this is likely due to missing dependent column in adata.obs.
Validation complete in 0:00:02.419974 with status is_valid=False
CELLxGENE requires all observations to be annotated.
If information for a specific column like disease_ontology_term_id is not available, CELLxGENE requires to fall back to default values like “normal” or “unknown”.
Let’s save these defaults to the instance using lamindb.examples.cellxgene.save_cellxgene_defaults():
ln.examples.cellxgene.save_cellxgene_defaults()
Now we can start curating the dataset:
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
curator.validate()
except ln.errors.ValidationError:
pass
Show code cell output
! 1 term not validated in feature 'index' in slot 'var': 'invalid_ensembl_id'
→ fix organism 'human', fix typos, remove non-existent values, or save terms via: curator.slots['var'].cat.add_new_from('index')
The error shows invalid genes are present in the dataset.
Let’s remove them from both the adata and adata.raw objects:
adata = adata[
:, ~adata.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
if adata.raw is not None:
raw_data = adata.raw.to_adata()
raw_data = raw_data[
:, ~raw_data.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
adata.raw = raw_data
As we’ve subsetted the AnnData object, we have to recreate the AnnDataCurator to validate again:
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
curator.validate()
except ln.errors.ValidationError as e:
print(e)
Show code cell output
{
"SCHEMA": {
"COLUMN_NOT_IN_DATAFRAME": [
{
"schema": null,
"column": null,
"check": "column_in_dataframe",
"error": "column 'assay_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'organism_ontology_term_id']"
},
{
"schema": null,
"column": null,
"check": "column_in_dataframe",
"error": "column 'cell_type_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'organism_ontology_term_id']"
},
{
"schema": null,
"column": null,
"check": "column_in_dataframe",
"error": "column 'self_reported_ethnicity_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'organism_ontology_term_id']"
}
]
}
}
The validation error tells us that we’re missing several columns.
The reason is simple:
CELLxGENE requires all obs metadata to be stored as ontology IDs in entity_ontology_term_id columns.
Therefore, we first translate the name based obs columns into the required format.
adata.obs
Show code cell output
| disease_ontology_term_id | development_stage_ontology_term_id | sex_ontology_term_id | tissue_ontology_term_id | cell_type | self_reported_ethnicity | donor_id | is_primary_data | suspension_type | tissue_type | organism_ontology_term_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| barcode1 | MONDO:0004975 | unknown | PATO:0000383 | UBERON:0002048XXX | T cell | South Asian | -1 | False | cell | tissue | NCBITaxon:9606 |
| barcode2 | MONDO:0004980 | unknown | PATO:0000384 | UBERON:0002048XXX | B cell | South Asian | 1 | False | cell | tissue | NCBITaxon:9606 |
| barcode3 | MONDO:0004980 | unknown | unknown | UBERON:0000948 | B cell | South Asian | 2 | False | cell | tissue | NCBITaxon:9606 |
# Add missing assay column
adata.obs["assay_ontology_term_id"] = "EFO:0005684"
# Add `entity_ontology_term_id` columns by translating names to ontology IDs
standardization_map = {
"self_reported_ethnicity": (
bt.Ethnicity,
"self_reported_ethnicity_ontology_term_id",
),
"cell_type": (bt.CellType, "cell_type_ontology_term_id"),
}
for col, (bt_class, new_col) in standardization_map.items():
adata.obs[new_col] = bt_class.standardize(
adata.obs[col], field="name", return_field="ontology_id"
)
# Drop the name columns because CELLxGENE disallows them
adata.obs = adata.obs.drop(columns=list(standardization_map.keys()))
Show code cell output
! found 1 name in public source: ['South Asian']
please add corresponding Ethnicity records via: `.from_values(['South Asian'])`
! found 2 names in public source: ['T cell', 'B cell']
please add corresponding CellType records via: `.from_values(['T cell', 'B cell'])`
try:
curator.validate()
except ln.errors.ValidationError:
pass
Show code cell output
! 2 terms not validated in feature 'columns' in slot 'obs': 'cell_type', 'self_reported_ethnicity'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! ontology ID http://www.ebi.ac.uk/efo/EFO_0008913 not found in DataFrame
! ontology ID http://www.ebi.ac.uk/efo/EFO_0003738 not found in DataFrame
! 1 term not validated in feature 'tissue_ontology_term_id' in slot 'obs': 'UBERON:0002048XXX'
→ fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('tissue_ontology_term_id')
An error is shown for the tissue label “UBERON:0002048XXX” because it contains a few extra X - a typo.
Let’s fix it:
adata.obs["tissue_ontology_term_id"] = adata.obs[
"tissue_ontology_term_id"
].cat.rename_categories({"UBERON:0002048XXX": "UBERON:0002048"})
Now validate should pass.
# recreate the AnnDataCurator to refresh cached categoricals
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
curator.validate()
Save artifact¶
We can now save the curated artifact:
artifact = curator.save_artifact(key="examples/dataset-curated-against-cxg.h5ad")
Show code cell output
→ writing the in-memory object into cache
→ returning schema with same hash: Schema(uid='9IGAsduNztzeHcT2', name='obs of CELLxGENE version 5.2.0 for human of ontology_id', description=None, n=12, is_type=False, itype='Feature', otype='DataFrame', dtype=None, hash='9UencaGxx1vVapZbW7rfLA', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-10-27 08:28:16 UTC, is_locked=False)
artifact.describe()
Show code cell output
Artifact: examples/dataset-curated-against-cxg.h5ad (0000) ├── uid: BkGDMgcMpoLMgxTv0000 run: JYFlJsg (cellxgene-curate.ipynb) │ kind: dataset otype: AnnData │ hash: 2L22Zlwons3ahouHdI44RQ size: 43.4 KB │ branch: main space: all │ created_at: 2025-10-27 08:30:34 UTC created_by: testuser1 │ n_observations: 3 ├── storage/path: │ /home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/.lamindb/BkGDMgcMpoLMgxTv0000.h5ad ├── Dataset features │ ├── var (2) │ │ var_index bionty.Gene.ensembl_gene_id[sour… ENSG00000000419, ENSG00000139618 │ │ feature_is_filtered bool │ └── obs (12) │ assay_ontology_term_id bionty.ExperimentalFactor.ontolo… EFO:0005684 │ cell_type_ontology_term_id bionty.CellType.ontology_id[sour… CL:0000084, CL:0000236 │ development_stage_ontology_te… bionty.DevelopmentalStage.ontolo… unknown │ disease_ontology_term_id bionty.Disease.ontology_id[sourc… MONDO:0004975, MONDO:0004980 │ organism_ontology_term_id bionty.Organism.ontology_id[sour… NCBITaxon:9606 │ self_reported_ethnicity_ontol… bionty.Ethnicity.ontology_id[sou… HANCESTRO:0006 │ sex_ontology_term_id bionty.Phenotype.ontology_id[sou… PATO:0000383, PATO:0000384, unknown │ suspension_type ULabel cell │ tissue_ontology_term_id bionty.Tissue.ontology_id[source… UBERON:0000948, UBERON:0002048 │ tissue_type ULabel tissue │ donor_id str │ is_primary_data ULabel └── Labels └── .organisms bionty.Organism human .genes bionty.Gene DPM1, BRCA2 .tissues bionty.Tissue heart, lung .cell_types bionty.CellType T cell, B cell .diseases bionty.Disease Alzheimer disease, atopic eczema .phenotypes bionty.Phenotype unknown, female, male .experimental_factors bionty.ExperimentalFactor RNA-seq of coding RNA from single cells .developmental_stages bionty.DevelopmentalStage unknown .ethnicities bionty.Ethnicity South Asian .ulabels ULabel tissue, cell
Validating using cellxgene-schema¶
To validate the now curated AnnData object using CZI’s cellxgene-schema CLI tool, we need to write the AnnData object to disk.
adata.write("small_cxg_curated.h5ad")
# %%bash -e
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg_curated.h5ad
Show code cell output
Loading dependencies
Loading validator modules
Starting validation...
WARNING: Dataframe 'var' only has 2 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING: Data contains assay(s) that are not represented in the 'suspension_type' schema definition table. Ensure you have selected the most appropriate value for the assay(s) between 'cell', 'nucleus', and 'na'. Please contact [email protected] during submission so that the assay(s) can be added to the schema definition document.
Validation complete in 0:00:02.386614 with status is_valid=True
Note
The CELLxGENE Schema is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.