Curate AnnData based on the CELLxGENE schema

This guide shows how to curate an AnnData object against the CELLxGENE schema v5.2.0.

Summary

To ingest validate & annotated datasets adhering to a CELLxGENE Schema, call

!cellxgene-schema --version # should print 5.2.0
!cellxgene-schema validate small_cxg_curated.h5ad  # validation

using a shell, and then

schema = ln.examples.cellxgene.get_cxg_schema(version="5.2.0")
ln.Artifact("…", schema=schema).save()  # annotation (re-validates ontologies, but not some other details)
# pip install 'lamindb[bionty,jupyter]' pronto
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.2.3

!lamin init --storage ./test-cellxgene-curate --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-cellxgene-curate
import lamindb as ln
import bionty as bt

ln.track()
Hide code cell output
 connected lamindb: testuser1/test-cellxgene-curate
 created Transform('pO0gxiTft69k0000', key='cellxgene-curate.ipynb'), started new Run('JYFlJsgSyDk30Nbg') at 2025-10-27 08:28:08 UTC
 notebook imports: bionty==1.8.1 lamindb==1.14a1
 recommendation: to identify the notebook across renames, pass the uid: ln.track("pO0gxiTft69k")

The CELLxGENE schema

As a first step, we generate the specific CELLxGENE schema which adds missing sources to the instance:

cxg_schema = ln.examples.cellxgene.get_cxg_schema("5.2.0")
Hide code cell output
/tmp/ipykernel_3802/2806544410.py:1: DeprecationWarning: Use create_cellxgene_schema instead of get_cxg_schema, get_cxg_schema will be removed in the future.
  cxg_schema = ln.examples.cellxgene.get_cxg_schema("5.2.0")
 writing the in-memory object into cache
 writing the in-memory object into cache
 writing the in-memory object into cache
 writing the in-memory object into cache
 writing the in-memory object into cache
 referenced read-only storage location at s3://bionty-assets, is managed by instance with uid 2WgKqzPc1eW3
cxg_schema.describe()
Hide code cell output
Schema: AnnData of CELLxGENE version 5.2.0 for human of ontology_id
├── uid: 1t7iWOosjdFwsYcX                run: JYFlJsg (cellxgene-curate.ipynb)
itype: Composite                     otype: AnnData                       
hash: m9RPCyeF9wiHUq6TkKzFuw         ordered_set: False                   
maximal_set: False                   minimal_set: True                    
branch: main                         space: all                           
created_at: 2025-10-27 08:28:16 UTC  created_by: testuser1                
├── var: var of CELLxGENE version 5.2.0
│   ├── uid: uACnhaxsZtm1LXFW                run: JYFlJsg (cellxgene-curate.ipynb)
│   │   itype: Feature                       otype: None                          
│   │   hash: 3QBevwcTQ6LpiBZ95PjbPQ         ordered_set: False                   
│   │   maximal_set: False                   minimal_set: True                    
│   │   branch: main                         space: all                           
│   │   created_at: 2025-10-27 08:28:16 UTC  created_by: testuser1                
│   └── Features (2)
│       └── name              dtype                                             opti…  null…  coerce_d…  default_v…
var_index         bionty.Gene.ensembl_gene_id[source__uid='5dmX95…  ✗      ✓      ✓          unset     
feature_is_filt…  bool                                              ✗      ✓      ✓          unset     
└── obs: obs of CELLxGENE version 5.2.0 for human of ontology_id
    ├── uid: 9IGAsduNztzeHcT2                run: JYFlJsg (cellxgene-curate.ipynb)
itype: Feature                       otype: DataFrame                     
hash: 9UencaGxx1vVapZbW7rfLA         ordered_set: False                   
maximal_set: False                   minimal_set: True                    
branch: main                         space: all                           
created_at: 2025-10-27 08:28:16 UTC  created_by: testuser1                
    └── Features (12)
        └── name                              dtype                                                      coe…  def…
            assay_ontology_term_id            bionty.ExperimentalFactor.ontology_id[source__uid='2…      ✓     uns…
            cell_type_ontology_term_id        bionty.CellType.ontology_id[source__uid='3Uw2Va7a']        ✓     uns…
            development_stage_ontology_term…  bionty.DevelopmentalStage.ontology_id[source__uid='1…      ✓     uns…
            disease_ontology_term_id          bionty.Disease.ontology_id[source__uid='4a3ejKuf']         ✓     uns…
            self_reported_ethnicity_ontolog…  bionty.Ethnicity.ontology_id[source__uid='MJRqduf9']       ✓     uns…
            sex_ontology_term_id              bionty.Phenotype.ontology_id[source__uid='3ox8Ekgl']       ✓     uns…
            tissue_ontology_term_id           bionty.Tissue.ontology_id[source__uid='MUtAGdL4']          ✓     uns…
            organism_ontology_term_id         bionty.Organism.ontology_id[source__uid='4tsksCMX']        ✓     uns…
            donor_id                          str                                                        ✓     unk…
            is_primary_data                   ULabel                                                     ✓     uns…
            suspension_type                   ULabel                                                     ✓     uns…
            tissue_type                       ULabel                                                     ✓     uns…

The schema has two components:

cxg_schema.slots["var"].describe()
Hide code cell output
Schema: var of CELLxGENE version 5.2.0
├── uid: uACnhaxsZtm1LXFW                run: JYFlJsg (cellxgene-curate.ipynb)
itype: Feature                       otype: None                          
hash: 3QBevwcTQ6LpiBZ95PjbPQ         ordered_set: False                   
maximal_set: False                   minimal_set: True                    
branch: main                         space: all                           
created_at: 2025-10-27 08:28:16 UTC  created_by: testuser1                
└── Features (2)
    └── name               dtype                                              optio…  null…  coerce_dt…  default_v…
        var_index          bionty.Gene.ensembl_gene_id[source__uid='5dmX950…  ✗       ✓      ✓           unset     
        feature_is_filte…  bool                                               ✗       ✓      ✓           unset     
cxg_schema.slots["obs"].describe()
Hide code cell output
Schema: obs of CELLxGENE version 5.2.0 for human of ontology_id
├── uid: 9IGAsduNztzeHcT2                run: JYFlJsg (cellxgene-curate.ipynb)
itype: Feature                       otype: DataFrame                     
hash: 9UencaGxx1vVapZbW7rfLA         ordered_set: False                   
maximal_set: False                   minimal_set: True                    
branch: main                         space: all                           
created_at: 2025-10-27 08:28:16 UTC  created_by: testuser1                
└── Features (12)
    └── name                              dtype                                                     …  coe…  defau…
        assay_ontology_term_id            bionty.ExperimentalFactor.ontology_id[source__uid='2v…    ✓  ✓     unset 
        cell_type_ontology_term_id        bionty.CellType.ontology_id[source__uid='3Uw2Va7a']       ✓  ✓     unset 
        development_stage_ontology_term…  bionty.DevelopmentalStage.ontology_id[source__uid='1G…    ✓  ✓     unset 
        disease_ontology_term_id          bionty.Disease.ontology_id[source__uid='4a3ejKuf']        ✓  ✓     unset 
        self_reported_ethnicity_ontolog…  bionty.Ethnicity.ontology_id[source__uid='MJRqduf9']      ✓  ✓     unset 
        sex_ontology_term_id              bionty.Phenotype.ontology_id[source__uid='3ox8Ekgl']      ✓  ✓     unset 
        tissue_ontology_term_id           bionty.Tissue.ontology_id[source__uid='MUtAGdL4']         ✓  ✓     unset 
        organism_ontology_term_id         bionty.Organism.ontology_id[source__uid='4tsksCMX']       ✓  ✓     unset 
        donor_id                          str                                                       ✓  ✓     unkno…
        is_primary_data                   ULabel                                                    ✓  ✓     unset 
        suspension_type                   ULabel                                                    ✓  ✓     unset 
        tissue_type                       ULabel                                                    ✓  ✓     unset 

In the following, we will validate a dataset the CELLxGENE schema and curate it.

Validate and curate metadata

Let’s start with an AnnData object that we would like to curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres all requirements of CELLxGENE including the CELLxGENE schema.

adata = ln.examples.datasets.small_dataset3_cellxgene(
    with_obs_typo=True, with_var_typo=True
)
adata.write_h5ad("small_cxg.h5ad")
adata
Hide code cell output
AnnData object with n_obs × n_vars = 3 × 3
    obs: 'disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'organism_ontology_term_id'
    var: 'feature_is_filtered'
    uns: 'title'
    obsm: 'X_pca'

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg.h5ad
Hide code cell output
Loading dependencies
Loading validator modules
Starting validation...
WARNING: Dataframe 'var' only has 3 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.
ERROR: Add labels error: Column 'cell_type' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Add labels error: Column 'self_reported_ethnicity' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'var', make sure it is a valid ID.
ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'raw.var', make sure it is a valid ID.
ERROR: Dataframe 'obs' is missing column 'cell_type_ontology_term_id'.
ERROR: Dataframe 'obs' is missing column 'assay_ontology_term_id'.
ERROR: 'UBERON:0002048XXX' in 'tissue_ontology_term_id' is not a valid ontology term id of 'UBERON'. When 'tissue_type' is 'tissue' or 'organoid', 'tissue_ontology_term_id' MUST be a descendant term id of 'UBERON:0001062' (anatomical entity).
ERROR: Dataframe 'obs' is missing column 'self_reported_ethnicity_ontology_term_id'.
ERROR: Checking values with dependencies failed for adata.obs['suspension_type'], this is likely due to missing dependent column in adata.obs.
Validation complete in 0:00:02.419974 with status is_valid=False

CELLxGENE requires all observations to be annotated. If information for a specific column like disease_ontology_term_id is not available, CELLxGENE requires to fall back to default values like “normal” or “unknown”. Let’s save these defaults to the instance using lamindb.examples.cellxgene.save_cellxgene_defaults():

ln.examples.cellxgene.save_cellxgene_defaults()

Now we can start curating the dataset:

curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass
Hide code cell output
! 1 term not validated in feature 'index' in slot 'var': 'invalid_ensembl_id'
    → fix organism 'human', fix typos, remove non-existent values, or save terms via: curator.slots['var'].cat.add_new_from('index')

The error shows invalid genes are present in the dataset. Let’s remove them from both the adata and adata.raw objects:

adata = adata[
    :, ~adata.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
if adata.raw is not None:
    raw_data = adata.raw.to_adata()
    raw_data = raw_data[
        :, ~raw_data.var.index.isin(curator.slots["var"].cat.non_validated["index"])
    ].copy()
    adata.raw = raw_data

As we’ve subsetted the AnnData object, we have to recreate the AnnDataCurator to validate again:

curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    print(e)
Hide code cell output
{
    "SCHEMA": {
        "COLUMN_NOT_IN_DATAFRAME": [
            {
                "schema": null,
                "column": null,
                "check": "column_in_dataframe",
                "error": "column 'assay_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'organism_ontology_term_id']"
            },
            {
                "schema": null,
                "column": null,
                "check": "column_in_dataframe",
                "error": "column 'cell_type_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'organism_ontology_term_id']"
            },
            {
                "schema": null,
                "column": null,
                "check": "column_in_dataframe",
                "error": "column 'self_reported_ethnicity_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type', 'organism_ontology_term_id']"
            }
        ]
    }
}

The validation error tells us that we’re missing several columns. The reason is simple: CELLxGENE requires all obs metadata to be stored as ontology IDs in entity_ontology_term_id columns. Therefore, we first translate the name based obs columns into the required format.

adata.obs
Hide code cell output
disease_ontology_term_id development_stage_ontology_term_id sex_ontology_term_id tissue_ontology_term_id cell_type self_reported_ethnicity donor_id is_primary_data suspension_type tissue_type organism_ontology_term_id
barcode1 MONDO:0004975 unknown PATO:0000383 UBERON:0002048XXX T cell South Asian -1 False cell tissue NCBITaxon:9606
barcode2 MONDO:0004980 unknown PATO:0000384 UBERON:0002048XXX B cell South Asian 1 False cell tissue NCBITaxon:9606
barcode3 MONDO:0004980 unknown unknown UBERON:0000948 B cell South Asian 2 False cell tissue NCBITaxon:9606
# Add missing assay column
adata.obs["assay_ontology_term_id"] = "EFO:0005684"
# Add `entity_ontology_term_id` columns by translating names to ontology IDs
standardization_map = {
    "self_reported_ethnicity": (
        bt.Ethnicity,
        "self_reported_ethnicity_ontology_term_id",
    ),
    "cell_type": (bt.CellType, "cell_type_ontology_term_id"),
}

for col, (bt_class, new_col) in standardization_map.items():
    adata.obs[new_col] = bt_class.standardize(
        adata.obs[col], field="name", return_field="ontology_id"
    )
# Drop the name columns because CELLxGENE disallows them
adata.obs = adata.obs.drop(columns=list(standardization_map.keys()))
Hide code cell output
! found 1 name in public source: ['South Asian']
  please add corresponding Ethnicity records via: `.from_values(['South Asian'])`
! found 2 names in public source: ['T cell', 'B cell']
  please add corresponding CellType records via: `.from_values(['T cell', 'B cell'])`
try:
    curator.validate()
except ln.errors.ValidationError:
    pass
Hide code cell output
! 2 terms not validated in feature 'columns' in slot 'obs': 'cell_type', 'self_reported_ethnicity'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('columns')
! ontology ID http://www.ebi.ac.uk/efo/EFO_0008913 not found in DataFrame
! ontology ID http://www.ebi.ac.uk/efo/EFO_0003738 not found in DataFrame
! 1 term not validated in feature 'tissue_ontology_term_id' in slot 'obs': 'UBERON:0002048XXX'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('tissue_ontology_term_id')

An error is shown for the tissue label “UBERON:0002048XXX” because it contains a few extra X - a typo. Let’s fix it:

adata.obs["tissue_ontology_term_id"] = adata.obs[
    "tissue_ontology_term_id"
].cat.rename_categories({"UBERON:0002048XXX": "UBERON:0002048"})

Now validate should pass.

# recreate the AnnDataCurator to refresh cached categoricals
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
curator.validate()

Save artifact

We can now save the curated artifact:

artifact = curator.save_artifact(key="examples/dataset-curated-against-cxg.h5ad")
Hide code cell output
 writing the in-memory object into cache
 returning schema with same hash: Schema(uid='9IGAsduNztzeHcT2', name='obs of CELLxGENE version 5.2.0 for human of ontology_id', description=None, n=12, is_type=False, itype='Feature', otype='DataFrame', dtype=None, hash='9UencaGxx1vVapZbW7rfLA', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-10-27 08:28:16 UTC, is_locked=False)
artifact.describe()
Hide code cell output
Artifact: examples/dataset-curated-against-cxg.h5ad (0000)
├── uid: BkGDMgcMpoLMgxTv0000            run: JYFlJsg (cellxgene-curate.ipynb)
kind: dataset                        otype: AnnData                       
hash: 2L22Zlwons3ahouHdI44RQ         size: 43.4 KB                        
branch: main                         space: all                           
created_at: 2025-10-27 08:30:34 UTC  created_by: testuser1                
n_observations: 3                                                         
├── storage/path: 
/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/.lamindb/BkGDMgcMpoLMgxTv0000.h5ad
├── Dataset features
├── var (2)                                                                                                    
│   var_index                       bionty.Gene.ensembl_gene_id[sour…  ENSG00000000419, ENSG00000139618        
│   feature_is_filtered             bool                                                                       
└── obs (12)                                                                                                   
    assay_ontology_term_id          bionty.ExperimentalFactor.ontolo…  EFO:0005684                             
    cell_type_ontology_term_id      bionty.CellType.ontology_id[sour…  CL:0000084, CL:0000236                  
    development_stage_ontology_te…  bionty.DevelopmentalStage.ontolo…  unknown                                 
    disease_ontology_term_id        bionty.Disease.ontology_id[sourc…  MONDO:0004975, MONDO:0004980            
    organism_ontology_term_id       bionty.Organism.ontology_id[sour…  NCBITaxon:9606                          
    self_reported_ethnicity_ontol…  bionty.Ethnicity.ontology_id[sou…  HANCESTRO:0006                          
    sex_ontology_term_id            bionty.Phenotype.ontology_id[sou…  PATO:0000383, PATO:0000384, unknown     
    suspension_type                 ULabel                             cell                                    
    tissue_ontology_term_id         bionty.Tissue.ontology_id[source…  UBERON:0000948, UBERON:0002048          
    tissue_type                     ULabel                             tissue                                  
    donor_id                        str                                                                        
    is_primary_data                 ULabel                                                                     
└── Labels
    └── .organisms                      bionty.Organism                    human                                   
        .genes                          bionty.Gene                        DPM1, BRCA2                             
        .tissues                        bionty.Tissue                      heart, lung                             
        .cell_types                     bionty.CellType                    T cell, B cell                          
        .diseases                       bionty.Disease                     Alzheimer disease, atopic eczema        
        .phenotypes                     bionty.Phenotype                   unknown, female, male                   
        .experimental_factors           bionty.ExperimentalFactor          RNA-seq of coding RNA from single cells 
        .developmental_stages           bionty.DevelopmentalStage          unknown                                 
        .ethnicities                    bionty.Ethnicity                   South Asian                             
        .ulabels                        ULabel                             tissue, cell                            

Validating using cellxgene-schema

To validate the now curated AnnData object using CZI’s cellxgene-schema CLI tool, we need to write the AnnData object to disk.

adata.write("small_cxg_curated.h5ad")
# %%bash -e
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg_curated.h5ad
Hide code cell output
Loading dependencies
Loading validator modules
Starting validation...
WARNING: Dataframe 'var' only has 2 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING: Data contains assay(s) that are not represented in the 'suspension_type' schema definition table. Ensure you have selected the most appropriate value for the assay(s) between 'cell', 'nucleus', and 'na'. Please contact [email protected] during submission so that the assay(s) can be added to the schema definition document.
Validation complete in 0:00:02.386614 with status is_valid=True

Note

The CELLxGENE Schema is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.