Curate AnnData based on the CELLxGENE schema .md .md

This guide shows how to curate an AnnData object against the latest CELLxGENE schema.

Summary

To ingest validate & annotated datasets adhering to a CELLxGENE Schema, call

!cellxgene-schema --version # should print latest version
!cellxgene-schema validate small_cxg_curated.h5ad  # validation

using a shell, and then

schema = ln.examples.cellxgene.create_cellxgene_schema()
ln.Artifact("…", schema=schema).save()  # annotation (re-validates ontologies, but not some other details)
# pip install lamindb pronto
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.2.3

!lamin init --storage ./test-cellxgene-curate --modules bionty
Hide code cell output
! using anonymous user (to identify, call: lamin login)
 initialized lamindb: anonymous/test-cellxgene-curate
import lamindb as ln
import bionty as bt
import re

ln.track()
Hide code cell output
 connected lamindb: anonymous/test-cellxgene-curate
 created Transform('6e7TWd66Ql0i0000', key='cellxgene-curate.ipynb'), started new Run('ZtABrNfazRVHkka7') at <django.db.models.expressions.DatabaseDefault object at 0x7fc4d67cf260>
 notebook imports: bionty==2.3.1 lamindb-core==2.5.1
 recommendation: to identify the notebook across renames, pass the uid: ln.track("6e7TWd66Ql0i")

The CELLxGENE schema

As a first step, we generate the specific CELLxGENE schema which adds missing sources to the instance:

cxg_schema = ln.examples.cellxgene.create_cellxgene_schema()
cxg_schema.describe()
Hide code cell output
Schema: CELLxGENE AnnData of ontology_id
├── uid: 1zh3bbVZanBIrBEJ                run: ZtABrNf (cellxgene-curate.ipynb)
itype: None                          otype: AnnData                       
hash: JQx88Qmaty5OtoqMzYVnrQ         ordered_set: False                   
maximal_set: False                   minimal_set: True                    
branch: main                         space: all                           
created_at: 2026-06-04 11:26:56 UTC  created_by: anonymous                
├── var: var of CELLxGENE
│   ├── uid: i2kVNumbCmavZcyO                run: ZtABrNf (cellxgene-curate.ipynb)
│   │   itype: Feature                       otype: None                          
│   │   hash: smp1Myp5o2dinPxpwzV_yQ         ordered_set: False                   
│   │   maximal_set: False                   minimal_set: True                    
│   │   branch: main                         space: all                           
│   │   created_at: 2026-06-04 11:26:56 UTC  created_by: anonymous                
│   └── Features (2)
│       └── name               dtype                                              optio…  nulla…  coe…  default_va…
var_index          bionty.Gene.ensembl_gene_id[source__uid='2w43l1Y…  ✗       ✓       ✓     unset      
feature_is_filte…  bool                                               ✗       ✓       ✓     unset      
├── obs: obs of CELLxGENE of ontology_id
│   ├── uid: b07Ss1o4ARBgKCab                run: ZtABrNf (cellxgene-curate.ipynb)
│   │   itype: Feature                       otype: DataFrame                     
│   │   hash: yH7S_ThvrvVAIQaaW22rTA         ordered_set: False                   
│   │   maximal_set: False                   minimal_set: True                    
│   │   branch: main                         space: all                           
│   │   created_at: 2026-06-04 11:26:56 UTC  created_by: anonymous                
│   └── Features (11)
│       └── name                               dtype                                                   …  …  defau…
assay_ontology_term_id             bionty.ExperimentalFactor.ontology_id                   ✗  ✓  unset 
cell_type_ontology_term_id         bionty.CellType.ontology_id                             ✗  ✓  unset 
development_stage_ontology_term_…  bionty.DevelopmentalStage.ontology_id[source__uid='7J…  ✗  ✓  unset 
disease_ontology_term_id           bionty.Disease.ontology_id                              ✗  ✓  unset 
self_reported_ethnicity_ontology…  bionty.Ethnicity.ontology_id                            ✗  ✓  unset 
sex_ontology_term_id               bionty.Phenotype.ontology_id                            ✗  ✓  unset 
tissue_ontology_term_id            bionty.Tissue.ontology_id|bionty.CellType.ontology_id   ✗  ✓  unset 
donor_id                           str                                                     ✗  ✓  unkno…
is_primary_data                    ULabel                                                  ✗  ✓  unset 
suspension_type                    ULabel                                                  ✗  ✓  unset 
tissue_type                        ULabel                                                  ✗  ✓  unset 
└── uns: uns of CELLxGENE version
    ├── uid: XScK0C2lydmfkCV7                run: ZtABrNf (cellxgene-curate.ipynb)
itype: Feature                       otype: DataFrame                     
hash: FuoqSv5mtnyQ2eZMXNtRHQ         ordered_set: False                   
maximal_set: False                   minimal_set: True                    
branch: main                         space: all                           
created_at: 2026-06-04 11:26:56 UTC  created_by: anonymous                
    └── Features (1)
        └── name                       dtype                        optional  nullable  coerce  default_value
            organism_ontology_term_id  bionty.Organism.ontology_id  ✗         ✓         ✓       unset        

The schema has three components:

cxg_schema.slots["var"].describe()
Hide code cell output
Schema: var of CELLxGENE
├── uid: i2kVNumbCmavZcyO                run: ZtABrNf (cellxgene-curate.ipynb)
itype: Feature                       otype: None                          
hash: smp1Myp5o2dinPxpwzV_yQ         ordered_set: False                   
maximal_set: False                   minimal_set: True                    
branch: main                         space: all                           
created_at: 2026-06-04 11:26:56 UTC  created_by: anonymous                
└── Features (2)
    └── name                dtype                                               optio…  nullab…  coe…  default_val…
        var_index           bionty.Gene.ensembl_gene_id[source__uid='2w43l1YS…  ✗       ✓        ✓     unset       
        feature_is_filter…  bool                                                ✗       ✓        ✓     unset       
cxg_schema.slots["obs"].describe()
Hide code cell output
Schema: obs of CELLxGENE of ontology_id
├── uid: b07Ss1o4ARBgKCab                run: ZtABrNf (cellxgene-curate.ipynb)
itype: Feature                       otype: DataFrame                     
hash: yH7S_ThvrvVAIQaaW22rTA         ordered_set: False                   
maximal_set: False                   minimal_set: True                    
branch: main                         space: all                           
created_at: 2026-06-04 11:26:56 UTC  created_by: anonymous                
└── Features (11)
    └── name                                dtype                                                   o…  …    defau…
        assay_ontology_term_id              bionty.ExperimentalFactor.ontology_id                   ✗   ✓    unset 
        cell_type_ontology_term_id          bionty.CellType.ontology_id                             ✗   ✓    unset 
        development_stage_ontology_term_id  bionty.DevelopmentalStage.ontology_id[source__uid='7J…  ✗   ✓    unset 
        disease_ontology_term_id            bionty.Disease.ontology_id                              ✗   ✓    unset 
        self_reported_ethnicity_ontology_…  bionty.Ethnicity.ontology_id                            ✗   ✓    unset 
        sex_ontology_term_id                bionty.Phenotype.ontology_id                            ✗   ✓    unset 
        tissue_ontology_term_id             bionty.Tissue.ontology_id|bionty.CellType.ontology_id   ✗   ✓    unset 
        donor_id                            str                                                     ✗   ✓    unkno…
        is_primary_data                     ULabel                                                  ✗   ✓    unset 
        suspension_type                     ULabel                                                  ✗   ✓    unset 
        tissue_type                         ULabel                                                  ✗   ✓    unset 
cxg_schema.slots["uns"].describe()
Hide code cell output
Schema: uns of CELLxGENE version
├── uid: XScK0C2lydmfkCV7                run: ZtABrNf (cellxgene-curate.ipynb)
itype: Feature                       otype: DataFrame                     
hash: FuoqSv5mtnyQ2eZMXNtRHQ         ordered_set: False                   
maximal_set: False                   minimal_set: True                    
branch: main                         space: all                           
created_at: 2026-06-04 11:26:56 UTC  created_by: anonymous                
└── Features (1)
    └── name                       dtype                        optional  nullable  coerce  default_value
        organism_ontology_term_id  bionty.Organism.ontology_id  ✗         ✓         ✓       unset        

In the following, we will validate a dataset the CELLxGENE schema and curate it.

Validate and curate metadata

Let’s start with an AnnData object that we would like to curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres all requirements of CELLxGENE including the CELLxGENE schema.

adata = ln.examples.datasets.small_dataset3_cellxgene(
    with_obs_typo=True, with_var_typo=True
)
adata.uns["organism_ontology_term_id"] = adata.obs["organism_ontology_term_id"].iloc[0]
adata.obs = adata.obs.drop(columns=["organism_ontology_term_id"])
adata.write_h5ad("small_cxg.h5ad")
adata
Hide code cell output
AnnData object with n_obs × n_vars = 3 × 3
    obs: 'disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type'
    var: 'feature_is_filtered'
    uns: 'title', 'organism_ontology_term_id'
    obsm: 'X_pca'

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg.h5ad
Hide code cell output
INFO:cellxgene_schema:Loading dependencies
INFO:cellxgene_schema:Loading validator modules
INFO:cellxgene_schema.validate:Starting validation...
/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.12/site-packages/cellxgene_schema/validate.py:693: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  is_filtered = column[i]
WARNING:cellxgene_schema.validate:WARNING: Dataframe 'var' only has 3 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING:cellxgene_schema.validate:WARNING: Data contains assay(s) that are not represented in the 'suspension_type' schema definition table. Ensure you have selected the most appropriate value for the assay(s) between 'cell', 'nucleus', and 'na'. Please contact [email protected] during submission so that the assay(s) can be added to the schema definition document.
WARNING:cellxgene_schema.validate:WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.
ERROR:cellxgene_schema.validate:ERROR: Add labels error: Column 'cell_type' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR:cellxgene_schema.validate:ERROR: Add labels error: Column 'self_reported_ethnicity' is a reserved column name of 'obs'. Remove it from h5ad and try again.
ERROR:cellxgene_schema.validate:ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'var', make sure it is a valid ID.
ERROR:cellxgene_schema.validate:ERROR: Could not infer organism from feature ID 'invalid_ensembl_id' in 'raw.var', make sure it is a valid ID.
ERROR:cellxgene_schema.validate:ERROR: Dataframe 'obs' is missing column 'assay_ontology_term_id'.
ERROR:cellxgene_schema.validate:ERROR: 'UBERON:0002048XXX' in 'tissue_ontology_term_id' is not a valid ontology term id of 'UBERON, ZFA, FBbt, WBbt'. When 'tissue_type' is 'tissue', 'tissue_ontology_term_id' must be a valid UBERON, ZFA, FBbt, or WBbt term.
ERROR:cellxgene_schema.validate:ERROR: Dataframe 'obs' is missing column 'self_reported_ethnicity_ontology_term_id'.
INFO:cellxgene_schema.validate:Validation complete in 0:00:11.583047 with status is_valid=False

CELLxGENE requires all observations to be annotated. If information for a specific column like disease_ontology_term_id is not available, CELLxGENE requires to fall back to default values like “normal” or “unknown”. Let’s save these defaults to the instance using lamindb.examples.cellxgene.save_cellxgene_defaults():

ln.examples.cellxgene.save_cellxgene_defaults()

Now we can start curating the dataset:

curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

The error shows invalid genes are present in the dataset. Let’s remove them from both the adata and adata.raw objects:

adata = adata[
    :, ~adata.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
if adata.raw is not None:
    raw_data = adata.raw.to_adata()
    raw_data = raw_data[
        :, ~raw_data.var.index.isin(curator.slots["var"].cat.non_validated["index"])
    ].copy()
    adata.raw = raw_data

As we’ve subsetted the AnnData object, we have to recreate the AnnDataCurator to validate again:

curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    print(e)
Hide code cell output
{
    "SCHEMA": {
        "COLUMN_NOT_IN_DATAFRAME": [
            {
                "schema": null,
                "column": "assay_ontology_term_id",
                "check": "column_in_dataframe",
                "error": "column 'assay_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
            },
            {
                "schema": null,
                "column": "cell_type_ontology_term_id",
                "check": "column_in_dataframe",
                "error": "column 'cell_type_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
            },
            {
                "schema": null,
                "column": "self_reported_ethnicity_ontology_term_id",
                "check": "column_in_dataframe",
                "error": "column 'self_reported_ethnicity_ontology_term_id' not in dataframe. Columns in dataframe: ['disease_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'cell_type', 'self_reported_ethnicity', 'donor_id', 'is_primary_data', 'suspension_type', 'tissue_type']"
            }
        ]
    }
}

The validation error tells us that we’re missing several columns. The reason is simple: CELLxGENE requires all obs metadata to be stored as ontology IDs in entity_ontology_term_id columns. Therefore, we first translate the name based obs columns into the required format.

adata.obs
Hide code cell output
disease_ontology_term_id development_stage_ontology_term_id sex_ontology_term_id tissue_ontology_term_id cell_type self_reported_ethnicity donor_id is_primary_data suspension_type tissue_type
barcode1 MONDO:0004975 unknown PATO:0000383 UBERON:0002048XXX T cell South Asian -1 False cell tissue
barcode2 MONDO:0004980 unknown PATO:0000384 UBERON:0002048XXX B cell South Asian 1 False cell tissue
barcode3 MONDO:0004980 unknown unknown UBERON:0000948 B cell South Asian 2 False cell tissue
# Add missing assay column
adata.obs["assay_ontology_term_id"] = "EFO:0005684"

def get_source_from_feature(feature: ln.Feature) -> bt.Source | None:
    if match := re.search(r"source__uid='([^']+)'", feature.dtype_as_str):
        return bt.Source.get(uid=match.group(1))
    return None

# Add `entity_ontology_term_id` columns by translating names to ontology IDs
standardization_map = {
    "self_reported_ethnicity": (
        bt.Ethnicity,
        "self_reported_ethnicity_ontology_term_id",
    ),
    "cell_type": (bt.CellType, "cell_type_ontology_term_id"),
}

for col, (bt_class, new_col) in standardization_map.items():
    feature = cxg_schema.slots["obs"].features.filter(name=new_col).one()
    source = get_source_from_feature(feature)

    adata.obs[new_col] = bt_class.standardize(
        adata.obs[col], field="name", return_field="ontology_id", source=source
    )
# Drop the name columns because CELLxGENE disallows them
adata.obs = adata.obs.drop(columns=list(standardization_map.keys()))
Hide code cell output
! found 1 name in public source: ['South Asian']
  please add corresponding Ethnicity records via: `.from_values(['South Asian'])`
! found 2 names in public source: ['T cell', 'B cell']
  please add corresponding CellType records via: `.from_values(['T cell', 'B cell'])`
try:
    curator.validate()
except ln.errors.ValidationError:
    pass
Hide code cell output
! ontology ID BFO:0000020 not found in DataFrame

An error is shown for the tissue label “UBERON:0002048XXX” because it contains a few extra X - a typo. Let’s fix it:

adata.obs["tissue_ontology_term_id"] = adata.obs[
    "tissue_ontology_term_id"
].cat.rename_categories({"UBERON:0002048XXX": "UBERON:0002048"})

Now validate should pass.

# recreate the AnnDataCurator to refresh cached categoricals
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
curator.validate()

Save artifact

We can now save the curated artifact:

artifact = curator.save_artifact(key="examples/dataset-curated-against-cxg.h5ad")
Hide code cell output
 returning schema with same hash: Schema(uid='b07Ss1o4ARBgKCab', is_type=False, name='obs of CELLxGENE of ontology_id', description=None, n_members=11, coerce=True, flexible=False, itype='Feature', otype='DataFrame', hash='yH7S_ThvrvVAIQaaW22rTA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, created_on_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, created_at=2026-06-04 11:26:56 UTC, is_locked=False)
 returning schema with same hash: Schema(uid='XScK0C2lydmfkCV7', is_type=False, name='uns of CELLxGENE version', description=None, n_members=1, coerce=True, flexible=False, itype='Feature', otype='DataFrame', hash='FuoqSv5mtnyQ2eZMXNtRHQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, created_on_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, created_at=2026-06-04 11:26:56 UTC, is_locked=False)
artifact.describe()
Hide code cell output
Artifact: examples/dataset-curated-against-cxg.h5ad (0000)
├── uid: U97fYHacIwDQGkxA0000            run: ZtABrNf (cellxgene-curate.ipynb)   
kind: dataset                        otype: AnnData                          
hash: 0OJQXQlQ3k-2FEhfHNwgIQ         size: 41.8 KB                           
branch: main                         space: all                              
created_at: 2026-06-04 11:30:17 UTC  created_by: anonymous                   
n_observations: 3                    schema: CELLxGENE AnnData of ontology_id
├── storage/path: 
/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/.lamindb/U97fYHacIwDQGkxA0000.h5ad
├── Dataset features
├── var (2)                                                                                                    
│   feature_is_filtered            bool                                                                        
│   var_index                      bionty.Gene.ensembl_gene_id[source…  ENSG00000000419, ENSG00000139618       
├── obs (11)                                                                                                   
│   assay_ontology_term_id         bionty.ExperimentalFactor.ontology…  EFO:0005684                            
│   cell_type_ontology_term_id     bionty.CellType.ontology_id          CL:0000084, CL:0000236                 
│   development_stage_ontology_t…  bionty.DevelopmentalStage.ontology…  unknown                                
│   disease_ontology_term_id       bionty.Disease.ontology_id           MONDO:0004975, MONDO:0004980           
│   donor_id                       str                                                                         
│   is_primary_data                ULabel                                                                      
│   self_reported_ethnicity_onto…  bionty.Ethnicity.ontology_id         HANCESTRO:0848                         
│   sex_ontology_term_id           bionty.Phenotype.ontology_id         PATO:0000383, PATO:0000384, unknown    
│   suspension_type                ULabel                               cell                                   
│   tissue_ontology_term_id        bionty.Tissue.ontology_id|bionty.C…  UBERON:0000948, UBERON:0002048         
│   tissue_type                    ULabel                               tissue                                 
└── uns (1)                                                                                                    
    organism_ontology_term_id      bionty.Organism.ontology_id          NCBITaxon:9606                         
└── Labels
    └── .ulabels                       ULabel                               tissue, cell                           
        .organisms                     bionty.Organism                      human                                  
        .genes                         bionty.Gene                          DPM1, BRCA2                            
        .tissues                       bionty.Tissue                        heart, lung                            
        .cell_types                    bionty.CellType                      T cell, B cell                         
        .diseases                      bionty.Disease                       Alzheimer disease, atopic eczema       
        .phenotypes                    bionty.Phenotype                     unknown, female, male                  
        .experimental_factors          bionty.ExperimentalFactor            RNA-seq of coding RNA from single cells
        .developmental_stages          bionty.DevelopmentalStage            unknown                                
        .ethnicities                   bionty.Ethnicity                     South Asian                            

Validating using cellxgene-schema

To validate the now curated AnnData object using CZI’s cellxgene-schema CLI tool, we need to write the AnnData object to disk.

adata.write("small_cxg_curated.h5ad")
# %%bash -e
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg_curated.h5ad
Hide code cell output
INFO:cellxgene_schema:Loading dependencies
INFO:cellxgene_schema:Loading validator modules
INFO:cellxgene_schema.validate:Starting validation...
/home/runner/.local/share/uv/tools/cellxgene-schema/lib/python3.12/site-packages/cellxgene_schema/validate.py:693: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  is_filtered = column[i]
WARNING:cellxgene_schema.validate:WARNING: Dataframe 'var' only has 2 rows. Features SHOULD NOT be filtered from expression matrix.
WARNING:cellxgene_schema.validate:WARNING: Data contains assay(s) that are not represented in the 'suspension_type' schema definition table. Ensure you have selected the most appropriate value for the assay(s) between 'cell', 'nucleus', and 'na'. Please contact [email protected] during submission so that the assay(s) can be added to the schema definition document.
INFO:cellxgene_schema.validate:Validation complete in 0:00:10.762258 with status is_valid=True

Note

The CELLxGENE Schema is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.