Curate `AnnData` based on the CELLxGENE schema¶

This guide shows how to curate an AnnData object against the CELLxGENE schema v5.2.0.

Summary

To ingest validate & annotated datasets adhering to a CELLxGENE Schema, call

!cellxgene-schema --version # should print 5.2.0
!cellxgene-schema validate small_cxg_curated.h5ad  # validation

using a shell, and then

schema = ln.examples.cellxgene.get_cxg_schema(version="5.2.0")
ln.Artifact("…", schema=schema).save()  # annotation (re-validates ontologies, but not some other details)

# pip install lamindb pronto
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.2.3

!lamin init --storage ./test-cellxgene-curate --modules bionty

import lamindb as ln
import bionty as bt

ln.track()

The CELLxGENE schema¶

As a first step, we generate the specific CELLxGENE schema which adds missing sources to the instance:

cxg_schema = ln.examples.cellxgene.get_cxg_schema("5.2.0")

cxg_schema.describe()

Show code cell output Hide code cell output

Schema: AnnData of CELLxGENE version 5.2.0 for human of ontology_id
├── uid: ZWWdLt2c3ymsGye2                run: knj8qVU (cellxgene-curate.ipynb)
│   itype: None                          otype: AnnData                       
│   hash: 7VzLvJr9Xw3xqHvEyYj7Dg         ordered_set: False                   
│   maximal_set: False                   minimal_set: True                    
│   branch: main                         space: all                           
│   created_at: 2025-12-03 17:15:19 UTC  created_by: testuser1                
├── var: var of CELLxGENE version 5.2.0
│   ├── uid: wH8OZ6EfQSnNG7dX                run: knj8qVU (cellxgene-curate.ipynb)
│   │   itype: Feature                       otype: None                          
│   │   hash: saVD59g52s98U-UcCSfE9A         ordered_set: False                   
│   │   maximal_set: False                   minimal_set: True                    
│   │   branch: main                         space: all                           
│   │   created_at: 2025-12-03 17:15:18 UTC  created_by: testuser1                
│   └── Features (2)
│       └── name              dtype                                             opti…  null…  coerce_d…  default_v…
│           var_index         bionty.Gene.ensembl_gene_id[source__uid='5dmX95…  ✗      ✓      ✓          unset     
│           feature_is_filt…  bool                                              ✗      ✓      ✓          unset     
└── obs: obs of CELLxGENE version 5.2.0 for human of ontology_id
    ├── uid: F2tCMdg4oD6HB2tl                run: knj8qVU (cellxgene-curate.ipynb)
    │   itype: Feature                       otype: DataFrame                     
    │   hash: Lar6-yl92gHdXslQTqKDAA         ordered_set: False                   
    │   maximal_set: False                   minimal_set: True                    
    │   branch: main                         space: all                           
    │   created_at: 2025-12-03 17:15:19 UTC  created_by: testuser1                
    └── Features (12)
        └── name                              dtype                                                      coe…  def…
            assay_ontology_term_id            bionty.ExperimentalFactor.ontology_id[source__uid='2…      ✓     uns…
            cell_type_ontology_term_id        bionty.CellType.ontology_id[source__uid='3Uw2Va7a']        ✓     uns…
            development_stage_ontology_term…  bionty.DevelopmentalStage.ontology_id[source__uid='1…      ✓     uns…
            disease_ontology_term_id          bionty.Disease.ontology_id[source__uid='4a3ejKuf']         ✓     uns…
            self_reported_ethnicity_ontolog…  bionty.Ethnicity.ontology_id[source__uid='MJRqduf9']       ✓     uns…
            sex_ontology_term_id              bionty.Phenotype.ontology_id[source__uid='3ox8Ekgl']       ✓     uns…
            tissue_ontology_term_id           bionty.Tissue.ontology_id[source__uid='MUtAGdL4']          ✓     uns…
            organism_ontology_term_id         bionty.Organism.ontology_id[source__uid='4tsksCMX']        ✓     uns…
            donor_id                          str                                                        ✓     unk…
            is_primary_data                   ULabel                                                     ✓     uns…
            suspension_type                   ULabel                                                     ✓     uns…
            tissue_type                       ULabel                                                     ✓     uns…

The schema has two components:

cxg_schema.slots["var"].describe()

cxg_schema.slots["obs"].describe()

Show code cell output Hide code cell output

Schema: obs of CELLxGENE version 5.2.0 for human of ontology_id
├── uid: F2tCMdg4oD6HB2tl                run: knj8qVU (cellxgene-curate.ipynb)
│   itype: Feature                       otype: DataFrame                     
│   hash: Lar6-yl92gHdXslQTqKDAA         ordered_set: False                   
│   maximal_set: False                   minimal_set: True                    
│   branch: main                         space: all                           
│   created_at: 2025-12-03 17:15:19 UTC  created_by: testuser1                
└── Features (12)
    └── name                              dtype                                                     …  coe…  defau…
        assay_ontology_term_id            bionty.ExperimentalFactor.ontology_id[source__uid='2v…    ✓  ✓     unset 
        cell_type_ontology_term_id        bionty.CellType.ontology_id[source__uid='3Uw2Va7a']       ✓  ✓     unset 
        development_stage_ontology_term…  bionty.DevelopmentalStage.ontology_id[source__uid='1G…    ✓  ✓     unset 
        disease_ontology_term_id          bionty.Disease.ontology_id[source__uid='4a3ejKuf']        ✓  ✓     unset 
        self_reported_ethnicity_ontolog…  bionty.Ethnicity.ontology_id[source__uid='MJRqduf9']      ✓  ✓     unset 
        sex_ontology_term_id              bionty.Phenotype.ontology_id[source__uid='3ox8Ekgl']      ✓  ✓     unset 
        tissue_ontology_term_id           bionty.Tissue.ontology_id[source__uid='MUtAGdL4']         ✓  ✓     unset 
        organism_ontology_term_id         bionty.Organism.ontology_id[source__uid='4tsksCMX']       ✓  ✓     unset 
        donor_id                          str                                                       ✓  ✓     unkno…
        is_primary_data                   ULabel                                                    ✓  ✓     unset 
        suspension_type                   ULabel                                                    ✓  ✓     unset 
        tissue_type                       ULabel                                                    ✓  ✓     unset

In the following, we will validate a dataset the CELLxGENE schema and curate it.

Validate and curate metadata¶

Let’s start with an AnnData object that we would like to curate. We are writing it to disk to run CZI’s cellxgene-schema CLI tool which verifies whether an on-disk h5ad dataset adheres all requirements of CELLxGENE including the CELLxGENE schema.

adata = ln.examples.datasets.small_dataset3_cellxgene(
    with_obs_typo=True, with_var_typo=True
)
adata.write_h5ad("small_cxg.h5ad")
adata

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg.h5ad

CELLxGENE requires all observations to be annotated. If information for a specific column like disease_ontology_term_id is not available, CELLxGENE requires to fall back to default values like “normal” or “unknown”. Let’s save these defaults to the instance using lamindb.examples.cellxgene.save_cellxgene_defaults():

ln.examples.cellxgene.save_cellxgene_defaults()

Now we can start curating the dataset:

curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

The error shows invalid genes are present in the dataset. Let’s remove them from both the adata and adata.raw objects:

adata = adata[
    :, ~adata.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
if adata.raw is not None:
    raw_data = adata.raw.to_adata()
    raw_data = raw_data[
        :, ~raw_data.var.index.isin(curator.slots["var"].cat.non_validated["index"])
    ].copy()
    adata.raw = raw_data

As we’ve subsetted the AnnData object, we have to recreate the AnnDataCurator to validate again:

curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    print(e)

The validation error tells us that we’re missing several columns. The reason is simple: CELLxGENE requires all obs metadata to be stored as ontology IDs in entity_ontology_term_id columns. Therefore, we first translate the name based obs columns into the required format.

adata.obs

Show code cell output Hide code cell output

	disease_ontology_term_id	development_stage_ontology_term_id	sex_ontology_term_id	tissue_ontology_term_id	cell_type	self_reported_ethnicity	donor_id	is_primary_data	suspension_type	tissue_type	organism_ontology_term_id
barcode1	MONDO:0004975	unknown	PATO:0000383	UBERON:0002048XXX	T cell	South Asian	-1	False	cell	tissue	NCBITaxon:9606
barcode2	MONDO:0004980	unknown	PATO:0000384	UBERON:0002048XXX	B cell	South Asian	1	False	cell	tissue	NCBITaxon:9606
barcode3	MONDO:0004980	unknown	unknown	UBERON:0000948	B cell	South Asian	2	False	cell	tissue	NCBITaxon:9606

# Add missing assay column
adata.obs["assay_ontology_term_id"] = "EFO:0005684"
# Add `entity_ontology_term_id` columns by translating names to ontology IDs
standardization_map = {
    "self_reported_ethnicity": (
        bt.Ethnicity,
        "self_reported_ethnicity_ontology_term_id",
    ),
    "cell_type": (bt.CellType, "cell_type_ontology_term_id"),
}

for col, (bt_class, new_col) in standardization_map.items():
    adata.obs[new_col] = bt_class.standardize(
        adata.obs[col], field="name", return_field="ontology_id"
    )
# Drop the name columns because CELLxGENE disallows them
adata.obs = adata.obs.drop(columns=list(standardization_map.keys()))

try:
    curator.validate()
except ln.errors.ValidationError:
    pass

An error is shown for the tissue label “UBERON:0002048XXX” because it contains a few extra X - a typo. Let’s fix it:

adata.obs["tissue_ontology_term_id"] = adata.obs[
    "tissue_ontology_term_id"
].cat.rename_categories({"UBERON:0002048XXX": "UBERON:0002048"})

Now validate should pass.

# recreate the AnnDataCurator to refresh cached categoricals
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
curator.validate()

Save artifact¶

We can now save the curated artifact:

artifact = curator.save_artifact(key="examples/dataset-curated-against-cxg.h5ad")

artifact.describe()

Show code cell output Hide code cell output

Artifact: examples/dataset-curated-against-cxg.h5ad (0000)
├── uid: 9nuz4tLKg72C7poR0000            run: knj8qVU (cellxgene-curate.ipynb)
│   kind: dataset                        otype: AnnData                       
│   hash: 2L22Zlwons3ahouHdI44RQ         size: 43.4 KB                        
│   branch: main                         space: all                           
│   created_at: 2025-12-03 17:17:37 UTC  created_by: testuser1                
│   n_observations: 3                                                         
├── storage/path: 
│   /home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene-curate/.lamindb/9nuz4tLKg72C7poR0000.h5ad
├── Dataset features
│   ├── var (2)                                                                                                    
│   │   var_index                       bionty.Gene.ensembl_gene_id[sour…  ENSG00000000419, ENSG00000139618        
│   │   feature_is_filtered             bool                                                                       
│   └── obs (12)                                                                                                   
│       assay_ontology_term_id          bionty.ExperimentalFactor.ontolo…  EFO:0005684                             
│       cell_type_ontology_term_id      bionty.CellType.ontology_id[sour…  CL:0000084, CL:0000236                  
│       development_stage_ontology_te…  bionty.DevelopmentalStage.ontolo…  unknown                                 
│       disease_ontology_term_id        bionty.Disease.ontology_id[sourc…  MONDO:0004975, MONDO:0004980            
│       organism_ontology_term_id       bionty.Organism.ontology_id[sour…  NCBITaxon:9606                          
│       self_reported_ethnicity_ontol…  bionty.Ethnicity.ontology_id[sou…  HANCESTRO:0006                          
│       sex_ontology_term_id            bionty.Phenotype.ontology_id[sou…  PATO:0000383, PATO:0000384, unknown     
│       suspension_type                 ULabel                             cell                                    
│       tissue_ontology_term_id         bionty.Tissue.ontology_id[source…  UBERON:0000948, UBERON:0002048          
│       tissue_type                     ULabel                             tissue                                  
│       donor_id                        str                                                                        
│       is_primary_data                 ULabel                                                                     
└── Labels
    └── .ulabels                        ULabel                             tissue, cell                            
        .organisms                      bionty.Organism                    human                                   
        .genes                          bionty.Gene                        DPM1, BRCA2                             
        .tissues                        bionty.Tissue                      heart, lung                             
        .cell_types                     bionty.CellType                    T cell, B cell                          
        .diseases                       bionty.Disease                     Alzheimer disease, atopic eczema        
        .phenotypes                     bionty.Phenotype                   unknown, female, male                   
        .experimental_factors           bionty.ExperimentalFactor          RNA-seq of coding RNA from single cells 
        .developmental_stages           bionty.DevelopmentalStage          unknown                                 
        .ethnicities                    bionty.Ethnicity                   South Asian

Validating using cellxgene-schema¶

To validate the now curated AnnData object using CZI’s cellxgene-schema CLI tool, we need to write the AnnData object to disk.

adata.write("small_cxg_curated.h5ad")

# %%bash -e
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg_curated.h5ad

Note

The CELLxGENE Schema is designed to validate all metadata for adherence to ontologies. It does not reimplement all rules of the cellxgene schema and we therefore recommend running the cellxgene-schema if full adherence beyond metadata is a necessity.

Curate AnnData based on the CELLxGENE schema¶

The CELLxGENE schema¶

Validate and curate metadata¶

Save artifact¶

Validating using cellxgene-schema¶

Curate `AnnData` based on the CELLxGENE schema¶