##### Validate & standardize datasets [image: .md][image]

Data curation with LaminDB ensures your datasets are **validated** and
**queryable** through **annotation**.

Curating a dataset with LaminDB means three things:

* **Validate** that the dataset matches a desired schema.

* **Standardize** the dataset (e.g., by fixing typos, mapping
  synonyms) or update registries if validation fails.

* **Annotate** the dataset by linking it against metadata entities so
  that it becomes queryable.

In this guide we'll curate common data structures. Here is a guide for
the underlying low-level API.

Note: If you know either "pydantic" or "pandera", here is an FAQ that
compares LaminDB with both of these tools.

 # pip install lamindb
 !lamin init --storage ./test-curate --modules bionty

 import lamindb as ln

 ln.track()

#### Schema design patterns

A "Schema" in LaminDB is a specification that defines the expected
structure, data types, and validation rules for a dataset. It is
similar to "pydantic.Model" for dictionaries, and "pandera.Schema",
and "pyarrow.lib.Schema" for tables, but supporting more complicated
data structures.

Schemas ensure data consistency by defining:

* What "Feature"s (dimensions) exist in your dataset

* What data types those features should have

* What values are valid for categorical features

* Which "Feature"s are required vs optional

An exemplary schema:

 schema = ln.Schema(
 name="experiment_schema", # human-readable name
 features=[ # required features
 ln.Feature(name="cell_type", dtype=bt.CellType),
 ln.Feature(name="treatment", dtype=str),
 ],
 otype="DataFrame" # object type (DataFrame, AnnData, etc.)
 )

For composite data structures using slots:

-[ What are slots? ]-

For composite data structures, you need to specify which component
contains which schema, for example, to validate both cell metadata in
".obs" and gene metadata in ".var" within the same schema. Each slot
is a key like ""obs"" for AnnData observations,""rna:var"" for MuData
modalities, or ""attrs:nested:key"" for SpatialData annotations.

 # AnnData with multiple "slots"
 adata_schema = ln.Schema(
 otype="AnnData",
 slots={
 "obs": cell_metadata_schema, # cell annotations
 "var.T": gene_id_schema # gene-derived features
 }
 )

Before diving into curation, let's understand the different schema
approaches and when to use each one. Think of schemas as rules that
define what valid data should look like.

###### Flexible schema

Use when: You want to validate those columns whose names match feature
names in your "Feature" registry.

 import lamindb as ln

 schema = ln.Schema(name="valid_features", itype=ln.Feature).save()

###### Minimal required schema

Use when: You need certain columns but want flexibility for additional
metadata.

 import lamindb as ln

 schema = ln.Schema(
 name="Mini immuno schema",
 features=[
 ln.Feature.get(name="perturbation"),
 ln.Feature.get(name="cell_type_by_model"),
 ln.Feature.get(name="assay_oid"),
 ln.Feature.get(name="donor"),
 ln.Feature.get(name="concentration"),
 ln.Feature.get(name="treatment_time_h"),
 ],
 flexible=True,  # _additional_ columns in a dataframe are validated & annotated
 ).save()

###### Strict Schema

Use when: You need complete control over data structure and values.

 # Only allows specified columns
 schema = ln.Schema(
 features=[...],
 minimal_set=True,  # whether all passed features are required
 maximal_set=False  # whether additional features are allowed
 )

#### DataFrame

###### Step 1: Load and examine your data

We'll be working with the mini immuno dataset:

 df = ln.examples.datasets.mini_immuno.get_dataset1(
 with_cell_type_synonym=True, with_cell_type_typo=True
 )
 df

###### Step 2: Set up your metadata registries

Before creating a schema, ensure your registries have the right
features and labels:

 import bionty as bt

 import lamindb as ln

 # define valid labels
 perturbation_type = ln.Record(name="Perturbation", is_type=True).save()
 ln.Record(name="DMSO", type=perturbation_type).save()
 ln.Record(name="IFNG", type=perturbation_type).save()
 bt.CellType.from_source(name="B cell").save()
 bt.CellType.from_source(name="T cell").save()

 # define valid features
 ln.Feature(name="perturbation", dtype=perturbation_type).save()
 ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save()
 ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save()
 ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save()
 ln.Feature(name="concentration", dtype=str).save()
 ln.Feature(name="treatment_time_h", dtype="num", coerce=True).save()
 ln.Feature(name="donor", dtype=str, nullable=True).save()
 ln.Feature(name="donor_ethnicity", dtype=list[bt.Ethnicity]).save()

###### Step 3: Create your schema

 schema = ln.examples.datasets.mini_immuno.define_mini_immuno_schema_flexible()
 schema.describe()

###### Step 4: Initialize Curator and first validation

If you expect the validation to pass, you can directly register an
artifact by providing the schema:

 artifact = ln.Artifact.from_dataframe(df, key="examples/my_curated_dataset.parquet", schema=schema).save()

The "validate()" method validates that your dataset adheres to the
criteria defined by the "schema". It identifies which values are
already validated (exist in the registries) and which are potentially
problematic (do not yet exist in our registries).

 try:
 curator = ln.curators.DataFrameCurator(df, schema)
 curator.validate()
 except ln.errors.ValidationError as error:
 print(error)

###### Step 5: Fix validation issues

 # check the non-validated terms
 curator.cat.non_validated

For "cell_type_by_expert", we saw 2 terms are not validated.

First, let's standardize synonym "B-cell" as suggested

 curator.cat.standardize("cell_type_by_expert")

 # now we have only one non-validated cell type left
 curator.cat.non_validated

For "CD8-pos alpha-beta T cell", let's understand which cell type in
the public ontology might be the actual match.

 # to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
 # use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
 lookup = curator.cat.lookup(public=True)
 lookup

 # here is an example for the "cell_type" column
 cell_types = lookup["cell_type_by_expert"]
 cell_types.cd8_positive_alpha_beta_t_cell

 # fix the cell type name
 df["cell_type_by_expert"] = df["cell_type_by_expert"].cat.rename_categories(
 {"CD8-pos alpha-beta T cell": cell_types.cd8_positive_alpha_beta_t_cell.name}
 )

For perturbation, we want to add the new values: "DMSO", "IFNG"

 # this adds perturbations that were _not_ validated
 curator.cat.add_new_from("perturbation")

 ln.Feature.get(name="perturbation")

 # validate again
 curator.validate()

###### Step 6: Save your curated dataset

 artifact = curator.save_artifact(key="examples/my_curated_dataset.parquet")

 artifact.describe()

#### Common fixes

This section covers the most frequent curation issues and their
solutions. Use this as a reference when validation fails.

###### Feature validation issues

**Issue**: "Column not in dataframe"

 "column 'treatment' not in dataframe. Columns in dataframe: ['drug', 'timepoint', ...]"

**Solutions**:

 # Solution 1: Rename columns to match schema
 df = df.rename(columns={
 'treatment': 'drug',
 'time': 'timepoint',
 ...
 })

 # Solution 2: Create missing columns
 df['treatment'] = 'unknown'  # Add with default value (or define Feature.default_value)

 # Solution 3: Modify schema to match your data
 schema = ln.Schema(
 features=[
 ln.Feature.get(name="drug"),  # Use actual column name
 ln.Feature.get(name="timepoint"),
 ],
 ...
 )

###### Value validation issues

**Issue**: "Terms not validated in feature 'perturbation'"

 2 terms not validated in feature 'cell_type': 'B-cell', 'CD8-pos alpha-beta T cell'
 1 synonym found: "B-cell" → "B cell"
 → curate synonyms via: .standardize("cell_type")
 for remaining terms:
 → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('cell_type')

**Solutions**:

 # Solution 1: Use automatic standardization if given hint (handles synonyms))
 curator.cat.standardize('cell_type')

 # Solution 2: Manual mapping for complex cases
 value_mapping = {
 'T-cells': 'T cell',
 'B-cells': 'B cell',
 }
 df['cell_type'] = df['cell_type'].map(value_mapping).fillna(df['cell_type'])

 # Solution 3: Use public ontology lookup for correct names
 lookup = curator.cat.lookup(public=True)
 cell_types = lookup["cell_type"]
 df['cell_type'] = df['cell_type'].cat.rename_categories({
 'CD8-pos T cell': cell_types.cd8_positive_alpha_beta_t_cell.name
 })

 # Solution 4: Add new legitimate terms
 curator.cat.add_new_from("cell_type")

###### Data type issues

**Issue**: "Expected categorical data, got object"

 TypeError: Expected categorical data for cell_type, got object

**Solutions**:

 # Solution 1: Convert to categorical
 df['cell_type'] = df['cell_type'].astype('category')

 # Solution 2: Use coercion in feature definition
 ln.Feature(name="cell_type", dtype=bt.CellType, coerce=True).save()

###### Organism-specific ontology issues

**Issue**: "Terms not validated" for organism-specific ontologies like
developmental stages

 2 terms not validated in feature 'developmental_stage_ontology_id': 'MmusDv:0000142', 'MmusDv:0000022'

**Solution**: Specify organism-specific source in feature definition
using "cat_filters":

 # When defining the schema, specify the organism-specific source
 mouse_source = bt.Source.filter(
 entity="bionty.DevelopmentalStage",
 organism="mouse"
 ).one()

 schema = ln.Schema(
 features=[
 ln.Feature(
 name="developmental_stage_ontology_id",
 dtype=bt.DevelopmentalStage.ontology_id,
 cat_filters={"source": mouse_source}  # Specify organism-specific source
 )
 ],
 ...
 )

This pattern applies to any ontology where the same registry serves
multiple organisms (e.g., "DevelopmentalStage", "Phenotype", ...).

#### External data validation

Since not all metadata is always stored within the dataset itself, it
is also possible to validate external metadata.

curate_dataframe_external_features.py

 import lamindb as ln
 from datetime import date

 df = ln.examples.datasets.mini_immuno.get_dataset1(otype="DataFrame")

 temperature = ln.Feature(name="temperature", dtype=float).save()
 date_of_study = ln.Feature(name="date_of_study", dtype=date).save()
 external_schema = ln.Schema(features=[temperature, date_of_study]).save()

 concentration = ln.Feature(name="concentration", dtype=str).save()
 donor = ln.Feature(name="donor", dtype=str, nullable=True).save()
 schema = ln.Schema(
 features=[concentration, donor],
 slots={"__external__": external_schema},
 otype="DataFrame",
 ).save()

 artifact = ln.Artifact.from_dataframe(
 df,
 key="examples/dataset1.parquet",
 features={"temperature": 21.6, "date_of_study": date(2024, 10, 1)},
 schema=schema,
 ).save()
 artifact.describe()

 !python scripts/curate_dataframe_external_features.py

#### Union dtypes

Some metadata columns might validate against several registries.

curate_dataframe_union_features.py

 import lamindb as ln
 import pandas as pd

 union_feature = ln.Feature(
 name="mixed_feature",
| dtype="cat[bionty.Tissue.ontology_id | bionty.CellType.ontology_id]", |
 ).save()

 df_mixed = pd.DataFrame({"mixed_feature": ["UBERON:0000178", "CL:0000540"]})

 schema = ln.Schema(features=[union_feature], coerce=True).save()

 curator = ln.curators.DataFrameCurator(df_mixed, schema)
 curator.validate()

 !python scripts/curate_dataframe_union_features.py

#### AnnData

"AnnData" like all other data structures that follow is a composite
structure that stores different arrays in different "slots".

###### Allow a flexible schema

We can also allow a flexible schema for an "AnnData" and only require
that it's indexed with Ensembl gene IDs.

curate_anndata_flexible.py

 import lamindb as ln

 ln.examples.datasets.mini_immuno.define_features_labels()
 adata = ln.examples.datasets.mini_immuno.get_dataset1(otype="AnnData")
 artifact = ln.Artifact.from_anndata(
 adata,
 key="examples/mini_immuno.h5ad",
 schema="ensembl_gene_ids_and_valid_features_in_obs",
 ).save()
 artifact.describe()

Let's run the script.

 !python scripts/curate_anndata_flexible.py

Under-the-hood, this uses the following build-in schema
("anndata_ensembl_gene_ids_and_valid_features_in_obs()"):

 import bionty as bt

 import lamindb as ln

 obs_schema = ln.examples.schemas.valid_features()
 varT_schema = ln.Schema(
 name="valid_ensembl_gene_ids", itype=bt.Gene.ensembl_gene_id
 ).save()
 schema = ln.Schema(
 name="anndata_ensembl_gene_ids_and_valid_features_in_obs",
 otype="AnnData",
 slots={"obs": obs_schema, "var.T": varT_schema},
 ).save()

This schema tranposes the "var" DataFrame during curation, so that one
validates and annotates the columns of "var.T", i.e.,
"[ENSG00000153563, ENSG00000010610, ENSG00000170458]". If one doesn't
transpose, one would annotate the columns of "var", i.e.,
"[gene_symbol, gene_type]".

[image]

###### Fix validation issues

 adata = ln.examples.datasets.mini_immuno.get_dataset1(
 with_gene_typo=True, with_cell_type_typo=True, otype="AnnData"
 )
 adata

 schema = ln.examples.schemas.anndata_ensembl_gene_ids_and_valid_features_in_obs()
 schema.describe()

Check the slots of a schema:

 schema.slots

 curator = ln.curators.AnnDataCurator(adata, schema)
 try:
 curator.validate()
 except ln.errors.ValidationError as error:
 print(error)

As above, we leverage a lookup object with valid cell types to find
the correct name.

 valid_cell_types = curator.slots["obs"].cat.lookup()["cell_type_by_expert"]
 adata.obs["cell_type_by_expert"] = adata.obs[
 "cell_type_by_expert"
 ].cat.rename_categories(
 {"CD8-pos alpha-beta T cell": valid_cell_types.cd8_positive_alpha_beta_t_cell.name}
 )

The validated "AnnData" can be subsequently saved as an "Artifact":

 adata.obs.columns

 curator.slots["var.T"].cat.add_new_from("columns")

 curator.validate()

 artifact = curator.save_artifact(key="examples/my_curated_anndata.h5ad")

Access the schema for each slot:

 artifact.features.slots

The saved artifact has been annotated with validated features and
labels:

 artifact.describe()

#### Unstructured dictionaries

Most datastructures support unstructured metadata stored as
dictionaries:

* Pandas DataFrames: ".attrs"

* AnnData: ".uns"

* MuData: ".uns" and "modality:uns"

* SpatialData: ".attrs"

Here, we exemplary show how to curate such metadata for AnnData:

define_schema_anndata_uns.py

 import lamindb as ln

 from define_schema_df_metadata import study_metadata_schema

 anndata_uns_schema = ln.Schema(
 otype="AnnData",
 slots={
 "uns:study_metadata": study_metadata_schema,
 },
 ).save()

 !python scripts/define_schema_anndata_uns.py

curate_anndata_uns.py

 import lamindb as ln

 ln.examples.datasets.mini_immuno.define_features_labels()
 adata = ln.examples.datasets.mini_immuno.get_dataset1(otype="AnnData")
 schema = ln.Schema.get(name="Study metadata schema")
 artifact = ln.Artifact.from_anndata(
 adata, schema=schema, key="examples/mini_immuno_uns.h5ad"
 )
 artifact.describe()

 !python scripts/curate_anndata_uns.py

#### MuData

curate_mudata.py

 import lamindb as ln
 import bionty as bt

 from docs.scripts.define_schema_df_metadata import study_metadata_schema

 # define labels
 perturbation = ln.Record(name="Perturbation", is_type=True).save()
 ln.Record(name="Perturbed", type=perturbation).save()
 ln.Record(name="NT", type=perturbation).save()

 replicate = ln.Record(name="Replicate", is_type=True).save()
 ln.Record(name="rep1", type=replicate).save()
 ln.Record(name="rep2", type=replicate).save()
 ln.Record(name="rep3", type=replicate).save()

 # define the global obs schema
 obs_schema = ln.Schema(
 name="mudata_papalexi21_subset_obs_schema",
 features=[
 ln.Feature(name="perturbation", dtype="cat[Record[Perturbation]]").save(),
 ln.Feature(name="replicate", dtype="cat[Record[Replicate]]").save(),
 ],
 ).save()

 # define the ['rna'].obs schema
 obs_schema_rna = ln.Schema(
 name="mudata_papalexi21_subset_rna_obs_schema",
 features=[
 ln.Feature(name="nCount_RNA", dtype=int).save(),
 ln.Feature(name="nFeature_RNA", dtype=int).save(),
 ln.Feature(name="percent.mito", dtype=float).save(),
 ],
 ).save()

 # define the ['hto'].obs schema
 obs_schema_hto = ln.Schema(
 name="mudata_papalexi21_subset_hto_obs_schema",
 features=[
 ln.Feature(name="nCount_HTO", dtype=float).save(),
 ln.Feature(name="nFeature_HTO", dtype=int).save(),
 ln.Feature(name="technique", dtype=bt.ExperimentalFactor).save(),
 ],
 ).save()

 # define ['rna'].var schema
 var_schema_rna = ln.Schema(
 name="mudata_papalexi21_subset_rna_var_schema",
 itype=bt.Gene.symbol,
 dtype=float,
 ).save()

 # define composite schema
 mudata_schema = ln.Schema(
 name="mudata_papalexi21_subset_mudata_schema",
 otype="MuData",
 slots={
 "obs": obs_schema,
 "rna:obs": obs_schema_rna,
 "hto:obs": obs_schema_hto,
 "rna:var": var_schema_rna,
 "uns:study_metadata": study_metadata_schema,
 },
 ).save()

 # curate a MuData
 mdata = ln.examples.datasets.mudata_papalexi21_subset(with_uns=True)
 bt.settings.organism = "human"  # set the organism to map gene symbols
 curator = ln.curators.MuDataCurator(mdata, mudata_schema)
 artifact = curator.save_artifact(key="examples/mudata_papalexi21_subset.h5mu")
 assert artifact.schema == mudata_schema

 !python scripts/curate_mudata.py

#### SpatialData

define_schema_spatialdata.py

 import lamindb as ln
 import bionty as bt

 # a very comprehensive schema for different slots of a SpatialData object

 # define or query features
 bio_dict = ln.Feature(name="bio", dtype=dict).save()
 tech_dict = ln.Feature(name="tech", dtype=dict).save()
 disease = ln.Feature(name="disease", dtype=bt.Disease, coerce=True).save()
 developmental_stage = ln.Feature(
 name="developmental_stage",
 dtype=bt.DevelopmentalStage,
 coerce=True,
 ).save()
 assay = ln.Feature(name="assay", dtype=bt.ExperimentalFactor, coerce=True).save()
 sample_region = ln.Feature(name="sample_region", dtype=str).save()
 analysis = ln.Feature(name="analysis", dtype=str).save()

 # define or query schema components
 attrs_schema = ln.Schema([bio_dict, tech_dict]).save()
 sample_schema = ln.Schema([disease, developmental_stage]).save()
 tech_schema = ln.Schema([assay]).save()
 obs_schema = ln.Schema([sample_region]).save()
 uns_schema = ln.Schema([analysis]).save()
 # enforces only registered Ensembl Gene IDs pass validation (maximal_set=True)
 varT_schema = ln.Schema(itype=bt.Gene.ensembl_gene_id, maximal_set=True).save()

 # compose the SpatialData schema
 sdata_schema = ln.Schema(
 name="spatialdata_blobs_schema",
 otype="SpatialData",
 slots={
 "attrs:bio": sample_schema,
 "attrs:tech": tech_schema,
 "attrs": attrs_schema,
 "tables:table:obs": obs_schema,
 "tables:table:var.T": varT_schema,
 },
 ).save()

 !python scripts/define_schema_spatialdata.py

curate_spatialdata.py

 import lamindb as ln

 spatialdata = ln.examples.datasets.spatialdata_blobs()
 sdata_schema = ln.Schema.get(name="spatialdata_blobs_schema")
 curator = ln.curators.SpatialDataCurator(spatialdata, sdata_schema)
 try:
 curator.validate()
 except ln.errors.ValidationError:
 pass

 spatialdata.tables["table"].var.drop(index="ENSG00000999999", inplace=True)

 # validate again (must pass now) and save artifact
 artifact = ln.Artifact.from_spatialdata(
 spatialdata, key="examples/spatialdata1.zarr", schema=sdata_schema
 ).save()
 artifact.describe()

 !python scripts/curate_spatialdata.py

#### TiledbsomaExperiment

curate_soma_experiment.py

 import lamindb as ln
 import bionty as bt
 import tiledbsoma as soma
 import tiledbsoma.io

 adata = ln.examples.datasets.mini_immuno.get_dataset1(otype="AnnData")
 tiledbsoma.io.from_anndata("small_dataset.tiledbsoma", adata, measurement_name="RNA")

 obs_schema = ln.Schema(
 name="soma_obs_schema",
 features=[
 ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
 ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
 ],
 ).save()

 var_schema = ln.Schema(
 name="soma_var_schema",
 features=[
 ln.Feature(name="var_id", dtype=bt.Gene.ensembl_gene_id).save(),
 ],
 coerce=True,
 ).save()

 soma_schema = ln.Schema(
 name="soma_experiment_schema",
 otype="tiledbsoma",
 slots={
 "obs": obs_schema,
 "ms:RNA.T": var_schema,
 },
 ).save()

 with soma.Experiment.open("small_dataset.tiledbsoma") as experiment:
 curator = ln.curators.TiledbsomaExperimentCurator(experiment, soma_schema)
 curator.validate()
 artifact = curator.save_artifact(
 key="examples/soma_experiment.tiledbsoma",
 description="SOMA experiment with schema validation",
 )
 assert artifact.schema == soma_schema
 artifact.describe()

 !python scripts/curate_soma_experiment.py

#### Other data structures

If you have other data structures, read: How do I validate & annotate
arbitrary data structures? .

 !rm -rf ./test-curate
 !rm -rf ./small_dataset.tiledbsoma
 !lamin delete --force test-curate