Pydantic & Pandera vs. LaminDB

This doc explains conceptual differences between data validation with pydantic, pandera, and LaminDB.

!lamin init --storage test-pydantic-pandera --modules bionty
Hide code cell output
 resetting django module variables
 initialized lamindb: testuser1/test-pydantic-pandera

Let us work with a test dataframe.

import pandas as pd
import pydantic
import lamindb as ln
import bionty as bt
import pandera.pandas as pandera
import pprint

from typing import Literal, Any

df = ln.core.datasets.small_dataset1()
df
 connected lamindb: testuser1/test-pydantic-pandera
ENSG00000153563 ENSG00000010610 ENSG00000170458 perturbation sample_note cell_type_by_expert cell_type_by_model assay_oid concentration treatment_time_h donor donor_ethnicity
sample1 1 3 5 DMSO was ok B cell B cell EFO:0008913 0.1% 24 D0001 [Chinese, Singaporean Chinese]
sample2 2 4 6 IFNG looks naah CD8-positive, alpha-beta T cell T cell EFO:0008913 200 nM 24 D0002 [Chinese, Han Chinese]
sample3 3 5 7 DMSO pretty! 🤩 CD8-positive, alpha-beta T cell T cell EFO:0008913 0.1% 6 None [Chinese]

Define a schema

pydantic

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal["T cell", "B cell"]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"

pandera

pandera_schema = pandera.DataFrameSchema(
    {
        "perturbation": pandera.Column(
            str, checks=pandera.Check.isin(["DMSO", "IFNG"])
        ),
        "cell_type_by_model": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "cell_type_by_expert": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "assay_oid": pandera.Column(str, checks=pandera.Check.isin(["EFO:0008913"])),
        "concentration": pandera.Column(str),
        "treatment_time_h": pandera.Column(int),
        "donor": pandera.Column(str, nullable=True),
    },
    name="My immuno schema",
)

LaminDB

Features & labels are defined on the level of the database instance. You can either define a schema with required (and optional) columns.

ln.ULabel(name="DMSO").save()
ln.ULabel(name="IFNG").save()

# leverage ontologies through types ln.ULabel, bt.CellType, bt.ExperimentalFactor
lamindb_schema = ln.Schema(
    name="My immuno schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
        ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
        ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
        ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
        ln.Feature(name="concentration", dtype=str).save(),
        ln.Feature(name="treatment_time_h", dtype=int).save(),
        ln.Feature(name="donor", dtype=str, nullable=True).save(),
    ],
).save()

Or merely define a constraint on the feature identifier.

lamindb_schema_only_itype = ln.Schema(
    name="Allow any valid features & labels", itype=ln.Feature
)

Validate a dataframe

pydantic

class DataFrameValidationError(Exception):
    pass


def validate_dataframe(df: pd.DataFrame, model: type[pydantic.BaseModel]):
    errors = []

    for i, row in enumerate(df.to_dict(orient="records")):
        try:
            model(**row)
        except pydantic.ValidationError as e:
            errors.append(f"row {i} failed validation: {e}")

    if errors:
        error_message = "\n".join(errors)
        raise DataFrameValidationError(
            f"DataFrame validation failed with the following errors:\n{error_message}"
        )
try:
    validate_dataframe(df, ImmunoSchema)
except DataFrameValidationError as e:
    print(e)
Hide code cell output
DataFrame validation failed with the following errors:
row 1 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
  Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/literal_error
row 2 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
  Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/literal_error

To fix the validation error, we need to update the Literal and re-run the model definition.

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal[
    "T cell", "B cell", "CD8-positive, alpha-beta T cell"  # <-- updated
]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"
validate_dataframe(df, ImmunoSchema)

pandera

try:
    pandera_schema.validate(df)
except pandera.errors.SchemaError as e:
    print(e)
Hide code cell output
Column 'cell_type_by_expert' failed element-wise validator number 0: isin(['T cell', 'B cell']) failure cases: CD8-positive, alpha-beta T cell, CD8-positive, alpha-beta T cell

LaminDB

Because the term "CD8-positive, alpha-beta T cell" is part of the public CellType ontology, validation passes the first time.

If validation had not passed, we could have resolved the issue simply by adding a new term to the CellType registry rather than editing the code. This also puts downstream data scientists into a position to update ontologies.

curator = ln.curators.DataFrameCurator(df, lamindb_schema)
curator.validate()
Hide code cell output
! 5 terms not validated in feature 'columns': 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note', 'donor_ethnicity'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')

What was the cell type validation based on? Let’s inspect the CellType registry.

bt.CellType.to_dataframe()
Hide code cell output
uid name ontology_id abbr synonyms description space_id source_id run_id created_at created_by_id _aux branch_id
id
14 6By01L04 alpha-beta T cell CL:0000789 None alpha-beta T-cell|alpha-beta T-lymphocyte|alph... A T Cell That Expresses An Alpha-Beta T Cell R... 1 16 None 2025-09-14 14:13:22.638000+00:00 1 None 1
15 4BEwsp1Q mature alpha-beta T cell CL:0000791 None mature alpha-beta T-cell|mature alpha-beta T l... A Alpha-Beta T Cell That Has A Mature Phenotype. 1 16 None 2025-09-14 14:13:22.638000+00:00 1 None 1
16 2OTzqBTM mature T cell CL:0002419 None mature T-cell|CD3e-positive T cell A T Cell That Expresses A T Cell Receptor Comp... 1 16 None 2025-09-14 14:13:22.638000+00:00 1 None 1
13 6IC9NGJE CD8-positive, alpha-beta T cell CL:0000625 None CD8-positive, alpha-beta T-lymphocyte|CD8-posi... A T Cell Expressing An Alpha-Beta T Cell Recep... 1 16 None 2025-09-14 14:13:22.382000+00:00 1 None 1
3 4bKGljt0 cell CL:0000000 None None A Material Entity Of Anatomical Origin (Part O... 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
4 2K93w3xO motile cell CL:0000219 None None A Cell That Moves By Its Own Activities. 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
5 2cXC7cgF single nucleate cell CL:0000226 None None A Cell With A Single Nucleus. 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
6 4WnpvUTH eukaryotic cell CL:0000255 None None Any Cell That In Taxon Some Eukaryota. 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
7 X6c7osZ5 lymphocyte CL:0000542 None None A Lymphocyte Is A Leukocyte Commonly Found In ... 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
8 3VEAlFdi leukocyte CL:0000738 None leucocyte|white blood cell An Achromatic Cell Of The Myeloid Or Lymphoid ... 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
9 2Jgr5Xx4 mononuclear leukocyte CL:0000842 None mononuclear cell A Leukocyte With A Single Non-Segmented Nucleu... 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
10 7GpphKmr lymphocyte of B lineage CL:0000945 None None A Lymphocyte Of B Lineage With The Commitment ... 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
11 4Ilrnj9U hematopoietic cell CL:0000988 None haemopoietic cell|haematopoietic cell|hemopoie... A Cell Of A Hematopoietic Lineage. 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
12 u3sr1Gdf nucleate cell CL:0002242 None None A Cell Containing At Least One Nucleus. 1 16 None 2025-09-14 14:13:22.079000+00:00 1 None 1
1 ryEtgi1y B cell CL:0000236 None B-cell|B-lymphocyte|B lymphocyte A Lymphocyte Of B Lineage That Is Capable Of B... 1 16 None 2025-09-14 14:13:21.660000+00:00 1 None 1
2 22LvKd01 T cell CL:0000084 None T lymphocyte|T-lymphocyte|T-cell A Type Of Lymphocyte Whose Defining Characteri... 1 16 None 2025-09-14 14:13:21.660000+00:00 1 None 1

The CellType regsitry is hierachical as it contains the Cell Ontology.

bt.CellType.get(name="CD8-positive, alpha-beta T cell").view_parents()
Hide code cell output
../_images/a79ede3b1898189af834bf1a1fab8e1eab4ca24cf000216f978d4ccb381561bd.svg

Overview of validation properties

Importantly, LaminDB offers not only a DataFrameCurator, but also a AnnDataCurator, MuDataCurator, SpatialDataCurator, and TiledbsomaCurator.

The below overview only concerns validating dataframes.

Experience of data engineer

property

pydantic

pandera

lamindb

define schema as code

yes, in form of a pydantic.BaseModel

yes, in form of a pandera.DataFrameSchema

yes, in form of a lamindb.Schema

define schema as a set of constraints without the need of listing fields/columns/features; e.g. useful if validating 60k genes

no

no

yes

update labels independent of code

not possible because labels are enums/literals

not possible because labels are hard-coded in Check

possible by adding new terms to a registry

built-in validation from public ontologies

no

no

yes

sync labels with ELN/LIMS registries without code change

no

no

yes

can re-use fields/columns/features across schemas

limited via subclass

only in same Python session

yes because persisted in database

schema modifications can invalidate previously validated datasets

yes

yes

no because LaminDB allows to query datasets that were validated with a schema version

can use columnar organization of dataframe

no, need to iterate over potentially millions of rows

yes

yes

Experience of data consumer

property

pydantic

pandera

lamindb

dataset is queryable / findable

no

no

yes, by querying for labels & features

dataset is annotated

no

no

yes

user knows what validation constraints were

no, because might not have access to code and doesn’t know which code was run

no (same as pydantic)

yes, via artifact.schema

Annotation & queryability

Engineer: annotate the dataset

Either use the Curator object:

artifact = curator.save_artifact(key="our_datasets/dataset1.parquet")
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
 returning existing schema with same hash: Schema(uid='2DdYiRC03UXtZ68p', name='My immuno schema', n=7, is_type=False, itype='Feature', hash='uNBrvtukbcdE9RM2C3QlqA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-09-14 14:13:20 UTC)

If you don’t expect a need for Curator functionality for updating ontologies and standardization, you can also use the Artifact constructor.

artifact = ln.Artifact.from_dataframe(
    df, key="our_datasets/dataset1.parquet", schema=lamindb_schema
).save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
 returning existing artifact with same hash: Artifact(uid='nJ3BUJBKlwmlIzsq0000', is_latest=True, key='our_datasets/dataset1.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=9868, hash='aUPgpQYtBZ_zvZvl637r9g', n_observations=3, branch_id=1, space_id=1, storage_id=1, schema_id=1, created_by_id=1, created_at=2025-09-14 14:13:24 UTC); to track this artifact as an input, use: ln.Artifact.get()
! 5 terms not validated in feature 'columns': 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note', 'donor_ethnicity'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')
 returning existing schema with same hash: Schema(uid='2DdYiRC03UXtZ68p', name='My immuno schema', n=7, is_type=False, itype='Feature', hash='uNBrvtukbcdE9RM2C3QlqA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-09-14 14:13:20 UTC)

Consumer: see annotations

artifact.describe()
Hide code cell output
Artifact .parquet · DataFrame · dataset
├── General
│   ├── key: our_datasets/dataset1.parquet
│   ├── uid: nJ3BUJBKlwmlIzsq0000          hash: aUPgpQYtBZ_zvZvl637r9g
│   ├── size: 9.6 KB                       transform: None
│   ├── space: all                         branch: all
│   ├── created_by: testuser1              created_at: 2025-09-14 14:13:24
│   ├── n_observations: 3
│   └── storage path: 
/home/runner/work/lamindb/lamindb/docs/faq/test-pydantic-pandera/our_datasets/dataset1.parquet
├── Dataset features
│   └── columns7                     [Feature]                                                                  
assay_oid                       cat[bionty.ExperimentalFactor.on…  single-cell RNA sequencing              
cell_type_by_expert             cat[bionty.CellType]               B cell, CD8-positive, alpha-beta T cell 
cell_type_by_model              cat[bionty.CellType]               B cell, T cell                          
perturbation                    cat[ULabel]                        DMSO, IFNG                              
concentration                   str                                                                        
treatment_time_h                int                                                                        
donor                           str                                                                        
└── Labels
    └── .cell_types                     bionty.CellType                    B cell, T cell, CD8-positive, alpha-bet…
        .experimental_factors           bionty.ExperimentalFactor          single-cell RNA sequencing              
        .ulabels                        ULabel                             DMSO, IFNG                              

Consumer: query the dataset

ln.Artifact.filter(perturbation="IFNG").to_dataframe()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux branch_id
id
1 nJ3BUJBKlwmlIzsq0000 our_datasets/dataset1.parquet None .parquet dataset DataFrame 9868 aUPgpQYtBZ_zvZvl637r9g None 3 md5 True False 1 1 1 None True None 2025-09-14 14:13:24.726000+00:00 1 {'af': {'0': True}} 1

Consumer: understand validation

By accessing artifact.schema, the consumer can understand how the dataset was validated.

artifact.schema
Hide code cell output
Schema(uid='2DdYiRC03UXtZ68p', name='My immuno schema', n=7, is_type=False, itype='Feature', hash='uNBrvtukbcdE9RM2C3QlqA', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-09-14 14:13:20 UTC)
artifact.schema.features.to_dataframe()
Hide code cell output
uid name dtype is_type unit description array_rank array_size array_shape proxy_dtype synonyms _expect_many _curation space_id type_id run_id created_at created_by_id _aux branch_id
id
1 CigI933a31cZ perturbation cat[ULabel] None None None 0 0 None None None None None 1 None None 2025-09-14 14:13:20.937000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
2 papKWyFCQtDx cell_type_by_model cat[bionty.CellType] None None None 0 0 None None None None None 1 None None 2025-09-14 14:13:20.942000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
3 BwJDh5eO6IcG cell_type_by_expert cat[bionty.CellType] None None None 0 0 None None None None None 1 None None 2025-09-14 14:13:20.947000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
4 0H0gWsMRX8lT assay_oid cat[bionty.ExperimentalFactor.ontology_id] None None None 0 0 None None None None None 1 None None 2025-09-14 14:13:20.952000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
5 w12D5titNbAJ concentration str None None None 0 0 None None None None None 1 None None 2025-09-14 14:13:20.957000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
6 5eGFhCXlbRpX treatment_time_h int None None None 0 0 None None None None None 1 None None 2025-09-14 14:13:20.963000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
7 cpVqBUDCn492 donor str None None None 0 0 None None None None None 1 None None 2025-09-14 14:13:20.968000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1

Nested data with dynamic keys

We will now examine another more complex example where data is nested with potentially arbitrary (dynamic) keys. The example is inspired by the CELLxGENE schema where annotations are stored as dictionaries in the AnnData .uns slot.

uns_dict = ln.core.datasets.dict_cellxgene_uns()
pprint.pprint(uns_dict)
{'organism_ontology_term_id': 'NCBITaxon:9606',
 'spatial': {'is_single': True,
             'library_1': {'images': {'fullres': 'path/to/fullres.jpg',
                                      'hires': 'path/to/hires.jpg'},
                           'scalefactors': {'spot_diameter_fullres': 89.43,
                                            'tissue_hires_scalef': 0.177}},
             'library_2': {'images': {'fullres': 'path/to/fullres_2.jpg',
                                      'hires': 'path/to/hires_2.jpg'},
                           'scalefactors': {'spot_diameter_fullres': 120.34,
                                            'tissue_hires_scalef': 0.355}}}}

pydantic

Pydantic is primed to deal with nested data.

class Images(pydantic.BaseModel):
    fullres: str
    hires: str


class Scalefactors(pydantic.BaseModel):
    spot_diameter_fullres: float
    tissue_hires_scalef: float


class Library(pydantic.BaseModel):
    images: Images
    scalefactors: Scalefactors


class Spatial(pydantic.BaseModel):
    is_single: bool
    model_config = {"extra": "allow"}

    def __init__(self, **data):
        libraries = {}
        other_fields = {}

        # store all libraries under a single key for validation
        for key, value in data.items():
            if key.startswith("library_"):
                libraries[key] = Library(**value)
            else:
                other_fields[key] = value

        other_fields["libraries"] = libraries
        super().__init__(**other_fields)


class SpatialDataSchema(pydantic.BaseModel):
    organism_ontology_term_id: str
    spatial: Spatial


validated_data = SpatialDataSchema(**uns_dict)

However, pydantic either requires all dictionary keys to be known beforehand to construct the Model classes or workarounds to collect all keys for a single model.

pandera

Pandera cannot validate dictionaries because it is designed for structured dataframe data. Therefore, we need to flatten the dictionary to transform it into a DataFrame:

def _flatten_dict(d: dict[Any, Any], parent_key: str = "", sep: str = "_"):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(_flatten_dict(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)
def create_dynamic_schema(flattened_data: dict[str, Any]):
    schema_dict = {
        "organism_ontology_term_id": pandera.Column(str),
        "spatial_is_single": pandera.Column(bool),
    }

    for key in flattened_data.keys():
        if key.startswith("spatial_library_") and key.endswith("_images_fullres"):
            lib_prefix = key.replace("_images_fullres", "")
            schema_dict.update(
                {
                    f"{lib_prefix}_images_fullres": pandera.Column(str),
                    f"{lib_prefix}_images_hires": pandera.Column(str),
                    f"{lib_prefix}_scalefactors_spot_diameter_fullres": pandera.Column(
                        float
                    ),
                    f"{lib_prefix}_scalefactors_tissue_hires_scalef": pandera.Column(
                        float
                    ),
                }
            )

    return pandera.DataFrameSchema(schema_dict)


flattened = _flatten_dict(uns_dict)
df = pd.DataFrame([flattened])
spatial_schema = create_dynamic_schema(flattened)
validated_df = spatial_schema.validate(df)

Analogously to pydantic, pandera does not have out of the box support for dynamically named keys. Therefore, it is necessary to dynamically construct a pydantic schema.

LaminDB

Similarly, LaminDB currently requires constructing flattened dataframes to dynamically create features for the schema, which can then be used for validation with the DataFrameCurator. Future improvements are expected, including support for a dictionary-specific curator.

def create_dynamic_schema(flattened_data: dict[str, Any]) -> ln.Schema:
    features = []

    for key, value in flattened_data.items():
        if key == "organism_ontology_term_id":
            features.append(ln.Feature(name=key, dtype=bt.Organism.ontology_id).save())
        elif isinstance(value, bool):
            features.append(ln.Feature(name=key, dtype=bool).save())
        elif isinstance(value, (int, float)):
            features.append(ln.Feature(name=key, dtype=float).save())
        else:
            features.append(ln.Feature(name=key, dtype=str).save())

    return ln.Schema(
        name="Spatial data schema", features=features, coerce_dtype=True
    ).save()


flattened = _flatten_dict(uns_dict)
flattened_df = pd.DataFrame([flattened])
spatial_schema = create_dynamic_schema(flattened)
curator = ln.curators.DataFrameCurator(flattened_df, spatial_schema)
curator.validate()
Hide code cell output
! you are trying to create a record with name='spatial_library_1_images_hires' but a record with similar name exists: 'spatial_library_1_images_fullres'. Did you mean to load it?
! you are trying to create a record with name='spatial_library_2_images_hires' but a record with similar name exists: 'spatial_library_2_images_fullres'. Did you mean to load it?

Note

Curators for scverse data structures allow for the specification of schema slots that access and validate dataframes in nested dictionary attributes like .attrs or .uns. These schema slots use colon-separated paths like 'attrs:sample' or 'uns:spatial:images' to target specific dataframes for validation.