Pydantic & Pandera vs. LaminDB¶

This doc explains conceptual differences between data validation with pydantic, pandera, and LaminDB.

!lamin init --storage test-pydantic-pandera --modules bionty
Hide code cell output
→ initialized lamindb: testuser1/test-pydantic-pandera

Let us work with a test dataframe.

import pandas as pd
import pydantic
import lamindb as ln
import bionty as bt
import pandera.pandas as pandera
import pprint

from typing import Literal, Any

df = ln.core.datasets.small_dataset1()
df
→ connected lamindb: testuser1/test-pydantic-pandera
ENSG00000153563 ENSG00000010610 ENSG00000170458 perturbation sample_note cell_type_by_expert cell_type_by_model assay_oid concentration treatment_time_h donor donor_ethnicity
sample1 1 3 5 DMSO was ok B cell B cell EFO:0008913 0.1% 24 D0001 [Chinese, Singaporean Chinese]
sample2 2 4 6 IFNG looks naah CD8-positive, alpha-beta T cell T cell EFO:0008913 200 nM 24 D0002 [Chinese, Han Chinese]
sample3 3 5 7 DMSO pretty! 🤩 CD8-positive, alpha-beta T cell T cell EFO:0008913 0.1% 6 None [Chinese]

Define a schema¶

pydantic¶

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal["T cell", "B cell"]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"

pandera¶

pandera_schema = pandera.DataFrameSchema(
    {
        "perturbation": pandera.Column(
            str, checks=pandera.Check.isin(["DMSO", "IFNG"])
        ),
        "cell_type_by_model": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "cell_type_by_expert": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "assay_oid": pandera.Column(str, checks=pandera.Check.isin(["EFO:0008913"])),
        "concentration": pandera.Column(str),
        "treatment_time_h": pandera.Column(int),
        "donor": pandera.Column(str, nullable=True),
    },
    name="My immuno schema",
)

LaminDB¶

Features & labels are defined on the level of the database instance. You can either define a schema with required (and optional) columns.

ln.ULabel(name="DMSO").save()
ln.ULabel(name="IFNG").save()

# leverage ontologies through types ln.ULabel, bt.CellType, bt.ExperimentalFactor
lamindb_schema = ln.Schema(
    name="My immuno schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
        ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
        ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
        ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
        ln.Feature(name="concentration", dtype=str).save(),
        ln.Feature(name="treatment_time_h", dtype=int).save(),
        ln.Feature(name="donor", dtype=str, nullable=True).save(),
    ],
).save()

Or merely define a constraint on the feature identifier.

lamindb_schema_only_itype = ln.Schema(
    name="Allow any valid features & labels", itype=ln.Feature
)

Validate a dataframe¶

pydantic¶

class DataFrameValidationError(Exception):
    pass


def validate_dataframe(df: pd.DataFrame, model: type[pydantic.BaseModel]):
    errors = []

    for i, row in enumerate(df.to_dict(orient="records")):
        try:
            model(**row)
        except pydantic.ValidationError as e:
            errors.append(f"row {i} failed validation: {e}")

    if errors:
        error_message = "\n".join(errors)
        raise DataFrameValidationError(
            f"DataFrame validation failed with the following errors:\n{error_message}"
        )
try:
    validate_dataframe(df, ImmunoSchema)
except DataFrameValidationError as e:
    print(e)
Hide code cell output
DataFrame validation failed with the following errors:
row 1 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
  Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/literal_error
row 2 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
  Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/literal_error

To fix the validation error, we need to update the Literal and re-run the model definition.

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal[
    "T cell", "B cell", "CD8-positive, alpha-beta T cell"  # <-- updated
]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"
validate_dataframe(df, ImmunoSchema)

pandera¶

try:
    pandera_schema.validate(df)
except pandera.errors.SchemaError as e:
    print(e)
Hide code cell output
Column 'cell_type_by_expert' failed element-wise validator number 0: isin(['T cell', 'B cell']) failure cases: CD8-positive, alpha-beta T cell, CD8-positive, alpha-beta T cell

LaminDB¶

Because the term "CD8-positive, alpha-beta T cell" is part of the public CellType ontology, validation passes the first time.

If validation had not passed, we could have resolved the issue simply by adding a new term to the CellType registry rather than editing the code. This also puts downstream data scientists into a position to update ontologies.

curator = ln.curators.DataFrameCurator(df, lamindb_schema)
curator.validate()
Hide code cell output
! 5 terms not validated in feature 'columns': 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note', 'donor_ethnicity'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')

What was the cell type validation based on? Let’s inspect the CellType registry.

bt.CellType.df()
Hide code cell output
uid name ontology_id abbr synonyms description space_id source_id run_id created_at created_by_id _aux branch_id
id
14 6By01L04 alpha-beta T cell CL:0000789 None alpha-beta T-cell|alpha-beta T lymphocyte|alph... A T Cell That Expresses An Alpha-Beta T Cell R... 1 16 None 2025-08-06 17:32:58.318000+00:00 1 None 1
15 4BEwsp1Q mature alpha-beta T cell CL:0000791 None mature alpha-beta T-lymphocyte|mature alpha-be... A Alpha-Beta T Cell That Has A Mature Phenotype. 1 16 None 2025-08-06 17:32:58.318000+00:00 1 None 1
16 2OTzqBTM mature T cell CL:0002419 None CD3e-positive T cell|mature T-cell A T Cell That Expresses A T Cell Receptor Comp... 1 16 None 2025-08-06 17:32:58.318000+00:00 1 None 1
13 6IC9NGJE CD8-positive, alpha-beta T cell CL:0000625 None CD8-positive, alpha-beta T-cell|CD8-positive, ... A T Cell Expressing An Alpha-Beta T Cell Recep... 1 16 None 2025-08-06 17:32:58.052000+00:00 1 None 1
3 4bKGljt0 cell CL:0000000 None None A Material Entity Of Anatomical Origin (Part O... 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
4 2K93w3xO motile cell CL:0000219 None None A Cell That Moves By Its Own Activities. 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
5 2cXC7cgF single nucleate cell CL:0000226 None None A Cell With A Single Nucleus. 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
6 4WnpvUTH eukaryotic cell CL:0000255 None None Any Cell That Only Exists In Eukaryota. 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
7 X6c7osZ5 lymphocyte CL:0000542 None None A Lymphocyte Is A Leukocyte Commonly Found In ... 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
8 3VEAlFdi leukocyte CL:0000738 None white blood cell|leucocyte An Achromatic Cell Of The Myeloid Or Lymphoid ... 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
9 2Jgr5Xx4 mononuclear cell CL:0000842 None mononuclear leukocyte A Leukocyte With A Single Non-Segmented Nucleu... 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
10 7GpphKmr lymphocyte of B lineage CL:0000945 None None A Lymphocyte Of B Lineage With The Commitment ... 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
11 4Ilrnj9U hematopoietic cell CL:0000988 None haematopoietic cell|hemopoietic cell|haemopoie... A Cell Of A Hematopoietic Lineage. 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
12 u3sr1Gdf nucleate cell CL:0002242 None None A Cell Containing At Least One Nucleus. 1 16 None 2025-08-06 17:32:57.773000+00:00 1 None 1
1 ryEtgi1y B cell CL:0000236 None B lymphocyte|B-lymphocyte|B-cell A Lymphocyte Of B Lineage That Is Capable Of B... 1 16 None 2025-08-06 17:32:57.423000+00:00 1 None 1
2 22LvKd01 T cell CL:0000084 None T-cell|T-lymphocyte|T lymphocyte A Type Of Lymphocyte Whose Defining Characteri... 1 16 None 2025-08-06 17:32:57.423000+00:00 1 None 1

The CellType regsitry is hierachical as it contains the Cell Ontology.

bt.CellType.get(name="CD8-positive, alpha-beta T cell").view_parents()
Hide code cell output
../_images/84193fab95ba218423e9949e33b1197cbe4c4f27ba057a167f7931d7a9efbe15.svg

Overview of validation properties¶

Importantly, LaminDB offers not only a DataFrameCurator, but also a AnnDataCurator, MuDataCurator, SpatialDataCurator, and TiledbsomaCurator.

The below overview only concerns validating dataframes.

Experience of data engineer¶

property

pydantic

pandera

lamindb

define schema as code

yes, in form of a pydantic.BaseModel

yes, in form of a pandera.DataFrameSchema

yes, in form of a lamindb.Schema

define schema as a set of constraints without the need of listing fields/columns/features; e.g. useful if validating 60k genes

no

no

yes

update labels independent of code

not possible because labels are enums/literals

not possible because labels are hard-coded in Check

possible by adding new terms to a registry

built-in validation from public ontologies

no

no

yes

sync labels with ELN/LIMS registries without code change

no

no

yes

can re-use fields/columns/features across schemas

limited via subclass

only in same Python session

yes because persisted in database

schema modifications can invalidate previously validated datasets

yes

yes

no because LaminDB allows to query datasets that were validated with a schema version

can use columnar organization of dataframe

no, need to iterate over potentially millions of rows

yes

yes

Experience of data consumer¶

property

pydantic

pandera

lamindb

dataset is queryable / findable

no

no

yes, by querying for labels & features

dataset is annotated

no

no

yes

user knows what validation constraints were

no, because might not have access to code and doesn’t know which code was run

no (same as pydantic)

yes, via artifact.schema

Annotation & queryability¶

Engineer: annotate the dataset¶

Either use the Curator object:

artifact = curator.save_artifact(key="our_datasets/dataset1.parquet")
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
→ returning existing schema with same hash: Schema(uid='mXXNUBXa0dcIpjAh', name='My immuno schema', n=7, is_type=False, itype='Feature', hash='7jMK9hCpchb193u8fxYjhQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-08-06 17:32:56 UTC)

If you don’t expect a need for Curator functionality for updating ontologies and standardization, you can also use the Artifact constructor.

artifact = ln.Artifact.from_df(
    df, key="our_datasets/dataset1.parquet", schema=lamindb_schema
).save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
→ returning existing artifact with same hash: Artifact(uid='oe3o2LLA3YRVap830000', is_latest=True, key='our_datasets/dataset1.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=9868, hash='9p8ssBs4kmrjbP6cPCeeaQ', n_observations=3, branch_id=1, space_id=1, storage_id=1, schema_id=1, created_by_id=1, created_at=2025-08-06 17:32:59 UTC); to track this artifact as an input, use: ln.Artifact.get()
! 5 terms not validated in feature 'columns': 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note', 'donor_ethnicity'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')
→ returning existing schema with same hash: Schema(uid='mXXNUBXa0dcIpjAh', name='My immuno schema', n=7, is_type=False, itype='Feature', hash='7jMK9hCpchb193u8fxYjhQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-08-06 17:32:56 UTC)

Consumer: see annotations¶

artifact.describe()
Hide code cell output
Artifact .parquet · DataFrame · dataset
├── General
│   ├── key: our_datasets/dataset1.parquet
│   ├── uid: oe3o2LLA3YRVap830000          hash: 9p8ssBs4kmrjbP6cPCeeaQ
│   ├── size: 9.6 KB                       transform: None
│   ├── space: all                         branch: all
│   ├── created_by: testuser1              created_at: 2025-08-06 17:32:59
│   ├── n_observations: 3
│   └── storage path: 
│       /home/runner/work/lamindb/lamindb/docs/faq/test-pydantic-pandera/our_datasets/dataset1.parquet
├── Dataset features
│   └── columns • 7                     [Feature]                                                                  
│       assay_oid                       cat[bionty.ExperimentalFactor.on…  single-cell RNA sequencing              
│       cell_type_by_expert             cat[bionty.CellType]               B cell, CD8-positive, alpha-beta T cell 
│       cell_type_by_model              cat[bionty.CellType]               B cell, T cell                          
│       perturbation                    cat[ULabel]                        DMSO, IFNG                              
│       concentration                   str                                                                        
│       treatment_time_h                int                                                                        
│       donor                           str                                                                        
└── Labels
    └── .cell_types                     bionty.CellType                    B cell, T cell, CD8-positive, alpha-bet…
        .experimental_factors           bionty.ExperimentalFactor          single-cell RNA sequencing              
        .ulabels                        ULabel                             DMSO, IFNG                              

Consumer: query the dataset¶

ln.Artifact.filter(perturbation="IFNG").df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux branch_id
id
1 oe3o2LLA3YRVap830000 our_datasets/dataset1.parquet None .parquet dataset DataFrame 9868 9p8ssBs4kmrjbP6cPCeeaQ None 3 md5 True False 1 1 1 None True None 2025-08-06 17:32:59.712000+00:00 1 {'af': {'0': True}} 1

Consumer: understand validation¶

By accessing artifact.schema, the consumer can understand how the dataset was validated.

artifact.schema
Hide code cell output
Schema(uid='mXXNUBXa0dcIpjAh', name='My immuno schema', n=7, is_type=False, itype='Feature', hash='7jMK9hCpchb193u8fxYjhQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-08-06 17:32:56 UTC)
artifact.schema.features.df()
Hide code cell output
uid name dtype is_type unit description array_rank array_size array_shape proxy_dtype synonyms _expect_many _curation space_id type_id run_id created_at created_by_id _aux branch_id
id
1 kyowrwUJ5W14 perturbation cat[ULabel] None None None 0 0 None None None None None 1 None None 2025-08-06 17:32:56.728000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
2 C7iiZwuXe0bo cell_type_by_model cat[bionty.CellType] None None None 0 0 None None None None None 1 None None 2025-08-06 17:32:56.784000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
3 3XB796ZHwlmX cell_type_by_expert cat[bionty.CellType] None None None 0 0 None None None None None 1 None None 2025-08-06 17:32:56.789000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
4 EGVHhO5HBvTY assay_oid cat[bionty.ExperimentalFactor.ontology_id] None None None 0 0 None None None None None 1 None None 2025-08-06 17:32:56.794000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
5 C3Ou3ZRxjr0Q concentration str None None None 0 0 None None None None None 1 None None 2025-08-06 17:32:56.799000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
6 IKN89zSOwGnt treatment_time_h int None None None 0 0 None None None None None 1 None None 2025-08-06 17:32:56.804000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
7 oIEcTBWewxCJ donor str None None None 0 0 None None None None None 1 None None 2025-08-06 17:32:56.809000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1

Nested data with dynamic keys¶

We will now examine another more complex example where data is nested with potentially arbitrary (dynamic) keys. The example is inspired by the CELLxGENE schema where annotations are stored as dictionaries in the AnnData .uns slot.

uns_dict = ln.core.datasets.dict_cxg_uns()
pprint.pprint(uns_dict)
{'organism_ontology_term_id': 'NCBITaxon:9606',
 'spatial': {'is_single': True,
             'library_1': {'images': {'fullres': 'path/to/fullres.jpg',
                                      'hires': 'path/to/hires.jpg'},
                           'scalefactors': {'spot_diameter_fullres': 89.43,
                                            'tissue_hires_scalef': 0.177}},
             'library_2': {'images': {'fullres': 'path/to/fullres_2.jpg',
                                      'hires': 'path/to/hires_2.jpg'},
                           'scalefactors': {'spot_diameter_fullres': 120.34,
                                            'tissue_hires_scalef': 0.355}}}}

pydantic¶

Pydantic is primed to deal with nested data.

class Images(pydantic.BaseModel):
    fullres: str
    hires: str


class Scalefactors(pydantic.BaseModel):
    spot_diameter_fullres: float
    tissue_hires_scalef: float


class Library(pydantic.BaseModel):
    images: Images
    scalefactors: Scalefactors


class Spatial(pydantic.BaseModel):
    is_single: bool
    model_config = {"extra": "allow"}

    def __init__(self, **data):
        libraries = {}
        other_fields = {}

        # store all libraries under a single key for validation
        for key, value in data.items():
            if key.startswith("library_"):
                libraries[key] = Library(**value)
            else:
                other_fields[key] = value

        other_fields["libraries"] = libraries
        super().__init__(**other_fields)


class SpatialDataSchema(pydantic.BaseModel):
    organism_ontology_term_id: str
    spatial: Spatial


validated_data = SpatialDataSchema(**uns_dict)

However, pydantic either requires all dictionary keys to be known beforehand to construct the Model classes or workarounds to collect all keys for a single model.

pandera¶

Pandera cannot validate dictionaries because it is designed for structured dataframe data. Therefore, we need to flatten the dictionary to transform it into a DataFrame:

def _flatten_dict(d: dict[Any, Any], parent_key: str = "", sep: str = "_"):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(_flatten_dict(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)
def create_dynamic_schema(flattened_data: dict[str, Any]):
    schema_dict = {
        "organism_ontology_term_id": pandera.Column(str),
        "spatial_is_single": pandera.Column(bool),
    }

    for key in flattened_data.keys():
        if key.startswith("spatial_library_") and key.endswith("_images_fullres"):
            lib_prefix = key.replace("_images_fullres", "")
            schema_dict.update(
                {
                    f"{lib_prefix}_images_fullres": pandera.Column(str),
                    f"{lib_prefix}_images_hires": pandera.Column(str),
                    f"{lib_prefix}_scalefactors_spot_diameter_fullres": pandera.Column(
                        float
                    ),
                    f"{lib_prefix}_scalefactors_tissue_hires_scalef": pandera.Column(
                        float
                    ),
                }
            )

    return pandera.DataFrameSchema(schema_dict)


flattened = _flatten_dict(uns_dict)
df = pd.DataFrame([flattened])
spatial_schema = create_dynamic_schema(flattened)
validated_df = spatial_schema.validate(df)

Analogously to pydantic, pandera does not have out of the box support for dynamically named keys. Therefore, it is necessary to dynamically construct a pydantic schema.

LaminDB¶

Similarly, LaminDB currently requires constructing flattened dataframes to dynamically create features for the schema, which can then be used for validation with the DataFrameCurator. Future improvements are expected, including support for a dictionary-specific curator.

def create_dynamic_schema(flattened_data: dict[str, Any]) -> ln.Schema:
    features = []

    for key, value in flattened_data.items():
        if key == "organism_ontology_term_id":
            features.append(ln.Feature(name=key, dtype=bt.Organism.ontology_id).save())
        elif isinstance(value, bool):
            features.append(ln.Feature(name=key, dtype=bool).save())
        elif isinstance(value, (int, float)):
            features.append(ln.Feature(name=key, dtype=float).save())
        else:
            features.append(ln.Feature(name=key, dtype=str).save())

    return ln.Schema(
        name="Spatial data schema", features=features, coerce_dtype=True
    ).save()


flattened = _flatten_dict(uns_dict)
flattened_df = pd.DataFrame([flattened])
spatial_schema = create_dynamic_schema(flattened)
curator = ln.curators.DataFrameCurator(flattened_df, spatial_schema)
curator.validate()
Hide code cell output
! you are trying to create a record with name='spatial_library_1_images_hires' but a record with similar name exists: 'spatial_library_1_images_fullres'. Did you mean to load it?
! you are trying to create a record with name='spatial_library_2_images_hires' but a record with similar name exists: 'spatial_library_2_images_fullres'. Did you mean to load it?

Note

Curators for scverse data structures allow for the specification of schema slots that access and validate dataframes in nested dictionary attributes like .attrs or .uns. These schema slots use colon-separated paths like 'attrs:sample' or 'uns:spatial:images' to to target specific dataframes for validation.