Pydantic & Pandera vs. LaminDB¶

This doc explains conceptual differences between data validation with pydantic, pandera, and LaminDB.

!lamin init --storage test-pydantic-pandera --modules bionty

Let us work with a test dataframe.

import pandas as pd
import pydantic
import lamindb as ln
import bionty as bt
import pandera.pandas as pandera
import pprint

from typing import Literal, Any

df = ln.core.datasets.small_dataset1()
df

→ connected lamindb: testuser1/test-pydantic-pandera

	ENSG00000153563	ENSG00000010610	ENSG00000170458	perturbation	sample_note	cell_type_by_expert	cell_type_by_model	assay_oid	concentration	treatment_time_h	donor	donor_ethnicity
sample1	1	3	5	DMSO	was ok	B cell	B cell	EFO:0008913	0.1%	24	D0001	[Chinese, Singaporean Chinese]
sample2	2	4	6	IFNG	looks naah	CD8-positive, alpha-beta T cell	T cell	EFO:0008913	200 nM	24	D0002	[Chinese, Han Chinese]
sample3	3	5	7	DMSO	pretty! 🤩	CD8-positive, alpha-beta T cell	T cell	EFO:0008913	0.1%	6	None	[Chinese]

Define a schema¶

pydantic¶

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal["T cell", "B cell"]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"

pandera¶

pandera_schema = pandera.DataFrameSchema(
    {
        "perturbation": pandera.Column(
            str, checks=pandera.Check.isin(["DMSO", "IFNG"])
        ),
        "cell_type_by_model": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "cell_type_by_expert": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "assay_oid": pandera.Column(str, checks=pandera.Check.isin(["EFO:0008913"])),
        "concentration": pandera.Column(str),
        "treatment_time_h": pandera.Column(int),
        "donor": pandera.Column(str, nullable=True),
    },
    name="My immuno schema",
)

LaminDB¶

Features & labels are defined on the level of the database instance. You can either define a schema with required (and optional) columns.

ln.ULabel(name="DMSO").save()
ln.ULabel(name="IFNG").save()

# leverage ontologies through types ln.ULabel, bt.CellType, bt.ExperimentalFactor
lamindb_schema = ln.Schema(
    name="My immuno schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
        ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
        ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
        ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
        ln.Feature(name="concentration", dtype=str).save(),
        ln.Feature(name="treatment_time_h", dtype=int).save(),
        ln.Feature(name="donor", dtype=str, nullable=True).save(),
    ],
).save()

Or merely define a constraint on the feature identifier.

lamindb_schema_only_itype = ln.Schema(
    name="Allow any valid features & labels", itype=ln.Feature
)

Validate a dataframe¶

pydantic¶

class DataFrameValidationError(Exception):
    pass


def validate_dataframe(df: pd.DataFrame, model: type[pydantic.BaseModel]):
    errors = []

    for i, row in enumerate(df.to_dict(orient="records")):
        try:
            model(**row)
        except pydantic.ValidationError as e:
            errors.append(f"row {i} failed validation: {e}")

    if errors:
        error_message = "\n".join(errors)
        raise DataFrameValidationError(
            f"DataFrame validation failed with the following errors:\n{error_message}"
        )

try:
    validate_dataframe(df, ImmunoSchema)
except DataFrameValidationError as e:
    print(e)

To fix the validation error, we need to update the Literal and re-run the model definition.

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal[
    "T cell", "B cell", "CD8-positive, alpha-beta T cell"  # <-- updated
]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"

validate_dataframe(df, ImmunoSchema)

pandera¶

try:
    pandera_schema.validate(df)
except pandera.errors.SchemaError as e:
    print(e)

LaminDB¶

Because the term "CD8-positive, alpha-beta T cell" is part of the public CellType ontology, validation passes the first time.

If validation had not passed, we could have resolved the issue simply by adding a new term to the CellType registry rather than editing the code. This also puts downstream data scientists into a position to update ontologies.

curator = ln.curators.DataFrameCurator(df, lamindb_schema)
curator.validate()

What was the cell type validation based on? Let’s inspect the CellType registry.

bt.CellType.df()

Show code cell output Hide code cell output

	uid	name	ontology_id	abbr	synonyms	description	space_id	source_id	run_id	created_at	created_by_id	_aux	branch_id
id
14	6By01L04	alpha-beta T cell	CL:0000789	None	alpha-beta T-cell\|alpha-beta T lymphocyte\|alph...	A T Cell That Expresses An Alpha-Beta T Cell R...	1	16	None	2025-08-06 17:32:58.318000+00:00	1	None	1
15	4BEwsp1Q	mature alpha-beta T cell	CL:0000791	None	mature alpha-beta T-lymphocyte\|mature alpha-be...	A Alpha-Beta T Cell That Has A Mature Phenotype.	1	16	None	2025-08-06 17:32:58.318000+00:00	1	None	1
16	2OTzqBTM	mature T cell	CL:0002419	None	CD3e-positive T cell\|mature T-cell	A T Cell That Expresses A T Cell Receptor Comp...	1	16	None	2025-08-06 17:32:58.318000+00:00	1	None	1
13	6IC9NGJE	CD8-positive, alpha-beta T cell	CL:0000625	None	CD8-positive, alpha-beta T-cell\|CD8-positive, ...	A T Cell Expressing An Alpha-Beta T Cell Recep...	1	16	None	2025-08-06 17:32:58.052000+00:00	1	None	1
3	4bKGljt0	cell	CL:0000000	None	None	A Material Entity Of Anatomical Origin (Part O...	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
4	2K93w3xO	motile cell	CL:0000219	None	None	A Cell That Moves By Its Own Activities.	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
5	2cXC7cgF	single nucleate cell	CL:0000226	None	None	A Cell With A Single Nucleus.	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
6	4WnpvUTH	eukaryotic cell	CL:0000255	None	None	Any Cell That Only Exists In Eukaryota.	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
7	X6c7osZ5	lymphocyte	CL:0000542	None	None	A Lymphocyte Is A Leukocyte Commonly Found In ...	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
8	3VEAlFdi	leukocyte	CL:0000738	None	white blood cell\|leucocyte	An Achromatic Cell Of The Myeloid Or Lymphoid ...	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
9	2Jgr5Xx4	mononuclear cell	CL:0000842	None	mononuclear leukocyte	A Leukocyte With A Single Non-Segmented Nucleu...	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
10	7GpphKmr	lymphocyte of B lineage	CL:0000945	None	None	A Lymphocyte Of B Lineage With The Commitment ...	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
11	4Ilrnj9U	hematopoietic cell	CL:0000988	None	haematopoietic cell\|hemopoietic cell\|haemopoie...	A Cell Of A Hematopoietic Lineage.	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
12	u3sr1Gdf	nucleate cell	CL:0002242	None	None	A Cell Containing At Least One Nucleus.	1	16	None	2025-08-06 17:32:57.773000+00:00	1	None	1
1	ryEtgi1y	B cell	CL:0000236	None	B lymphocyte\|B-lymphocyte\|B-cell	A Lymphocyte Of B Lineage That Is Capable Of B...	1	16	None	2025-08-06 17:32:57.423000+00:00	1	None	1
2	22LvKd01	T cell	CL:0000084	None	T-cell\|T-lymphocyte\|T lymphocyte	A Type Of Lymphocyte Whose Defining Characteri...	1	16	None	2025-08-06 17:32:57.423000+00:00	1	None	1

The CellType regsitry is hierachical as it contains the Cell Ontology.

bt.CellType.get(name="CD8-positive, alpha-beta T cell").view_parents()

Overview of validation properties¶

Importantly, LaminDB offers not only a DataFrameCurator, but also a AnnDataCurator, MuDataCurator, SpatialDataCurator, and TiledbsomaCurator.

The below overview only concerns validating dataframes.

Experience of data engineer¶

property	`pydantic`	`pandera`	`lamindb`
define schema as code	yes, in form of a `pydantic.BaseModel`	yes, in form of a `pandera.DataFrameSchema`	yes, in form of a `lamindb.Schema`
define schema as a set of constraints without the need of listing fields/columns/features; e.g. useful if validating 60k genes	no	no	yes
update labels independent of code	not possible because labels are enums/literals	not possible because labels are hard-coded in `Check`	possible by adding new terms to a registry
built-in validation from public ontologies	no	no	yes
sync labels with ELN/LIMS registries without code change	no	no	yes
can re-use fields/columns/features across schemas	limited via subclass	only in same Python session	yes because persisted in database
schema modifications can invalidate previously validated datasets	yes	yes	no because LaminDB allows to query datasets that were validated with a schema version
can use columnar organization of dataframe	no, need to iterate over potentially millions of rows	yes	yes

Experience of data consumer¶

property	`pydantic`	`pandera`	`lamindb`
dataset is queryable / findable	no	no	yes, by querying for labels & features
dataset is annotated	no	no	yes
user knows what validation constraints were	no, because might not have access to code and doesn’t know which code was run	no (same as pydantic)	yes, via `artifact.schema`

Annotation & queryability¶

Engineer: annotate the dataset¶

Either use the Curator object:

artifact = curator.save_artifact(key="our_datasets/dataset1.parquet")

If you don’t expect a need for Curator functionality for updating ontologies and standardization, you can also use the Artifact constructor.

artifact = ln.Artifact.from_df(
    df, key="our_datasets/dataset1.parquet", schema=lamindb_schema
).save()

Consumer: see annotations¶

artifact.describe()

Consumer: query the dataset¶

ln.Artifact.filter(perturbation="IFNG").df()

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
1	oe3o2LLA3YRVap830000	our_datasets/dataset1.parquet	None	.parquet	dataset	DataFrame	9868	9p8ssBs4kmrjbP6cPCeeaQ	None	3	md5	True	False	1	1	1	None	True	None	2025-08-06 17:32:59.712000+00:00	1	{'af': {'0': True}}	1

Consumer: understand validation¶

By accessing artifact.schema, the consumer can understand how the dataset was validated.

artifact.schema

artifact.schema.features.df()

Show code cell output Hide code cell output

	uid	name	dtype	is_type	unit	description	array_rank	array_size	array_shape	proxy_dtype	synonyms	_expect_many	_curation	space_id	type_id	run_id	created_at	created_by_id	_aux	branch_id
id
1	kyowrwUJ5W14	perturbation	cat[ULabel]	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-08-06 17:32:56.728000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
2	C7iiZwuXe0bo	cell_type_by_model	cat[bionty.CellType]	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-08-06 17:32:56.784000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
3	3XB796ZHwlmX	cell_type_by_expert	cat[bionty.CellType]	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-08-06 17:32:56.789000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
4	EGVHhO5HBvTY	assay_oid	cat[bionty.ExperimentalFactor.ontology_id]	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-08-06 17:32:56.794000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
5	C3Ou3ZRxjr0Q	concentration	str	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-08-06 17:32:56.799000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
6	IKN89zSOwGnt	treatment_time_h	int	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-08-06 17:32:56.804000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1
7	oIEcTBWewxCJ	donor	str	None	None	None	0	0	None	None	None	None	None	1	None	None	2025-08-06 17:32:56.809000+00:00	1	{'af': {'0': None, '1': True, '2': False}}	1

Nested data with dynamic keys¶

We will now examine another more complex example where data is nested with potentially arbitrary (dynamic) keys. The example is inspired by the CELLxGENE schema where annotations are stored as dictionaries in the AnnData .uns slot.

uns_dict = ln.core.datasets.dict_cxg_uns()
pprint.pprint(uns_dict)

{'organism_ontology_term_id': 'NCBITaxon:9606',
 'spatial': {'is_single': True,
             'library_1': {'images': {'fullres': 'path/to/fullres.jpg',
                                      'hires': 'path/to/hires.jpg'},
                           'scalefactors': {'spot_diameter_fullres': 89.43,
                                            'tissue_hires_scalef': 0.177}},
             'library_2': {'images': {'fullres': 'path/to/fullres_2.jpg',
                                      'hires': 'path/to/hires_2.jpg'},
                           'scalefactors': {'spot_diameter_fullres': 120.34,
                                            'tissue_hires_scalef': 0.355}}}}

pydantic¶

Pydantic is primed to deal with nested data.

class Images(pydantic.BaseModel):
    fullres: str
    hires: str


class Scalefactors(pydantic.BaseModel):
    spot_diameter_fullres: float
    tissue_hires_scalef: float


class Library(pydantic.BaseModel):
    images: Images
    scalefactors: Scalefactors


class Spatial(pydantic.BaseModel):
    is_single: bool
    model_config = {"extra": "allow"}

    def __init__(self, **data):
        libraries = {}
        other_fields = {}

        # store all libraries under a single key for validation
        for key, value in data.items():
            if key.startswith("library_"):
                libraries[key] = Library(**value)
            else:
                other_fields[key] = value

        other_fields["libraries"] = libraries
        super().__init__(**other_fields)


class SpatialDataSchema(pydantic.BaseModel):
    organism_ontology_term_id: str
    spatial: Spatial


validated_data = SpatialDataSchema(**uns_dict)

However, pydantic either requires all dictionary keys to be known beforehand to construct the Model classes or workarounds to collect all keys for a single model.

pandera¶

Pandera cannot validate dictionaries because it is designed for structured dataframe data. Therefore, we need to flatten the dictionary to transform it into a DataFrame:

def _flatten_dict(d: dict[Any, Any], parent_key: str = "", sep: str = "_"):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(_flatten_dict(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

def create_dynamic_schema(flattened_data: dict[str, Any]):
    schema_dict = {
        "organism_ontology_term_id": pandera.Column(str),
        "spatial_is_single": pandera.Column(bool),
    }

    for key in flattened_data.keys():
        if key.startswith("spatial_library_") and key.endswith("_images_fullres"):
            lib_prefix = key.replace("_images_fullres", "")
            schema_dict.update(
                {
                    f"{lib_prefix}_images_fullres": pandera.Column(str),
                    f"{lib_prefix}_images_hires": pandera.Column(str),
                    f"{lib_prefix}_scalefactors_spot_diameter_fullres": pandera.Column(
                        float
                    ),
                    f"{lib_prefix}_scalefactors_tissue_hires_scalef": pandera.Column(
                        float
                    ),
                }
            )

    return pandera.DataFrameSchema(schema_dict)


flattened = _flatten_dict(uns_dict)
df = pd.DataFrame([flattened])
spatial_schema = create_dynamic_schema(flattened)
validated_df = spatial_schema.validate(df)

Analogously to pydantic, pandera does not have out of the box support for dynamically named keys. Therefore, it is necessary to dynamically construct a pydantic schema.

LaminDB¶

Similarly, LaminDB currently requires constructing flattened dataframes to dynamically create features for the schema, which can then be used for validation with the DataFrameCurator. Future improvements are expected, including support for a dictionary-specific curator.

def create_dynamic_schema(flattened_data: dict[str, Any]) -> ln.Schema:
    features = []

    for key, value in flattened_data.items():
        if key == "organism_ontology_term_id":
            features.append(ln.Feature(name=key, dtype=bt.Organism.ontology_id).save())
        elif isinstance(value, bool):
            features.append(ln.Feature(name=key, dtype=bool).save())
        elif isinstance(value, (int, float)):
            features.append(ln.Feature(name=key, dtype=float).save())
        else:
            features.append(ln.Feature(name=key, dtype=str).save())

    return ln.Schema(
        name="Spatial data schema", features=features, coerce_dtype=True
    ).save()


flattened = _flatten_dict(uns_dict)
flattened_df = pd.DataFrame([flattened])
spatial_schema = create_dynamic_schema(flattened)
curator = ln.curators.DataFrameCurator(flattened_df, spatial_schema)
curator.validate()

Note

Curators for scverse data structures allow for the specification of schema slots that access and validate dataframes in nested dictionary attributes like .attrs or .uns. These schema slots use colon-separated paths like 'attrs:sample' or 'uns:spatial:images' to to target specific dataframes for validation.