Pydantic & Pandera vs. LaminDB¶

This doc explains conceptual differences between data validation with pydantic, pandera, and lamindb.

!lamin init --storage test-pydantic-pandera --modules bionty
Hide code cell output
→ initialized lamindb: testuser1/test-pydantic-pandera

Let us work with a test dataframe.

import pandas as pd
import pydantic
from typing import Literal
import lamindb as ln
import bionty as bt
import pandera

df = ln.core.datasets.small_dataset1()
df
→ connected lamindb: testuser1/test-pydantic-pandera
ENSG00000153563 ENSG00000010610 ENSG00000170458 perturbation sample_note cell_type_by_expert cell_type_by_model assay_oid concentration treatment_time_h donor
sample1 1 3 5 DMSO was ok B cell B cell EFO:0008913 0.1% 24 D0001
sample2 2 4 6 IFNG looks naah CD8-positive, alpha-beta T cell T cell EFO:0008913 200 nM 24 D0002
sample3 3 5 7 DMSO pretty! 🤩 CD8-positive, alpha-beta T cell T cell EFO:0008913 0.1% 6 None

Define a schema¶

pydantic¶

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal["T cell", "B cell"]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"

pandera¶

# Define the Pandera schema using DataFrameSchema
pandera_schema = pandera.DataFrameSchema(
    {
        "perturbation": pandera.Column(
            str, checks=pandera.Check.isin(["DMSO", "IFNG"])
        ),
        "cell_type_by_model": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "cell_type_by_expert": pandera.Column(
            str, checks=pandera.Check.isin(["T cell", "B cell"])
        ),
        "assay_oid": pandera.Column(str, checks=pandera.Check.isin(["EFO:0008913"])),
        "concentration": pandera.Column(str),
        "treatment_time_h": pandera.Column(int),
        "donor": pandera.Column(str, nullable=True),
    },
    name="My immuno schema",
)

lamindb¶

Features & labels are defined on the level of the database instance. You can either define a schema with required (and optional) columns.

ln.ULabel(name="DMSO").save()  # define a DMSO label
ln.ULabel(name="IFNG").save()  # define an IFNG label

# leverage ontologies through types ln.ULabel, bt.CellType, bt.ExperimentalFactor
lamindb_schema = ln.Schema(
    name="My immuno schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
        ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
        ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
        ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
        ln.Feature(name="concentration", dtype=str).save(),
        ln.Feature(name="treatment_time_h", dtype=int).save(),
        ln.Feature(name="donor", dtype=str, nullable=True).save(),
    ],
).save()

Or merely define a constraint on the feature identifier.

lamindb_schema_only_itype = ln.Schema(
    name="Allow any valid features & labels", itype=ln.Feature
)

Validate a dataframe¶

pydantic¶

class DataFrameValidationError(Exception):
    pass


def validate_dataframe(df: pd.DataFrame, model: type[pydantic.BaseModel]):
    errors = []

    for i, row in enumerate(df.to_dict(orient="records")):
        try:
            model(**row)
        except pydantic.ValidationError as e:
            errors.append(f"row {i} failed validation: {e}")

    if errors:
        error_message = "\n".join(errors)
        raise DataFrameValidationError(
            f"DataFrame validation failed with the following errors:\n{error_message}"
        )
try:
    validate_dataframe(df, ImmunoSchema)
except DataFrameValidationError as e:
    print(e)
Hide code cell output
DataFrame validation failed with the following errors:
row 1 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
  Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/literal_error
row 2 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
  Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/literal_error

To fix the validation error, we need to update the Literal and re-run the model definition.

Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal[
    "T cell", "B cell", "CD8-positive, alpha-beta T cell"  # <-- updated
]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    concentration: str
    treatment_time_h: int
    donor: str | None

    class Config:
        title = "My immuno schema"
validate_dataframe(df, ImmunoSchema)

pandera¶

try:
    pandera_schema.validate(df)
except pandera.errors.SchemaError as e:
    print(e)
Hide code cell output
Column 'cell_type_by_expert' failed element-wise validator number 0: isin(['T cell', 'B cell']) failure cases: CD8-positive, alpha-beta T cell, CD8-positive, alpha-beta T cell

lamindb¶

Because the term "CD8-positive, alpha-beta T cell" is part of the public CellType ontology, validation passes the first time.

If validation and not passed, we could have resolved the issue simply by adding a new term to the CellType registry rather than editing the code. This also puts downstream data scientists into a position to update ontologies.

curator = ln.curators.DataFrameCurator(df, lamindb_schema)
curator.validate()

What was the cell type validation based on? Let’s inspect the CellType registry.

bt.CellType.df()
Hide code cell output
uid name ontology_id abbr synonyms description space_id source_id run_id created_at created_by_id _aux _branch_code
id
14 6By01L04 alpha-beta T cell CL:0000789 None alpha-beta T-cell|alpha-beta T lymphocyte|alph... A T Cell That Expresses An Alpha-Beta T Cell R... 1 32 None 2025-04-15 16:34:59.703000+00:00 1 None 1
15 4BEwsp1Q mature alpha-beta T cell CL:0000791 None mature alpha-beta T-lymphocyte|mature alpha-be... A Alpha-Beta T Cell That Has A Mature Phenotype. 1 32 None 2025-04-15 16:34:59.703000+00:00 1 None 1
16 2OTzqBTM mature T cell CL:0002419 None CD3e-positive T cell|mature T-cell A T Cell That Expresses A T Cell Receptor Comp... 1 32 None 2025-04-15 16:34:59.703000+00:00 1 None 1
13 6IC9NGJE CD8-positive, alpha-beta T cell CL:0000625 None CD8-positive, alpha-beta T-cell|CD8-positive, ... A T Cell Expressing An Alpha-Beta T Cell Recep... 1 32 None 2025-04-15 16:34:59.248000+00:00 1 None 1
3 4bKGljt0 cell CL:0000000 None None A Material Entity Of Anatomical Origin (Part O... 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
4 2K93w3xO motile cell CL:0000219 None None A Cell That Moves By Its Own Activities. 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
5 2cXC7cgF single nucleate cell CL:0000226 None None A Cell With A Single Nucleus. 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
6 4WnpvUTH eukaryotic cell CL:0000255 None None Any Cell That Only Exists In Eukaryota. 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
7 X6c7osZ5 lymphocyte CL:0000542 None None A Lymphocyte Is A Leukocyte Commonly Found In ... 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
8 3VEAlFdi leukocyte CL:0000738 None white blood cell|leucocyte An Achromatic Cell Of The Myeloid Or Lymphoid ... 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
9 2Jgr5Xx4 mononuclear cell CL:0000842 None mononuclear leukocyte A Leukocyte With A Single Non-Segmented Nucleu... 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
10 7GpphKmr lymphocyte of B lineage CL:0000945 None None A Lymphocyte Of B Lineage With The Commitment ... 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
11 4Ilrnj9U hematopoietic cell CL:0000988 None haematopoietic cell|hemopoietic cell|haemopoie... A Cell Of A Hematopoietic Lineage. 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
12 u3sr1Gdf nucleate cell CL:0002242 None None A Cell Containing At Least One Nucleus. 1 32 None 2025-04-15 16:34:58.799000+00:00 1 None 1
1 ryEtgi1y B cell CL:0000236 None B lymphocyte|B-lymphocyte|B-cell A Lymphocyte Of B Lineage That Is Capable Of B... 1 32 None 2025-04-15 16:34:58.273000+00:00 1 None 1
2 22LvKd01 T cell CL:0000084 None T-cell|T-lymphocyte|T lymphocyte A Type Of Lymphocyte Whose Defining Characteri... 1 32 None 2025-04-15 16:34:58.273000+00:00 1 None 1

The CellType regsitry is hierachical as it contains the Cell Ontology.

bt.CellType.get(name="CD8-positive, alpha-beta T cell").view_parents()
Hide code cell output
../_images/84193fab95ba218423e9949e33b1197cbe4c4f27ba057a167f7931d7a9efbe15.svg

Overview of validation properties¶

Importantly, LaminDB offers not only a DataFrameCurator, but also a AnnDataCurator, MuDataCurator, SpatialDataCurator, TiledbsomaCurator.

The below overview only concerns validating dataframes.

Experience of data engineer¶

property

pydantic

pandera

lamindb

define schema as code

yes, in form of a pydantic.BaseModel

yes, in form of a pandera.DataFrameSchema

yes, in form of a lamindb.Schema

define schema as a set of constraints without the need of listing fields/columns/features; e.g. useful if validating 60k genes

no

no

yes

update labels independent of code

not possible because labels are enums/literals

not possible because labels are hard-coded in Check

possible by adding new terms to a registry

built-in validation from public ontologies

no

no

yes

sync labels with ELN/LIMS registries without code change

no

no

yes

can re-use fields/columns/features across schemas

limited via subclass

only in same Python session

yes because persisted in database

schema modifications can invalidate previously validated datasets

yes

yes

no because LaminDB allows to query datasets that were validated with a schema version

can use columnar organization of dataframe

no, need to iterate over potentially millions of rows

yes

yes

Experience of data consumer¶

property

pydantic

pandera

lamindb

dataset is queryable / findable

no

no

yes, by querying for labels & features

dataset is annotated

no

no

yes

user knows what validation constraints were

no, because might not have access to code and doesn’t know which code was run

no (same as pydantic)

yes, via artifact.schema

Annotation & queryability¶

Engineer: annotate the dataset¶

Either use the Curator object:

artifact = curator.save_artifact(key="our_datasets/dataset1.parquet")
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! 4 unique terms (36.40%) are not validated for name: 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note'
→ returning existing schema with same hash: Schema(uid='4TwOICz9LEhoKFSpkFe1', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='vDiiZc9g0y226dQ7TeJLTQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-04-15 16:34:57 UTC)

If you don’t expect a need for Curator functionality for updating ontologies and standaridization, you can also use the Artifact constructor.

artifact = ln.Artifact.from_df(
    df, key="our_datasets/dataset1.parquet", schema=lamindb_schema
).save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
→ returning existing artifact with same hash: Artifact(uid='mxIXVF1kq395L4dK0000', is_latest=True, key='our_datasets/dataset1.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=8997, hash='tcuE7mvTOGmtda83Qxf82Q', n_observations=3, space_id=1, storage_id=1, schema_id=1, created_by_id=1, created_at=2025-04-15 16:35:01 UTC); to track this artifact as an input, use: ln.Artifact.get()
! run input wasn't tracked, call `ln.track()` and re-run
! 4 unique terms (36.40%) are not validated for name: 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note'
→ returning existing schema with same hash: Schema(uid='4TwOICz9LEhoKFSpkFe1', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='vDiiZc9g0y226dQ7TeJLTQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-04-15 16:34:57 UTC)

Consumer: see annotations¶

artifact.describe()
Hide code cell output
Artifact .parquet/DataFrame
├── General
│   ├── .uid = 'mxIXVF1kq395L4dK0000'
│   ├── .key = 'our_datasets/dataset1.parquet'
│   ├── .size = 8997
│   ├── .hash = 'tcuE7mvTOGmtda83Qxf82Q'
│   ├── .n_observations = 3
│   ├── .path = 
│   │   /home/runner/work/lamindb/lamindb/docs/faq/test-pydantic-pandera/.lamindb/mxIXVF1kq395L4dK0000.parquet
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-04-15 16:35:01
├── Dataset features
│   └── columns • 7                 [Feature]                                                           
│       assay_oid                   cat[bionty.ExperimentalF…  single-cell RNA sequencing               
│       cell_type_by_expert         cat[bionty.CellType]       B cell, CD8-positive, alpha-beta T cell  
│       cell_type_by_model          cat[bionty.CellType]       B cell, T cell                           
│       perturbation                cat[ULabel]                DMSO, IFNG                               
│       concentration               str                                                                 
│       treatment_time_h            int                                                                 
│       donor                       str                                                                 
└── Labels
    └── .cell_types                 bionty.CellType            B cell, T cell, CD8-positive, alpha-beta…
        .experimental_factors       bionty.ExperimentalFactor  single-cell RNA sequencing               
        .ulabels                    ULabel                     DMSO, IFNG                               

Consumer: query the dataset¶

ln.Artifact.filter(perturbation="IFNG").df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1 mxIXVF1kq395L4dK0000 our_datasets/dataset1.parquet None .parquet dataset DataFrame 8997 tcuE7mvTOGmtda83Qxf82Q None 3 md5 True False 1 1 1 None True None 2025-04-15 16:35:01.583000+00:00 1 None 1

Consumer: understand validation¶

By accessing artifact.schema, the consumer can understand how the dataset was validated.

artifact.schema
Hide code cell output
Schema(uid='4TwOICz9LEhoKFSpkFe1', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='vDiiZc9g0y226dQ7TeJLTQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-04-15 16:34:57 UTC)
artifact.schema.features.df()
Hide code cell output
uid name dtype is_type unit description array_rank array_size array_shape proxy_dtype synonyms _expect_many _curation space_id type_id run_id created_at created_by_id _aux _branch_code
id
1 KZNjCQQYWLa6 perturbation cat[ULabel] None None None 0 0 None None None True None 1 None None 2025-04-15 16:34:57.261000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
2 1VU4xStNeu45 cell_type_by_model cat[bionty.CellType] None None None 0 0 None None None True None 1 None None 2025-04-15 16:34:57.301000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
3 eybBW5tGYmag cell_type_by_expert cat[bionty.CellType] None None None 0 0 None None None True None 1 None None 2025-04-15 16:34:57.306000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
4 msLZyZoJfkxV assay_oid cat[bionty.ExperimentalFactor.ontology_id] None None None 0 0 None None None True None 1 None None 2025-04-15 16:34:57.311000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
5 nc88HUhZTazp concentration str None None None 0 0 None None None True None 1 None None 2025-04-15 16:34:57.316000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
6 AFYBsSFyOkQB treatment_time_h int None None None 0 0 None None None True None 1 None None 2025-04-15 16:34:57.322000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1
7 gJQm1Lkx4uIR donor str None None None 0 0 None None None True None 1 None None 2025-04-15 16:34:57.327000+00:00 1 {'af': {'0': None, '1': True, '2': False}} 1