Pydantic & Pandera vs. LaminDB¶
This doc explains conceptual differences between data validation with pydantic
, pandera
, and lamindb
.
!lamin init --storage test-pydantic-pandera --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-pydantic-pandera
Let us work with a test dataframe.
import pandas as pd
import pydantic
from typing import Literal
import lamindb as ln
import bionty as bt
import pandera
df = ln.core.datasets.small_dataset1()
df
→ connected lamindb: testuser1/test-pydantic-pandera
ENSG00000153563 | ENSG00000010610 | ENSG00000170458 | perturbation | sample_note | cell_type_by_expert | cell_type_by_model | assay_oid | concentration | treatment_time_h | donor | |
---|---|---|---|---|---|---|---|---|---|---|---|
sample1 | 1 | 3 | 5 | DMSO | was ok | B cell | B cell | EFO:0008913 | 0.1% | 24 | D0001 |
sample2 | 2 | 4 | 6 | IFNG | looks naah | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 200 nM | 24 | D0002 |
sample3 | 3 | 5 | 7 | DMSO | pretty! 🤩 | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 0.1% | 6 | None |
Define a schema¶
pydantic¶
Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal["T cell", "B cell"]
OntologyID = Literal["EFO:0008913"]
class ImmunoSchema(pydantic.BaseModel):
perturbation: Perturbation
cell_type_by_model: CellType
cell_type_by_expert: CellType
assay_oid: OntologyID
concentration: str
treatment_time_h: int
donor: str | None
class Config:
title = "My immuno schema"
pandera¶
# Define the Pandera schema using DataFrameSchema
pandera_schema = pandera.DataFrameSchema(
{
"perturbation": pandera.Column(
str, checks=pandera.Check.isin(["DMSO", "IFNG"])
),
"cell_type_by_model": pandera.Column(
str, checks=pandera.Check.isin(["T cell", "B cell"])
),
"cell_type_by_expert": pandera.Column(
str, checks=pandera.Check.isin(["T cell", "B cell"])
),
"assay_oid": pandera.Column(str, checks=pandera.Check.isin(["EFO:0008913"])),
"concentration": pandera.Column(str),
"treatment_time_h": pandera.Column(int),
"donor": pandera.Column(str, nullable=True),
},
name="My immuno schema",
)
lamindb¶
Features & labels are defined on the level of the database instance. You can either define a schema with required (and optional) columns.
ln.ULabel(name="DMSO").save() # define a DMSO label
ln.ULabel(name="IFNG").save() # define an IFNG label
# leverage ontologies through types ln.ULabel, bt.CellType, bt.ExperimentalFactor
lamindb_schema = ln.Schema(
name="My immuno schema",
features=[
ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
ln.Feature(name="concentration", dtype=str).save(),
ln.Feature(name="treatment_time_h", dtype=int).save(),
ln.Feature(name="donor", dtype=str, nullable=True).save(),
],
).save()
Or merely define a constraint on the feature identifier.
lamindb_schema_only_itype = ln.Schema(
name="Allow any valid features & labels", itype=ln.Feature
)
Validate a dataframe¶
pydantic¶
class DataFrameValidationError(Exception):
pass
def validate_dataframe(df: pd.DataFrame, model: type[pydantic.BaseModel]):
errors = []
for i, row in enumerate(df.to_dict(orient="records")):
try:
model(**row)
except pydantic.ValidationError as e:
errors.append(f"row {i} failed validation: {e}")
if errors:
error_message = "\n".join(errors)
raise DataFrameValidationError(
f"DataFrame validation failed with the following errors:\n{error_message}"
)
try:
validate_dataframe(df, ImmunoSchema)
except DataFrameValidationError as e:
print(e)
Show code cell output
DataFrame validation failed with the following errors:
row 1 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/literal_error
row 2 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/literal_error
To fix the validation error, we need to update the Literal
and re-run the model definition.
Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal[
"T cell", "B cell", "CD8-positive, alpha-beta T cell" # <-- updated
]
OntologyID = Literal["EFO:0008913"]
class ImmunoSchema(pydantic.BaseModel):
perturbation: Perturbation
cell_type_by_model: CellType
cell_type_by_expert: CellType
assay_oid: OntologyID
concentration: str
treatment_time_h: int
donor: str | None
class Config:
title = "My immuno schema"
validate_dataframe(df, ImmunoSchema)
pandera¶
try:
pandera_schema.validate(df)
except pandera.errors.SchemaError as e:
print(e)
Show code cell output
Column 'cell_type_by_expert' failed element-wise validator number 0: isin(['T cell', 'B cell']) failure cases: CD8-positive, alpha-beta T cell, CD8-positive, alpha-beta T cell
lamindb¶
Because the term "CD8-positive, alpha-beta T cell"
is part of the public CellType
ontology, validation passes the first time.
If validation and not passed, we could have resolved the issue simply by adding a new term to the CellType
registry rather than editing the code. This also puts downstream data scientists into a position to update ontologies.
curator = ln.curators.DataFrameCurator(df, lamindb_schema)
curator.validate()
What was the cell type validation based on? Let’s inspect the CellType
registry.
bt.CellType.df()
Show code cell output
uid | name | ontology_id | abbr | synonyms | description | space_id | source_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
14 | 6By01L04 | alpha-beta T cell | CL:0000789 | None | alpha-beta T-cell|alpha-beta T lymphocyte|alph... | A T Cell That Expresses An Alpha-Beta T Cell R... | 1 | 32 | None | 2025-04-15 16:34:59.703000+00:00 | 1 | None | 1 |
15 | 4BEwsp1Q | mature alpha-beta T cell | CL:0000791 | None | mature alpha-beta T-lymphocyte|mature alpha-be... | A Alpha-Beta T Cell That Has A Mature Phenotype. | 1 | 32 | None | 2025-04-15 16:34:59.703000+00:00 | 1 | None | 1 |
16 | 2OTzqBTM | mature T cell | CL:0002419 | None | CD3e-positive T cell|mature T-cell | A T Cell That Expresses A T Cell Receptor Comp... | 1 | 32 | None | 2025-04-15 16:34:59.703000+00:00 | 1 | None | 1 |
13 | 6IC9NGJE | CD8-positive, alpha-beta T cell | CL:0000625 | None | CD8-positive, alpha-beta T-cell|CD8-positive, ... | A T Cell Expressing An Alpha-Beta T Cell Recep... | 1 | 32 | None | 2025-04-15 16:34:59.248000+00:00 | 1 | None | 1 |
3 | 4bKGljt0 | cell | CL:0000000 | None | None | A Material Entity Of Anatomical Origin (Part O... | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
4 | 2K93w3xO | motile cell | CL:0000219 | None | None | A Cell That Moves By Its Own Activities. | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
5 | 2cXC7cgF | single nucleate cell | CL:0000226 | None | None | A Cell With A Single Nucleus. | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
6 | 4WnpvUTH | eukaryotic cell | CL:0000255 | None | None | Any Cell That Only Exists In Eukaryota. | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
7 | X6c7osZ5 | lymphocyte | CL:0000542 | None | None | A Lymphocyte Is A Leukocyte Commonly Found In ... | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
8 | 3VEAlFdi | leukocyte | CL:0000738 | None | white blood cell|leucocyte | An Achromatic Cell Of The Myeloid Or Lymphoid ... | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
9 | 2Jgr5Xx4 | mononuclear cell | CL:0000842 | None | mononuclear leukocyte | A Leukocyte With A Single Non-Segmented Nucleu... | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
10 | 7GpphKmr | lymphocyte of B lineage | CL:0000945 | None | None | A Lymphocyte Of B Lineage With The Commitment ... | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
11 | 4Ilrnj9U | hematopoietic cell | CL:0000988 | None | haematopoietic cell|hemopoietic cell|haemopoie... | A Cell Of A Hematopoietic Lineage. | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
12 | u3sr1Gdf | nucleate cell | CL:0002242 | None | None | A Cell Containing At Least One Nucleus. | 1 | 32 | None | 2025-04-15 16:34:58.799000+00:00 | 1 | None | 1 |
1 | ryEtgi1y | B cell | CL:0000236 | None | B lymphocyte|B-lymphocyte|B-cell | A Lymphocyte Of B Lineage That Is Capable Of B... | 1 | 32 | None | 2025-04-15 16:34:58.273000+00:00 | 1 | None | 1 |
2 | 22LvKd01 | T cell | CL:0000084 | None | T-cell|T-lymphocyte|T lymphocyte | A Type Of Lymphocyte Whose Defining Characteri... | 1 | 32 | None | 2025-04-15 16:34:58.273000+00:00 | 1 | None | 1 |
The CellType
regsitry is hierachical as it contains the Cell Ontology.
bt.CellType.get(name="CD8-positive, alpha-beta T cell").view_parents()
Show code cell output
Overview of validation properties¶
Importantly, LaminDB offers not only a DataFrameCurator
, but also a AnnDataCurator
, MuDataCurator
, SpatialDataCurator
, TiledbsomaCurator
.
The below overview only concerns validating dataframes.
Experience of data engineer¶
property |
|
|
|
---|---|---|---|
define schema as code |
yes, in form of a |
yes, in form of a |
yes, in form of a |
define schema as a set of constraints without the need of listing fields/columns/features; e.g. useful if validating 60k genes |
no |
no |
yes |
update labels independent of code |
not possible because labels are enums/literals |
not possible because labels are hard-coded in |
possible by adding new terms to a registry |
built-in validation from public ontologies |
no |
no |
yes |
sync labels with ELN/LIMS registries without code change |
no |
no |
yes |
can re-use fields/columns/features across schemas |
limited via subclass |
only in same Python session |
yes because persisted in database |
schema modifications can invalidate previously validated datasets |
yes |
yes |
no because LaminDB allows to query datasets that were validated with a schema version |
can use columnar organization of dataframe |
no, need to iterate over potentially millions of rows |
yes |
yes |
Experience of data consumer¶
property |
|
|
|
---|---|---|---|
dataset is queryable / findable |
no |
no |
yes, by querying for labels & features |
dataset is annotated |
no |
no |
yes |
user knows what validation constraints were |
no, because might not have access to code and doesn’t know which code was run |
no (same as pydantic) |
yes, via |
Annotation & queryability¶
Engineer: annotate the dataset¶
Either use the Curator
object:
artifact = curator.save_artifact(key="our_datasets/dataset1.parquet")
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
! 4 unique terms (36.40%) are not validated for name: 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note'
→ returning existing schema with same hash: Schema(uid='4TwOICz9LEhoKFSpkFe1', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='vDiiZc9g0y226dQ7TeJLTQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-04-15 16:34:57 UTC)
If you don’t expect a need for Curator functionality for updating ontologies and standaridization, you can also use the Artifact
constructor.
artifact = ln.Artifact.from_df(
df, key="our_datasets/dataset1.parquet", schema=lamindb_schema
).save()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
→ returning existing artifact with same hash: Artifact(uid='mxIXVF1kq395L4dK0000', is_latest=True, key='our_datasets/dataset1.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=8997, hash='tcuE7mvTOGmtda83Qxf82Q', n_observations=3, space_id=1, storage_id=1, schema_id=1, created_by_id=1, created_at=2025-04-15 16:35:01 UTC); to track this artifact as an input, use: ln.Artifact.get()
! run input wasn't tracked, call `ln.track()` and re-run
! 4 unique terms (36.40%) are not validated for name: 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note'
→ returning existing schema with same hash: Schema(uid='4TwOICz9LEhoKFSpkFe1', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='vDiiZc9g0y226dQ7TeJLTQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-04-15 16:34:57 UTC)
Consumer: see annotations¶
artifact.describe()
Show code cell output
Artifact .parquet/DataFrame ├── General │ ├── .uid = 'mxIXVF1kq395L4dK0000' │ ├── .key = 'our_datasets/dataset1.parquet' │ ├── .size = 8997 │ ├── .hash = 'tcuE7mvTOGmtda83Qxf82Q' │ ├── .n_observations = 3 │ ├── .path = │ │ /home/runner/work/lamindb/lamindb/docs/faq/test-pydantic-pandera/.lamindb/mxIXVF1kq395L4dK0000.parquet │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-04-15 16:35:01 ├── Dataset features │ └── columns • 7 [Feature] │ assay_oid cat[bionty.ExperimentalF… single-cell RNA sequencing │ cell_type_by_expert cat[bionty.CellType] B cell, CD8-positive, alpha-beta T cell │ cell_type_by_model cat[bionty.CellType] B cell, T cell │ perturbation cat[ULabel] DMSO, IFNG │ concentration str │ treatment_time_h int │ donor str └── Labels └── .cell_types bionty.CellType B cell, T cell, CD8-positive, alpha-beta… .experimental_factors bionty.ExperimentalFactor single-cell RNA sequencing .ulabels ULabel DMSO, IFNG
Consumer: query the dataset¶
ln.Artifact.filter(perturbation="IFNG").df()
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1 | mxIXVF1kq395L4dK0000 | our_datasets/dataset1.parquet | None | .parquet | dataset | DataFrame | 8997 | tcuE7mvTOGmtda83Qxf82Q | None | 3 | md5 | True | False | 1 | 1 | 1 | None | True | None | 2025-04-15 16:35:01.583000+00:00 | 1 | None | 1 |
Consumer: understand validation¶
By accessing artifact.schema
, the consumer can understand how the dataset was validated.
artifact.schema
Show code cell output
Schema(uid='4TwOICz9LEhoKFSpkFe1', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='vDiiZc9g0y226dQ7TeJLTQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-04-15 16:34:57 UTC)
artifact.schema.features.df()
Show code cell output
uid | name | dtype | is_type | unit | description | array_rank | array_size | array_shape | proxy_dtype | synonyms | _expect_many | _curation | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | KZNjCQQYWLa6 | perturbation | cat[ULabel] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-04-15 16:34:57.261000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
2 | 1VU4xStNeu45 | cell_type_by_model | cat[bionty.CellType] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-04-15 16:34:57.301000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
3 | eybBW5tGYmag | cell_type_by_expert | cat[bionty.CellType] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-04-15 16:34:57.306000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
4 | msLZyZoJfkxV | assay_oid | cat[bionty.ExperimentalFactor.ontology_id] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-04-15 16:34:57.311000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
5 | nc88HUhZTazp | concentration | str | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-04-15 16:34:57.316000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
6 | AFYBsSFyOkQB | treatment_time_h | int | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-04-15 16:34:57.322000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
7 | gJQm1Lkx4uIR | donor | str | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-04-15 16:34:57.327000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |