Pydantic & Pandera vs. LaminDB¶
This doc explains conceptual differences between data validation with pydantic
, pandera
, and LaminDB
.
!lamin init --storage test-pydantic-pandera --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-pydantic-pandera
Let us work with a test dataframe.
import pandas as pd
import pydantic
import lamindb as ln
import bionty as bt
import pandera.pandas as pandera
import pprint
from typing import Literal, Any
df = ln.core.datasets.small_dataset1()
df
→ connected lamindb: testuser1/test-pydantic-pandera
ENSG00000153563 | ENSG00000010610 | ENSG00000170458 | perturbation | sample_note | cell_type_by_expert | cell_type_by_model | assay_oid | concentration | treatment_time_h | donor | donor_ethnicity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
sample1 | 1 | 3 | 5 | DMSO | was ok | B cell | B cell | EFO:0008913 | 0.1% | 24 | D0001 | [Chinese, Singaporean Chinese] |
sample2 | 2 | 4 | 6 | IFNG | looks naah | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 200 nM | 24 | D0002 | [Chinese, Han Chinese] |
sample3 | 3 | 5 | 7 | DMSO | pretty! 🤩 | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 0.1% | 6 | None | [Chinese] |
Define a schema¶
pydantic¶
Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal["T cell", "B cell"]
OntologyID = Literal["EFO:0008913"]
class ImmunoSchema(pydantic.BaseModel):
perturbation: Perturbation
cell_type_by_model: CellType
cell_type_by_expert: CellType
assay_oid: OntologyID
concentration: str
treatment_time_h: int
donor: str | None
class Config:
title = "My immuno schema"
pandera¶
pandera_schema = pandera.DataFrameSchema(
{
"perturbation": pandera.Column(
str, checks=pandera.Check.isin(["DMSO", "IFNG"])
),
"cell_type_by_model": pandera.Column(
str, checks=pandera.Check.isin(["T cell", "B cell"])
),
"cell_type_by_expert": pandera.Column(
str, checks=pandera.Check.isin(["T cell", "B cell"])
),
"assay_oid": pandera.Column(str, checks=pandera.Check.isin(["EFO:0008913"])),
"concentration": pandera.Column(str),
"treatment_time_h": pandera.Column(int),
"donor": pandera.Column(str, nullable=True),
},
name="My immuno schema",
)
LaminDB¶
Features & labels are defined on the level of the database instance. You can either define a schema with required (and optional) columns.
ln.ULabel(name="DMSO").save()
ln.ULabel(name="IFNG").save()
# leverage ontologies through types ln.ULabel, bt.CellType, bt.ExperimentalFactor
lamindb_schema = ln.Schema(
name="My immuno schema",
features=[
ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
ln.Feature(name="concentration", dtype=str).save(),
ln.Feature(name="treatment_time_h", dtype=int).save(),
ln.Feature(name="donor", dtype=str, nullable=True).save(),
],
).save()
Or merely define a constraint on the feature identifier.
lamindb_schema_only_itype = ln.Schema(
name="Allow any valid features & labels", itype=ln.Feature
)
Validate a dataframe¶
pydantic¶
class DataFrameValidationError(Exception):
pass
def validate_dataframe(df: pd.DataFrame, model: type[pydantic.BaseModel]):
errors = []
for i, row in enumerate(df.to_dict(orient="records")):
try:
model(**row)
except pydantic.ValidationError as e:
errors.append(f"row {i} failed validation: {e}")
if errors:
error_message = "\n".join(errors)
raise DataFrameValidationError(
f"DataFrame validation failed with the following errors:\n{error_message}"
)
try:
validate_dataframe(df, ImmunoSchema)
except DataFrameValidationError as e:
print(e)
Show code cell output
DataFrame validation failed with the following errors:
row 1 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/literal_error
row 2 failed validation: 1 validation error for My immuno schema
cell_type_by_expert
Input should be 'T cell' or 'B cell' [type=literal_error, input_value='CD8-positive, alpha-beta T cell', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/literal_error
To fix the validation error, we need to update the Literal
and re-run the model definition.
Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal[
"T cell", "B cell", "CD8-positive, alpha-beta T cell" # <-- updated
]
OntologyID = Literal["EFO:0008913"]
class ImmunoSchema(pydantic.BaseModel):
perturbation: Perturbation
cell_type_by_model: CellType
cell_type_by_expert: CellType
assay_oid: OntologyID
concentration: str
treatment_time_h: int
donor: str | None
class Config:
title = "My immuno schema"
validate_dataframe(df, ImmunoSchema)
pandera¶
try:
pandera_schema.validate(df)
except pandera.errors.SchemaError as e:
print(e)
Show code cell output
Column 'cell_type_by_expert' failed element-wise validator number 0: isin(['T cell', 'B cell']) failure cases: CD8-positive, alpha-beta T cell, CD8-positive, alpha-beta T cell
LaminDB¶
Because the term "CD8-positive, alpha-beta T cell"
is part of the public CellType
ontology, validation passes the first time.
If validation had not passed, we could have resolved the issue simply by adding a new term to the CellType
registry rather than editing the code.
This also puts downstream data scientists into a position to update ontologies.
curator = ln.curators.DataFrameCurator(df, lamindb_schema)
curator.validate()
Show code cell output
! 5 terms not validated in feature 'columns': 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note', 'donor_ethnicity'
→ fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')
What was the cell type validation based on? Let’s inspect the CellType
registry.
bt.CellType.df()
Show code cell output
uid | name | ontology_id | abbr | synonyms | description | space_id | source_id | run_id | created_at | created_by_id | _aux | branch_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
14 | 6By01L04 | alpha-beta T cell | CL:0000789 | None | alpha-beta T-cell|alpha-beta T lymphocyte|alph... | A T Cell That Expresses An Alpha-Beta T Cell R... | 1 | 16 | None | 2025-08-06 17:32:58.318000+00:00 | 1 | None | 1 |
15 | 4BEwsp1Q | mature alpha-beta T cell | CL:0000791 | None | mature alpha-beta T-lymphocyte|mature alpha-be... | A Alpha-Beta T Cell That Has A Mature Phenotype. | 1 | 16 | None | 2025-08-06 17:32:58.318000+00:00 | 1 | None | 1 |
16 | 2OTzqBTM | mature T cell | CL:0002419 | None | CD3e-positive T cell|mature T-cell | A T Cell That Expresses A T Cell Receptor Comp... | 1 | 16 | None | 2025-08-06 17:32:58.318000+00:00 | 1 | None | 1 |
13 | 6IC9NGJE | CD8-positive, alpha-beta T cell | CL:0000625 | None | CD8-positive, alpha-beta T-cell|CD8-positive, ... | A T Cell Expressing An Alpha-Beta T Cell Recep... | 1 | 16 | None | 2025-08-06 17:32:58.052000+00:00 | 1 | None | 1 |
3 | 4bKGljt0 | cell | CL:0000000 | None | None | A Material Entity Of Anatomical Origin (Part O... | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
4 | 2K93w3xO | motile cell | CL:0000219 | None | None | A Cell That Moves By Its Own Activities. | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
5 | 2cXC7cgF | single nucleate cell | CL:0000226 | None | None | A Cell With A Single Nucleus. | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
6 | 4WnpvUTH | eukaryotic cell | CL:0000255 | None | None | Any Cell That Only Exists In Eukaryota. | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
7 | X6c7osZ5 | lymphocyte | CL:0000542 | None | None | A Lymphocyte Is A Leukocyte Commonly Found In ... | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
8 | 3VEAlFdi | leukocyte | CL:0000738 | None | white blood cell|leucocyte | An Achromatic Cell Of The Myeloid Or Lymphoid ... | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
9 | 2Jgr5Xx4 | mononuclear cell | CL:0000842 | None | mononuclear leukocyte | A Leukocyte With A Single Non-Segmented Nucleu... | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
10 | 7GpphKmr | lymphocyte of B lineage | CL:0000945 | None | None | A Lymphocyte Of B Lineage With The Commitment ... | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
11 | 4Ilrnj9U | hematopoietic cell | CL:0000988 | None | haematopoietic cell|hemopoietic cell|haemopoie... | A Cell Of A Hematopoietic Lineage. | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
12 | u3sr1Gdf | nucleate cell | CL:0002242 | None | None | A Cell Containing At Least One Nucleus. | 1 | 16 | None | 2025-08-06 17:32:57.773000+00:00 | 1 | None | 1 |
1 | ryEtgi1y | B cell | CL:0000236 | None | B lymphocyte|B-lymphocyte|B-cell | A Lymphocyte Of B Lineage That Is Capable Of B... | 1 | 16 | None | 2025-08-06 17:32:57.423000+00:00 | 1 | None | 1 |
2 | 22LvKd01 | T cell | CL:0000084 | None | T-cell|T-lymphocyte|T lymphocyte | A Type Of Lymphocyte Whose Defining Characteri... | 1 | 16 | None | 2025-08-06 17:32:57.423000+00:00 | 1 | None | 1 |
The CellType
regsitry is hierachical as it contains the Cell Ontology.
bt.CellType.get(name="CD8-positive, alpha-beta T cell").view_parents()
Show code cell output
Overview of validation properties¶
Importantly, LaminDB offers not only a DataFrameCurator
, but also a AnnDataCurator
, MuDataCurator
, SpatialDataCurator
, and TiledbsomaCurator
.
The below overview only concerns validating dataframes.
Experience of data engineer¶
property |
|
|
|
---|---|---|---|
define schema as code |
yes, in form of a |
yes, in form of a |
yes, in form of a |
define schema as a set of constraints without the need of listing fields/columns/features; e.g. useful if validating 60k genes |
no |
no |
yes |
update labels independent of code |
not possible because labels are enums/literals |
not possible because labels are hard-coded in |
possible by adding new terms to a registry |
built-in validation from public ontologies |
no |
no |
yes |
sync labels with ELN/LIMS registries without code change |
no |
no |
yes |
can re-use fields/columns/features across schemas |
limited via subclass |
only in same Python session |
yes because persisted in database |
schema modifications can invalidate previously validated datasets |
yes |
yes |
no because LaminDB allows to query datasets that were validated with a schema version |
can use columnar organization of dataframe |
no, need to iterate over potentially millions of rows |
yes |
yes |
Experience of data consumer¶
property |
|
|
|
---|---|---|---|
dataset is queryable / findable |
no |
no |
yes, by querying for labels & features |
dataset is annotated |
no |
no |
yes |
user knows what validation constraints were |
no, because might not have access to code and doesn’t know which code was run |
no (same as pydantic) |
yes, via |
Annotation & queryability¶
Engineer: annotate the dataset¶
Either use the Curator
object:
artifact = curator.save_artifact(key="our_datasets/dataset1.parquet")
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
→ returning existing schema with same hash: Schema(uid='mXXNUBXa0dcIpjAh', name='My immuno schema', n=7, is_type=False, itype='Feature', hash='7jMK9hCpchb193u8fxYjhQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-08-06 17:32:56 UTC)
If you don’t expect a need for Curator functionality for updating ontologies and standardization, you can also use the Artifact
constructor.
artifact = ln.Artifact.from_df(
df, key="our_datasets/dataset1.parquet", schema=lamindb_schema
).save()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
→ returning existing artifact with same hash: Artifact(uid='oe3o2LLA3YRVap830000', is_latest=True, key='our_datasets/dataset1.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=9868, hash='9p8ssBs4kmrjbP6cPCeeaQ', n_observations=3, branch_id=1, space_id=1, storage_id=1, schema_id=1, created_by_id=1, created_at=2025-08-06 17:32:59 UTC); to track this artifact as an input, use: ln.Artifact.get()
! 5 terms not validated in feature 'columns': 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note', 'donor_ethnicity'
→ fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')
→ returning existing schema with same hash: Schema(uid='mXXNUBXa0dcIpjAh', name='My immuno schema', n=7, is_type=False, itype='Feature', hash='7jMK9hCpchb193u8fxYjhQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-08-06 17:32:56 UTC)
Consumer: see annotations¶
artifact.describe()
Show code cell output
Artifact .parquet · DataFrame · dataset ├── General │ ├── key: our_datasets/dataset1.parquet │ ├── uid: oe3o2LLA3YRVap830000 hash: 9p8ssBs4kmrjbP6cPCeeaQ │ ├── size: 9.6 KB transform: None │ ├── space: all branch: all │ ├── created_by: testuser1 created_at: 2025-08-06 17:32:59 │ ├── n_observations: 3 │ └── storage path: │ /home/runner/work/lamindb/lamindb/docs/faq/test-pydantic-pandera/our_datasets/dataset1.parquet ├── Dataset features │ └── columns • 7 [Feature] │ assay_oid cat[bionty.ExperimentalFactor.on… single-cell RNA sequencing │ cell_type_by_expert cat[bionty.CellType] B cell, CD8-positive, alpha-beta T cell │ cell_type_by_model cat[bionty.CellType] B cell, T cell │ perturbation cat[ULabel] DMSO, IFNG │ concentration str │ treatment_time_h int │ donor str └── Labels └── .cell_types bionty.CellType B cell, T cell, CD8-positive, alpha-bet… .experimental_factors bionty.ExperimentalFactor single-cell RNA sequencing .ulabels ULabel DMSO, IFNG
Consumer: query the dataset¶
ln.Artifact.filter(perturbation="IFNG").df()
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | branch_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1 | oe3o2LLA3YRVap830000 | our_datasets/dataset1.parquet | None | .parquet | dataset | DataFrame | 9868 | 9p8ssBs4kmrjbP6cPCeeaQ | None | 3 | md5 | True | False | 1 | 1 | 1 | None | True | None | 2025-08-06 17:32:59.712000+00:00 | 1 | {'af': {'0': True}} | 1 |
Consumer: understand validation¶
By accessing artifact.schema
, the consumer can understand how the dataset was validated.
artifact.schema
Show code cell output
Schema(uid='mXXNUBXa0dcIpjAh', name='My immuno schema', n=7, is_type=False, itype='Feature', hash='7jMK9hCpchb193u8fxYjhQ', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, created_at=2025-08-06 17:32:56 UTC)
artifact.schema.features.df()
Show code cell output
uid | name | dtype | is_type | unit | description | array_rank | array_size | array_shape | proxy_dtype | synonyms | _expect_many | _curation | space_id | type_id | run_id | created_at | created_by_id | _aux | branch_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | kyowrwUJ5W14 | perturbation | cat[ULabel] | None | None | None | 0 | 0 | None | None | None | None | None | 1 | None | None | 2025-08-06 17:32:56.728000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
2 | C7iiZwuXe0bo | cell_type_by_model | cat[bionty.CellType] | None | None | None | 0 | 0 | None | None | None | None | None | 1 | None | None | 2025-08-06 17:32:56.784000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
3 | 3XB796ZHwlmX | cell_type_by_expert | cat[bionty.CellType] | None | None | None | 0 | 0 | None | None | None | None | None | 1 | None | None | 2025-08-06 17:32:56.789000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
4 | EGVHhO5HBvTY | assay_oid | cat[bionty.ExperimentalFactor.ontology_id] | None | None | None | 0 | 0 | None | None | None | None | None | 1 | None | None | 2025-08-06 17:32:56.794000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
5 | C3Ou3ZRxjr0Q | concentration | str | None | None | None | 0 | 0 | None | None | None | None | None | 1 | None | None | 2025-08-06 17:32:56.799000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
6 | IKN89zSOwGnt | treatment_time_h | int | None | None | None | 0 | 0 | None | None | None | None | None | 1 | None | None | 2025-08-06 17:32:56.804000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
7 | oIEcTBWewxCJ | donor | str | None | None | None | 0 | 0 | None | None | None | None | None | 1 | None | None | 2025-08-06 17:32:56.809000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
Nested data with dynamic keys¶
We will now examine another more complex example where data is nested with potentially arbitrary (dynamic) keys.
The example is inspired by the CELLxGENE schema where annotations are stored as dictionaries in the AnnData .uns
slot.
uns_dict = ln.core.datasets.dict_cxg_uns()
pprint.pprint(uns_dict)
{'organism_ontology_term_id': 'NCBITaxon:9606',
'spatial': {'is_single': True,
'library_1': {'images': {'fullres': 'path/to/fullres.jpg',
'hires': 'path/to/hires.jpg'},
'scalefactors': {'spot_diameter_fullres': 89.43,
'tissue_hires_scalef': 0.177}},
'library_2': {'images': {'fullres': 'path/to/fullres_2.jpg',
'hires': 'path/to/hires_2.jpg'},
'scalefactors': {'spot_diameter_fullres': 120.34,
'tissue_hires_scalef': 0.355}}}}
pydantic¶
Pydantic is primed to deal with nested data.
class Images(pydantic.BaseModel):
fullres: str
hires: str
class Scalefactors(pydantic.BaseModel):
spot_diameter_fullres: float
tissue_hires_scalef: float
class Library(pydantic.BaseModel):
images: Images
scalefactors: Scalefactors
class Spatial(pydantic.BaseModel):
is_single: bool
model_config = {"extra": "allow"}
def __init__(self, **data):
libraries = {}
other_fields = {}
# store all libraries under a single key for validation
for key, value in data.items():
if key.startswith("library_"):
libraries[key] = Library(**value)
else:
other_fields[key] = value
other_fields["libraries"] = libraries
super().__init__(**other_fields)
class SpatialDataSchema(pydantic.BaseModel):
organism_ontology_term_id: str
spatial: Spatial
validated_data = SpatialDataSchema(**uns_dict)
However, pydantic either requires all dictionary keys to be known beforehand to construct the Model classes or workarounds to collect all keys for a single model.
pandera¶
Pandera cannot validate dictionaries because it is designed for structured dataframe data. Therefore, we need to flatten the dictionary to transform it into a DataFrame:
def _flatten_dict(d: dict[Any, Any], parent_key: str = "", sep: str = "_"):
items = []
for k, v in d.items():
new_key = f"{parent_key}{sep}{k}" if parent_key else k
if isinstance(v, dict):
items.extend(_flatten_dict(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def create_dynamic_schema(flattened_data: dict[str, Any]):
schema_dict = {
"organism_ontology_term_id": pandera.Column(str),
"spatial_is_single": pandera.Column(bool),
}
for key in flattened_data.keys():
if key.startswith("spatial_library_") and key.endswith("_images_fullres"):
lib_prefix = key.replace("_images_fullres", "")
schema_dict.update(
{
f"{lib_prefix}_images_fullres": pandera.Column(str),
f"{lib_prefix}_images_hires": pandera.Column(str),
f"{lib_prefix}_scalefactors_spot_diameter_fullres": pandera.Column(
float
),
f"{lib_prefix}_scalefactors_tissue_hires_scalef": pandera.Column(
float
),
}
)
return pandera.DataFrameSchema(schema_dict)
flattened = _flatten_dict(uns_dict)
df = pd.DataFrame([flattened])
spatial_schema = create_dynamic_schema(flattened)
validated_df = spatial_schema.validate(df)
Analogously to pydantic, pandera does not have out of the box support for dynamically named keys. Therefore, it is necessary to dynamically construct a pydantic schema.
LaminDB¶
Similarly, LaminDB currently requires constructing flattened dataframes to dynamically create features for the schema, which can then be used for validation with the DataFrameCurator. Future improvements are expected, including support for a dictionary-specific curator.
def create_dynamic_schema(flattened_data: dict[str, Any]) -> ln.Schema:
features = []
for key, value in flattened_data.items():
if key == "organism_ontology_term_id":
features.append(ln.Feature(name=key, dtype=bt.Organism.ontology_id).save())
elif isinstance(value, bool):
features.append(ln.Feature(name=key, dtype=bool).save())
elif isinstance(value, (int, float)):
features.append(ln.Feature(name=key, dtype=float).save())
else:
features.append(ln.Feature(name=key, dtype=str).save())
return ln.Schema(
name="Spatial data schema", features=features, coerce_dtype=True
).save()
flattened = _flatten_dict(uns_dict)
flattened_df = pd.DataFrame([flattened])
spatial_schema = create_dynamic_schema(flattened)
curator = ln.curators.DataFrameCurator(flattened_df, spatial_schema)
curator.validate()
Show code cell output
! you are trying to create a record with name='spatial_library_1_images_hires' but a record with similar name exists: 'spatial_library_1_images_fullres'. Did you mean to load it?
! you are trying to create a record with name='spatial_library_2_images_hires' but a record with similar name exists: 'spatial_library_2_images_fullres'. Did you mean to load it?
Note
Curators for scverse data structures allow for the specification of schema slots that access and validate dataframes in nested dictionary attributes like .attrs
or .uns
.
These schema slots use colon-separated paths like 'attrs:sample'
or 'uns:spatial:images'
to to target specific dataframes for validation.