Introduction¶

LaminDB is an open-source data framework for biology. It makes your data queryable, traceable, reproducible, and FAIR. With one API, you get: lakehouse, lineage, feature store, ontologies, LIMS, and ELN.

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

💡 Tip: Copy this summary.md into an LLM chat and let AI explain.

Who uses it?

Scientists & engineers in pharma, biotech, and academia, including:

Pfizer – A global BigPharma company with headquarters in the US
Ensocell Therapeutics – A BioTech with offices in Cambridge, UK, and California
DZNE – The National Research Center for Neuro-Degenerative Diseases in Germany
Helmholtz Munich – The National Research Center for Environmental Health in Germany
scverse – An international non-profit for open-source omics data tools
The Global Immunological Swarm Learning Network – Research hospitals at U Bonn, Harvard, MIT, Stanford, ETH Zürich, Charite, Mount Sinai, and others

Setup¶

Install the lamindb Python package:

pip install lamindb

Create a LaminDB instance:

lamin init --modules bionty --storage ./quickstart-data  # or s3://my-bucket, gs://my-bucket

Or if you have write access to an instance, connect to it:

lamin connect account/name

Quickstart¶

Lake ♾️ LIMS ♾️ Sheets¶

You can create records for the entities underlying your experiments: samples, perturbations, instruments, etc., for example:

sample = ln.Record(name="Sample", is_type=True).save()  # type sample
ln.Record(name="P53mutant1", type=sample).save()        # sample 1
ln.Record(name="P53mutant2", type=sample).save()        # sample 2

sample = ln$Record(name="Sample", is_type=TRUE)$save()  # type sample
ln$Record(name="P53mutant1", type=sample)$save()        # sample 1
ln$Record(name="P53mutant2", type=sample)$save()        # sample 2

Define the corresponding features and annotate:

ln.Feature(name="design_sample", dtype=sample).save()
artifact.features.add_values({"design_sample": "P53mutant1"})

ln$Feature(name="design_sample", dtype=sample)$save()
artifact$features$add_values({"design_sample": "P53mutant1"})

You can query & search the Record registry in the same way as Artifact or Run.

ln.Record.search("p53").to_dataframe()

ln$Record$search("p53")$to_dataframe()

You can also create relationships of entities and – if you connect your LaminDB instance to LaminHub – edit them like Excel sheets in a GUI.

Lake: versioning¶

If you change source code or datasets, LaminDB manages their versioning for you. Assume you run a new version of our create-fasta.py script to create a new version of sample.fasta.

import lamindb as ln

ln.track()
open("sample.fasta", "w").write(">seq1\nTGCA\n")  # a new sequence
ln.Artifact("sample.fasta", key="sample.fasta", features={"design_sample": "P53mutant1"}).save()  # annotate with the new sample
ln.finish()

library(laminr)
ln <- import_module("lamindb")

ln$track()
writeLines(">seq1\nTGCA\n", "sample.fasta")  # a new sequence
ln$Artifact("sample.fasta", key="sample.fasta", features={"design_sample": "P53mutant1"})$save()  # annotate with the new sample
ln$finish()

If you now query by key, you’ll get the latest version of this artifact.

artifact = ln.Artifact.get(key="sample.fasta")  # get artifact by key
artifact.versions.to_dataframe()                # see all versions of that artifact

artifact = ln$Artifact$get(key="sample.fasta")  # get artifact by key
artifact$versions$to_dataframe()                # see all versions of that artifact

Lakehouse ♾️ feature store¶

Here is how you ingest a DataFrame:

import pandas as pd

df = pd.DataFrame({
    "sequence_str": ["ACGT", "TGCA"],
    "gc_content": [0.55, 0.54],
    "experiment_note": ["Looks great", "Ok"],
    "experiment_date": [date(2025, 10, 24), date(2025, 10, 25)],
})
ln.Artifact.from_dataframe(df, key="my_datasets/sequences.parquet").save()  # no validation

# library(reticulate)
# pd <- import("pandas")  # or use native R data.frame

df = pd$DataFrame({
    "sequence_str": ["ACGT", "TGCA"],
    "gc_content": [0$55, 0$54],
    "experiment_note": ["Looks great", "Ok"],
    "experiment_date": [date(2025, 10, 24), date(2025, 10, 25)],
})
ln$Artifact$from_dataframe(df, key="my_datasets/sequences.parquet")$save()  # no validation

To validate & annotate the content of the dataframe, use a built-in schema:

ln.Feature(name="sequence_str", dtype=str).save()  # define a remaining feature
artifact = ln.Artifact.from_dataframe(
    df,
    key="my_datasets/sequences.parquet",
    schema="valid_features"  # validate columns against features
).save()
artifact.describe()

ln$Feature(name="sequence_str", dtype=str)$save()  # define a remaining feature
artifact = ln$Artifact$from_dataframe(
    df,
    key="my_datasets/sequences.parquet",
    schema="valid_features"  # validate columns against features
)$save()
artifact$describe()

Now you know which schema the dataset satisfies. You can filter for datasets by schema and then launch distributed queries and batch loading.

Lakehouse beyond tables¶

To validate an AnnData with a built-in schema call:

import anndata as ad
import numpy as np

adata = ad.AnnData(
    X=pd.DataFrame([[1]*10]*21).values,
    obs=pd.DataFrame({'cell_type_by_model': ['T cell', 'B cell', 'NK cell'] * 7}),
    var=pd.DataFrame(index=[f'ENSG{i:011d}' for i in range(10)])
)

artifact = ln.Artifact.from_anndata(
    adata,
    key="my_datasets/scrna.h5ad",
    schema="ensembl_gene_ids_and_valid_features_in_obs"
)
artifact.describe()

# import anndata as ad  # TODO: Convert this import manually
# library(reticulate)
# np <- import("numpy")  # or use native R arrays

adata = ad$AnnData(
    X=pd$DataFrame([[1]*10]*21)$values,
    obs=pd$DataFrame({'cell_type_by_model': ['T cell', 'B cell', 'NK cell'] * 7}),
    var=pd$DataFrame(index=[f'ENSG{i:011d}' for i in range(10)])
)

artifact = ln$Artifact$from_anndata(
    adata,
    key="my_datasets/scrna.h5ad",
    schema="ensembl_gene_ids_and_valid_features_in_obs"
)
artifact$describe()

To validate a spatialdata or any other array-like dataset, you need to construct a Schema. You can do this by composing the schema of a complicated object from simple pandera/pydantic-like schemas: docs.lamin.ai/curate.

Ontologies¶

Plugin bionty gives you >20 of them as SQLRecord registries. This was used to validate the ENSG ids in the adata just before.

import bionty as bt

bt.CellType.import_source()  # import the default ontology
bt.CellType.to_dataframe()   # your extendable cell type ontology in a simple registry

# import bionty as bt  # TODO: Convert this import manually

bt$CellType$import_source()  # import the default ontology
bt$CellType$to_dataframe()   # your extendable cell type ontology in a simple registry

CLI¶

Most of the functionality that’s available in Python is also available on the command line (and in R through LaminR). For instance, to upload a file or folder, run:

lamin save myfile.txt --key examples/myfile.txt

Workflow managers¶

LaminDB is not a workflow manager, but it integrates well with existing workflow managers and can subsitute them in some settings.

In github.com/laminlabs/schmidt22 we manage several workflows, scripts, and notebooks to re-construct the project of Schmidt el al. (2022). A phenotypic CRISPRa screening result is integrated with scRNA-seq data. Here is one of the input artifacts:

And here is the lineage of the final result:

You can explore it here.

If you’d like to integrate with Nextflow, Snakemake, or redun, see here: docs.lamin.ai/pipelines