Introduction¶
LaminDB is an open-source data lakehouse to enable learning at scale in biology. It organizes datasets through validation & annotation and provides data lineage, queryability, and reproducibility on top of FAIR data.
Why?
Reproducing analytical results or understanding how a dataset or model was created can be a pain. Let alone training models on historical data, LIMS & ELN systems, orthogonal assays, or datasets generated by other teams. Even maintaining a mere overview of a project’s or team’s datasets & analyses is harder than it sounds.
Biological datasets are typically managed with versioned storage systems, GUI-focused community or SaaS platforms, structureless data lakes, rigid data warehouses (SQL, monolithic arrays), and data lakehouses for tabular data.
LaminDB extends the lakehouse architecture to biological registries & datasets beyond tables (DataFrame
, AnnData
, .zarr
, .tiledbsoma
, …) with enough structure to enable queries and enough freedom to keep the pace of R&D high.
Moreover, it provides context through data lineage – tracing data and code, scientists and models – and abstractions for biological domain knowledge and experimental metadata.
Highlights
data lineage: track inputs & outputs of notebooks, scripts, functions & pipelines with a single line of code
unified infrastructure: access diverse storage locations (local, S3, GCP, …), SQL databases (Postgres, SQLite) & ontologies
lakehouse capabilities: manage, monitor & validate features, labels & dataset schemas; perform distributed queries and batch loading
biological data formats: validate & annotate formats like
DataFrame
,AnnData
,MuData
, … backed byparquet
,zarr
, HDF5, LanceDB, DuckDB, …biological entities: organize experimental metadata & extensible ontologies in registries based on the Django ORM
reproducible & auditable: auto-version & timestamp execution reports, source code & compute environments, attribute records to users
zero lock-in & scalable: runs in your infrastructure; is not a client for a rate-limited REST API
extendable: create custom plug-ins for your own applications based on the Django ecosystem
integrations: visualization tools like vitessce, workflow managers like nextflow & redun, and other tools
production-ready: used in BigPharma, BioTech, hospitals & top labs
LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.
Explore
Explore a web interface for working with LaminDB instances. It offers free access to large omics data collections at lamin.ai/explore – no account required.
While you stay in full control over storage & database permissions directly on AWS or GCP, LaminHub allows you to manage access similar to how you’d do it on GitHub, Google Drive, Microsoft Sharepoint, or Notion. See Manage access.
LaminHub is a SaaS product. For private data & commercial usage, see: lamin.ai/pricing.
You can copy this summary.md into an LLM chat and let AI explain.
Setup¶
Install the lamindb
Python package:
pip install lamindb
Create a LaminDB instance:
lamin init --storage ./quickstart-data # or s3://my-bucket, gs://my-bucket
Or if you have write access to an instance, login and connect to it:
lamin login
lamin connect account/instance
Install the laminr
package.
install.packages("laminr", dependencies = TRUE)
Create a LaminDB instance:
laminr::lamin_init(storage = "./mydata", modules = c("bionty"))
Or if you have write access to an instance, login and connect to it:
laminr::lamin_login()
laminr::lamin_connect("<account>/<instance>")
Quickstart¶
Track a script or notebook run with source code, inputs, outputs, logs, and environment.
import lamindb as ln
ln.track() # track a run
open("sample.fasta", "w").write(">seq1\nACGT\n")
ln.Artifact("sample.fasta", key="sample.fasta").save() # create an artifact
ln.finish() # finish the run
library(laminr)
ln <- import_module("lamindb")
ln$track() # track a run
writeLines(">seq1\nACGT\n", "sample.fasta")
ln$Artifact("sample.fasta", key="sample.fasta")$save() # create an artifact
ln$finish() # finish the run
This code snippet creates an artifact, which can store a dataset or model as a file or folder in various formats.
Running the snippet as a script (python create-fasta.py
) produces the following data lineage.
artifact = ln.Artifact.get(key="sample.fasta") # query artifact by key
artifact.view_lineage()
artifact = ln$Artifact$get(key="sample.fasta") # query artifact by key
artifact$view_lineage()

You’ll know how that artifact was created and what it’s used for (interactive visualization) in addition to capturing basic metadata:
artifact.describe()
artifact$describe()

You can organize datasets with validation & annotation of any kind of metadata to then access them via queries & search. Here is a more comprehensive example:

To annotate an artifact with a label, use:
my_experiment = ln.ULabel(name="My experiment").save() # create a label in the universal label ontology
artifact.ulabels.add(my_experiment) # annotate the artifact with the label
my_experiment = ln$ULabel(name="My experiment")$save() # create a label in the universal label ontology
artifact$ulabels$add(my_experiment) # annotate the artifact with the label
To query for a set of artifacts, use the filter()
statement.
ln.Artifact.filter(ulabels=my_experiment, suffix=".fasta").to_dataframe() # query by suffix and the ulabel we just created
ln.Artifact.filter(transform__key="create-fasta.py").to_dataframe() # query by the name of the script we just ran
ln$Artifact$filter(ulabels=my_experiment, suffix=".fasta")$to_dataframe() # query by suffix and the ulabel we just created
ln$Artifact$filter(transform__key="create-fasta.py")$to_dataframe() # query by the name of the script we just ran
If you have a structured dataset like a DataFrame
, an AnnData
, or another array, you can validate the content of the dataset (and parse annotations).
Here is an example for a dataframe.
With a large body of validated datasets, you can then access data through distributed queries & batch streaming, see here: docs.lamin.ai/arrays.