Introduction¶
LaminDB is an open-source data framework for biology. It makes your data queryable, traceable, reproducible, and FAIR. With one API, you get: lakehouse, lineage, feature store, ontologies, LIMS, and ELN.
Why?
Reproducing analytical results or understanding how a dataset or model was created can be a pain. Training models on historical data, LIMS & ELN systems, orthogonal assays, or datasets from other teams is even harder. Even maintaining an overview of a project’s datasets & analyses is more difficult than it should be.
Biological datasets are typically managed with versioned storage systems, GUI-focused platforms, structureless data lakes, rigid data warehouses (SQL, monolithic arrays), or tabular lakehouses.
LaminDB extends the lakehouse architecture to biological registries & datasets beyond tables (DataFrame
, AnnData
, .zarr
, .tiledbsoma
, …) with enough structure to enable queries and enough freedom to keep the pace of R&D high.
Moreover, it provides context through data lineage – tracing data and code, scientists and models – and abstractions for biological domain knowledge and experimental metadata.

Highlights
lineage → track inputs & outputs of notebooks, scripts, functions & pipelines with a single line of code
lakehouse → manage, monitor & validate schemas; query across many datasets
feature store → manage features & labels; leverage batch loading
FAIR datasets → validate & annotate
DataFrame
,AnnData
,SpatialData
,parquet
,.h5ad
,zarr
, …LIMS & ELN → manage experimental metadata, ontologies & markdown notes
unified access → storage locations (local, S3, GCP, …), SQL databases (Postgres, SQLite) & ontologies
reproducible & auditable → auto-version & timestamp execution reports, source code & environments; attribute records to users
zero lock-in & scalable → runs in your infrastructure; not a client for a rate-limited REST API
extendable → create custom plug-ins based on the Django ORM
production-ready → used in BigPharma, BioTech, hospitals & top labs
LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.
Explore
Explore a web interface for working with LaminDB instances. It offers free access to large omics data collections at lamin.ai/explore – no account required.
While you stay in full control over storage & database permissions directly on AWS or GCP, LaminHub allows you to manage access similar to how you’d do it on GitHub, Google Drive, Microsoft Sharepoint, or Notion. See Manage access.
LaminHub is a SaaS product. For private data & commercial usage, see: lamin.ai/pricing.
You can copy this summary.md into an LLM chat and let AI explain.
Setup¶
Install the lamindb
Python package:
pip install lamindb
Create a LaminDB instance:
lamin init --storage ./quickstart-data # or s3://my-bucket, gs://my-bucket
Or if you have write access to an instance, login and connect to it:
lamin login
lamin connect account/instance
Install the laminr
package.
install.packages("laminr", dependencies = TRUE)
Create a LaminDB instance:
lc <- import_module("lamin_cli")
lc$init(storage = "./mydata", modules = "bionty")
Or if you have write access to an instance, login and connect to it:
lc$login()
lc$connect("<account>/<instance>")
Quickstart¶
Track a script or notebook run with source code, inputs, outputs, logs, and environment.
import lamindb as ln
ln.track() # track a run
open("sample.fasta", "w").write(">seq1\nACGT\n")
ln.Artifact("sample.fasta", key="sample.fasta").save() # create an artifact
ln.finish() # finish the run
library(laminr)
ln <- import_module("lamindb")
ln$track() # track a run
writeLines(">seq1\nACGT\n", "sample.fasta")
ln$Artifact("sample.fasta", key="sample.fasta")$save() # create an artifact
ln$finish() # finish the run
This code snippet creates an artifact, which can store a dataset or model as a file or folder in various formats.
Running the snippet as a script (python create-fasta.py
) produces the following data lineage.
artifact = ln.Artifact.get(key="sample.fasta") # query artifact by key
artifact.view_lineage()
artifact = ln$Artifact$get(key="sample.fasta") # query artifact by key
artifact$view_lineage()

You’ll know how that artifact was created and what it’s used for (interactive visualization) in addition to capturing basic metadata:
artifact.describe()
artifact$describe()

You can organize datasets with validation & annotation of any kind of metadata to then access them via queries & search. Here is a more comprehensive example:

To annotate an artifact with a label, use:
my_experiment = ln.Record(name="My experiment").save() # create a label record
artifact.records.add(my_experiment) # annotate the artifact with the label
my_experiment = ln$Record(name="My experiment")$save() # create a label record
artifact$records$add(my_experiment) # annotate the artifact with the label
To query for a set of artifacts, use the filter()
statement.
ln.Artifact.filter(records=my_experiment, suffix=".fasta").to_dataframe() # query by suffix and the ulabel we just created
ln.Artifact.filter(transform__key="create-fasta.py").to_dataframe() # query by the name of the script we just ran
ln$Artifact$filter(records=my_experiment, suffix=".fasta")$to_dataframe() # query by suffix and the ulabel we just created
ln$Artifact$filter(transform__key="create-fasta.py")$to_dataframe() # query by the name of the script we just ran
If you have a structured dataset like a DataFrame
, an AnnData
, or another array, you can validate the content of the dataset (and parse annotations).
Here is an example for a dataframe.
With a large body of validated datasets, you can then access data through distributed queries & batch streaming, see here: docs.lamin.ai/arrays.