stars pypi PyPI Downloads cran DocsLLMs

Introduction

LaminDB is an open-source data lakehouse to enable learning at scale in biology. It organizes datasets through validation & annotation and provides data lineage, queryability, and reproducibility on top of FAIR data.

Why?

Reproducing analytical results or understanding how a dataset or model was created can be a pain. Let alone training models on historical data, LIMS & ELN systems, orthogonal assays, or datasets generated by other teams. Even maintaining a mere overview of a project’s or team’s datasets & analyses is harder than it sounds.

Biological datasets are typically managed with versioned storage systems, GUI-focused community or SaaS platforms, structureless data lakes, rigid data warehouses (SQL, monolithic arrays), and data lakehouses for tabular data.

LaminDB extends the lakehouse architecture to biological registries & datasets beyond tables (DataFrame, AnnData, .zarr, .tiledbsoma, …) with enough structure to enable queries and enough freedom to keep the pace of R&D high. Moreover, it provides context through data lineage – tracing data and code, scientists and models – and abstractions for biological domain knowledge and experimental metadata.

Highlights
  • data lineage: track inputs & outputs of notebooks, scripts, functions & pipelines with a single line of code

  • unified infrastructure: access diverse storage locations (local, S3, GCP, …), SQL databases (Postgres, SQLite) & ontologies

  • lakehouse capabilities: manage, monitor & validate features, labels & dataset schemas; perform distributed queries and batch loading

  • biological data formats: validate & annotate formats like DataFrame, AnnData, MuData, … backed by parquet, zarr, HDF5, LanceDB, DuckDB, …

  • biological entities: organize experimental metadata & extensible ontologies in registries based on the Django ORM

  • reproducible & auditable: auto-version & timestamp execution reports, source code & compute environments, attribute records to users

  • zero lock-in & scalable: runs in your infrastructure; is not a client for a rate-limited REST API

  • extendable: create custom plug-ins for your own applications based on the Django ecosystem

  • integrations: visualization tools like vitessce, workflow managers like nextflow & redun, and other tools

  • production-ready: used in BigPharma, BioTech, hospitals & top labs

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

Explore

Explore a web interface for working with LaminDB instances. It offers free access to large omics data collections at lamin.ai/explore – no account required.

While you stay in full control over storage & database permissions directly on AWS or GCP, LaminHub allows you to manage access similar to how you’d do it on GitHub, Google Drive, Microsoft Sharepoint, or Notion. See Manage access.

LaminHub is a SaaS product. For private data & commercial usage, see: lamin.ai/pricing.

You can copy this summary.md into an LLM chat and let AI explain.

Setup

Install the lamindb Python package:

pip install lamindb

Create a LaminDB instance:

lamin init --storage ./quickstart-data  # or s3://my-bucket, gs://my-bucket

Or if you have write access to an instance, login and connect to it:

lamin login
lamin connect account/instance

Install the laminr package.

install.packages("laminr", dependencies = TRUE)

Create a LaminDB instance:

laminr::lamin_init(storage = "./mydata", modules = c("bionty"))

Or if you have write access to an instance, login and connect to it:

laminr::lamin_login()
laminr::lamin_connect("<account>/<instance>")

Quickstart

Track a script or notebook run with source code, inputs, outputs, logs, and environment.

import lamindb as ln

ln.track()  # track a run
open("sample.fasta", "w").write(">seq1\nACGT\n")
ln.Artifact("sample.fasta", key="sample.fasta").save()  # create an artifact
ln.finish()  # finish the run
library(laminr)
ln <- import_module("lamindb")

ln$track()  # track a run
writeLines(">seq1\nACGT\n", "sample.fasta")
ln$Artifact("sample.fasta", key="sample.fasta")$save()  # create an artifact
ln$finish()  # finish the run

This code snippet creates an artifact, which can store a dataset or model as a file or folder in various formats. Running the snippet as a script (python create-fasta.py) produces the following data lineage.

artifact = ln.Artifact.get(key="sample.fasta")  # query artifact by key
artifact.view_lineage()
artifact = ln$Artifact$get(key="sample.fasta")  # query artifact by key
artifact$view_lineage()

You’ll know how that artifact was created and what it’s used for (interactive visualization) in addition to capturing basic metadata:

artifact.describe()
artifact$describe()

You can organize datasets with validation & annotation of any kind of metadata to then access them via queries & search. Here is a more comprehensive example:

To annotate an artifact with a label, use:

my_experiment = ln.ULabel(name="My experiment").save()  # create a label in the universal label ontology
artifact.ulabels.add(my_experiment)  # annotate the artifact with the label
my_experiment = ln$ULabel(name="My experiment")$save()  # create a label in the universal label ontology
artifact$ulabels$add(my_experiment)  # annotate the artifact with the label

To query for a set of artifacts, use the filter() statement.

ln.Artifact.filter(ulabels=my_experiment, suffix=".fasta").to_dataframe()  # query by suffix and the ulabel we just created
ln.Artifact.filter(transform__key="create-fasta.py").to_dataframe()  # query by the name of the script we just ran
ln$Artifact$filter(ulabels=my_experiment, suffix=".fasta")$to_dataframe()  # query by suffix and the ulabel we just created
ln$Artifact$filter(transform__key="create-fasta.py")$to_dataframe()  # query by the name of the script we just ran

If you have a structured dataset like a DataFrame, an AnnData, or another array, you can validate the content of the dataset (and parse annotations). Here is an example for a dataframe.

With a large body of validated datasets, you can then access data through distributed queries & batch streaming, see here: docs.lamin.ai/arrays.