Introduction¶
LaminDB is an open-source data framework to enable learning at scale in computational biology. It lets you track data transformations, curate datasets, manage metadata, and query a built-in database for biological entities & data structures.
Why?
Reproducing analytical results or understanding how a dataset or model was created can be a pain. Leave alone training models on historical data, orthogonal assays, or datasets generated by other teams.
Biological datasets are typically managed with versioned storage systems (file systems, object storage, git, dvc), UI-focused community or SaaS platforms, structureless data lakes, rigid data warehouses (SQL, monolithic arrays), and data lakehouses for tabular data.
LaminDB goes beyond these systems with a lakehouse that models biological datasets beyond tables with enough structure to enable queries and enough freedom to keep the pace of R&D high.
For data structures like DataFrame
, AnnData
, .zarr
, .tiledbsoma
, etc., LaminDB tracks and provides the rich context that collaborative biological research requires:
data lineage: data sources and transformations; scientists and machine learning models
domain knowledge and experimental metadata: the features and labels derived from domain entities
In this blog post, we discuss a breadth of data management problems of the field.
LaminDB specs
Any LaminDB instance comes with an underlying SQL metadata database to organize files, folders, and arrays across any number of storage locations.
The following detailed specs are for the Python package lamindb
. For the analogous R package laminr
, see the R docs.
Manage data & metadata with a unified API (“lakehouse”).
Model files and folders as datasets & models via one class:
Artifact
Use array formats in memory & storage: DataFrame, AnnData, MuData, tiledbsoma, … backed by parquet, zarr, tiledb, HDF5, h5ad, DuckDB, …
Create iterable & queryable collections of artifacts with data loaders:
Collection
Version artifacts, collections & transforms:
IsVersioned
Track data lineage across notebooks, scripts, pipelines & UI.
Track scripts & notebooks with a simple method call:
track()
Track functions with a decorator:
tracked()
A unified registry for all your notebooks, scripts & pipelines:
Transform
A unified registry for all data transformation runs:
Run
Manage execution reports, source code and Python environments for notebooks & scripts
Integrate with workflow managers: redun, nextflow, snakemake
Manage registries for experimental metadata & in-house ontologies, import public ontologies.
Use >20 public ontologies with module
bionty
:Gene
,Protein
,CellMarker
,ExperimentalFactor
,CellType
,CellLine
,Tissue
, …Use a canonical wetlab database schema module
wetlab
Safeguards against typos & duplications
Version ontology
Validate, standardize & annotate.
Validate & standardize metadata:
validate
,standardize
.High-level curation flow including annotation:
Curator
Inspect validation failures:
inspect
Organize and share data across a mesh of LaminDB instances.
Create & connect to instances with the same ease as git repos:
lamin init
&lamin connect
Zero-copy transfer data across instances
Integrate with analytics tools.
Vitessce:
save_vitessce_config
Zero lock-in, scalable, auditable.
Zero lock-in: LaminDB runs on generic backends server-side and is not a client for “Lamin Cloud”
Flexible storage backends (local, S3, GCP, https, HF, R2, anything fsspec supports)
Two SQL backends for managing metadata: SQLite & Postgres
Scalable: metadata registries support 100s of millions of entries, storage is as scalable as S3
Plug-in custom schema modules & manage database schema migrations
Auditable: data & metadata records are hashed, timestamped, and attributed to users (full audit log to come)
Secure: embedded in your infrastructure
Tested, typed, idempotent & ACID
LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.
LaminHub overview
See for yourself by browsing the demo instances in the hub UI or lamin connect owner/instance
them via the CLI.
lamin.ai/laminlabs/lamindata - A generic demo instance with various data types
lamin.ai/laminlabs/cellxgene - An instance that interfaces the CELLxGENE data (guide)
lamin.ai/laminlabs/arrayloader-benchmarks - Work with ML models & benchmarks
See the pricing page. Basic LaminHub features are free.
Secure & intuitive access management.
Rather than configuring storage & database permissions directly on AWS or GCP, LaminHub allows you to manage collaborators for databases & storage locations in the same way you manage access to repositories on GitHub. See Access management.
A UI to work with LaminDB instances.
See an overview of all datasets, models, code, and metadata in your instance.

See validated datasets in context of ontologies & experimental metadata.

Query & search.

See scripts, notebooks & pipelines with their inputs & outputs.

Track pipelines, notebooks & UI transforms in one place.

Quickstart¶
Install the lamindb
Python package.
pip install 'lamindb[jupyter,bionty]' # support notebooks & biological ontologies
Connect to a LaminDB instance.
lamin connect account/instance # <-- replace with your instance
Access an input dataset and save an output dataset.
import lamindb as ln
ln.track() # track a run of your notebook or script
artifact = ln.Artifact.get("3TNCsZZcnIBv2WGb0001") # get an artifact by uid
filepath = artifact.cache() # cache the artifact on disk
# do your work
ln.Artifact("./my_dataset.csv", key="my_results/my_dataset.csv").save() # save a file
ln.finish() # mark the run as finished & save a report for the current notebook/script
install.packages("laminr", dependencies = TRUE) # install the laminr package from CRAN
laminr::install_lamindb() # install lamindb for usage via reticulate
laminr::lamin_connect("<account>/<instance>") # <-- replace with your instance
library(laminr)
ln <- import_module("lamindb")
ln$track() # track a run of your notebook or script
artifact <- ln$Artifact$get("3TNCsZZcnIBv2WGb0001") # get an artifact by uid
filepath <- artifact$cache() # cache the artifact on disk
# do your work
ln$Artifact("./my_dataset.csv", key="my_results/my_dataset.csv").save() # save a folder
ln$finish() # mark the run finished
Depending on whether you ran RStudio’s notebook mode, you may need to save an html export for .qmd
or .Rmd
file via the command-line.
lamin save my-analysis.Rmd
For more, see the R docs.
Walkthrough¶
Biological systems are characterized via batches of data in various formats.
LaminDB provides a framework to transform these batches into more useful representations: validated, queryable datasets, machine learning models, and analytical insights.
The metadata involved in this process are stored in a LaminDB instance, a database that manages datasets through their metadata. Creating one is as simple as creating a git repository.
!lamin init --storage ./lamin-intro --modules bionty
Show code cell output
! using anonymous user (to identify, call: lamin login)
→ initialized lamindb: anonymous/lamin-intro
What else can I configure during setup?
You can pass a cloud storage location to
--storage
(S3, GCP, R2, HF, etc.)--storage s3://my-bucket
Instead of the default SQLite database pass a Postgres connection string to
--db
:--db postgresql://<user>:<pwd>@<hostname>:<port>/<dbname>
Instead of a default instance name derived from the storage location, provide a custom name:
--name my-name
Mount additional schema modules:
--modules bionty,wetlab,custom1
For more info, see Install & setup.
Track data transformations¶
The code that generates a dataset is a transform (Transform
). It can be a script, a notebook, a pipeline, or a function. Let’s track the notebook that’s being run.
import lamindb as ln
import pandas as pd
ln.track() # track the current notebook or script
Show code cell output
→ connected lamindb: anonymous/lamin-intro
→ created Transform('jh6KN5z7XOtR0000'), started new Run('Ze4sFM46...') at 2025-03-20 21:14:56 UTC
→ notebook imports: anndata==0.11.3 bionty==1.1.2 lamindb==1.3.0 pandas==2.2.3
By calling track()
, the notebook gets automatically linked as the source of all data that’s about to be saved!
What happened under the hood?
The full run environment and imported package versions of current notebook were detected
Notebook metadata was detected and stored in a
Transform
record with a unique idRun metadata was detected and stored in a
Run
record with a unique id
The Transform
registry stores data transformations: scripts, notebooks, pipelines, functions.
The Run
registry stores executions of transforms. Many runs can be linked to the same transform if executed with different context (time, user, input data, etc.).
How do I track a pipeline instead of a notebook?
Leverage a pipeline integration, see: Pipelines – workflow managers. Or manually add code as seen below.
transform = ln.Transform(name="My pipeline")
transform.version = "1.2.0" # tag the version
ln.track(transform)
Why should I care about tracking notebooks?
Because of interactivity & humans are in the loop, most mistakes happen when using notebooks.
track()
makes notebooks & derived results reproducible & auditable, enabling to learn from mistakes.
This is important as much insight generated from biological data is driven by computational biologists interacting with it. An early blog post on this is here.
Is this compliant with OpenLineage?
Yes. What OpenLineage calls a “job”, LaminDB calls a “transform”. What OpenLineage calls a “run”, LaminDB calls a “run”.
You can see all your transforms and their runs in the Transform
and Run
registries.
ln.Transform.df()
Show code cell output
uid | key | description | type | source_code | hash | reference | reference_type | space_id | _template_id | version | is_latest | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||
1 | jh6KN5z7XOtR0000 | introduction.ipynb | Introduction | notebook | None | None | None | None | 1 | None | None | True | 2025-03-20 21:14:56.319000+00:00 | 1 | None | 1 |
ln.Run.df()
Show code cell output
uid | name | started_at | finished_at | reference | reference_type | _is_consecutive | _status_code | space_id | transform_id | report_id | _logfile_id | environment_id | initiated_by_run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
1 | Ze4sFM46vndcZ6HPyw4H | None | 2025-03-20 21:14:56.330451+00:00 | None | None | None | None | 0 | 1 | 1 | None | None | None | None | 2025-03-20 21:14:56.331000+00:00 | 1 | None | 1 |
Manage artifacts¶
The Artifact
class manages datasets & models that are stored as files, folders, or arrays. Artifact
is a registry to manage search, queries, validation & storage access.
You can register data structures (DataFrame
, AnnData
, …) and files or folders in local storage, AWS S3 (s3://...
), Google Cloud (gs://...
), Hugging Face (hf://...
), or any other file system supported by fsspec
.
Dataframes¶
Let’s first look at an exemplary dataframe.
df = ln.core.datasets.small_dataset1(with_typo=True)
df
Show code cell output
ENSG00000153563 | ENSG00000010610 | ENSG00000170458 | perturbation | sample_note | cell_type_by_expert | cell_type_by_model | assay_oid | concentration | treatment_time_h | donor | |
---|---|---|---|---|---|---|---|---|---|---|---|
sample1 | 1 | 3 | 5 | DMSO | was ok | B cell | B cell | EFO:0008913 | 0.1% | 24 | D0001 |
sample2 | 2 | 4 | 6 | IFNJ | looks naah | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 200 nM | 24 | D0002 |
sample3 | 3 | 5 | 7 | DMSO | pretty! 🤩 | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 0.1% | 6 | None |
This is how you create an artifact from a dataframe.
artifact = ln.Artifact.from_df(df, key="my_datasets/rnaseq1.parquet").save()
artifact.describe()
Show code cell output
Artifact .parquet/DataFrame └── General ├── .uid = '4x6wtbfizcRu5tqZ0000' ├── .key = 'my_datasets/rnaseq1.parquet' ├── .size = 9012 ├── .hash = 'ZHlfaXCXxza090J-PA1nCg' ├── .n_observations = 3 ├── .path = /home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro/.lamindb/4x6wtbfizcRu5tqZ0000.parquet ├── .created_by = anonymous ├── .created_at = 2025-03-20 21:15:00 └── .transform = 'Introduction'
And this is how you load it back into memory.
artifact.load()
Show code cell output
ENSG00000153563 | ENSG00000010610 | ENSG00000170458 | perturbation | sample_note | cell_type_by_expert | cell_type_by_model | assay_oid | concentration | treatment_time_h | donor | |
---|---|---|---|---|---|---|---|---|---|---|---|
sample1 | 1 | 3 | 5 | DMSO | was ok | B cell | B cell | EFO:0008913 | 0.1% | 24 | D0001 |
sample2 | 2 | 4 | 6 | IFNJ | looks naah | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 200 nM | 24 | D0002 |
sample3 | 3 | 5 | 7 | DMSO | pretty! 🤩 | CD8-positive, alpha-beta T cell | T cell | EFO:0008913 | 0.1% | 6 | None |
Understand data lineage¶
You can understand where an artifact comes from by looking at its Transform
& Run
records:
artifact.transform
Transform(uid='jh6KN5z7XOtR0000', is_latest=True, key='introduction.ipynb', description='Introduction', type='notebook', space_id=1, created_by_id=1, created_at=2025-03-20 21:14:56 UTC)
artifact.run
Run(uid='Ze4sFM46vndcZ6HPyw4H', started_at=2025-03-20 21:14:56 UTC, space_id=1, transform_id=1, created_by_id=1, created_at=2025-03-20 21:14:56 UTC)
Or visualize deeper data lineage with the view_lineage()
method. Here we’re only one step deep.
artifact.view_lineage()
Show me a more interesting example, please!
I just want to see the transforms.
artifact.transform.view_lineage()
Data lineage also helps to understand what a dataset is being used for. Many datasets are being used over and over for different purposes.
At the end of your notebook or script, call finish()
. Here, we’re not yet done so we’re commenting it out.
# ln.finish() # mark run as finished, save execution report, source code & environment
Here is how a notebook looks on the hub.

To create a new version of a notebook or script, run lamin load
on the terminal, e.g.,
$ lamin load https://lamin.ai/laminlabs/lamindata/transform/13VINnFk89PE0004
→ notebook is here: mcfarland_2020_preparation.ipynb
Versioning¶
Just like transforms, artifacts are versioned. Let’s create a new version by revising the dataset.
# keep the dataframe with a typo around - we'll need it later
df_typo = df.copy()
# fix the "IFNJ" typo
df["perturbation"] = df["perturbation"].cat.rename_categories({"IFNJ": "IFNG"})
# create a new version
artifact = ln.Artifact.from_df(df, key="my_datasets/rnaseq1.parquet").save()
# see all versions of an artifact
artifact.versions.df()
Show code cell output
→ creating new artifact version for key='my_datasets/rnaseq1.parquet' (storage: '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro')
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1 | 4x6wtbfizcRu5tqZ0000 | my_datasets/rnaseq1.parquet | None | .parquet | dataset | DataFrame | 9012 | ZHlfaXCXxza090J-PA1nCg | None | 3 | md5 | True | False | 1 | 1 | None | None | False | 1 | 2025-03-20 21:15:00.386000+00:00 | 1 | None | 1 |
2 | 4x6wtbfizcRu5tqZ0001 | my_datasets/rnaseq1.parquet | None | .parquet | dataset | DataFrame | 9012 | iBiiWBkIitgFtLcru2CLyA | None | 3 | md5 | True | False | 1 | 1 | None | None | True | 1 | 2025-03-20 21:15:00.601000+00:00 | 1 | None | 1 |
Can I also create new versions independent of key
?
That works, too, you can use revises
:
artifact_v1 = ln.Artifact.from_df(df, description="Just a description").save()
# below revises artifact_v1
artifact_v2 = ln.Artifact.from_df(df_updated, revises=artifact_v1).save()
The good thing about passing revises: Artifact
is that you don’t need to worry about coming up with naming conventions for paths.
The good thing about versioning based on key
is that it’s how all data versioning tools are doing it.
Files and folders¶
Let’s look at a folder in the cloud that contains 3 sub-folders storing images & metadata of Iris flowers, generated in 3 subsequent studies.
# we use anon=True here in case no aws credentials are configured
ln.UPath("s3://lamindata/iris_studies", anon=True).view_tree()
Show code cell output
3 sub-directories & 151 files with suffixes '.csv', '.jpg'
s3://lamindata/iris_studies
├── study0_raw_images/
│ ├── iris-0337d20a3b7273aa0ddaa7d6afb57a37a759b060e4401871db3cefaa6adc068d.jpg
│ ├── iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce4ef46a3239e4b939bd9807b.jpg
│ ├── iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee7104a0c4200218a33903f82444.jpg
│ ├── iris-0fec175448a23db03c1987527f7e9bb74c18cffa76ef003f962c62603b1cbb87.jpg
│ ├── iris-125b6645e086cd60131764a6bed12650e0f7f2091c8bbb72555c103196c01881.jpg
│ ├── iris-13dfaff08727abea3da8cfd8d097fe1404e76417fefe27ff71900a89954e145a.jpg
│ ├── iris-1566f7f5421eaf423a82b3c1cd1328f2a685c5ef87d8d8e710f098635d86d3d0.jpg
│ ├── iris-1804702f49c2c385f8b30913569aebc6dce3da52ec02c2c638a2b0806f16014e.jpg
│ ├── iris-318d451a8c95551aecfde6b55520f302966db0a26a84770427300780b35aa05a.jpg
│ ├── iris-3dec97fe46d33e194520ca70740e4c2e11b0ffbffbd0aec0d06afdc167ddf775.jpg
│ ├── iris-3eed72bc2511f619190ce79d24a0436fef7fcf424e25523cb849642d14ac7bcf.jpg
│ ├── iris-430fa45aad0edfeb5b7138ff208fdeaa801b9830a9eb68f378242465b727289a.jpg
│ ├── iris-4cc15cd54152928861ecbdc8df34895ed463403efb1571dac78e3223b70ef569.jpg
│ ├── iris-4febb88ef811b5ca6077d17ef8ae5dbc598d3f869c52af7c14891def774d73fa.jpg
│ ├── iris-590e7f5b8f4de94e4b82760919abd9684ec909d9f65691bed8e8f850010ac775.jpg
│ ├── iris-5a313749aa61e9927389affdf88dccdf21d97d8a5f6aa2bd246ca4bc926903ba.jpg
│ ├── iris-5b3106db389d61f4277f43de4953e660ff858d8ab58a048b3d8bf8d10f556389.jpg
│ ├── iris-5f4e8fffde2404cc30be275999fddeec64f8a711ab73f7fa4eb7667c8475c57b.jpg
│ ├── iris-68d83ad09262afb25337ccc1d0f3a6d36f118910f36451ce8a6600c77a8aa5bd.jpg
│ ├── iris-70069edd7ab0b829b84bb6d4465b2ca4038e129bb19d0d3f2ba671adc03398cc.jpg
│ ├── iris-7038aef1137814473a91f19a63ac7a55a709c6497e30efc79ca57cfaa688f705.jpg
│ ├── iris-74d1acf18cfacd0a728c180ec8e1c7b4f43aff72584b05ac6b7c59f5572bd4d4.jpg
│ ├── iris-7c3b5c5518313fc6ff2c27fcbc1527065cbb42004d75d656671601fa485e5838.jpg
│ ├── iris-7cf1ebf02b2cc31539ed09ab89530fec6f31144a0d5248a50e7c14f64d24fe6e.jpg
│ ├── iris-7dcc69fa294fe04767706c6f455ea6b31d33db647b08aab44b3cd9022e2f2249.jpg
│ ├── iris-801b7efb867255e85137bc1e1b06fd6cbab70d20cab5b5046733392ecb5b3150.jpg
│ ├── iris-8305dd2a080e7fe941ea36f3b3ec0aa1a195ad5d957831cf4088edccea9465e2.jpg
│ ├── iris-83f433381b755101b9fc9fbc9743e35fbb8a1a10911c48f53b11e965a1cbf101.jpg
│ ├── iris-874121a450fa8a420bdc79cc7808fd28c5ea98758a4b50337a12a009fa556139.jpg
│ ├── iris-8c216e1acff39be76d6133e1f549d138bf63359fa0da01417e681842210ea262.jpg
│ ├── iris-92c4268516ace906ad1ac44592016e36d47a8c72a51cacca8597ba9e18a8278b.jpg
│ ├── iris-95d7ec04b8158f0873fa4aab7b0a5ec616553f3f9ddd6623c110e3bc8298248f.jpg
│ ├── iris-9ce2d8c4f1eae5911fcbd2883137ba5542c87cc2fe85b0a3fbec2c45293c903e.jpg
│ ├── iris-9ee27633bb041ef1b677e03e7a86df708f63f0595512972403dcf5188a3f48f5.jpg
│ ├── iris-9fb8d691550315506ae08233406e8f1a4afed411ea0b0ac37e4b9cdb9c42e1ec.jpg
│ ├── iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf37840d7486a061438cf5771.jpg
│ ├── iris-a2be5db78e5b603a5297d9a7eec4e7f14ef2cba0c9d072dc0a59a4db3ab5bb13.jpg
│ ├── iris-ad7da5f15e2848ca269f28cd1dc094f6f685de2275ceaebb8e79d2199b98f584.jpg
│ ├── iris-bc515e63b5a4af49db8c802c58c83db69075debf28c792990d55a10e881944d9.jpg
│ ├── iris-bd8d83096126eaa10c44d48dbad4b36aeb9f605f1a0f6ca929d3d0d492dafeb6.jpg
│ ├── iris-bdae8314e4385d8e2322abd8e63a82758a9063c77514f49fc252e651cbd79f82.jpg
│ ├── iris-c175cd02ac392ecead95d17049f5af1dcbe37851c3e42d73e6bb813d588ea70b.jpg
│ ├── iris-c31e6056c94b5cb618436fbaac9eaff73403fa1b87a72db2c363d172a4db1820.jpg
│ ├── iris-ca40bc5839ee2f9f5dcac621235a1db2f533f40f96a35e1282f907b40afa457d.jpg
│ ├── iris-ddb685c56cfb9c8496bcba0d57710e1526fff7d499536b3942d0ab375fa1c4a6.jpg
│ ├── iris-e437a7c7ad2bbac87fef3666b40c4de1251b9c5f595183eda90a8d9b1ef5b188.jpg
│ ├── iris-e7e0774289e2153cc733ff62768c40f34ac9b7b42e23c1abc2739f275e71a754.jpg
│ ├── iris-e9da6dd69b7b07f80f6a813e2222eae8c8f7c3aeaa6bcc02b25ea7d763bcf022.jpg
│ ├── iris-eb01666d4591b2e03abecef5a7ded79c6d4ecb6d1922382c990ad95210d55795.jpg
│ ├── iris-f6e4890dee087bd52e2c58ea4c6c2652da81809603ea3af561f11f8c2775c5f3.jpg
│ └── meta.csv
├── study1_raw_images/
│ ├── iris-0879d3f5b337fe512da1c7bf1d2bfd7616d744d3eef7fa532455a879d5cc4ba0.jpg
│ ├── iris-0b486eebacd93e114a6ec24264e035684cebe7d2074eb71eb1a71dd70bf61e8f.jpg
│ ├── iris-0ff5ba898a0ec179a25ca217af45374fdd06d606bb85fc29294291facad1776a.jpg
│ ├── iris-1175239c07a943d89a6335fb4b99a9fb5aabb2137c4d96102f10b25260ae523f.jpg
│ ├── iris-1289c57b571e8e98e4feb3e18a890130adc145b971b7e208a6ce5bad945b4a5a.jpg
│ ├── iris-12adb3a8516399e27ff1a9d20d28dca4674836ed00c7c0ae268afce2c30c4451.jpg
│ ├── iris-17ac8f7b5734443090f35bdc531bfe05b0235b5d164afb5c95f9d35f13655cf3.jpg
│ ├── iris-2118d3f235a574afd48a1f345bc2937dad6e7660648516c8029f4e76993ea74d.jpg
│ ├── iris-213cd179db580f8e633087dcda0969fd175d18d4f325cb5b4c5f394bbba0c1e0.jpg
│ ├── iris-21a1255e058722de1abe928e5bbe1c77bda31824c406c53f19530a3ca40be218.jpg
│ ├── iris-249370d38cc29bc2a4038e528f9c484c186fe46a126e4b6c76607860679c0453.jpg
│ ├── iris-2ac575a689662b7045c25e2554df5f985a3c6c0fd5236fabef8de9c78815330c.jpg
│ ├── iris-2c5b373c2a5fd214092eb578c75eb5dc84334e5f11a02f4fa23d5d316b18f770.jpg
│ ├── iris-2ecaad6dfe3d9b84a756bc2303a975a732718b954a6f54eae85f681ea3189b13.jpg
│ ├── iris-32827aec52e0f3fa131fa85f2092fc6fa02b1b80642740b59d029cef920c26b3.jpg
│ ├── iris-336fc3472b6465826f7cd87d5cef8f78d43cf2772ebe058ce71e1c5bad74c0e1.jpg
│ ├── iris-432026d8501abcd495bd98937a82213da97fca410af1c46889eabbcf2fd1b589.jpg
│ ├── iris-49a9158e46e788a39eeaefe82b19504d58dde167f540df6bc9492c3916d5f7ca.jpg
│ ├── iris-4b47f927405d90caa15cbf17b0442390fc71a2ca6fb8d07138e8de17d739e9a4.jpg
│ ├── iris-5691cad06fe37f743025c097fa9c4cec85e20ca3b0efff29175e60434e212421.jpg
│ ├── iris-5c38dba6f6c27064eb3920a5758e8f86c26fec662cc1ac4b5208d5f30d1e3ead.jpg
│ ├── iris-5da184e8620ebf0feef4d5ffe4346e6c44b2fb60cecc0320bd7726a1844b14cd.jpg
│ ├── iris-66eee9ff0bfa521905f733b2a0c6c5acad7b8f1a30d280ed4a17f54fe1822a7e.jpg
│ ├── iris-6815050b6117cf2e1fd60b1c33bfbb94837b8e173ff869f625757da4a04965c9.jpg
│ ├── iris-793fe85ddd6a97e9c9f184ed20d1d216e48bf85aa71633eff6d27073e0825d54.jpg
│ ├── iris-850229e6293a741277eb5efaa64d03c812f007c5d0f470992a8d4cfdb902230c.jpg
│ ├── iris-86d782d20ef7a60e905e367050b0413ca566acc672bc92add0bb0304faa54cfc.jpg
│ ├── iris-875a96790adc5672e044cf9da9d2edb397627884dfe91c488ab3fb65f65c80ff.jpg
│ ├── iris-96f06136df7a415550b90e443771d0b5b0cd990b503b64cc4987f5cb6797fa9b.jpg
│ ├── iris-9a889c96a37e8927f20773783a084f31897f075353d34a304c85e53be480e72a.jpg
│ ├── iris-9e3208f4f9fedc9598ddf26f77925a1e8df9d7865a4d6e5b4f74075d558d6a5e.jpg
│ ├── iris-a7e13b6f2d7f796768d898f5f66dceefdbd566dd4406eea9f266fc16dd68a6f2.jpg
│ ├── iris-b026efb61a9e3876749536afe183d2ace078e5e29615b07ac8792ab55ba90ebc.jpg
│ ├── iris-b3c086333cb5ccb7bb66a163cf4bf449dc0f28df27d6580a35832f32fd67bfc9.jpg
│ ├── iris-b795e034b6ea08d3cd9acaa434c67aca9d17016991e8dd7d6fd19ae8f6120b77.jpg
│ ├── iris-bb4a7ad4c844987bc9dc9dfad2b363698811efe3615512997a13cd191c23febc.jpg
│ ├── iris-bd60a6ed0369df4bea1934ef52277c32757838123456a595c0f2484959553a36.jpg
│ ├── iris-c15d6019ebe17d7446ced589ef5ef7a70474d35a8b072e0edfcec850b0a106db.jpg
│ ├── iris-c45295e76c6289504921412293d5ddbe4610bb6e3b593ea9ec90958e74b73ed2.jpg
│ ├── iris-c50d481f9fa3666c2c3808806c7c2945623f9d9a6a1d93a17133c4cb1560c41c.jpg
│ ├── iris-df4206653f1ec9909434323c05bb15ded18e72587e335f8905536c34a4be3d45.jpg
│ ├── iris-e45d869cb9d443b39d59e35c2f47870f5a2a335fce53f0c8a5bc615b9c53c429.jpg
│ ├── iris-e76fa5406e02a312c102f16eb5d27c7e0de37b35f801e1ed4c28bd4caf133e7a.jpg
│ ├── iris-e8d3fd862aae1c005bcc80a73fd34b9e683634933563e7538b520f26fd315478.jpg
│ ├── iris-ea578f650069a67e5e660bb22b46c23e0a182cbfb59cdf5448cf20ce858131b6.jpg
│ ├── iris-eba0c546e9b7b3d92f0b7eb98b2914810912990789479838807993d13787a2d9.jpg
│ ├── iris-f22d4b9605e62db13072246ff6925b9cf0240461f9dfc948d154b983db4243b9.jpg
│ ├── iris-fac5f8c23d8c50658db0f4e4a074c2f7771917eb52cbdf6eda50c12889510cf4.jpg
│ └── meta.csv
└── study2_raw_images/
├── iris-01cdd55ca6402713465841abddcce79a2e906e12edf95afb77c16bde4b4907dc.jpg
├── iris-02868b71ddd9b33ab795ac41609ea7b20a6e94f2543fad5d7fa11241d61feacf.jpg
├── iris-0415d2f3295db04bebc93249b685f7d7af7873faa911cd270ecd8363bd322ed5.jpg
├── iris-0c826b6f4648edf507e0cafdab53712bb6fd1f04dab453cee8db774a728dd640.jpg
├── iris-10fb9f154ead3c56ba0ab2c1ab609521c963f2326a648f82c9d7cabd178fc425.jpg
├── iris-14cbed88b0d2a929477bdf1299724f22d782e90f29ce55531f4a3d8608f7d926.jpg
├── iris-186fe29e32ee1405ddbdd36236dd7691a3c45ba78cc4c0bf11489fa09fbb1b65.jpg
├── iris-1b0b5aabd59e4c6ed1ceb54e57534d76f2f3f97e0a81800ff7ed901c35a424ab.jpg
├── iris-1d35672eb95f5b1cf14c2977eb025c246f83cdacd056115fdc93e946b56b610c.jpg
├── iris-1f941001f508ff1bd492457a90da64e52c461bfd64587a3cf7c6bf1bcb35adab.jpg
├── iris-2a09038b87009ecee5e5b4cd4cef068653809cc1e08984f193fad00f1c0df972.jpg
├── iris-308389e34b6d9a61828b339916aed7af295fdb1c7577c23fb37252937619e7e4.jpg
├── iris-30e4e56b1f170ff4863b178a0a43ea7a64fdd06c1f89a775ec4dbf5fec71e15c.jpg
├── iris-332953f4d6a355ca189e2508164b24360fc69f83304e7384ca2203ddcb7c73b5.jpg
├── iris-338fc323ed045a908fb1e8ff991255e1b8e01c967e36b054cb65edddf97b3bb0.jpg
├── iris-34a7cc16d26ba0883574e7a1c913ad50cf630e56ec08ee1113bf3584f4e40230.jpg
├── iris-360196ba36654c0d9070f95265a8a90bc224311eb34d1ab0cf851d8407d7c28e.jpg
├── iris-36132c6df6b47bda180b1daaafc7ac8a32fd7f9af83a92569da41429da49ea5b.jpg
├── iris-36f2b9282342292b67f38a55a62b0c66fa4e5bb58587f7fec90d1e93ea8c407a.jpg
├── iris-37ad07fd7b39bc377fa6e9cafdb6e0c57fb77df2c264fe631705a8436c0c2513.jpg
├── iris-3ba1625bb78e4b69b114bdafcdab64104b211d8ebadca89409e9e7ead6a0557c.jpg
├── iris-4c5d9a33327db025d9c391aeb182cbe20cfab4d4eb4ac951cc5cd15e132145d8.jpg
├── iris-522f3eb1807d015f99e66e73b19775800712890f2c7f5b777409a451fa47d532.jpg
├── iris-589fa96b9a3c2654cf08d05d3bebf4ab7bc23592d7d5a95218f9ff87612992fa.jpg
├── iris-61b71f1de04a03ce719094b65179b06e3cd80afa01622b30cda8c3e41de6bfaa.jpg
├── iris-62ef719cd70780088a4c140afae2a96c6ca9c22b72b078e3b9d25678d00b88a5.jpg
├── iris-819130af42335d4bb75bebb0d2ee2e353a89a3d518a1d2ce69842859c5668c5a.jpg
├── iris-8669e4937a2003054408afd228d99cb737e9db5088f42d292267c43a3889001a.jpg
├── iris-86c76e0f331bc62192c392cf7c3ea710d2272a8cc9928d2566a5fc4559e5dce4.jpg
├── iris-8a8bc54332a42bb35ee131d7b64e9375b4ac890632eb09e193835b838172d797.jpg
├── iris-8e9439ec7231fa3b9bc9f62a67af4e180466b32a72316600431b1ec93e63b296.jpg
├── iris-90b7d491b9a39bb5c8bb7649cce90ab7f483c2759fb55fda2d9067ac9eec7e39.jpg
├── iris-9dededf184993455c411a0ed81d6c3c55af7c610ccb55c6ae34dfac2f8bde978.jpg
├── iris-9e6ce91679c9aaceb3e9c930f11e788aacbfa8341a2a5737583c14a4d6666f3d.jpg
├── iris-a0e65269f7dc7801ac1ad8bd0c5aa547a70c7655447e921d1d4d153a9d23815e.jpg
├── iris-a445b0720254984275097c83afbdb1fe896cb010b5c662a6532ed0601ea24d7c.jpg
├── iris-a6b85bf1f3d18bbb6470440592834c2c7f081b490836392cf5f01636ee7cf658.jpg
├── iris-b005c82b844de575f0b972b9a1797b2b1fbe98c067c484a51006afc4f549ada4.jpg
├── iris-bfcf79b3b527eb64b78f9a068a1000042336e532f0f44e68f818dd13ab492a76.jpg
├── iris-c156236fb6e888764485e796f1f972bbc7ad960fe6330a7ce9182922046439c4.jpg
├── iris-d99d5fd2de5be1419cbd569570dbb6c9a6c8ec4f0a1ff5b55dc2607f6ecdca8f.jpg
├── iris-d9aae37a8fa6afdef2af170c266a597925eea935f4d070e979d565713ea62642.jpg
├── iris-dbc87fcecade2c070baaf99caf03f4f0f6e3aa977e34972383cb94d0efe8a95d.jpg
├── iris-e3d1a560d25cf573d2cbbf2fe6cd231819e998109a5cf1788d59fbb9859b3be2.jpg
├── iris-ec288bdad71388f907457db2476f12a5cb43c28cfa28d2a2077398a42b948a35.jpg
├── iris-ed5b4e072d43bc53a00a4a7f4d0f5d7c0cbd6a006e9c2d463128cedc956cb3de.jpg
├── iris-f3018a9440d17c265062d1c61475127f9952b6fe951d38fd7700402d706c0b01.jpg
├── iris-f47c5963cdbaa3238ba2d446848e8449c6af83e663f0a9216cf0baba8429b36f.jpg
├── iris-fa4b6d7e3617216104b1405cda21bf234840cd84a2c1966034caa63def2f64f0.jpg
├── iris-fc4b0cc65387ff78471659d14a78f0309a76f4c3ec641b871e40b40424255097.jpg
└── meta.csv
Let’s create an artifact for the first sub-folder.
artifact = ln.Artifact("s3://lamindata/iris_studies/study0_raw_images").save()
artifact
Show code cell output
! calling anonymously, will miss private instances
Artifact(uid='jhRYpvTeYnFTzGox0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_files=51, space_id=1, storage_id=2, run_id=1, created_by_id=1, created_at=2025-03-20 21:15:02 UTC)
As you see from path
, the folder wasn’t copied into another storage location. It was merely registered in its present storage location.
artifact.path
Show code cell output
S3QueryPath('s3://lamindata/iris_studies/study0_raw_images')
LaminDB keeps track of all your storage locations.
ln.Storage.df()
Show code cell output
uid | root | description | type | region | instance_uid | space_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
2 | 9QNzrgIwHpbg | s3://lamindata | None | s3 | us-east-1 | None | 1 | None | 2025-03-20 21:15:02.559000+00:00 | 1 | None | 1 |
1 | GdPz9NhxmThz | /home/runner/work/lamin-docs/lamin-docs/docs/l... | None | local | None | 3MepSh2Col3I | 1 | None | 2025-03-20 21:14:52.865000+00:00 | 1 | None | 1 |
To cache the cloud folder locally, call cache()
.
artifact.cache()
Show code cell output
PosixUPath('/home/runner/.cache/lamindb/lamindata/iris_studies/study0_raw_images')
If the data is large, you might not want to download but stream it via open()
. For more on this, see: Slice arrays.
How do I update & delete an artifact?
artifact.description = "My new description" # change description
artifact.save() # save the change to the database
artifact.delete() # move to trash
artifact.delete(permanent=True) # permanently delete
How do I create an artifact for a local file or folder?
Source path is local:
ln.Artifact("./my_data.fcs", key="my_data.fcs")
ln.Artifact("./my_images/", key="my_images")
Upon artifact.save()
, the source path will be copied or uploaded into your instance’s current storage, visible & changeable via ln.settings.storage
.
If the source path is remote or already in a registered storage location (one that’s registered in ln.Storage
), artifact.save()
will not trigger a copy or upload but register the existing path.
ln.Artifact("s3://my-bucket/my_data.fcs") # key is auto-populated from S3, you can optionally pass a description
ln.Artifact("s3://my-bucket/my_images/") # key is auto-populated from S3, you can optionally pass a description
You can use any storage location supported by `fsspec`.
Which fields are populated when creating an artifact record?
Basic fields:
uid
: universal IDkey
: a (virtual) relative path of the artifact instorage
description
: an optional string descriptionstorage
: the storage location (the root, say, an S3 bucket or a local directory)suffix
: an optional file/path suffixsize
: the artifact size in byteshash
: a hash useful to check for integrity and collisions (is this artifact already stored?)hash_type
: the type of the hashcreated_at
: time of creationupdated_at
: time of last update
Provenance-related fields:
created_by
: theUser
who created the artifact
For a full reference, see Artifact
.
What exactly happens during save?
In the database: An artifact record is inserted into the Artifact
registry. If the artifact record exists already, it’s returned.
In storage:
If the default storage is in the cloud,
.save()
triggers an upload for a local artifact.If the artifact is already in a registered storage location, only the metadata of the record is saved to the
artifact
registry.
How does LaminDB compare to a AWS S3?
LaminDB provides a database on top of AWS S3 (or GCP storage, file systems, etc.).
Similar to organizing files with paths, you can organize artifacts using the key
parameter of Artifact
.
However, you’ll see that you can more conveniently query data by entities you care about: people, code, experiments, genes, proteins, cell types, etc.
Are artifacts aware of array-like data?
Yes.
You can make artifacts from paths referencing array-like objects:
ln.Artifact("./my_anndata.h5ad", key="my_anndata.h5ad")
ln.Artifact("./my_zarr_array/", key="my_zarr_array")
Or from in-memory objects:
ln.Artifact.from_df(df, key="my_dataframe.parquet")
ln.Artifact.from_anndata(adata, key="my_anndata.h5ad")
You can open large artifacts for slicing from the cloud or load small artifacts directly into memory via:
artifact.open()
Query & search registries¶
To get an overview over all artifacts in your instance, call df
.
ln.Artifact.df()
Show code cell output
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
3 | jhRYpvTeYnFTzGox0000 | iris_studies/study0_raw_images | None | None | None | 658465 | IVKGMfNwi8zKvnpaD_gG7w | 51.0 | NaN | md5-d | False | True | 1 | 2 | None | None | True | 1 | 2025-03-20 21:15:02.678000+00:00 | 1 | None | 1 | |
2 | 4x6wtbfizcRu5tqZ0001 | my_datasets/rnaseq1.parquet | None | .parquet | dataset | DataFrame | 9012 | iBiiWBkIitgFtLcru2CLyA | NaN | 3.0 | md5 | True | False | 1 | 1 | None | None | True | 1 | 2025-03-20 21:15:00.601000+00:00 | 1 | None | 1 |
1 | 4x6wtbfizcRu5tqZ0000 | my_datasets/rnaseq1.parquet | None | .parquet | dataset | DataFrame | 9012 | ZHlfaXCXxza090J-PA1nCg | NaN | 3.0 | md5 | True | False | 1 | 1 | None | None | False | 1 | 2025-03-20 21:15:00.386000+00:00 | 1 | None | 1 |
LaminDB’s central classes are registries that store records (Record
objects). If you want to see the fields of a registry, look at the class or auto-complete.
ln.Artifact
Show code cell output
Artifact
Simple fields
.uid: CharField
.key: CharField
.description: CharField
.suffix: CharField
.kind: CharField
.otype: CharField
.size: BigIntegerField
.hash: CharField
.n_files: BigIntegerField
.n_observations: BigIntegerField
.version: CharField
.is_latest: BooleanField
.created_at: DateTimeField
.updated_at: DateTimeField
Relational fields
.space: Space
.storage: Storage
.run: Run
.schema: Schema
.created_by: User
.ulabels: ULabel
.input_of_runs: Run
.feature_sets: Schema
.collections: Collection
.references: Reference
.projects: Project
Bionty fields
.organisms: bionty.Organism
.genes: bionty.Gene
.proteins: bionty.Protein
.cell_markers: bionty.CellMarker
.tissues: bionty.Tissue
.cell_types: bionty.CellType
.diseases: bionty.Disease
.cell_lines: bionty.CellLine
.phenotypes: bionty.Phenotype
.pathways: bionty.Pathway
.experimental_factors: bionty.ExperimentalFactor
.developmental_stages: bionty.DevelopmentalStage
.ethnicities: bionty.Ethnicity
Each registry is a table in the relational schema of the underlying database. With view()
, you can see the latest changes to the database.
ln.view()
Show code cell output
****************
* module: core *
****************
Artifact
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
3 | jhRYpvTeYnFTzGox0000 | iris_studies/study0_raw_images | None | None | None | 658465 | IVKGMfNwi8zKvnpaD_gG7w | 51.0 | NaN | md5-d | False | True | 1 | 2 | None | None | True | 1 | 2025-03-20 21:15:02.678000+00:00 | 1 | None | 1 | |
2 | 4x6wtbfizcRu5tqZ0001 | my_datasets/rnaseq1.parquet | None | .parquet | dataset | DataFrame | 9012 | iBiiWBkIitgFtLcru2CLyA | NaN | 3.0 | md5 | True | False | 1 | 1 | None | None | True | 1 | 2025-03-20 21:15:00.601000+00:00 | 1 | None | 1 |
1 | 4x6wtbfizcRu5tqZ0000 | my_datasets/rnaseq1.parquet | None | .parquet | dataset | DataFrame | 9012 | ZHlfaXCXxza090J-PA1nCg | NaN | 3.0 | md5 | True | False | 1 | 1 | None | None | False | 1 | 2025-03-20 21:15:00.386000+00:00 | 1 | None | 1 |
Run
uid | name | started_at | finished_at | reference | reference_type | _is_consecutive | _status_code | space_id | transform_id | report_id | _logfile_id | environment_id | initiated_by_run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
1 | Ze4sFM46vndcZ6HPyw4H | None | 2025-03-20 21:14:56.330451+00:00 | None | None | None | None | 0 | 1 | 1 | None | None | None | None | 2025-03-20 21:14:56.331000+00:00 | 1 | None | 1 |
Storage
uid | root | description | type | region | instance_uid | space_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
2 | 9QNzrgIwHpbg | s3://lamindata | None | s3 | us-east-1 | None | 1 | None | 2025-03-20 21:15:02.559000+00:00 | 1 | None | 1 |
1 | GdPz9NhxmThz | /home/runner/work/lamin-docs/lamin-docs/docs/l... | None | local | None | 3MepSh2Col3I | 1 | None | 2025-03-20 21:14:52.865000+00:00 | 1 | None | 1 |
Transform
uid | key | description | type | source_code | hash | reference | reference_type | space_id | _template_id | version | is_latest | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||
1 | jh6KN5z7XOtR0000 | introduction.ipynb | Introduction | notebook | None | None | None | None | 1 | None | None | True | 2025-03-20 21:14:56.319000+00:00 | 1 | None | 1 |
******************
* module: bionty *
******************
Source
uid | entity | organism | name | in_db | currently_used | description | url | md5 | source_website | space_id | dataframe_artifact_id | version | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
53 | 5Xov8Lap | bionty.Disease | all | mondo | False | False | Mondo Disease Ontology | http://purl.obolibrary.org/obo/mondo/releases/... | None | https://mondo.monarchinitiative.org | 1 | None | 2024-02-06 | None | 2025-03-20 21:14:52.954000+00:00 | 1 | None | 1 |
54 | 69lnSXfR | bionty.Disease | all | mondo | False | False | Mondo Disease Ontology | http://purl.obolibrary.org/obo/mondo/releases/... | None | https://mondo.monarchinitiative.org | 1 | None | 2024-01-03 | None | 2025-03-20 21:14:52.954000+00:00 | 1 | None | 1 |
55 | 4ss2Hizg | bionty.Disease | all | mondo | False | False | Mondo Disease Ontology | http://purl.obolibrary.org/obo/mondo/releases/... | None | https://mondo.monarchinitiative.org | 1 | None | 2023-08-02 | None | 2025-03-20 21:14:52.954000+00:00 | 1 | None | 1 |
56 | Hgw08Vk3 | bionty.Disease | all | mondo | False | False | Mondo Disease Ontology | http://purl.obolibrary.org/obo/mondo/releases/... | None | https://mondo.monarchinitiative.org | 1 | None | 2023-04-04 | None | 2025-03-20 21:14:52.954000+00:00 | 1 | None | 1 |
57 | UUZUtULu | bionty.Disease | all | mondo | False | False | Mondo Disease Ontology | http://purl.obolibrary.org/obo/mondo/releases/... | None | https://mondo.monarchinitiative.org | 1 | None | 2023-02-06 | None | 2025-03-20 21:14:52.954000+00:00 | 1 | None | 1 |
58 | 7DH1aJIr | bionty.Disease | all | mondo | False | False | Mondo Disease Ontology | http://purl.obolibrary.org/obo/mondo/releases/... | None | https://mondo.monarchinitiative.org | 1 | None | 2022-10-11 | None | 2025-03-20 21:14:52.954000+00:00 | 1 | None | 1 |
59 | 4kswnHVF | bionty.Disease | human | doid | False | True | Human Disease Ontology | http://purl.obolibrary.org/obo/doid/releases/2... | None | https://disease-ontology.org | 1 | None | 2024-05-29 | None | 2025-03-20 21:14:52.954000+00:00 | 1 | None | 1 |
Which registries have I already learned about? 🤔
Every registry supports arbitrary relational queries using the class methods get
and filter
.
The syntax for it is Django’s query syntax.
Here are some simple query examples.
# get a single record (here the current notebook)
transform = ln.Transform.get(key="introduction.ipynb")
# get a set of records by filtering for a directory (LaminDB treats directories like AWS S3, as the prefix of the storage key)
ln.Artifact.filter(key__startswith="my_datasets/").df()
# query all artifacts ingested from a transform
artifacts = ln.Artifact.filter(transform=transform).all()
# query all artifacts ingested from a notebook with "intro" in the title
artifacts = ln.Artifact.filter(
transform__description__icontains="intro",
).all()
What does a double underscore mean?
For any field, the double underscore defines a comparator, e.g.,
name__icontains="Martha"
:name
contains"Martha"
when ignoring casename__startswith="Martha"
:name
starts with"Martha
name__in=["Martha", "John"]
:name
is"John"
or"Martha"
For more info, see: Query & search registries.
Can I chain filters and searches?
Yes: ln.Artifact.filter(suffix=".jpg").search("my image")
The class methods search
and lookup
help with approximate matches.
# search artifacts
ln.Artifact.search("iris").df().head()
# search transforms
ln.Transform.search("intro").df()
# look up records with auto-complete
ulabels = ln.ULabel.lookup()
Show me a screenshot

For more info, see: Query & search registries.
Features & labels¶
Now, how do you find datasets and how do you make sure they’re usable by analysts and machine learning models alike? With features & labels.
Features represent measurement dimensions (e.g. "species"
) and labels represent measured values (e.g. "iris setosa"
, "iris versicolor"
, "iris virginica"
).
In statistics, you’d say a feature is a categorical or numerical variable while a label is a category. Categorical variables draw their values from a set of categories.
Can you give me examples for what findability and usability means?
Findability: Which datasets measured expression of cell marker
CD14
? Which characterized cell lineK562
? Which have a test & train split? Etc.Usability: Are there typos in feature names? Are there typos in labels? Are types and units of features consistent? Etc.
# define the "temperature" & "experiment" features
ln.Feature(name="temperature", dtype=float).save()
ln.Feature(name="experiment", dtype=ln.ULabel).save()
# create & save labels
experiment_type = ln.ULabel(name="InVitroStudy", is_type=True).save()
my_experiment = ln.ULabel(name="My experiment", type=experiment_type).save()
artifact.features.add_values({"temperature": 21.6, "experiment": "My experiment"})
artifact.describe()
Show code cell output
Artifact ├── General │ ├── .uid = 'jhRYpvTeYnFTzGox0000' │ ├── .key = 'iris_studies/study0_raw_images' │ ├── .size = 658465 │ ├── .hash = 'IVKGMfNwi8zKvnpaD_gG7w' │ ├── .n_files = 51 │ ├── .path = s3://lamindata/iris_studies/study0_raw_images │ ├── .created_by = anonymous │ ├── .created_at = 2025-03-20 21:15:02 │ └── .transform = 'Introduction' ├── Linked features │ └── experiment cat[ULabel] My experiment │ temperature float 21.6 └── Labels └── .ulabels ULabel My experiment
Can I also directly label artifacts without using features?
Yes. For a ULabel
.
artifact.ulabels.add(my_experiment)
For a biological ontology managed through bionty.
import bionty as bt
cell_type = bt.CellType.from_source(name="effector T cell").save()
artifact.cell_types.add(cell_type)
Query artifacts by labels.
ln.Artifact.filter(ulabels__name__contains="My exp").df()
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
3 | jhRYpvTeYnFTzGox0000 | iris_studies/study0_raw_images | None | None | None | 658465 | IVKGMfNwi8zKvnpaD_gG7w | 51 | None | md5-d | False | True | 1 | 2 | None | None | True | 1 | 2025-03-20 21:15:02.678000+00:00 | 1 | None | 1 |
Curate datasets¶
You already saw how to ingest datasets without validation. This is often enough if you’re prototyping or working with one-off studies. But if you want to create a big body of standardized data, you have to invest the time to curate your datasets.
Let’s define a Schema
to curate a DataFrame
.
# define valid labels
perturbation_type = ln.ULabel(name="Perturbation", is_type=True).save()
ln.ULabel(name="DMSO", type=perturbation_type).save()
ln.ULabel(name="IFNG", type=perturbation_type).save()
# define the schema
schema = ln.Schema(
name="My DataFrame schema",
features=[
ln.Feature(name="ENSG00000153563", dtype=int).save(),
ln.Feature(name="ENSG00000010610", dtype=int).save(),
ln.Feature(name="ENSG00000170458", dtype=int).save(),
ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
],
).save()
With a Curator
, we can save an annotated & validated artifact with a single line of code.
curator = ln.curators.DataFrameCurator(df, schema)
# save curated artifact
artifact = curator.save_artifact(key="my_curated_dataset.parquet") # calls .validate()
# see the parsed annotations
artifact.describe()
# query for a ulabel that was parsed from the dataset
ln.Artifact.get(ulabels__name="IFNG")
Show code cell output
✓ "perturbation" is validated against ULabel.name
→ returning existing artifact with same hash: Artifact(uid='4x6wtbfizcRu5tqZ0001', is_latest=True, key='my_datasets/rnaseq1.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=9012, hash='iBiiWBkIitgFtLcru2CLyA', n_observations=3, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-03-20 21:15:00 UTC); to track this artifact as an input, use: ln.Artifact.get()
! key my_datasets/rnaseq1.parquet on existing artifact differs from passed key my_curated_dataset.parquet
✓ 4 unique terms (36.40%) are validated for name
! 7 unique terms (63.60%) are not validated for name: 'sample_note', 'cell_type_by_expert', 'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h', 'donor'
✓ loaded 4 Feature records matching name: 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'perturbation'
! did not create Feature records for 7 non-validated names: 'assay_oid', 'cell_type_by_expert', 'cell_type_by_model', 'concentration', 'donor', 'sample_note', 'treatment_time_h'
→ returning existing schema with same hash: Schema(uid='iXDG5vbbJRhJOMuPA2lq', name='My DataFrame schema', n=4, itype='Feature', is_type=False, hash='XtaYaNzZcHciEtARW3kWPQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, run_id=1, created_at=2025-03-20 21:15:07 UTC)
! updated otype from None to DataFrame
Artifact .parquet/DataFrame ├── General │ ├── .uid = '4x6wtbfizcRu5tqZ0001' │ ├── .key = 'my_datasets/rnaseq1.parquet' │ ├── .size = 9012 │ ├── .hash = 'iBiiWBkIitgFtLcru2CLyA' │ ├── .n_observations = 3 │ ├── .path = /home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro/.lamindb/4x6wtbfizcRu5tqZ0001.parquet │ ├── .created_by = anonymous │ ├── .created_at = 2025-03-20 21:15:00 │ └── .transform = 'Introduction' ├── Dataset features/.feature_sets │ └── columns • 4 [Feature] │ perturbation cat[ULabel] DMSO, IFNG │ ENSG00000153563 int │ ENSG00000010610 int │ ENSG00000170458 int └── Labels └── .ulabels ULabel DMSO, IFNG
Artifact(uid='4x6wtbfizcRu5tqZ0001', is_latest=True, key='my_datasets/rnaseq1.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=9012, hash='iBiiWBkIitgFtLcru2CLyA', n_observations=3, space_id=1, storage_id=1, run_id=1, schema_id=1, created_by_id=1, created_at=2025-03-20 21:15:00 UTC)
If we feed a dataset with an invalid dtype or typo, we’ll get a ValidationError
.
curator = ln.curators.DataFrameCurator(df_typo, schema)
# validate the dataset
try:
curator.validate()
except ln.errors.ValidationError as error:
print(str(error))
Show code cell output
• mapping "perturbation" on ULabel.name
! 1 term is not validated: 'IFNJ'
→ fix typos, remove non-existent values, or save terms via .add_new_from("perturbation")
1 term is not validated: 'IFNJ'
→ fix typos, remove non-existent values, or save terms via .add_new_from("perturbation")
Manage biological registries¶
The generic Feature
and ULabel
registries will get you pretty far.
But let’s now look at what you do can with a dedicated biological registry like Gene
.
Every bionty
registry is based on configurable public ontologies (>20 of them).
import bionty as bt
cell_types = bt.CellType.public()
cell_types
Show code cell output
PublicOntology
Entity: CellType
Organism: all
Source: cl, 2024-08-16
#terms: 2959
cell_types.search("gamma-delta T cell").head(2)
Show code cell output
name | definition | synonyms | parents | |
---|---|---|---|---|
ontology_id | ||||
CL:0000798 | gamma-delta T cell | A T Cell That Expresses A Gamma-Delta T Cell R... | gamma-delta T-cell|gamma-delta T lymphocyte|ga... | [CL:0000084] |
CL:4033072 | cycling gamma-delta T cell | A(N) Gamma-Delta T Cell That Is Cycling. | proliferating gamma-delta T cell | [CL:4033069, CL:0000798] |
Define an AnnData
schema.
# define var schema
var_schema = ln.Schema(
name="my_var_schema",
itype=bt.Gene.ensembl_gene_id,
dtype=int,
).save()
obs_schema = ln.Schema(
name="my_obs_schema",
features=[
ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
],
).save()
# define composite schema
anndata_schema = ln.Schema(
name="my_anndata_schema",
otype="AnnData",
components={"obs": obs_schema, "var": var_schema},
).save()
→ returning existing Feature record with same name: 'perturbation'
Validate & annotate an AnnData
.
import anndata as ad
import bionty as bt
# store the dataset as an AnnData object to distinguish data from metadata
adata = ad.AnnData(
df[["ENSG00000153563", "ENSG00000010610", "ENSG00000170458"]],
obs=df[["perturbation"]],
)
# save curated artifact
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
artifact = curator.save_artifact(description="my RNA-seq")
artifact.describe()
Show code cell output
✓ created 1 Organism record from Bionty matching name: 'human'
• saving validated records of 'columns'
✓ added 3 records from public with Gene.ensembl_gene_id for "columns": 'ENSG00000153563', 'ENSG00000170458', 'ENSG00000010610'
✓ "perturbation" is validated against ULabel.name
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/BUWXsL0TZE0I3OM20000.h5ad')
✓ storing artifact 'BUWXsL0TZE0I3OM20000' at '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro/.lamindb/BUWXsL0TZE0I3OM20000.h5ad'
✓ 3 unique terms (100.00%) are validated for ensembl_gene_id
✓ 1 unique term (100.00%) is validated for name
→ returning existing schema with same hash: Schema(uid='zZO0FnqaoaXjdEiV3OuY', name='my_obs_schema', n=1, itype='Feature', is_type=False, hash='FD-vBWEQKBG9DSZho__vaQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, run_id=1, created_at=2025-03-20 21:15:07 UTC)
! updated otype from None to DataFrame
✓ saved 1 feature set for slot: 'var'
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'BUWXsL0TZE0I3OM20000' │ ├── .size = 19240 │ ├── .hash = 'M53AXNxorUBgFvyLY4RnoQ' │ ├── .n_observations = 3 │ ├── .path = /home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro/.lamindb/BUWXsL0TZE0I3OM20000.h5ad │ ├── .created_by = anonymous │ ├── .created_at = 2025-03-20 21:15:09 │ └── .transform = 'Introduction' ├── Dataset features/.feature_sets │ ├── var • 3 [bionty.Gene] │ │ CD8A int │ │ CD14 int │ │ CD4 int │ └── obs • 1 [Feature] │ perturbation cat[ULabel] DMSO, IFNG └── Labels └── .ulabels ULabel DMSO, IFNG
Query for typed features.
# get a lookup object for human genes
genes = bt.Gene.filter(organism__name="human").lookup()
# query for all feature sets that contain CD8A
feature_sets = ln.FeatureSet.filter(genes=genes.cd8a).all()
# write the query
ln.Artifact.filter(feature_sets__in=feature_sets).df()
Show code cell output
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
4 | BUWXsL0TZE0I3OM20000 | None | my RNA-seq | .h5ad | dataset | AnnData | 19240 | M53AXNxorUBgFvyLY4RnoQ | None | 3 | md5 | True | False | 1 | 1 | 4 | None | True | 1 | 2025-03-20 21:15:09.901000+00:00 | 1 | None | 1 |
Update ontologies, e.g., create a cell type record and add a new cell state.
# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_source(name="neuron").save()
# create a record to track a new cell state
new_cell_state = bt.CellType(
name="my neuron cell state", description="explains X"
).save()
# express that it's a neuron state
new_cell_state.parents.add(neuron)
# view ontological hierarchy
new_cell_state.view_parents(distance=2)
Show code cell output
✓ created 1 CellType record from Bionty matching name: 'neuron'
✓ created 3 CellType records from Bionty matching ontology_id: 'CL:0002319', 'CL:0000404', 'CL:0000393'
Scale learning¶
How do you integrate new datasets with your existing datasets? Leverage Collection
.
# a new dataset
df2 = ln.core.datasets.small_dataset2(otype="DataFrame")
adata = ad.AnnData(
df2[["ENSG00000153563", "ENSG00000010610", "ENSG00000004468"]],
obs=df2[["perturbation"]],
)
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
artifact2 = curator.save_artifact(key="my_datasets/my_rnaseq2.h5ad")
Show code cell output
• saving validated records of 'columns'
✓ added 1 record from public with Gene.ensembl_gene_id for "columns": 'ENSG00000004468'
✓ "perturbation" is validated against ULabel.name
• path content will be copied to default storage upon `save()` with key 'my_datasets/my_rnaseq2.h5ad'
✓ storing artifact 'eyJ93FYHXOOEAGpk0000' at '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro/.lamindb/eyJ93FYHXOOEAGpk0000.h5ad'
✓ 3 unique terms (100.00%) are validated for ensembl_gene_id
✓ 1 unique term (100.00%) is validated for name
→ returning existing schema with same hash: Schema(uid='zZO0FnqaoaXjdEiV3OuY', name='my_obs_schema', n=1, itype='Feature', is_type=False, hash='FD-vBWEQKBG9DSZho__vaQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, run_id=1, created_at=2025-03-20 21:15:07 UTC)
! updated otype from None to DataFrame
✓ saved 1 feature set for slot: 'var'
Create a collection using Collection
.
collection = ln.Collection([artifact, artifact2], key="my-RNA-seq-collection").save()
collection.describe()
collection.view_lineage()
Show code cell output
Collection └── General ├── .uid = 'EZGZa5mwXL334UTo0000' ├── .key = 'my-RNA-seq-collection' ├── .hash = 'mekPc_BF8xL4czT7STCOUQ' ├── .created_by = anonymous ├── .created_at = 2025-03-20 21:15:13 └── .transform = 'Introduction'
# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()
# typically, it's too big, hence, open it for streaming (if the backend allows it)
# collection.open()
# or iterate over its artifacts
collection.artifacts.all()
# or look at a DataFrame listing the artifacts
collection.artifacts.df()
Show code cell output
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
4 | BUWXsL0TZE0I3OM20000 | None | my RNA-seq | .h5ad | dataset | AnnData | 19240 | M53AXNxorUBgFvyLY4RnoQ | None | 3 | md5 | True | False | 1 | 1 | 4 | None | True | 1 | 2025-03-20 21:15:09.901000+00:00 | 1 | None | 1 |
5 | eyJ93FYHXOOEAGpk0000 | my_datasets/my_rnaseq2.h5ad | None | .h5ad | dataset | AnnData | 19240 | iTOiRMzQuwLDPVHR9P4aPg | None | 3 | md5 | True | False | 1 | 1 | 4 | None | True | 1 | 2025-03-20 21:15:12.877000+00:00 | 1 | None | 1 |
Directly train models on collections of AnnData
.
# to train models, batch iterate through the collection as if it was one array
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(obs_keys=["cell_medium"])
sampler = WeightedRandomSampler(
weights=dataset.get_label_weights("cell_medium"), num_samples=len(dataset)
)
data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in data_loader:
pass
Read this blog post for more on training models on sharded datasets.
Design¶
World model¶
Teams need to have enough freedom to initiate work independently but enough structure to easily integrate datasets later on
Batched datasets (
Artifact
) from physical instruments are transformed (Transform
) into useful representationsLearning needs features (
Feature
,CellMarker
, …) and labels (ULabel
,CellLine
, …)Insights connect dataset representations with experimental metadata and knowledge (ontologies)
Architecture¶
LaminDB is a distributed system like git that can be run or hosted anywhere. As infrastructure, you merely need a database (SQLite/Postgres) and a storage location (file system, S3, GCP, HuggingFace, …).
You can easily create your new local instance:
lamin init --storage ./my-data-folder
import lamindb as ln
ln.setup.init(storage="./my-data-folder")
Or you can let collaborators connect to a cloud-hosted instance:
lamin connect account-handle/instance-name
import lamindb as ln
ln.connect("account-handle/instance-name")
library(laminr)
ln <- connect("account-handle/instance-name")
For learning more about how to create & host LaminDB instances on distributed infrastructure, see Install & setup. LaminDB instances work standalone but can optionally be managed by LaminHub. For an architecture diagram of LaminHub, reach out!
Database schema & API¶

LaminDB provides a SQL schema for common metadata entities: Artifact
, Collection
, Transform
, Feature
, ULabel
etc. - see the API reference or the source code.
The core metadata schema is extendable through modules (see green vs. red entities in graphic), e.g., with basic biological (Gene
, Protein
, CellLine
, etc.) & operational entities (Biosample
, Techsample
, Treatment
, etc.).
What is the metadata schema language?
Data models are defined in Python using the Django ORM. Django translates them to SQL tables. Django is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.
On top of the metadata schema, LaminDB is a Python API that models datasets as artifacts, abstracts over storage & database access, data transformations, and (biological) ontologies.
Note that the schemas of datasets (e.g., .parquet
files, .h5ad
arrays, etc.) are modeled through the Feature
registry and do not require migrations to be updated.
Custom registries¶
LaminDB can be extended with registry modules building on the Django ecosystem. Examples are:
bionty: Registries for basic biological entities, coupled to public ontologies.
wetlab: Registries for samples, treatments, etc.
If you’d like to create your own module:
Create a git repository with registries similar to wetlab
Create & deploy migrations via
lamin migrate create
andlamin migrate deploy
Repositories¶
LaminDB and its plugins consist in open-source Python libraries & publicly hosted metadata assets:
lamindb: Core package.
bionty: Registries for basic biological entities, coupled to public ontologies.
wetlab: Registries for samples, treatments, etc.
usecases: Use cases as visible on the docs.
All immediate dependencies are available as git submodules here, for instance,
lamindb-setup: Setup & configure LaminDB.
lamin-cli: CLI for
lamindb
andlamindb-setup
.
For a comprehensive list of open-sourced software, browse our GitHub account.
lamin-utils: Generic utilities, e.g., a logger.
readfcs: FCS artifact reader.
nbproject: Light-weight Jupyter notebook tracker.
bionty-assets: Assets for public biological ontologies.
LaminHub is not open-sourced.
Influences¶
LaminDB was influenced by many other projects, see Influences.