Introduction¶
LaminDB is an open-source data framework for biology.
LaminDB features
Manage data & metadata with a unified Python API (“lakehouse”).
Query & search across artifacts (
Artifact
) & metadata records (Record
):filter
,search
Cache artifacts on disk & load them into memory:
cache
,load
Manage features & labels:
Feature
,FeatureSet
,ULabel
Plug-in custom schemas & manage schema migrations
Use array formats in memory & storage: DataFrame, AnnData, MuData, SOMA, … backed by parquet, zarr, TileDB, HDF5, h5ad, DuckDB, …
Create iterable collections of artifacts with data loaders:
Collection
Version artifacts, collections & transforms:
IsVersioned
Track data lineage across notebooks, scripts, pipelines & UI.
Track run context with a simple method call:
track()
A unified registry for all your notebooks, scripts & pipelines:
Transform
A unified registry for all data transformation runs:
Run
Manage execution reports, source code and Python environments for notebooks & scripts
Integrate with workflow managers: redun, nextflow, snakemake
Manage registries for experimental metadata & in-house ontologies, import public ontologies.
Use >20 public ontologies with plug-in
bionty
:Gene
,Protein
,CellMarker
,ExperimentalFactor
,CellType
,CellLine
,Tissue
, …Safeguards against typos & duplications
Version ontology
Validate, standardize & annotate.
Validate & standardize metadata:
validate
,standardize
.High-level curation flow including annotation:
Curator
Inspect validation failures:
inspect
Organize and share data across a mesh of LaminDB instances.
Create & load instances like git repos:
lamin init
&lamin load
Zero-copy transfer data across instances
Integrate with analytics tools.
Vitessce:
save_vitessce_config
Zero lock-in, scalable, auditable.
Zero lock-in: LaminDB runs on generic backends server-side and is not a client for “Lamin Cloud”
Flexible storage backends (local, S3, GCP, anything fsspec supports)
Two SQL backends for managing metadata: SQLite & Postgres
Scalable: metadata registries support 100s of millions of entries, storage is as scalable as S3
Auditable: data & metadata records are hashed, timestamped, and attributed to users (full audit log to come)
Secure: embedded in your infrastructure (Lamin has no access to your data & metadata)
Tested, typed, idempotent & ACID
LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.
LaminHub features
See for yourself by browsing the demo instances in the hub UI or lamin load owner/instance
them via the CLI.
lamin.ai/laminlabs/lamindata - A generic demo instance with various data types
lamin.ai/laminlabs/cellxgene - An instance that interfaces the CELLxGENE data (guide)
lamin.ai/laminlabs/arrayloader-benchmarks - Work with ML models & benchmarks
See the pricing page. Basic LaminHub features are free.
Secure & intuitive access management.
Rather than configuring storage & database permissions directly on AWS or GCP, LaminHub allows you to manage collaborators for databases & storage locations in the same way you manage access to repositories on GitHub. See Access management.
A UI to work with LaminDB instances.
See validated datasets in context of ontologies & experimental metadata.
Query & search.
See scripts, notebooks & pipelines with their inputs & outputs.
Track pipelines, notebooks & UI transforms in one place.
Quickstart¶
You’ll ingest a small dataset while tracking data lineage, and see how to validate, annotate, query & search.
Install the lamindb
Python package.
# install with notebook support & biological entities
!pip install 'lamindb[jupyter,bionty]'
Initialize a LaminDB instance that stores data locally and mounts plugin bionty
.
# store artifacts in local directory `./lamin-intro`
!lamin init --storage ./lamin-intro --schema bionty
# (optional) make Django's unnecessary functionality private for clean auto-complete
!lamin set private-django-api true
Show code cell output
! using anonymous user (to identify, call: lamin login)
Data transformations¶
Call track()
to register a data transformation and start tracking inputs & outputs of a run. You will find your notebook in the Transform
registry along with scripts, pipelines & functions. Run
stores executions.
import lamindb as ln
# identify your code with an auto-generated uid
ln.context.uid = ( # <-- if undefined, auto-generated by ln.context.track()
"FPnfDtJz8qbE0000"
)
# track your run with inputs & outputs
ln.context.track()
Show code cell output
→ connected lamindb: anonymous/lamin-intro
→ notebook imports: anndata==0.10.9 bionty==0.50.1 lamindb==0.76.6 pandas==2.2.2 pytest==8.3.3
→ created Transform(uid='FPnfDtJz8qbE0000') & created Run(started_at='2024-09-10 19:08:55 UTC')
Is this compliant with OpenLineage?
Yes. What OpenLineage calls a “job”, LaminDB calls a “transform”. What OpenLineage calls a “run”, LaminDB calls a “run”.
What is ln.context.uid
?
To tie a piece of code to a record in a database in a way that survives name and content changes, you need to attach it to an immutable identifier, e.g., LaminDB’s uid
.
git, by comparison, identifies code by its content hash & file name. If you rename a notebook or script file and change the content, you lose the identity of the file. Notebook platforms like Google Colab and DeepNote support renaming and changing content of a given notebook, but they do not support versioning in a simple queryable way: every notebook version comes with the same notebook id.
To enable versioning, LaminDB auto-generates uid = f"{suid}{vuid}"
so that different versions of a transform are grouped by a random “stem uid” suid
(the first part of the uid
) while the last four characters encode a version in a vuid
(an auto-incrementing base62 number). You can optionally tag a version using the .version
field.
All versioned entities in LaminDB are versioned in this way, including artifacts and collections.
Artifacts & versioning¶
An Artifact
stores a dataset or model as a file, folder or array.
import pandas as pd
# a sample dataset
df_with_typo = pd.DataFrame(
{
"CD8A": [1, 2, 3],
"CD4": [3, 4, 5],
"CD14": [5, 6, 7],
"perturbation": ["DMSO", "IFNJ", "DMSO"],
},
index=["sample1", "sample2", "sample3"],
)
# create & save an artifact from a DataFrame
artifact = ln.Artifact.from_df(df_with_typo, description="my RNA-seq").save()
# artifacts come with typed, relational metadata
artifact.describe()
Show code cell output
Artifact(uid='7vvMjrJNxX6s8UgQ0000', is_latest=True, description='my RNA-seq', suffix='.parquet', type='dataset', size=4091, hash='h-0N84LghnfByLKITVnOFQ', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, updated_at='2024-09-10 19:08:55 UTC')
Provenance
.storage = '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro'
.transform = 'Introduction'
.run = '2024-09-10 19:08:55 UTC'
.created_by = 'anonymous'
Load the dataset into memory.
# returns a dataframe
artifact.load()
Show code cell output
CD8A | CD4 | CD14 | perturbation | |
---|---|---|---|---|
sample1 | 1 | 3 | 5 | DMSO |
sample2 | 2 | 4 | 6 | IFNJ |
sample3 | 3 | 5 | 7 | DMSO |
Looking at this: "IFNJ"
should have been "IFNG"
. 🙈 Let’s create a revision of this dataset.
# update the dataframe by fixing the typo in "IFNG"
df_fixed_typo = df_with_typo.copy()
df_fixed_typo.loc["sample2", "perturbation"] = "IFNG"
# create a revision
artifact_fixed = ln.Artifact.from_df(df_fixed_typo, revises=artifact).save()
# see all versions of an artifact
artifact_fixed.versions.df()
Show code cell output
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | 7vvMjrJNxX6s8UgQ0000 | None | False | my RNA-seq | None | .parquet | dataset | 4091 | h-0N84LghnfByLKITVnOFQ | None | None | md5 | DataFrame | 1 | True | 1 | 1 | 1 | 1 | 2024-09-10 19:08:55.959949+00:00 |
2 | 7vvMjrJNxX6s8UgQ0001 | None | True | my RNA-seq | None | .parquet | dataset | 4091 | PxPYEdKabVLAPsSuNu9m4g | None | None | md5 | DataFrame | 1 | True | 1 | 1 | 1 | 1 | 2024-09-10 19:08:55.960858+00:00 |
Similar to tagging a git commit, you can label a revision.
artifact_fixed.version = "1.0"
artifact_fixed.save()
artifact_fixed.versions.df()
Show code cell output
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | 7vvMjrJNxX6s8UgQ0000 | None | False | my RNA-seq | None | .parquet | dataset | 4091 | h-0N84LghnfByLKITVnOFQ | None | None | md5 | DataFrame | 1 | True | 1 | 1 | 1 | 1 | 2024-09-10 19:08:55.959949+00:00 |
2 | 7vvMjrJNxX6s8UgQ0001 | 1.0 | True | my RNA-seq | None | .parquet | dataset | 4091 | PxPYEdKabVLAPsSuNu9m4g | None | None | md5 | DataFrame | 1 | True | 1 | 1 | 1 | 1 | 2024-09-10 19:08:55.977833+00:00 |
View data lineage.
artifact_fixed.view_lineage()
Show code cell output
I’d rather control versioning through a key or file path like on S3.
That works, too, and you won’t need to pass an old version via revises
:
artifact_v1 = ln.Artifact.from_df(df, key="my_datasets/my_study1.parquet").save()
# below automatically creates a new version of artifact_v1 because the `key` matches
artifact_v2 = ln.Artifact.from_df(df_updated, key="my_datasets/my_study1.parquet").save()
The good thing about passing revises: Artifact
is that it works for entities that don’t come with a file path and you don’t need to worry about coming up with naming conventions for paths. You’ll see that LaminDB makes it easy to organize data by entities, rather than file paths.
How does this look for a file or folder?
Source path is local:
ln.Artifact("./my_data.fcs", description="my flow cytometry file")
ln.Artifact("./my_images/", description="my folder of images")
Upon artifact.save()
, the source path will be copied (uploaded) into your default storage.
If the source path is remote, artifact.save()
won’t trigger data duplication but register the existing path.
ln.Artifact("s3://my-bucket/my_data.fcs", description="my flow cytometry file")
ln.Artifact("s3://my-bucket/my_images/", description="my folder of images")
You can also use other remote file systems supported by `fsspec`.
How does LaminDB compare to a AWS S3?
LaminDB is a layer on top of a storage backend (AWS S3, GCP storage, local filesystem, etc.) and a database (Postgres, SQLite) for managing metadata.
Similar to organizing files in file systems & object stores with paths, you can organize artifacts using the key
parameter of Artifact
.
However, LaminDB encourages you to not rely on semantic keys but instead organize your data based on metadata.
Rather than memorizing names of folders and files, you find data via the entities you care about: people, code, experiments, genes, proteins, cell types, etc.
LaminDB embeds each artifact into rich relational metadata and indexes them in storage with a universal ID (uid
).
This scales much better than semantic keys, which lead to deep hierarchical information structures that can become hard to navigate.
Because metadata is typed and relational, you can work with more structure, more integrity, and richer queries compared to leveraging S3’s JSON-like metadata. You’ll learn more about this below.
Are artifacts aware of array-like data?
Yes.
You can make artifacts from paths referencing array-like objects:
ln.Artifact("./my_anndata.h5ad", description="curated array")
ln.Artifact("./my_zarr_array/", description="my zarr array store")
Or from in-memory objects:
ln.Artifact.from_df(df, description="my dataframe")
ln.Artifact.from_anndata(adata, description="annotated array")
You can open large artifacts for slicing from the cloud or load small artifacts directly into memory.
Datasets & labels¶
Label an artifact with a ULabel
and a bionty.CellLine
. The same works for any entity in any custom schema module.
import bionty as bt
# create & save a ulabel record
candidate_marker_study = ln.ULabel(name="Candidate marker study").save()
# label the artifact
artifact.ulabels.add(candidate_marker_study)
# repeat for a bionty entity
cell_line = bt.CellLine.from_source(name="HEK293").save()
artifact.cell_lines.add(cell_line)
# describe the artifact
artifact.describe()
Show code cell output
Artifact(uid='7vvMjrJNxX6s8UgQ0000', is_latest=False, description='my RNA-seq', suffix='.parquet', type='dataset', size=4091, hash='h-0N84LghnfByLKITVnOFQ', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, updated_at='2024-09-10 19:08:55 UTC')
Provenance
.storage = '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro'
.transform = 'Introduction'
.run = '2024-09-10 19:08:55 UTC'
.created_by = 'anonymous'
Labels
.cell_lines = 'HEK293'
.ulabels = 'Candidate marker study'
Registries, records & fields¶
LaminDB’s central classes are related records that inherit from Record
. We’ve already seen how to create new artifact
, transform
and ulabel
records.
The easiest way to see all existing records of a given type is to call the class method df
.
ln.ULabel.df()
Show code cell output
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
1 | e6C67VLx | Candidate marker study | None | None | None | 1 | 1 | 2024-09-10 19:08:56.090423+00:00 |
Existing records are stored in the record’s registry (metaclass Registry
), which maps 1:1 to on a SQL table in the SQLite or Postgres backend.
A record and its registry share the same fields, which define the metadata you can query for. If you want to see them, look at the class or auto-complete.
ln.Artifact
Show code cell output
Artifact
Simple fields
.uid: CharField
.description: CharField
.key: CharField
.suffix: CharField
.type: CharField
.size: BigIntegerField
.hash: CharField
.n_objects: BigIntegerField
.n_observations: BigIntegerField
.visibility: SmallIntegerField
.version: CharField
.is_latest: BooleanField
.created_at: DateTimeField
.updated_at: DateTimeField
Relational fields
.storage: Storage
.transform: Transform
.run: Run
.created_by: User
.ulabels: ULabel
.input_of_runs: Run
.feature_sets: FeatureSet
.collections: Collection
Bionty fields
.organisms: bionty.Organism
.genes: bionty.Gene
.proteins: bionty.Protein
.cell_markers: bionty.CellMarker
.tissues: bionty.Tissue
.cell_types: bionty.CellType
.diseases: bionty.Disease
.cell_lines: bionty.CellLine
.phenotypes: bionty.Phenotype
.pathways: bionty.Pathway
.experimental_factors: bionty.ExperimentalFactor
.developmental_stages: bionty.DevelopmentalStage
.ethnicities: bionty.Ethnicity
Query & search¶
You can write arbitrary relational queries using the class methods get
and filter
. The syntax for it is Django’s query syntax, one of the two most popular ORMs in Python (the other is SQLAlchemy).
# get a single record by uid (here, the latest version of the current notebook)
transform = ln.Transform.get("FPnfDtJz8qbE")
# get a single record by matching a field
transform = ln.Transform.get(name="Introduction")
# get a set of records by filtering on description
ln.Artifact.filter(description="my RNA-seq").df()
# query all artifacts ingested from the current notebook
artifacts = ln.Artifact.filter(transform=transform).all()
# query all artifacts ingested from a notebook with "intro" in the name and labeled "Candidate marker study"
artifacts = ln.Artifact.filter(
transform__name__icontains="intro", ulabels=candidate_marker_study
).all()
The class methods search
and lookup
help finding sets of approximately matching records.
# search in a registry
ln.Transform.search("intro").df()
# look up records with auto-complete
ulabels = ln.ULabel.lookup()
Show me a screenshot
Datasets & features¶
What fields are to metadata records, features are to datasets. You can annotate datasets by the features they measure.
But because LaminDB validates all user input against its registries, annotating with a "temperature"
feature doesn’t work right away.
import pytest
with pytest.raises(ln.core.exceptions.ValidationError) as e:
artifact.features.add_values({"temperature": 21.6})
print(e.exconly())
Show code cell output
lamindb.core.exceptions.ValidationError: These keys could not be validated: ['temperature']
Here is how to create a feature:
ln.Feature(name='temperature', dtype='float').save()
Following the hint in the error message, create & save a Feature
.
# create & save the "temperature" feature (only required once)
ln.Feature(name="temperature", dtype="float").save()
# now we can annotate with the feature & the value
artifact.features.add_values({"temperature": 21.6})
# describe the artifact
artifact.describe()
Show code cell output
Artifact(uid='7vvMjrJNxX6s8UgQ0000', is_latest=False, description='my RNA-seq', suffix='.parquet', type='dataset', size=4091, hash='h-0N84LghnfByLKITVnOFQ', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, updated_at='2024-09-10 19:08:55 UTC')
Provenance
.storage = '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro'
.transform = 'Introduction'
.run = '2024-09-10 19:08:55 UTC'
.created_by = 'anonymous'
Labels
.cell_lines = 'HEK293'
.ulabels = 'Candidate marker study'
Features
'temperature' = 21.6
We can also annotate with categorical features:
# register a categorical feature
ln.Feature(name="study", dtype="cat").save()
# add a categorical value
artifact.features.add_values({"study": "Candidate marker study"})
# describe the artifact with type information
artifact.describe(print_types=True)
Show code cell output
Artifact(uid='7vvMjrJNxX6s8UgQ0000', is_latest=False, description='my RNA-seq', suffix='.parquet', type='dataset', size=4091, hash='h-0N84LghnfByLKITVnOFQ', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, updated_at='2024-09-10 19:08:55 UTC')
Provenance
.storage: Storage = '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro'
.transform: Transform = 'Introduction'
.run: Run = '2024-09-10 19:08:55 UTC'
.created_by: User = 'anonymous'
Labels
.cell_lines: bionty.CellLine = 'HEK293'
.ulabels: ULabel = 'Candidate marker study'
Features
'study': cat[ULabel] = 'Candidate marker study'
'temperature': float = 21.6
This is how you query artifacts by features.
ln.Artifact.features.filter(study__contains="marker study").df()
Show code cell output
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | 7vvMjrJNxX6s8UgQ0000 | None | False | my RNA-seq | None | .parquet | dataset | 4091 | h-0N84LghnfByLKITVnOFQ | None | None | md5 | DataFrame | 1 | True | 1 | 1 | 1 | 1 | 2024-09-10 19:08:55.959949+00:00 |
Features organize labels by how they’re measured in datasets, independently of how labels are stored in metadata registries.
Key examples¶
Curate datasets¶
In the quickstart, you just saw how to ingest & annotate datasets without validation. This is often enough if you’re prototyping or working with one-off studies. But if you want to create a big body of standardized data, you should invest a more time and curate your datasets.
Let’s use the high-level Curator
class to curate a DataFrame
.
# construct a Curator object to validate & annotate a DataFrame
curate = ln.Curator.from_df(
df_fixed_typo,
# define validation criteria
columns=ln.Feature.name, # map column names
categoricals={"perturbation": ln.ULabel.name}, # map categories
)
# validate the dataset
curate.validate()
Show code cell output
✓ added 1 record with Feature.name for columns: 'perturbation'
• 3 non-validated values are not saved in Feature.name: ['CD14', 'CD8A', 'CD4']!
→ to lookup values, use lookup().columns
→ to save, run add_new_from_columns
• mapping perturbation on ULabel.name
! 2 terms are not validated: 'DMSO', 'IFNG'
→ save terms via .add_new_from('perturbation')
False
The validation did not pass because LaminDB’s registries don’t yet know about the features "CD8A", "CD4", "CD14", "perturbation"
and labels "DMSO", "IFNG", "DMSO"
in this dataset. Hence, we need to initially populate them.
# add non-validated features based on the DataFrame columns
curate.add_new_from_columns()
# add non-validated labels based on the perturbation column of the dataframe
curate.add_new_from("perturbation")
# see the updated content of the ULabel registry
ln.ULabel.df()
Show code cell output
✓ added 3 records with Feature.name for columns: 'CD14', 'CD8A', 'CD4'
✓ added 2 records with ULabel.name for perturbation: 'DMSO', 'IFNG'
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
4 | n3ygHpoZ | is_perturbation | None | None | None | 1 | 1 | 2024-09-10 19:08:59.417234+00:00 |
3 | sF8PwJHv | IFNG | None | None | None | 1 | 1 | 2024-09-10 19:08:59.411096+00:00 |
2 | i8xPrv3e | DMSO | None | None | None | 1 | 1 | 2024-09-10 19:08:59.411061+00:00 |
1 | e6C67VLx | Candidate marker study | None | None | None | 1 | 1 | 2024-09-10 19:08:56.090423+00:00 |
With the ULabel
and Feature
registries now containing meaningful reference values, validation passes & and we can automatically parse features & labels to save an annotated & curated artifact.
# given the updated registries, the validation passes
curate.validate()
# save curated artifact
artifact = curate.save_artifact(description="my RNA-seq")
# see the parsed annotations
artifact.describe()
# query for a ulabel that was parsed from the dataset
ln.Artifact.get(ulabels__name="IFNG")
Show code cell output
✓ perturbation is validated against ULabel.name
→ returning existing artifact with same hash: Artifact(uid='7vvMjrJNxX6s8UgQ0001', version='1.0', is_latest=True, description='my RNA-seq', suffix='.parquet', type='dataset', size=4091, hash='PxPYEdKabVLAPsSuNu9m4g', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, updated_at='2024-09-10 19:08:55 UTC')
Artifact(uid='7vvMjrJNxX6s8UgQ0001', version='1.0', is_latest=True, description='my RNA-seq', suffix='.parquet', type='dataset', size=4091, hash='PxPYEdKabVLAPsSuNu9m4g', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, updated_at='2024-09-10 19:08:59 UTC')
Provenance
.storage = '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro'
.transform = 'Introduction'
.run = '2024-09-10 19:08:55 UTC'
.created_by = 'anonymous'
Labels
.ulabels = 'DMSO', 'IFNG'
Features
'perturbation' = 'DMSO', 'IFNG'
Feature sets
'columns' = 'perturbation', 'CD8A', 'CD4', 'CD14'
Artifact(uid='7vvMjrJNxX6s8UgQ0001', version='1.0', is_latest=True, description='my RNA-seq', suffix='.parquet', type='dataset', size=4091, hash='PxPYEdKabVLAPsSuNu9m4g', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, updated_at='2024-09-10 19:08:59 UTC')
Had we used ln.Cuartor
from the beginning, we would have caught the typo.
# construct a Curator object to validate & annotate a DataFrame
curate = ln.Curator.from_df(
df_with_typo,
# define validation criteria
columns=ln.Feature.name, # map column names
categoricals={"perturbation": ln.ULabel.name}, # map categories
)
# validate the dataset
curate.validate()
Show code cell output
• mapping perturbation on ULabel.name
! 1 terms is not validated: 'IFNJ'
→ save terms via .add_new_from('perturbation')
False
Manage biological registries¶
The generic Feature
and ULabel
registries will get you pretty far.
But let’s now look at what you do can with a dedicated biological registry like Gene
.
Every bionty
registry is based on configurable public ontologies (>20 of them).
cell_types = bt.CellType.public()
cell_types
Show code cell output
PublicOntology
Entity: CellType
Organism: all
Source: cl, 2024-05-15
#terms: 2931
cell_types.search("gamma delta T cell").head(2)
Show code cell output
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
gamma-delta T cell | CL:0000798 | A T Cell That Expresses A Gamma-Delta T Cell R... | gammadelta T cell|gamma-delta T-cell|gamma-del... | [CL:0000084] | 100.000000 |
CD27-negative gamma-delta T cell | CL:0002125 | A Circulating Gamma-Delta T Cell That Expresse... | gammadelta-17 cells | [CL:0000800] | 86.486486 |
Validate & annotate with typed features.
import anndata as ad
# store the dataset as an AnnData object to distinguish data from metadata
adata = ad.AnnData(
df_fixed_typo[["CD8A", "CD4", "CD14"]], obs=df_fixed_typo[["perturbation"]]
)
# create an annotation flow for an AnnData object
curate = ln.Curator.from_anndata(
adata,
# define validation criteria
var_index=bt.Gene.symbol, # map .var.index onto Gene registry
categoricals={adata.obs.perturbation.name: ln.ULabel.name},
organism="human", # specify the organism for the Gene registry
)
curate.add_validated_from_var_index()
curate.validate()
# save curated artifact
artifact = curate.save_artifact(description="my RNA-seq")
artifact.describe()
Show code cell output
✓ var_index is validated against Gene.symbol
✓ perturbation is validated against ULabel.name
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/IcQz5l3u8rgkZfHa0000.h5ad')
✓ storing artifact 'IcQz5l3u8rgkZfHa0000' at '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro/.lamindb/IcQz5l3u8rgkZfHa0000.h5ad'
• parsing feature names of X stored in slot 'var'
✓ 3 terms (100.00%) are validated for symbol
✓ linked: FeatureSet(uid='myoIQ8M75UAvFvnJZbUj', n=3, dtype='int', registry='bionty.Gene', hash='f2UVeHefaZxXFjmUwo9Ozw', created_by_id=1, run_id=1)
• parsing feature names of slot 'obs'
✓ 1 term (100.00%) is validated for name
✓ linked: FeatureSet(uid='OXIU5J7yJX8VsFU8bV2u', n=1, registry='Feature', hash='O7V85H9uxjft0x9OxqCVEA', created_by_id=1, run_id=1)
✓ saved 2 feature sets for slots: 'var','obs'
Artifact(uid='IcQz5l3u8rgkZfHa0000', is_latest=True, description='my RNA-seq', suffix='.h5ad', type='dataset', size=19240, hash='nLH34gqty3-5c2eGF6deOA', n_observations=3, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, updated_at='2024-09-10 19:09:03 UTC')
Provenance
.storage = '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro'
.transform = 'Introduction'
.run = '2024-09-10 19:08:55 UTC'
.created_by = 'anonymous'
Labels
.ulabels = 'DMSO', 'IFNG'
Features
'perturbation' = 'DMSO', 'IFNG'
Feature sets
'var' = 'CD8A', 'CD4', 'CD14'
'obs' = 'perturbation'
Query for typed features.
# get a lookup object for human genes
genes = bt.Gene.filter(organism__name="human").lookup()
# query for all feature sets that contain CD8A
feature_sets = ln.FeatureSet.filter(genes=genes.cd8a).all()
# write the query
ln.Artifact.filter(feature_sets__in=feature_sets).df()
Show code cell output
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
3 | IcQz5l3u8rgkZfHa0000 | None | True | my RNA-seq | None | .h5ad | dataset | 19240 | nLH34gqty3-5c2eGF6deOA | None | 3 | md5 | AnnData | 1 | True | 1 | 1 | 1 | 1 | 2024-09-10 19:09:03.325534+00:00 |
Update ontologies, e.g., create a cell type record and add a new cell state.
# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_source(name="neuron")
neuron.save()
# create a record to track a new cell state
new_cell_state = bt.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()
# express that it's a neuron state
new_cell_state.parents.add(neuron)
# view ontological hierarchy
new_cell_state.view_parents(distance=2)
Show code cell output
✓ created 1 CellType record from Bionty matching name: 'neuron'
✓ created 3 CellType records from Bionty matching ontology_id: 'CL:0000393', 'CL:0000404', 'CL:0002319'
! records with similar names exist! did you mean to load one of them?
uid | name | ontology_id | abbr | synonyms | description | source_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
1 | 3QnZfoBk | neuron | CL:0000540 | None | nerve cell | The Basic Cellular Unit Of Nervous Tissue. Eac... | 32 | 1 | 1 | 2024-09-10 19:09:04.447172+00:00 |
2 | 2qSJYeQX | electrically responsive cell | CL:0000393 | None | None | A Cell Whose Function Is Determined By Its Res... | 32 | 1 | 1 | 2024-09-10 19:09:05.256759+00:00 |
3 | 5NqNmmSr | electrically signaling cell | CL:0000404 | None | None | A Cell That Initiates An Electrical Signal And... | 32 | 1 | 1 | 2024-09-10 19:09:05.256797+00:00 |
Scale up data & learning¶
How do you integrate new datasets with your existing datasets? Leverage Collection
.
# a new dataset
df = pd.DataFrame(
{
"CD8A": [2, 3, 3],
"CD4": [3, 4, 5],
"CD38": [4, 2, 3],
"perturbation": ["DMSO", "IFNG", "IFNG"],
},
index=["sample4", "sample5", "sample6"],
)
adata = ad.AnnData(df[["CD8A", "CD4", "CD38"]], obs=df[["perturbation"]])
# validate, curate and save a new artifact
curate = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.symbol,
categoricals={adata.obs.perturbation.name: ln.ULabel.name},
organism="human",
)
curate.add_validated_from_var_index()
curate.validate()
artifact2 = curate.save_artifact(description="my RNA-seq dataset 2")
Show code cell output
✓ var_index is validated against Gene.symbol
✓ perturbation is validated against ULabel.name
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/RfpUjvhZerqkvKl10000.h5ad')
✓ storing artifact 'RfpUjvhZerqkvKl10000' at '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-intro/.lamindb/RfpUjvhZerqkvKl10000.h5ad'
• parsing feature names of X stored in slot 'var'
✓ 3 terms (100.00%) are validated for symbol
✓ linked: FeatureSet(uid='WFZCdQlYqQ3gtieeOeoj', n=3, dtype='int', registry='bionty.Gene', hash='QW2rHuIo5-eGNZbRxHMDCw', created_by_id=1, run_id=1)
• parsing feature names of slot 'obs'
✓ 1 term (100.00%) is validated for name
✓ linked: FeatureSet(uid='OXIU5J7yJX8VsFU8bV2u', n=1, registry='Feature', hash='O7V85H9uxjft0x9OxqCVEA', created_by_id=1, run_id=1)
✓ saved 1 feature set for slot: 'var'
Create a collection using Collection
.
collection = ln.Collection([artifact, artifact2], name="my RNA-seq collection").save()
collection.describe()
collection.view_lineage()
Show code cell output
Collection(uid='OGpVbDCNV0VbG2A60000', is_latest=True, name='my RNA-seq collection', hash='xDjhklRxArFHharWMZPEzw', visibility=1, updated_at='2024-09-10 19:09:07 UTC')
Provenance
.created_by = 'anonymous'
.transform = 'Introduction'
.run = '2024-09-10 19:08:55 UTC'
# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()
# typically, it's too big, hence, iterate over its artifacts
collection.artifacts.all()
# or look at a DataFrame listing the artifacts
collection.artifacts.df()
Show code cell output
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
3 | IcQz5l3u8rgkZfHa0000 | None | True | my RNA-seq | None | .h5ad | dataset | 19240 | nLH34gqty3-5c2eGF6deOA | None | 3 | md5 | AnnData | 1 | True | 1 | 1 | 1 | 1 | 2024-09-10 19:09:03.325534+00:00 |
4 | RfpUjvhZerqkvKl10000 | None | True | my RNA-seq dataset 2 | None | .h5ad | dataset | 19240 | K95PcyOoxIxtlytXMr6AVg | None | 3 | md5 | AnnData | 1 | True | 1 | 1 | 1 | 1 | 2024-09-10 19:09:07.549914+00:00 |
Directly train models on collections of AnnData
.
# to train models, batch iterate through the collection as if it was one array
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(obs_keys=["perturbation"])
sampler = WeightedRandomSampler(
weights=dataset.get_label_weights("perturbation"), num_samples=len(dataset)
)
data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in data_loader:
pass
Read this blog post for more on training models on sharded datasets.
Data lineage¶
Generate lineage¶
LaminDB doesn’t require you to write workflows to track data lineage (if you want a workflow, look at Pipelines – workflow managers).
Everything you need is:
import lamindb as ln
ln.context.track() # track your run context
# write code
ln.context.finish() # mark run as finished, save execution report, source code & environment
How does it look on the hub?
See an example for this introductory notebook here.
For lamindb instances that connect to the hub, you can download notebooks & scripts via lamin get
.
lamin get https://lamin.ai/laminlabs/lamindata/transform/FPnfDtJz8qbE65cN
Use lineage¶
View the sequence of data transformations (Transform
) in a project (from a use case, based on Schmidt et al., 2022):
transform.view_lineage()
Or, the generating flow of an artifact:
artifact.view_lineage()
Both figures are based on mere calls to ln.context.track()
in notebooks, pipelines & app.
Distributed & modular¶
Easily create & access lakehouse instances¶
LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can connect to your instance via:
ln.connect("account-handle/instance-name")
Or you load an instance on the command line for auto-connecting in a Python session:
lamin load "account-handle/instance-name"
Or you create your new instance:
lamin init --storage ./my-data-folder
Custom schemas and plugins¶
LaminDB can be customized & extended with schema & app plugins building on the Django ecosystem. Examples are:
bionty: Registries for basic biological entities, coupled to public ontologies.
wetlab: Exemplary custom schema to manage samples, treatments, etc.
If you’d like to create your own schema or app:
Create a git repository with registries similar to wetlab
Create & deploy migrations via
lamin migrate create
andlamin migrate deploy
It’s fastest if we do this for you based on our templates within an enterprise plan.
Design¶
Why?¶
Objects like pd.DataFrame
are at the heart of many data science workflows but there hasn’t been a tool to manage these objects in the rich context that collaborative biological research requires:
provenance: data sources, data transformations, models, users
domain knowledge & experimental metadata: the features & labels derived from domain entities
In this blog post, we discuss how the complexity of modern R&D data often blocks realizing the scientific progress it promises.
Assumptions¶
Schema & API¶
LaminDB provides a SQL schema for common entities: Artifact
, Collection
, Transform
, Feature
, ULabel
etc. - see the API reference or the source code.
The core schema is extendable through plugins (see blue vs. red entities in graphic), e.g., with basic biological (Gene
, Protein
, CellLine
, etc.) & operational entities (Biosample
, Techsample
, Treatment
, etc.).
What is the schema language?
Data models are defined in Python using the Django ORM. Django translates them to SQL tables. Django is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.
On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.
Repositories¶
LaminDB and its plugins consist in open-source Python libraries & publicly hosted metadata assets:
lamindb: Core package.
bionty: Registries for basic biological entities, coupled to public ontologies.
wetlab: Default wetlab schema.
guides: Guides.
usecases: Use cases.
All immediate dependencies are available as git submodules here, for instance,
lnschema-core: Core schema.
lamindb-setup: Setup & configure LaminDB.
lamin-cli: CLI for
lamindb
andlamindb-setup
.
For a comprehensive list of open-sourced software, browse our GitHub account.
lamin-utils: Generic utilities, e.g., a logger.
readfcs: FCS artifact reader.
nbproject: Light-weight Jupyter notebook tracker.
bionty-assets: Assets for public biological ontologies.
LaminHub is not open-sourced.
Influences¶
LaminDB was influenced by many other projects, see Influences.
Note
This is how you delete your lamindb instance.
You will be asked for confirmation and for consciously deleting your data.
lamin delete lamin-intro