Changelog 2025

Note

Get notified by watching releases for git repositories: lamindb, laminhub, laminr, and bionty.

🪜 For other years, see: 2024 · 2023 · 2022

2025-02-18 db 1.1.0

⚠️ The FeatureSet registry got renamed to Schema.

All your code is backward compatible. The Schema registry encompasses feature sets as a special case.

✨ Conveniently track functions including inputs, outputs, and parameters with a decorator: ln.tracked(). PR1 PR2 @falexwolf

@ln.tracked()
def subset_dataframe(
    input_artifact_key: str,  # all arguments tracked as parameters of the function run
    output_artifact_key: str,
    subset_rows: int = 2,
    subset_cols: int = 2,
) -> None:
    artifact = ln.Artifact.get(key=input_artifact_key)
    df = artifact.load()  # auto-tracked as input
    new_df = df.iloc[:subset_rows, :subset_cols]
    ln.Artifact.from_df(new_df, key=output_artifact_key).save()  # auto-tracked as output

✨ Make sub-types of ULabel, Feature, Schema, Project, Param, and Reference. PR @falexwolf

cell_medium = ln.ULabel(name="CellMedium", is_type=True).save()
ln.ULabel(name="DMSO", type=cell_medium).save()
ln.ULabel(name="IFNG", type=cell_medium).save()

✨ Use an overhauled dataset curation flow. @falexwolf @Zethson @sunnyosun

  • support persisting validation constraints as a pandera-compatible schema

  • support validating any feature type, no longer just categoricals

  • make the relationship between features, dataset schema, and curator evident

Detailed changes for the overhauled curation flow.

⚠️ The API gained the lamindb.curators module as the new way to access Curator classes for different data structures.

  • This release introduces the schema-based DataFrameCurator and AnnDataCurator

  • The old-style curation flow for categoricals based on lamindb.Curator.from_objecttype() continues to work

Before

After

image

image

image

image

Key PRs.

  • ✨ Overhaul curation guides + enable default values and filters on valid categories for features PR @falexwolf

  • ✨ Schema-based curators: AnnDataCurator PR @falexwolf

  • ✨ Schema-based curators: DataFrameCurator PR @falexwolf

Enabling PRs.

  • ✨ Allow passing artifact to Curator PR @sunnyosun

  • 🎨 A ManyToMany between Schema.components and .composites PR @falexwolf

  • ♻️ Mark Schema fields as non-editable PR @falexwolf

  • ✨ Add auxiliary field nullable to Feature PR @falexwolf

  • ♻️ Prettify AnnDataCurator implementation PR @falexwolf

  • 🚸 Better error for malformed categorical dtype PR @falexwolf

  • 🎨 A ManyToMany between Schema.components and .composites PR @falexwolf

  • 🚚 Restore .feature_sets as a ManyToManyField PR @falexwolf

  • 🩹 Delete files before re-downloading them in UPath.synchronize PR @Koncopd

  • 🚚 Rename CatCurator to CatManager PR @falexwolf

  • 🎨 Let Curator.validate() throw an error PR @falexwolf

  • ♻️ Re-purpose BaseCurator as Curator, introduce CatCurator and consolidate shared logic under CatCurator PR @falexwolf

  • ♻️ Refactor organism handling in curators PR @falexwolf

  • 🔥 Eliminate all logic related to using_key in curators PR @falexwolf

  • 🚚 Bulk-rename old-style curators to CatCurator PR @falexwolf

  • 🎨 Self-contained definition of CellxGene schema / validation constraints PR @falexwolf

  • 🚚 Move PertCurator from wetlab here and add CellxGene Curator test PR @falexwolf

  • 🚚 Move CellXGene Curator from cellxgene-lamin here PR @falexwolf

schema = ln.Schema(
    name="small_dataset1_obs_level_metadata",
    features=[
        ln.Feature(name="CD8A", dtype=int).save(),  # integer counts for CD8A marker
        ln.Feature(name="cell_medium", dtype=ln.ULabel).save(),  # a categorical feature that validates against the ULabel registry
        ln.Feature(name="sample_note", dtype=str).save(),   # a note for the sample
    ],
).save()

df = pd.DataFrame({
    "CD8A": [1, 4, 0],
    "cell_medium": ["DMSO", ],
    "sample_note": ["value_1", "value_2", "value_3"],
    "temperature": [22.2, 25.7, 27.3],
})
curator = ln.curators.DataFrameCurator(df, schema)
artifact = curator.save_artifact(key="example_datasets/dataset1.parquet")  # validates compliance with schema, annotates with metadata
assert artifact.schema == schema  # the validating schema

✨ Easily filter on a validating schema. @falexwolf @Zethson @sunnyosun

On the hub.

With the Schema filter button, find all datasets that satisfy a given schema (→ explore).

image
schema = ln.Schema.get(name="small_dataset1_obs_level_metadata")  # get a schema
ln.Artifact.filter(schema=schema).df()  # filter all datasets that were validated by the schema

Collection.open() returns a pyarrow dataset. PR @Koncopd

df = pd.DataFrame({"feat1": [0, 0, 1, 1], "feat2": [6, 7, 8, 9]})
df[:2].to_parquet("df1.parquet", engine="pyarrow")
df[2:].to_parquet("df2.parquet", engine="pyarrow")

artifact1 = ln.Artifact(shard1, key="df1.parquet").save()
artifact2 = ln.Artifact(shard2, key="df2.parquet").save()
collection = ln.Collection([artifact1, artifact2], key="parquet_col")

dataset = collection.open() # backed by files in the cloud storage
dataset.to_table().to_pandas().head()

✨ Support s3-compatible endpoint urls, say your on-prem MinIO deployment. PR @Koncopd

Speed up instance creation through squashed migrations.

Tiledbsoma.

  • ✨ Support endpoint_url in operations with tiledbsoma PR1 PR2 @Koncopd

  • ✨ Add Artifact.from_tiledbsoma to populate n_observations PR @Koncopd

MappedCollection.

  • 🐛 Allow filtering on np.nan in obs_filter of MappedCollection PR @Koncopd

  • 🐛 Fix labels for NaN in categorical columns for MappedCollection PR @Koncopd

SpatialDataCurator.

  • 🐛 Fix var_index standardization of SpatialDataCurator PR1 PR2 @Zethson

  • 🐛 Fix sample level metadata optional in SpatialDataCatManager PR @Zethson

Core functionality.

  • ✨ Allow to check the need for syncing without actually syncing PR @Koncopd

  • ✨ Check for corrupted cache in Artifact.load() & Artifact.open() PR PR @Koncopd

  • ✨ Infer n_observations in Artifact.from_anndata PR @Koncopd

  • 🐛 Account for VSCode appending languageid to markdown cell in notebook tracking PR @falexwolf

  • 🐛 Fix dangling folders on upload failures PR @Koncopd

  • 🐛 Normalize module names for robust checking in _check_instance_setup() PR @Koncopd

  • 🐛 Fix idempotency of Feature creation when description is passed and improve filter and get error behavior PR @Zethson

  • 🐛 Fix caching logic in Artifact.open() PR @Koncopd

  • 🚸 Make new version upon passing existing key to Collection PR @falexwolf

  • 🚸 Throw better error upon checking instance.modules when loading a lamindb schema module PR @Koncopd

  • 🚸 Validate existing records in the DB irrespective of whether an ontology source is passed or not PR @sunnyosun

  • 🚸 Full guarantee of avoiding duplicating Transform, Artifact & Collection in concurrent runs PR @falexwolf

  • 🚸 Fix RemovedInDjango60Warning PR @Zethson

  • 🚸 Better user feedback during keyword validation in Record constructor PR @Zethson

  • 🚸 Fix warning about artifacts in trash PR @ap–

  • 🚸 Improved error message when saving via CLI PR @Zethson

  • 🚸 Improve local storage not found warning message PR @Zethson

  • 🚸 Better error message when attempting to save a file while not being connected to an instance PR @Zethson

  • 🚸 Error for non-keyword parameters for Artifact.from_x methods PR @Zethson

Housekeeping.

  • 🚸 Error at runtime with old s3fs PR @Koncopd

  • 🚸 Safer resolve in check_path_is_child_of_root() PR @Koncopd

  • ⬆️ Upgrade fsspec packages (s3fs, gcsfs, universal_pathlib) PR @Koncopd

  • ➕ Add pyyaml to dependencies PR @Koncopd

2025-01-23 db 1.0.5

  • 🚸 No longer throw a NotebookNotSaved error in ln.finish() but wait for the user or gracefully exit PR @falexwolf

  • 🚸 Resolve save FutureWarning PR @Zethson

  • 🐛 Fix Artifact.replace() for folder-like artifacts PR @Koncopd

  • 🐛 Filter the latest transform on saving by filename PR @Koncopd

2025-01-21 db 1.0.4

🚚 Revert Collection.description back to unlimited length TextField. PR @falexwolf

2025-01-21 db 1.0.3

🚸 In track(), improve logging in RStudio sessions. PR @falexwolf

2025-01-20 R 0.4.0

  • 🚚 Migrate to lamindb v1 PR @falexwolf

  • 🚸 Improve the user experience for setting up Python & reticulate PR @lazappi

2025-01-20 db 1.0.2

🚚 Improvments for lamindb v1 migrations. PR @falexwolf

  • add a .description field to Schema

  • enable labeling Run with ULabel

  • add a .predecessors and .successors field to Project akin to what’s present on Transform

  • make .uid fields not editable

2025-01-18 db 1.0.1

🐛 Block non-admin users from confirming the dialogue for integrating lnschema-core. PR @falexwolf

2025-01-17 db 1.0.0

This release makes the API consistent, integrates lnschema_core & ourprojects into the lamindb package, and introduces a breadth of database migrations to enable future features without disruption. You’ll now need at least Python 3.10.

Your code will continue to run as is, but you will receive warnings about a few renamed API components.

What

Before

After

Dataset vs. model

Artifact.type

Artifact.kind

Python object for Artifact

Artifact._accessor

Artifact.otype

Number of files

Artifact.n_objects

Artifact.n_files

name arg of Transform

Transform(name="My notebook", key="my-notebook.ipynb")

Transform(key="my-notebook.ipynb", description="My notebook")

name arg of Collection

Collection(name="My collection")

Collection(key="My collection")

Consecutiveness field

Run.is_consecutive

Run._is_consecutive

Run initiator

Run.parent

Run.initiated_by_run

--schema arg

lamin init --schema bionty,wetlab

lamin init --modules bionty,wetlab

Migration guide:

  1. Upon lamin connect account/instance you will be prompted to confirm migrating away from lnschema_core

  2. After that, you will be prompted to call lamin migrate deploy to apply database migrations

New features:

  • ✨ Allow http storage backend for Artifact PR @Koncopd

  • ✨ Add SpatialDataCurator PR @Zethson

  • ✨ Allow filtering by multiple obs columns in MappedCollection PR @Koncopd

  • ✨ In git sync, also search git blob hash in non-default branches PR @Zethson

  • ✨ Add relationship with Project to everything except Run, Storage & User so that you can easily filter for the entities relevant to your project PR @falexwolf

  • ✨ Capture logs of scripts during ln.track() PR1 PR2 @falexwolf @Koncopd

  • ✨ Support "|"-seperated multi-values in Curator PR @sunnyosun

  • 🚸 Accept None in connect() and improve migration dialogue PR @falexwolf

UX improvements:

  • 🚸 Simplify the ln.track() experience PR @falexwolf

    1. you can omit the uid argument

    2. you can organize transforms in folders

    3. versioning is fully automated (requirement for 1.)

    4. you can save scripts and notebooks without running them (corollary of 1.)

    5. you avoid the interactive prompt in a notebook and the throwing of an error in a script (corollary of 1.)

    6. you are no longer required to add a title in a notebook

  • 🚸 Raise error when modifying Artifact.key in problematic ways PR1 PR2 @sunnyosun @Koncopd

  • 🚸 Better error message on running ln.track() within Python terminal PR @Koncopd

  • 🚸 Hide traceback for InstanceNotEmpty using Click Exception PR @Zethson

  • 🚸 Hide underscore attributes in __repr__ PR @Zethson

  • 🚸 Only auto-search ._name_field in sub-classes of CanCurate PR @falexwolf

  • 🚸 Simplify installation & API overview PR @falexwolf

  • 🚸 Make lamin_run_uid categorical in tiledbsoma stores PR @Koncopd

  • 🚸 Add defensive check for organism arg PR @Zethson

  • 🚸 Raise ValueError when trying to search a None value PR @Zethson

Bug fixes:

  • 🐛 Skip deleting storage when deleting outdated versions of folder-like artifacts PR @Koncopd

  • 🐛 Let SOMACurator() validate and annotate all .obs columns PR @falexwolf

  • 🐛 Fix renaming of feature sets PR @sunnyosun

  • 🐛 Do not raise an exception when default AWS credentials fail PR @Koncopd

  • 🐛 Only map synonyms when field is name PR @sunnyosun

  • 🐛 Fix source in .from_values PR @sunnyosun

  • 🐛 Fix creating instances with storage in the current local working directory PR @Koncopd

  • 🐛 Fix NA values in Curator.add_new_from() PR @sunnyosun

Refactors, renames & maintenance:

  • 🏗️ Integrate lnschema-core into lamindb PR1 PR2 @falexwolf @Koncopd

  • 🏗️ Integrate ourprojects into lamindb PR @falexwolf

  • ♻️ Manage created_at, updated_at on the database-level, make created_by not editable PR @falexwolf

  • 🚚 Rename transform type “glue” to “linker” PR @falexwolf

  • 🚚 Deprecate the --schema argument of lamin init in favor of --modules PR @falexwolf

  • ⬆️ Compatibility with tiledbsoma==1.15.0 PR @Koncopd

DevOps:

Detailed list of database migrations

Those not yet announced above will be announced with the functionality they enable.

  • ♻️ Add contenttypes Django plugin PR @falexwolf

  • 🚚 Prepare introduction of persistable Curator objects by renaming FeatureSet to Schema on the database-level PR @falexwolf

  • 🚚 Add a .type foreign key to ULabel, Feature, FeatureSet, Reference, Param PR @falexwolf

  • 🚚 Introduce RunData, TidyTable, and TidyTableData in the database PR @falexwolf

All remaining database schema changes were made in this PR @falexwolf. Data migrations happen automatically.

  • remove _source_code_artifact from Transform, it’s been deprecated since 0.75

    • data migration: for all transforms that have _source_code_artifact populated, populate source_code

  • rename Transform.name to Transform.description because it’s analogous to Artifact.description

    • backward compat:

      • in the Transform constructor use name to populate key in all cases in which only name is passed

      • return the same transform based on key in case source_code is None via ._name_field = "key"

    • data migrations:

      • there already was a legacy description field that was never exposed on the constructor; to be safe, we concatenated potential data in it on the new description field

      • for all transforms that have key=None and name!=None, use name to pre-populate key

  • rename Collection.name to Collection.key for consistency with Artifact & Transform and the high likelihood of you wanting to organize them hierarchically

  • a _branch_code integer on every record to model pull requests

    • include visibility within that code

    • repurpose visibility=0 as _branch_code=0 as “archive”

    • put an index on it

    • code a “draft” as _branch_code = 2, and “draft prs” as negative branch codes

  • rename values "number" to "num" in dtype

  • an ._aux json field on Record

  • a SmallInteger run._status_code that allows to write finished_at in clean up operations so that there is a run time also for aborted runs

  • rename Run.is_consecutive to Run._is_consecutive

  • a _template_id FK to store the information of the generating template (whether a record is a template is coded via _branch_code)

  • rename _accessor to otype to publicly declare the data format as suffix, accessor

  • rename Artifact.type to Artifact.kind

  • a FK to artifact run._logfile which holds logs

  • a hash field on ParamValue and FeatureValue to enforce uniqueness without running the danger of failure for large dictionaries

  • add a boolean field ._expect_many to Feature/Param that defaults to True/False and indicates whether values for this feature/param are expected to occur a single or multiple times for every single artifact/run

    • for feature

      • if it’s True (default), the values come from an observation-level aggregation and a dtype of datetime on the observation-level mean set[datetime] on the artifact-level

      • if it’s False it’s an artifact-level value and datetime means datetime; this is an edge case because an arbitrary artifact would always be a set of arbitrary measurements that would need to be aggregated (“one just happens to measure a single cell line in that artifact”)

    • for param

      • if it’s False (default), the values mean artifact/run-level values and datetime means datetime

      • if it’s True, the values would be from an aggregation, this seems like an edge case but say when characterizing a model ensemble trained with different parameters it could be relevant

  • remove the .transform foreign key from artifact and collection for consistency with all other records; introduce a property and a simple filter statement instead that maintains the same UX

  • store provenance metadata for TransformULabel, RunParamValue, ArtifactParamValue

  • enable linking projects & references to transforms & collections

  • rename Run.parent to Run.initiated_by_run

  • introduce a boolean flag on artifact that’s called _overwrite_versions, which indicates whether versions are overwritten or stored separately; it defaults to False for file-like artifacts and to True for folder-like artifacts

  • Rename n_objects to n_files for more clarity

  • Add a Space registry to lamindb with an FK on every BasicRecord

  • add a name column to Run so that a specific run can be used as a named specific analysis

  • remove _previous_runs field on everything except Artifact & Collection