Tutorial: Features & labels

In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:

  1. Findability: Which collections measured expression of cell marker CD14? Which characterized cell line K562? Which collections have a test & train split? Etc.

  2. Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.

Hint

This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.

If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Curate datasets.

import lamindb as ln
import pandas as pd
import pytest

ln.settings.verbosity = "hint"
Hide code cell output
 connected lamindb: anonymous/lamin-tutorial

TLDR

Annotate by labels

# create a label
study0 = ln.ULabel(
    name="Study 0: initial plant gathering", description="My initial study"
).save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.get(key="iris_studies/study0_raw_images")
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()
Hide code cell output
Artifact 
├── General
│   ├── .uid = 'VMWTlHIqTUe1jq6W0000'
│   ├── .key = 'iris_studies/study0_raw_images'
│   ├── .size = 658465
│   ├── .hash = 'IVKGMfNwi8zKvnpaD_gG7w'
│   ├── .n_files = 51
│   ├── .path = s3://lamindata/iris_studies/study0_raw_images
│   ├── .created_by = anonymous
│   ├── .created_at = 2025-01-20 08:39:36
│   └── .transform = 'Tutorial: Artifacts'
└── Labels
    └── .ulabels                    ULabel                     Study 0: initial plant gathering         

Annotate by features

Features are buckets for labels, numbers and other data types.

Often, data that you want to ingest comes with metadata.

Here, three metadata features species, scientist, instrument were collected.

df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
df.head()
species file_name scientist instrument
0 setosa iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce... Barbara McClintock Leica IIIc Camera
1 versicolor iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710... Edgar Anderson Leica IIIc Camera
2 versicolor iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf... Edgar Anderson Leica IIIc Camera
3 setosa iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109... Edgar Anderson Leica IIIc Camera
4 virginica iris-bdae8314e4385d8e2322abd8e63a82758a9063c77... Edgar Anderson Leica IIIc Camera

There are only a few values for features species, scientist & instrument, and we’d like to label the artifact with these values:

df.nunique()
species        3
file_name     50
scientist      2
instrument     1
dtype: int64

Let’s annotate the artifact with features & values and add a temperature measurement that Barbara & Edgar had forgotten in their csv:

with pytest.raises(ln.core.exceptions.ValidationError) as error:
    artifact.features.add_values(
        {
            "species": df.species.unique(),
            "scientist": df.scientist.unique(),
            "instrument": df.instrument.unique(),
            "temperature": 27.6,
            "study": "Study 0: initial plant gathering",
        }
    )
print(error.exconly())
Hide code cell output
lamindb.core.exceptions.ValidationError: These keys could not be validated: ['species', 'scientist', 'instrument', 'temperature', 'study']
Here is how to create a feature:

  ln.Feature(name='species', dtype='cat ? str').save()
  ln.Feature(name='scientist', dtype='cat ? str').save()
  ln.Feature(name='instrument', dtype='cat ? str').save()
  ln.Feature(name='temperature', dtype='float').save()
  ln.Feature(name='study', dtype='cat ? str').save()

As we saw, nothing was validated and hence, we got an error that tells us to register features & labels:

ln.Feature(name="species", dtype="cat").save()
ln.Feature(name="scientist", dtype="cat").save()
ln.Feature(name="instrument", dtype="cat").save()
ln.Feature(name="temperature", dtype="float").save()
ln.Feature(name="study", dtype="cat").save()
species = ln.ULabel.from_values(df["species"].unique(), create=True).save()
authors = ln.ULabel.from_values(df["scientist"].unique(), create=True).save()
instruments = ln.ULabel.from_values(df["instrument"].unique(), create=True).save()

Now everything works:

artifact.features.add_values(
    {
        "species": df.species.unique(),
        "scientist": df.scientist.unique(),
        "instrument": df.instrument.unique(),
        "temperature": 27.6,
        "study": "Study 0: initial plant gathering",
    }
)
artifact.describe()
Hide code cell output
Artifact 
├── General
│   ├── .uid = 'VMWTlHIqTUe1jq6W0000'
│   ├── .key = 'iris_studies/study0_raw_images'
│   ├── .size = 658465
│   ├── .hash = 'IVKGMfNwi8zKvnpaD_gG7w'
│   ├── .n_files = 51
│   ├── .path = s3://lamindata/iris_studies/study0_raw_images
│   ├── .created_by = anonymous
│   ├── .created_at = 2025-01-20 08:39:36
│   └── .transform = 'Tutorial: Artifacts'
├── Linked features
│   └── instrument                  cat[ULabel]                Leica IIIc Camera                        
scientist                   cat[ULabel]                Barbara McClintock, Edgar Anderson       
species                     cat[ULabel]                setosa, versicolor, virginica            
study                       cat[ULabel]                Study 0: initial plant gathering         
temperature                 float                      27.6                                     
└── Labels
    └── .ulabels                    ULabel                     Study 0: initial plant gathering, setosa…

Because we also re-labeled with the study label Study 0: initial plant gathering', we see that it appears under the study feature.

Retrieve features

artifact.features.get_values()
Hide code cell output
{'instrument': 'Leica IIIc Camera',
 'scientist': {'Barbara McClintock', 'Edgar Anderson'},
 'species': {'setosa', 'versicolor', 'virginica'},
 'study': 'Study 0: initial plant gathering',
 'temperature': 27.6}

Query by features

artifact = ln.Artifact.features.get(temperature=27.6)
artifact
Artifact(uid='VMWTlHIqTUe1jq6W0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_files=51, space_id=1, storage_id=2, run_id=1, created_by_id=1, created_at=2025-01-20 08:39:36 UTC)

Register metadata

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. "species") and labels represent measured values (e.g. "iris setosa", "iris versicolor", "iris virginica").

In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.

Register labels

We study 3 species of the Iris plant: setosa, versicolor & virginica. Let’s create 3 labels with ULabel.

ULabel enables you to manage an in-house ontology to manage all kinds of generic labels.

What are alternatives to ULabel?

In a complex project, you’ll likely want dedicated typed registries for selected label types, e.g., Gene, Tissue, etc. See: Manage biological registries.

ULabel, however, will get you quite far and scale to ~1M labels.

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:

is_species = ln.ULabel(name="is_species").save()
is_species.children.set(species)
is_species.view_parents(with_children=True)
Hide code cell output
_images/89eef339044f9d633db79a2685fdaf3978383500f387141a24fe65c9ec0359cc.svg

Query artifacts by labels

Using the new annotations, you can now query image artifacts by species & study labels:

ln.ULabel.df()
uid name is_type description reference reference_type space_id type_id run_id created_at created_by_id _aux _branch_code
id
8 6SGkyHZ7 is_species None None None None 1 None None 2025-01-20 08:39:44.531000+00:00 1 None 1
7 Uc86foXa Leica IIIc Camera None None None None 1 None None 2025-01-20 08:39:44.421000+00:00 1 None 1
5 0aJCCFxM Barbara McClintock None None None None 1 None None 2025-01-20 08:39:44.417000+00:00 1 None 1
6 Yy9HiheU Edgar Anderson None None None None 1 None None 2025-01-20 08:39:44.417000+00:00 1 None 1
2 WEJlAuCg setosa None None None None 1 None None 2025-01-20 08:39:44.411000+00:00 1 None 1
3 fm35HMeY versicolor None None None None 1 None None 2025-01-20 08:39:44.411000+00:00 1 None 1
4 0Nn9B31h virginica None None None None 1 None None 2025-01-20 08:39:44.411000+00:00 1 None 1
1 8SPJInwm Study 0: initial plant gathering None My initial study None None 1 None None 2025-01-20 08:39:43.557000+00:00 1 None 1
ulabels = ln.ULabel.lookup()
ln.Artifact.get(ulabels=ulabels.study_0_initial_plant_gathering)
Hide code cell output
Artifact(uid='VMWTlHIqTUe1jq6W0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_files=51, space_id=1, storage_id=2, run_id=1, created_by_id=1, created_at=2025-01-20 08:39:36 UTC)

Run an ML model

Let’s now run a mock ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    artifact.cache()
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data


transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
ln.context.track(transform=transform)
df = run_ml_model()
Hide code cell output
/tmp/ipykernel_3608/1867027381.py:7: FutureWarning: `name` will be removed soon, please pass 'Petal & sepal regressor' to `key` instead
  transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_gzpBQFrU0tmnO12ob2qd.txt
 created Transform('wqpSKjybuu380000'), started new Run('gzpBQFrU...') at 2025-01-20 08:39:44 UTC
 adding artifact ids [1] as inputs for run 2, adding parent transform 1

The output is a dataframe:

df.head()
Hide code cell output
sepal_length sepal_width petal_length petal_width iris_organism_name
0 0.051 0.035 0.014 0.002 setosa
1 0.049 0.030 0.014 0.002 setosa
2 0.047 0.032 0.013 0.002 setosa
3 0.046 0.031 0.015 0.002 setosa
4 0.050 0.036 0.014 0.002 setosa

And this is the pipeline that produced the dataframe:

ln.context.transform.view_lineage()
Hide code cell output
_images/2f83e7d2c7e1f8cd0e16d3367f044a6445f83a906b137836a008352e3d358875.svg

Register the output data

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?

Use the unit field of Feature. In the above example, you’d do:

for feature in features:
    if feature.type == "number":
        feature.unit = "m"  # SI unit for meters
        feature.save()

We can now validate & register the dataframe in one line:

artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
Hide code cell output
Artifact(uid='I22D5eDGtqEqamdc0000', is_latest=True, description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', kind='dataset', otype='DataFrame', size=4834, hash='wWP9MqUZoqM9uixdeW50tA', space_id=1, storage_id=1, run_id=2, created_by_id=1, created_at=2025-01-20 08:39:45 UTC)

There is one categorical feature, let’s add the species labels:

features = ln.Feature.lookup()
species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.species)
species_labels
<QuerySet [ULabel(uid='WEJlAuCg', name='setosa', created_by_id=1, space_id=1, created_at=2025-01-20 08:39:44 UTC), ULabel(uid='fm35HMeY', name='versicolor', created_by_id=1, space_id=1, created_at=2025-01-20 08:39:44 UTC), ULabel(uid='0Nn9B31h', name='virginica', created_by_id=1, space_id=1, created_at=2025-01-20 08:39:44 UTC)]>

Let’s now add study labels:

artifact.labels.add(ulabels.study_0_initial_plant_gathering, feature=features.study)

This is the context for our artifact:

artifact.describe()
artifact.view_lineage()
Hide code cell output
Artifact .parquet/DataFrame
├── General
│   ├── .uid = 'I22D5eDGtqEqamdc0000'
│   ├── .size = 4834
│   ├── .hash = 'wWP9MqUZoqM9uixdeW50tA'
│   ├── .path = /home/runner/work/lamin-docs/lamin-docs/docs/lamin-tutorial/.lamindb/I22D5eDGtqEqamdc0000.parquet
│   ├── .created_by = anonymous
│   ├── .created_at = 2025-01-20 08:39:45
│   └── .transform = 'None'
├── Linked features
│   └── species                     cat[ULabel]                setosa, versicolor, virginica            
study                       cat[ULabel]                Study 0: initial plant gathering         
└── Labels
    └── .ulabels                    ULabel                     Study 0: initial plant gathering, setosa…
_images/f72beeb434bdfc8bcb49c61fb7dad3487852e9a954b77db4d835d0c2a4e0dc14.svg

See the database content:

ln.view(registries=["Feature", "ULabel"])
Hide code cell output
uid name dtype is_type unit description array_rank array_size array_shape proxy_dtype synonyms _expect_many _curation space_id type_id run_id created_at created_by_id _aux _branch_code
id
6 djTIMoxFCwmd sepal_length float None None None 0 0 None None None True None 1 None 2.0 2025-01-20 08:39:45.703000+00:00 1 None 1
7 zpfK4nBACjex sepal_width float None None None 0 0 None None None True None 1 None 2.0 2025-01-20 08:39:45.703000+00:00 1 None 1
8 wet1CMIhe7jP petal_length float None None None 0 0 None None None True None 1 None 2.0 2025-01-20 08:39:45.703000+00:00 1 None 1
9 uPSDP1wvS2rN petal_width float None None None 0 0 None None None True None 1 None 2.0 2025-01-20 08:39:45.703000+00:00 1 None 1
10 zuWsTcqgqfW6 iris_organism_name str None None None 0 0 None None None True None 1 None 2.0 2025-01-20 08:39:45.703000+00:00 1 None 1
5 A90MQ63Pa1qa study cat[ULabel] None None None 0 0 None None None True None 1 None NaN 2025-01-20 08:39:44.402000+00:00 1 None 1
4 FoP3qkRiAS2v temperature float None None None 0 0 None None None True None 1 None NaN 2025-01-20 08:39:44.399000+00:00 1 None 1
uid name is_type description reference reference_type space_id type_id run_id created_at created_by_id _aux _branch_code
id
8 6SGkyHZ7 is_species None None None None 1 None None 2025-01-20 08:39:44.531000+00:00 1 None 1
7 Uc86foXa Leica IIIc Camera None None None None 1 None None 2025-01-20 08:39:44.421000+00:00 1 None 1
5 0aJCCFxM Barbara McClintock None None None None 1 None None 2025-01-20 08:39:44.417000+00:00 1 None 1
6 Yy9HiheU Edgar Anderson None None None None 1 None None 2025-01-20 08:39:44.417000+00:00 1 None 1
2 WEJlAuCg setosa None None None None 1 None None 2025-01-20 08:39:44.411000+00:00 1 None 1
3 fm35HMeY versicolor None None None None 1 None None 2025-01-20 08:39:44.411000+00:00 1 None 1
4 0Nn9B31h virginica None None None None 1 None None 2025-01-20 08:39:44.411000+00:00 1 None 1

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix

Manage metadata

Avoid duplicates

Let’s create a label "project1":

ln.ULabel(name="project1").save()
Hide code cell output
ULabel(uid='rEkp2KNb', name='project1', created_by_id=1, run_id=2, space_id=1, created_at=2025-01-20 08:39:45 UTC)

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")
label.save()
Hide code cell output
ULabel(uid='rEkp2KNb', name='project1', created_by_id=1, run_id=2, space_id=1, created_at=2025-01-20 08:39:45 UTC)

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")
Hide code cell output
uid name is_type description reference reference_type space_id type_id run_id created_at created_by_id _aux _branch_code
id
9 rEkp2KNb project1 None None None None 1 None 2 2025-01-20 08:39:45.852000+00:00 1 None 1
ULabel(uid='Fr9Uc73F', name='project 1', created_by_id=1, run_id=2, space_id=1, created_at=<django.db.models.expressions.DatabaseDefault object at 0x7fcc8ac6ce60>)

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via search_names.

Update & delete records

label = ln.ULabel.filter(name="project1").first()
label
Hide code cell output
ULabel(uid='rEkp2KNb', name='project1', created_by_id=1, run_id=2, space_id=1, created_at=2025-01-20 08:39:45 UTC)
label.name = "project1a"
label.save()
label
Hide code cell output
ULabel(uid='rEkp2KNb', name='project1a', created_by_id=1, run_id=2, space_id=1, created_at=2025-01-20 08:39:45 UTC)
label.delete()

Manage storage

Change default storage

The default storage location is:

ln.settings.storage
Hide code cell output
StorageSettings(root='/home/runner/work/lamin-docs/lamin-docs/docs/lamin-tutorial', uid='3jKQ8on6rE2m')

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations

ln.Storage.df()
Hide code cell output
uid root description type region instance_uid space_id run_id created_at created_by_id _aux _branch_code
id
2 kzfMvazaTiZ1 s3://lamindata None s3 us-east-1 None 1 None 2025-01-20 08:39:35.967000+00:00 1 None 1
1 3jKQ8on6rE2m /home/runner/work/lamin-docs/lamin-docs/docs/l... None local None 5WuFt3cW4zRx 1 None 2025-01-20 08:39:30.836000+00:00 1 None 1
Hide code cell content
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial