Tutorial: Features & labels

In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:

  1. Findability: Which collections measured expression of cell marker CD14? Which characterized cell line K562? Which collections have a test & train split? Etc.

  2. Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.

Hint

This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.

If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Validate, standardize & annotate.

import lamindb as ln
import pandas as pd
import pytest

ln.settings.verbosity = "hint"
Hide code cell output
💡 connected lamindb: anonymous/lamin-tutorial

TLDR

Annotate by labels

# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images").one()
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()
Hide code cell output
Artifact(uid='OmqbNZEDH1iKYW71nXD8', key='iris_studies/study0_raw_images', suffix='', type='dataset', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-06-19 23:05:13 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = 's3://lamindata'
    .transform = 'Tutorial: Artifacts'
    .run = '2024-06-19 23:05:11 UTC'
  Labels
    .ulabels = 'Study 0: initial plant gathering'

Annotate by features

Features are buckets for labels, numbers and other data types.

Often, data that you want to ingest comes with metadata.

Here, three metadata features species, scientist, instrument were collected.

df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
df.head()
species file_name scientist instrument
0 setosa iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce... Barbara McClintock Leica IIIc Camera
1 versicolor iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710... Edgar Anderson Leica IIIc Camera
2 versicolor iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf... Edgar Anderson Leica IIIc Camera
3 setosa iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109... Edgar Anderson Leica IIIc Camera
4 virginica iris-bdae8314e4385d8e2322abd8e63a82758a9063c77... Edgar Anderson Leica IIIc Camera

There are only a few values for features species, scientist & instrument, and we’d like to label the artifact with these values:

df.nunique()
species        3
file_name     50
scientist      2
instrument     1
dtype: int64

Let’s annotate the artifact with features & values and also add in a temperature measurement that Barbara & Edgar had forgotten to add to their csv:

with pytest.raises(ln.core.exceptions.ValidationError) as error:
    artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
print(error.exconly())
Hide code cell output
lamindb.core.exceptions.ValidationError: These keys could not be validated: ['species', 'scientist', 'instrument', 'temperature', 'study']
Here is how to create a feature:

  ln.Feature(name='species', dtype='cat').save()
  ln.Feature(name='scientist', dtype='cat').save()
  ln.Feature(name='instrument', dtype='cat').save()
  ln.Feature(name='temperature', dtype='float').save()
  ln.Feature(name='study', dtype='cat[ULabel]').save()

As we saw, nothing was validated and hence, we got an error that tells us to register features & labels:

ln.Feature(name='species', dtype='cat[ULabel]').save()
ln.Feature(name='scientist', dtype='cat[ULabel]').save()
ln.Feature(name='instrument', dtype='cat[ULabel]').save()
ln.Feature(name='study', dtype='cat[ULabel]').save()
ln.Feature(name='temperature', dtype='float').save()
species = ln.ULabel.from_values(df['species'].unique(), create=True)
ln.save(species)
authors = ln.ULabel.from_values(df['scientist'].unique(), create=True)
ln.save(authors)
instruments = ln.ULabel.from_values(df['instrument'].unique(), create=True)
ln.save(instruments)

Now everything works:

artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
artifact.describe()
Hide code cell output
Artifact(uid='OmqbNZEDH1iKYW71nXD8', key='iris_studies/study0_raw_images', suffix='', type='dataset', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at='2024-06-19 23:05:13 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = 's3://lamindata'
    .transform = 'Tutorial: Artifacts'
    .run = '2024-06-19 23:05:11 UTC'
  Labels
    .ulabels = 'Study 0: initial plant gathering', 'setosa', 'versicolor', 'virginica', 'Barbara McClintock', 'Edgar Anderson', 'Leica IIIc Camera'
  Features
    'study' = 'Study 0: initial plant gathering'
    'species' = 'setosa', 'versicolor', 'virginica'
    'scientist' = 'Barbara McClintock', 'Edgar Anderson'
    'instrument' = 'Leica IIIc Camera'
    'temperature' = 27.6

Because we also re-annotated with the study label Study 0: initial plant gathering', we see that it appears under the study feature.

Retrieve features

artifact.features.get_values()
Hide code cell output
{'study': 'Study 0: initial plant gathering',
 'species': ['setosa', 'versicolor', 'virginica'],
 'scientist': ['Barbara McClintock', 'Edgar Anderson'],
 'instrument': 'Leica IIIc Camera',
 'temperature': 27.6}

Query by features

artifact = ln.Artifact.features.filter(temperature=27.6).one()
artifact
Artifact(uid='OmqbNZEDH1iKYW71nXD8', key='iris_studies/study0_raw_images', suffix='', type='dataset', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, created_by_id=1, storage_id=2, transform_id=1, run_id=1, updated_at='2024-06-19 23:05:13 UTC')

Register metadata

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. "species") and labels represent measured values (e.g. "iris setosa", "iris versicolor", "iris virginica").

In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.

Register labels

We study 3 species of the Iris plant: setosa, versicolor & virginica. Let’s create 3 labels with ULabel.

ULabel enables you to manage an in-house ontology to manage all kinds of generic labels.

What are alternatives to ULabel?

In a complex project, you’ll likely want dedicated typed registries for selected label types, e.g., Gene, Tissue, etc. See: Manage biological registries.

ULabel, however, will get you quite far and scale to ~1M labels.

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:

is_species = ln.ULabel(name="is_species").save()
is_species.children.set(species)
is_species.view_parents(with_children=True)
Hide code cell output
_images/2d11b613fde5a27f189a610e93ec72e5381ee76b8fda3791f0c46cba03b1efcc.svg

Query artifacts by labels

Using the new annotations, you can now query image artifacts by species & study labels:

ln.ULabel.df()
uid name description reference reference_type run_id created_by_id updated_at
id
8 MxTwuvNv is_species None None None None 1 2024-06-19 23:05:20.579220+00:00
7 P0xknhSm Leica IIIc Camera None None None None 1 2024-06-19 23:05:20.463575+00:00
6 FsTwsZTz Edgar Anderson None None None None 1 2024-06-19 23:05:20.457250+00:00
5 XYhRSeBw Barbara McClintock None None None None 1 2024-06-19 23:05:20.457125+00:00
4 SIwAza0v virginica None None None None 1 2024-06-19 23:05:20.447254+00:00
3 Fp8x2R8d versicolor None None None None 1 2024-06-19 23:05:20.447141+00:00
2 LKPsZ8FX setosa None None None None 1 2024-06-19 23:05:20.447019+00:00
1 k5OXiLhF Study 0: initial plant gathering My initial study None None None 1 2024-06-19 23:05:19.842290+00:00
ulabels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels=ulabels.study_0_initial_plant_gathering).one()
Hide code cell output
Artifact(uid='OmqbNZEDH1iKYW71nXD8', key='iris_studies/study0_raw_images', suffix='', type='dataset', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, created_by_id=1, storage_id=2, transform_id=1, run_id=1, updated_at='2024-06-19 23:05:13 UTC')

Run an ML model

Let’s now run a mock ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    image_file_dir = artifact.cache()
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data

transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
run = ln.track(transform=transform)
df = run_ml_model()
Hide code cell output
💡 saved: Transform(uid='w9shEd90Iqo04Mj2', name='Petal & sepal regressor', type='pipeline', created_by_id=1, updated_at='2024-06-19 23:05:20 UTC')
💡 saved: Run(uid='gE5A0vdnM0u76lqs8wzw', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_gE5A0vdnM0u76lqs8wzw.txt
💡 adding artifact ids [1] as inputs for run 2, adding parent transform 1

The output is a dataframe:

df.head()
Hide code cell output
sepal_length sepal_width petal_length petal_width iris_organism_name
0 0.051 0.035 0.014 0.002 setosa
1 0.049 0.030 0.014 0.002 setosa
2 0.047 0.032 0.013 0.002 setosa
3 0.046 0.031 0.015 0.002 setosa
4 0.050 0.036 0.014 0.002 setosa

And this is the pipeline that produced the dataframe:

run
Run(uid='gE5A0vdnM0u76lqs8wzw', started_at='2024-06-19 23:05:20 UTC', is_consecutive=True, transform_id=2, created_by_id=1)
run.transform.view_parents()
Hide code cell output
_images/472b41e7c7f8c1bba8ce5d8ccfe871928dafe113ad9689f0ff58f19820ac98b1.svg

Register the output data

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?

Use the unit field of Feature. In the above example, you’d do:

for feature in features:
    if feature.type == "number":
        feature.unit = "m"  # SI unit for meters
        feature.save()

We can now validate & register the dataframe in one line:

artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
Hide code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/eNJPNTDc8jdcO65BDlMO.parquet')
✅ storing artifact 'eNJPNTDc8jdcO65BDlMO' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/eNJPNTDc8jdcO65BDlMO.parquet'
Artifact(uid='eNJPNTDc8jdcO65BDlMO', description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', type='dataset', accessor='DataFrame', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-06-19 23:05:22 UTC')

There is one categorical feature, let’s add the species labels:

features = ln.Feature.lookup()
species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.species)
species_labels
<QuerySet [ULabel(uid='LKPsZ8FX', name='setosa', created_by_id=1, updated_at='2024-06-19 23:05:20 UTC'), ULabel(uid='Fp8x2R8d', name='versicolor', created_by_id=1, updated_at='2024-06-19 23:05:20 UTC'), ULabel(uid='SIwAza0v', name='virginica', created_by_id=1, updated_at='2024-06-19 23:05:20 UTC')]>

Let’s now add study labels:

artifact.labels.add(ulabels.study_0_initial_plant_gathering, feature=features.study)

This is the context for our artifact:

artifact.describe()
artifact.view_lineage()
Hide code cell output
Artifact(uid='eNJPNTDc8jdcO65BDlMO', description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', type='dataset', accessor='DataFrame', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True, updated_at='2024-06-19 23:05:22 UTC')
  Provenance
    .created_by = 'anonymous'
    .storage = '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial'
    .transform = 'Petal & sepal regressor'
    .run = '2024-06-19 23:05:20 UTC'
  Labels
    .ulabels = 'Study 0: initial plant gathering', 'setosa', 'versicolor', 'virginica'
  Features
    'species' = 'setosa', 'versicolor', 'virginica'
    'study' = 'Study 0: initial plant gathering'
_images/5844e74c0d6774dbcafba872774a21d20d0a962b408341e6dc3b5c0803928b6b.svg

See the database content:

ln.view(registries=["Feature", "ULabel"])
Hide code cell output
Feature
uid name dtype unit description synonyms run_id created_by_id updated_at
id
10 NPFMM8x1NejN iris_organism_name cat None None None 2.0 1 2024-06-19 23:05:22.138586+00:00
9 CkGsXELUqIES petal_width float None None None 2.0 1 2024-06-19 23:05:22.138462+00:00
8 h1HJXqad3V0b petal_length float None None None 2.0 1 2024-06-19 23:05:22.138336+00:00
7 xOCroLBFcxcb sepal_width float None None None 2.0 1 2024-06-19 23:05:22.138208+00:00
6 pAbJvJXImfOt sepal_length float None None None 2.0 1 2024-06-19 23:05:22.138065+00:00
5 Yl82y0BDx7jb temperature float None None None NaN 1 2024-06-19 23:05:20.435099+00:00
4 QG0mBvGF56j5 study cat[ULabel] None None None NaN 1 2024-06-19 23:05:20.429859+00:00
ULabel
uid name description reference reference_type run_id created_by_id updated_at
id
8 MxTwuvNv is_species None None None None 1 2024-06-19 23:05:20.579220+00:00
7 P0xknhSm Leica IIIc Camera None None None None 1 2024-06-19 23:05:20.463575+00:00
6 FsTwsZTz Edgar Anderson None None None None 1 2024-06-19 23:05:20.457250+00:00
5 XYhRSeBw Barbara McClintock None None None None 1 2024-06-19 23:05:20.457125+00:00
4 SIwAza0v virginica None None None None 1 2024-06-19 23:05:20.447254+00:00
3 Fp8x2R8d versicolor None None None None 1 2024-06-19 23:05:20.447141+00:00
2 LKPsZ8FX setosa None None None None 1 2024-06-19 23:05:20.447019+00:00

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix

Manage metadata

Avoid duplicates

Let’s create a label "project1":

ln.ULabel(name="project1").save()
Hide code cell output
ULabel(uid='7gpI0Buz', name='project1', created_by_id=1, run_id=2, updated_at='2024-06-19 23:05:22 UTC')

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")
label.save()
Hide code cell output
💡 returning existing ULabel record with same name: 'project1'
ULabel(uid='7gpI0Buz', name='project1', created_by_id=1, run_id=2, updated_at='2024-06-19 23:05:22 UTC')

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")
Hide code cell output
❗ record with similar name exists! did you mean to load it?
uid name description reference reference_type run_id created_by_id updated_at
id
9 7gpI0Buz project1 None None None 2 1 2024-06-19 23:05:22.340619+00:00
ULabel(uid='hrLGtFd5', name='project 1', created_by_id=1, run_id=2)

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via upon_create_search_names.

Update & delete records

label = ln.ULabel.filter(name="project1").first()
label
Hide code cell output
ULabel(uid='7gpI0Buz', name='project1', created_by_id=1, run_id=2, updated_at='2024-06-19 23:05:22 UTC')
label.name = "project1a"
label.save()
label
Hide code cell output
ULabel(uid='7gpI0Buz', name='project1a', created_by_id=1, run_id=2, updated_at='2024-06-19 23:05:22 UTC')
label.delete()
Hide code cell output
(1, {'lnschema_core.ULabel': 1})

Manage storage

Change default storage

The default storage location is:

ln.settings.storage
Hide code cell output
PosixUPath('/home/runner/work/lamindb/lamindb/docs/lamin-tutorial')

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations

ln.Storage.df()
Hide code cell output
uid root description type region instance_uid run_id created_by_id updated_at
id
2 YmV3ZoHv s3://lamindata None s3 us-east-1 4XIuR0tvaiXM None 1 2024-06-19 23:05:13.589893+00:00
1 bj7sgUmyGH1N /home/runner/work/lamindb/lamindb/docs/lamin-t... None local None 5WuFt3cW4zRx None 1 2024-06-19 23:05:09.398113+00:00
Hide code cell content
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial
❗ calling anonymously, will miss private instances
💡 deleting instance anonymous/lamin-tutorial