Tutorial: Features & labels

In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:

  1. Findability: Which collections measured expression of cell marker CD14? Which characterized cell line K562? Which collections have a test & train split? Etc.

  2. Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.

Hint

This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.

If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Curate datasets.

import lamindb as ln
import pandas as pd
import pytest

ln.settings.verbosity = "hint"
Hide code cell output
 connected lamindb: anonymous/lamin-tutorial

TLDR

Annotate by labels

# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.get(key="iris_studies/study0_raw_images")
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()
Hide code cell output
Artifact 
├── General
│   ├── .uid = 'GVswddJr58LWPj9A0000'
│   ├── .key = iris_studies/study0_raw_images
│   ├── .size = 658465
│   ├── .hash = 'IVKGMfNwi8zKvnpaD_gG7w'
│   ├── .n_objects = 51
│   ├── .path = s3://lamindata/iris_studies/study0_raw_images
│   ├── .created_by = anonymous
│   ├── .created_at = 2024-12-21 08:21:54
│   └── .transform = 'Tutorial: Artifacts'
└── Labels
    └── .ulabels                    ULabel                     Study 0: initial plant gathering         

Annotate by features

Features are buckets for labels, numbers and other data types.

Often, data that you want to ingest comes with metadata.

Here, three metadata features species, scientist, instrument were collected.

df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
df.head()
species file_name scientist instrument
0 setosa iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce... Barbara McClintock Leica IIIc Camera
1 versicolor iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710... Edgar Anderson Leica IIIc Camera
2 versicolor iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf... Edgar Anderson Leica IIIc Camera
3 setosa iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109... Edgar Anderson Leica IIIc Camera
4 virginica iris-bdae8314e4385d8e2322abd8e63a82758a9063c77... Edgar Anderson Leica IIIc Camera

There are only a few values for features species, scientist & instrument, and we’d like to label the artifact with these values:

df.nunique()
species        3
file_name     50
scientist      2
instrument     1
dtype: int64

Let’s annotate the artifact with features & values and add a temperature measurement that Barbara & Edgar had forgotten in their csv:

with pytest.raises(ln.core.exceptions.ValidationError) as error:
    artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
print(error.exconly())
Hide code cell output
lamindb.core.exceptions.ValidationError: These keys could not be validated: ['species', 'scientist', 'instrument', 'temperature', 'study']
Here is how to create a feature:

  ln.Feature(name='species', dtype='cat ? str').save()
  ln.Feature(name='scientist', dtype='cat ? str').save()
  ln.Feature(name='instrument', dtype='cat ? str').save()
  ln.Feature(name='temperature', dtype='float').save()
  ln.Feature(name='study', dtype='cat ? str').save()

As we saw, nothing was validated and hence, we got an error that tells us to register features & labels:

ln.Feature(name='species', dtype='cat').save()
ln.Feature(name='scientist', dtype='cat').save()
ln.Feature(name='instrument', dtype='cat').save()
ln.Feature(name='temperature', dtype='float').save()
ln.Feature(name='study', dtype='cat').save()
species = ln.ULabel.from_values(df['species'].unique(), create=True).save()
authors = ln.ULabel.from_values(df['scientist'].unique(), create=True).save()
instruments = ln.ULabel.from_values(df['instrument'].unique(), create=True).save()

Now everything works:

artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
artifact.describe()
Hide code cell output
Artifact 
├── General
│   ├── .uid = 'GVswddJr58LWPj9A0000'
│   ├── .key = iris_studies/study0_raw_images
│   ├── .size = 658465
│   ├── .hash = 'IVKGMfNwi8zKvnpaD_gG7w'
│   ├── .n_objects = 51
│   ├── .path = s3://lamindata/iris_studies/study0_raw_images
│   ├── .created_by = anonymous
│   ├── .created_at = 2024-12-21 08:21:54
│   └── .transform = 'Tutorial: Artifacts'
├── Linked features
│   └── instrument                  cat[ULabel]                Leica IIIc Camera                        
scientist                   cat[ULabel]                Barbara McClintock, Edgar Anderson       
species                     cat[ULabel]                setosa, versicolor, virginica            
study                       cat[ULabel]                Study 0: initial plant gathering         
temperature                 float                      27.6                                     
└── Labels
    └── .ulabels                    ULabel                     Study 0: initial plant gathering, setosa…

Because we also re-labeled with the study label Study 0: initial plant gathering', we see that it appears under the study feature.

Retrieve features

artifact.features.get_values()
Hide code cell output
{'instrument': 'Leica IIIc Camera',
 'scientist': {'Barbara McClintock', 'Edgar Anderson'},
 'species': {'setosa', 'versicolor', 'virginica'},
 'study': 'Study 0: initial plant gathering',
 'temperature': 27.6}

Query by features

artifact = ln.Artifact.features.get(temperature=27.6)
artifact
Artifact(uid='GVswddJr58LWPj9A0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_objects=51, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=2, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-21 08:21:54 UTC)

Register metadata

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. "species") and labels represent measured values (e.g. "iris setosa", "iris versicolor", "iris virginica").

In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.

Register labels

We study 3 species of the Iris plant: setosa, versicolor & virginica. Let’s create 3 labels with ULabel.

ULabel enables you to manage an in-house ontology to manage all kinds of generic labels.

What are alternatives to ULabel?

In a complex project, you’ll likely want dedicated typed registries for selected label types, e.g., Gene, Tissue, etc. See: Manage biological registries.

ULabel, however, will get you quite far and scale to ~1M labels.

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:

is_species = ln.ULabel(name="is_species").save()
is_species.children.set(species)
is_species.view_parents(with_children=True)
Hide code cell output
_images/b84339b6105b8ee3804724611bc36b34722d6c7d7220ed0c1a7e3dd70eb07c8f.svg

Query artifacts by labels

Using the new annotations, you can now query image artifacts by species & study labels:

ln.ULabel.df()
uid name description reference reference_type run_id created_at created_by_id
id
8 D296Eqjz is_species None None None None 2024-12-21 08:22:02.213156+00:00 1
7 ASuwB4Ji Leica IIIc Camera None None None None 2024-12-21 08:22:02.127424+00:00 1
6 B3xyXWcZ Edgar Anderson None None None None 2024-12-21 08:22:02.124058+00:00 1
5 2s45INWR Barbara McClintock None None None None 2024-12-21 08:22:02.124018+00:00 1
4 fG5snH0c virginica None None None None 2024-12-21 08:22:02.119294+00:00 1
3 AUk6nRQ9 versicolor None None None None 2024-12-21 08:22:02.119272+00:00 1
2 8Ffiui16 setosa None None None None 2024-12-21 08:22:02.119230+00:00 1
1 rWe1jJql Study 0: initial plant gathering My initial study None None None 2024-12-21 08:22:01.271370+00:00 1
ulabels = ln.ULabel.lookup()
ln.Artifact.get(ulabels=ulabels.study_0_initial_plant_gathering)
Hide code cell output
Artifact(uid='GVswddJr58LWPj9A0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_objects=51, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=2, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-21 08:21:54 UTC)

Run an ML model

Let’s now run a mock ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    image_file_dir = artifact.cache()
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data

transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
ln.context.track(transform=transform)
df = run_ml_model()
Hide code cell output
 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_6panMrqxZMJtd0FzJO6c.txt
 created Transform('zJlu2Ojv'), started new Run('6panMrqx') at 2024-12-21 08:22:02 UTC
 adding artifact ids [1] as inputs for run 2, adding parent transform 1

The output is a dataframe:

df.head()
Hide code cell output
sepal_length sepal_width petal_length petal_width iris_organism_name
0 0.051 0.035 0.014 0.002 setosa
1 0.049 0.030 0.014 0.002 setosa
2 0.047 0.032 0.013 0.002 setosa
3 0.046 0.031 0.015 0.002 setosa
4 0.050 0.036 0.014 0.002 setosa

And this is the pipeline that produced the dataframe:

ln.context.transform.view_lineage()
Hide code cell output
_images/1d09049bec4950e42197386026c78982fe9f7d788f16cb1e7cc343247a277bdf.svg

Register the output data

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)
Hide code cell output
! You have few permissible values for feature iris_organism_name, consider dtype 'cat' instead of 'str'
How to track units of features?

Use the unit field of Feature. In the above example, you’d do:

for feature in features:
    if feature.type == "number":
        feature.unit = "m"  # SI unit for meters
        feature.save()

We can now validate & register the dataframe in one line:

artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
Hide code cell output
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/Z2pxgeT7ChLDa1C90000.parquet')
 storing artifact 'Z2pxgeT7ChLDa1C90000' at '/home/runner/work/lamin-docs/lamin-docs/docs/lamin-tutorial/.lamindb/Z2pxgeT7ChLDa1C90000.parquet'
Artifact(uid='Z2pxgeT7ChLDa1C90000', is_latest=True, description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', type='dataset', size=4834, hash='gfw0zKJdmmP_QAZzLlgygg', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=2, run_id=2, created_by_id=1, created_at=2024-12-21 08:22:03 UTC)

There is one categorical feature, let’s add the species labels:

features = ln.Feature.lookup()
species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.species)
species_labels
<QuerySet [ULabel(uid='8Ffiui16', name='setosa', created_by_id=1, created_at=2024-12-21 08:22:02 UTC), ULabel(uid='AUk6nRQ9', name='versicolor', created_by_id=1, created_at=2024-12-21 08:22:02 UTC), ULabel(uid='fG5snH0c', name='virginica', created_by_id=1, created_at=2024-12-21 08:22:02 UTC)]>

Let’s now add study labels:

artifact.labels.add(ulabels.study_0_initial_plant_gathering, feature=features.study)

This is the context for our artifact:

artifact.describe()
artifact.view_lineage()
Hide code cell output
Artifact .parquet/DataFrame
├── General
│   ├── .uid = 'Z2pxgeT7ChLDa1C90000'
│   ├── .size = 4834
│   ├── .hash = 'gfw0zKJdmmP_QAZzLlgygg'
│   ├── .path = /home/runner/work/lamin-docs/lamin-docs/docs/lamin-tutorial/.lamindb/Z2pxgeT7ChLDa1C90000.parquet
│   ├── .created_by = anonymous
│   ├── .created_at = 2024-12-21 08:22:03
│   └── .transform = 'Petal & sepal regressor'
├── Linked features
│   └── species                     cat[ULabel]                setosa, versicolor, virginica            
study                       cat[ULabel]                Study 0: initial plant gathering         
└── Labels
    └── .ulabels                    ULabel                     Study 0: initial plant gathering, setosa…
_images/8a44e72b8b98e56c2dfce64a8bb32258bab87f7b200faffc03c2bff364b814ff.svg

See the database content:

ln.view(registries=["Feature", "ULabel"])
Hide code cell output
Feature
uid name dtype unit description synonyms run_id created_at created_by_id
id
10 yWKqHkWI8ctA iris_organism_name str None None None 2.0 2024-12-21 08:22:03.501440+00:00 1
9 f6qOEZoglUtZ petal_width float None None None 2.0 2024-12-21 08:22:03.501419+00:00 1
8 XnC3pLbdAqpz petal_length float None None None 2.0 2024-12-21 08:22:03.501399+00:00 1
7 soJoGlGkam34 sepal_width float None None None 2.0 2024-12-21 08:22:03.501374+00:00 1
6 YNJ2KsylDUwB sepal_length float None None None 2.0 2024-12-21 08:22:03.501324+00:00 1
5 z8vb9vt8ejzp study cat[ULabel] None None None NaN 2024-12-21 08:22:02.112958+00:00 1
3 zp3ZSaAYMhPE instrument cat[ULabel] None None None NaN 2024-12-21 08:22:02.107682+00:00 1
ULabel
uid name description reference reference_type run_id created_at created_by_id
id
8 D296Eqjz is_species None None None None 2024-12-21 08:22:02.213156+00:00 1
7 ASuwB4Ji Leica IIIc Camera None None None None 2024-12-21 08:22:02.127424+00:00 1
6 B3xyXWcZ Edgar Anderson None None None None 2024-12-21 08:22:02.124058+00:00 1
5 2s45INWR Barbara McClintock None None None None 2024-12-21 08:22:02.124018+00:00 1
4 fG5snH0c virginica None None None None 2024-12-21 08:22:02.119294+00:00 1
3 AUk6nRQ9 versicolor None None None None 2024-12-21 08:22:02.119272+00:00 1
2 8Ffiui16 setosa None None None None 2024-12-21 08:22:02.119230+00:00 1

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix

Manage metadata

Avoid duplicates

Let’s create a label "project1":

ln.ULabel(name="project1").save()
Hide code cell output
ULabel(uid='H4JK054u', name='project1', created_by_id=1, run_id=2, created_at=2024-12-21 08:22:03 UTC)

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")
label.save()
Hide code cell output
 returning existing ULabel record with same name: 'project1'
ULabel(uid='H4JK054u', name='project1', created_by_id=1, run_id=2, created_at=2024-12-21 08:22:03 UTC)

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")
Hide code cell output
! record with similar name exists! did you mean to load it?
uid name description reference reference_type run_id created_at created_by_id
id
9 H4JK054u project1 None None None 2 2024-12-21 08:22:03.740291+00:00 1
ULabel(uid='bVIey5V5', name='project 1', created_by_id=1, run_id=2)

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via search_names.

Update & delete records

label = ln.ULabel.filter(name="project1").first()
label
Hide code cell output
ULabel(uid='H4JK054u', name='project1', created_by_id=1, run_id=2, created_at=2024-12-21 08:22:03 UTC)
label.name = "project1a"
label.save()
label
Hide code cell output
ULabel(uid='H4JK054u', name='project1a', created_by_id=1, run_id=2, created_at=2024-12-21 08:22:03 UTC)
label.delete()

Manage storage

Change default storage

The default storage location is:

ln.settings.storage
Hide code cell output
StorageSettings(root='/home/runner/work/lamin-docs/lamin-docs/docs/lamin-tutorial', uid='cqZ5ZwHKxYoQ')

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations

ln.Storage.df()
Hide code cell output
uid root description type region instance_uid run_id created_at created_by_id
id
2 8vLFZ3RWgt0u s3://lamindata None s3 us-east-1 None None 2024-12-21 08:21:54.175311+00:00 1
1 cqZ5ZwHKxYoQ /home/runner/work/lamin-docs/lamin-docs/docs/l... None local None 5WuFt3cW4zRx None 2024-12-21 08:21:49.589598+00:00 1
Hide code cell content
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial
! calling anonymously, will miss private instances
 deleting instance anonymous/lamin-tutorial