Tutorial: Features & labels¶
In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:
Findability: Which collections measured expression of cell marker
CD14
? Which characterized cell lineK562
? Which collections have a test & train split? Etc.Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.
Hint
This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.
If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Curate datasets.
import lamindb as ln
import pandas as pd
import pytest
ln.settings.verbosity = "hint"
Show code cell output
→ connected lamindb: anonymous/lamin-tutorial
TLDR¶
Annotate by labels¶
# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.get(key="iris_studies/study0_raw_images")
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()
Show code cell output
Artifact(uid='dXAZhdSqtewMfuDW0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_objects=51, _hash_type='md5-d', visibility=1, _key_is_virtual=False, updated_at='2024-09-10 15:13:32 UTC')
Provenance
.storage = 's3://lamindata'
.transform = 'Tutorial: Artifacts'
.run = '2024-09-10 15:13:29 UTC'
.created_by = 'anonymous'
Labels
.ulabels = 'Study 0: initial plant gathering'
Annotate by features¶
Features are buckets for labels, numbers and other data types.
Often, data that you want to ingest comes with metadata.
Here, three metadata features species
, scientist
, instrument
were collected.
df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
df.head()
species | file_name | scientist | instrument | |
---|---|---|---|---|
0 | setosa | iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce... | Barbara McClintock | Leica IIIc Camera |
1 | versicolor | iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710... | Edgar Anderson | Leica IIIc Camera |
2 | versicolor | iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf... | Edgar Anderson | Leica IIIc Camera |
3 | setosa | iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109... | Edgar Anderson | Leica IIIc Camera |
4 | virginica | iris-bdae8314e4385d8e2322abd8e63a82758a9063c77... | Edgar Anderson | Leica IIIc Camera |
There are only a few values for features species
, scientist
& instrument
, and we’d like to label the artifact with these values:
df.nunique()
species 3
file_name 50
scientist 2
instrument 1
dtype: int64
Let’s annotate the artifact with features & values and add a temperature
measurement that Barbara & Edgar had forgotten in their csv:
with pytest.raises(ln.core.exceptions.ValidationError) as error:
artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
print(error.exconly())
Show code cell output
lamindb.core.exceptions.ValidationError: These keys could not be validated: ['species', 'scientist', 'instrument', 'temperature', 'study']
Here is how to create a feature:
ln.Feature(name='species', dtype='cat').save()
ln.Feature(name='scientist', dtype='cat').save()
ln.Feature(name='instrument', dtype='cat').save()
ln.Feature(name='temperature', dtype='float').save()
ln.Feature(name='study', dtype='cat[ULabel]').save()
As we saw, nothing was validated and hence, we got an error that tells us to register features & labels:
ln.Feature(name='species', dtype='cat[ULabel]').save()
ln.Feature(name='scientist', dtype='cat[ULabel]').save()
ln.Feature(name='instrument', dtype='cat[ULabel]').save()
ln.Feature(name='study', dtype='cat[ULabel]').save()
ln.Feature(name='temperature', dtype='float').save()
species = ln.ULabel.from_values(df['species'].unique(), create=True)
ln.save(species)
authors = ln.ULabel.from_values(df['scientist'].unique(), create=True)
ln.save(authors)
instruments = ln.ULabel.from_values(df['instrument'].unique(), create=True)
ln.save(instruments)
Now everything works:
artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
artifact.describe()
Show code cell output
Artifact(uid='dXAZhdSqtewMfuDW0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_objects=51, _hash_type='md5-d', visibility=1, _key_is_virtual=False, updated_at='2024-09-10 15:13:32 UTC')
Provenance
.storage = 's3://lamindata'
.transform = 'Tutorial: Artifacts'
.run = '2024-09-10 15:13:29 UTC'
.created_by = 'anonymous'
Labels
.ulabels = 'Study 0: initial plant gathering', 'setosa', 'versicolor', 'virginica', 'Barbara McClintock', 'Edgar Anderson', 'Leica IIIc Camera'
Features
'instrument' = 'Leica IIIc Camera'
'scientist' = 'Barbara McClintock', 'Edgar Anderson'
'species' = 'setosa', 'versicolor', 'virginica'
'study' = 'Study 0: initial plant gathering'
'temperature' = 27.6
Because we also re-labeled with the study label Study 0: initial plant gathering'
, we see that it appears under the study
feature.
Retrieve features¶
artifact.features.get_values()
Show code cell output
{'study': 'Study 0: initial plant gathering',
'species': ['setosa', 'versicolor', 'virginica'],
'scientist': ['Barbara McClintock', 'Edgar Anderson'],
'instrument': 'Leica IIIc Camera',
'temperature': 27.6}
Query by features¶
artifact = ln.Artifact.features.get(temperature=27.6)
artifact
Artifact(uid='dXAZhdSqtewMfuDW0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_objects=51, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=2, transform_id=1, run_id=1, created_by_id=1, updated_at='2024-09-10 15:13:32 UTC')
Register metadata¶
Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.
Features represent measurement dimensions (e.g. "species"
) and labels represent measured values (e.g. "iris setosa"
, "iris versicolor"
, "iris virginica"
).
In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.
Register labels¶
We study 3 species of the Iris plant: setosa
, versicolor
& virginica
. Let’s create 3 labels with ULabel
.
ULabel
enables you to manage an in-house ontology to manage all kinds of generic labels.
What are alternatives to ULabel?
In a complex project, you’ll likely want dedicated typed registries for selected label types, e.g., Gene
, Tissue
, etc. See: Manage biological registries.
ULabel
, however, will get you quite far and scale to ~1M labels.
Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:
is_species = ln.ULabel(name="is_species").save()
is_species.children.set(species)
is_species.view_parents(with_children=True)
Show code cell output
Query artifacts by labels¶
Using the new annotations, you can now query image artifacts by species & study labels:
ln.ULabel.df()
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
8 | Na6irVvc | is_species | None | None | None | None | 1 | 2024-09-10 15:13:40.885804+00:00 |
7 | IsbYuCMA | Leica IIIc Camera | None | None | None | None | 1 | 2024-09-10 15:13:40.760776+00:00 |
6 | DBPkIfTF | Edgar Anderson | None | None | None | None | 1 | 2024-09-10 15:13:40.756715+00:00 |
5 | 7N6HSTo7 | Barbara McClintock | None | None | None | None | 1 | 2024-09-10 15:13:40.756658+00:00 |
4 | bFPYC8bd | virginica | None | None | None | None | 1 | 2024-09-10 15:13:40.750804+00:00 |
3 | MGWZR8nU | versicolor | None | None | None | None | 1 | 2024-09-10 15:13:40.750757+00:00 |
2 | cl0Qbp6d | setosa | None | None | None | None | 1 | 2024-09-10 15:13:40.750698+00:00 |
1 | lZbLrMNG | Study 0: initial plant gathering | My initial study | None | None | None | 1 | 2024-09-10 15:13:40.080835+00:00 |
ulabels = ln.ULabel.lookup()
ln.Artifact.get(ulabels=ulabels.study_0_initial_plant_gathering)
Show code cell output
Artifact(uid='dXAZhdSqtewMfuDW0000', is_latest=True, key='iris_studies/study0_raw_images', suffix='', size=658465, hash='IVKGMfNwi8zKvnpaD_gG7w', n_objects=51, _hash_type='md5-d', visibility=1, _key_is_virtual=False, storage_id=2, transform_id=1, run_id=1, created_by_id=1, updated_at='2024-09-10 15:13:32 UTC')
Run an ML model¶
Let’s now run a mock ML model that transforms the images into 4 high-level features.
def run_ml_model() -> pd.DataFrame:
image_file_dir = artifact.cache()
output_data = ln.core.datasets.df_iris_in_meter_study1()
return output_data
transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
ln.context.track(transform=transform)
df = run_ml_model()
Show code cell output
• tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_fZAuKIgVWyCdTPINA5Wl.txt
→ created Transform(uid='0sZPKa3Lg9Ba0000') & created Run(started_at='2024-09-10 15:13:41 UTC')
• adding artifact ids [1] as inputs for run 2, adding parent transform 1
The output is a dataframe:
df.head()
Show code cell output
sepal_length | sepal_width | petal_length | petal_width | iris_organism_name | |
---|---|---|---|---|---|
0 | 0.051 | 0.035 | 0.014 | 0.002 | setosa |
1 | 0.049 | 0.030 | 0.014 | 0.002 | setosa |
2 | 0.047 | 0.032 | 0.013 | 0.002 | setosa |
3 | 0.046 | 0.031 | 0.015 | 0.002 | setosa |
4 | 0.050 | 0.036 | 0.014 | 0.002 | setosa |
And this is the pipeline that produced the dataframe:
ln.context.transform.view_lineage()
Show code cell output
Register the output data¶
Let’s first register the features of the transformed data:
new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?
Use the unit
field of Feature
. In the above example, you’d do:
for feature in features:
if feature.type == "number":
feature.unit = "m" # SI unit for meters
feature.save()
We can now validate & register the dataframe in one line:
artifact = ln.Artifact.from_df(
df,
description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
Show code cell output
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/dz0Zfx8taEsKMP0i0000.parquet')
✓ storing artifact 'dz0Zfx8taEsKMP0i0000' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/dz0Zfx8taEsKMP0i0000.parquet'
Artifact(uid='dz0Zfx8taEsKMP0i0000', is_latest=True, description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', type='dataset', size=5347, hash='g53_Mfz7SMwVZE120ZqpRA', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=2, run_id=2, created_by_id=1, updated_at='2024-09-10 15:13:42 UTC')
There is one categorical feature, let’s add the species labels:
features = ln.Feature.lookup()
species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.species)
species_labels
<QuerySet [ULabel(uid='cl0Qbp6d', name='setosa', created_by_id=1, updated_at='2024-09-10 15:13:40 UTC'), ULabel(uid='MGWZR8nU', name='versicolor', created_by_id=1, updated_at='2024-09-10 15:13:40 UTC'), ULabel(uid='bFPYC8bd', name='virginica', created_by_id=1, updated_at='2024-09-10 15:13:40 UTC')]>
Let’s now add study labels:
artifact.labels.add(ulabels.study_0_initial_plant_gathering, feature=features.study)
This is the context for our artifact:
artifact.describe()
artifact.view_lineage()
Show code cell output
Artifact(uid='dz0Zfx8taEsKMP0i0000', is_latest=True, description='Iris study 1 - after measuring sepal & petal metrics', suffix='.parquet', type='dataset', size=5347, hash='g53_Mfz7SMwVZE120ZqpRA', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, updated_at='2024-09-10 15:13:42 UTC')
Provenance
.storage = '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial'
.transform = 'Petal & sepal regressor'
.run = '2024-09-10 15:13:41 UTC'
.created_by = 'anonymous'
Labels
.ulabels = 'Study 0: initial plant gathering', 'setosa', 'versicolor', 'virginica'
Features
'species' = 'setosa', 'versicolor', 'virginica'
'study' = 'Study 0: initial plant gathering'
See the database content:
ln.view(registries=["Feature", "ULabel"])
Show code cell output
Feature
uid | name | dtype | unit | description | synonyms | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
10 | 8FdU0RNjgIFH | iris_organism_name | cat | None | None | None | 2.0 | 1 | 2024-09-10 15:13:42.389150+00:00 |
9 | 4P0H2Gct2m9a | petal_width | float | None | None | None | 2.0 | 1 | 2024-09-10 15:13:42.389097+00:00 |
8 | wYIrIW4D9qDB | petal_length | float | None | None | None | 2.0 | 1 | 2024-09-10 15:13:42.389044+00:00 |
7 | KmO4Jo2pN3oR | sepal_width | float | None | None | None | 2.0 | 1 | 2024-09-10 15:13:42.388990+00:00 |
6 | UyBeuCxcxMDC | sepal_length | float | None | None | None | 2.0 | 1 | 2024-09-10 15:13:42.388925+00:00 |
5 | 8oXzDKfkGaYL | temperature | float | None | None | None | NaN | 1 | 2024-09-10 15:13:40.743361+00:00 |
4 | L8Jf413mo6Qk | study | cat[ULabel] | None | None | None | NaN | 1 | 2024-09-10 15:13:40.740145+00:00 |
ULabel
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
8 | Na6irVvc | is_species | None | None | None | None | 1 | 2024-09-10 15:13:40.885804+00:00 |
7 | IsbYuCMA | Leica IIIc Camera | None | None | None | None | 1 | 2024-09-10 15:13:40.760776+00:00 |
6 | DBPkIfTF | Edgar Anderson | None | None | None | None | 1 | 2024-09-10 15:13:40.756715+00:00 |
5 | 7N6HSTo7 | Barbara McClintock | None | None | None | None | 1 | 2024-09-10 15:13:40.756658+00:00 |
4 | bFPYC8bd | virginica | None | None | None | None | 1 | 2024-09-10 15:13:40.750804+00:00 |
3 | MGWZR8nU | versicolor | None | None | None | None | 1 | 2024-09-10 15:13:40.750757+00:00 |
2 | cl0Qbp6d | setosa | None | None | None | None | 1 | 2024-09-10 15:13:40.750698+00:00 |
This is it! 😅
If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.
Appendix¶
Manage metadata¶
Avoid duplicates¶
Let’s create a label "project1"
:
ln.ULabel(name="project1").save()
Show code cell output
ULabel(uid='6J0yA4Aw', name='project1', created_by_id=1, run_id=2, updated_at='2024-09-10 15:13:42 UTC')
We already created a project1
label before, let’s see what happens if we try to create it again:
label = ln.ULabel(name="project1")
label.save()
Show code cell output
→ returning existing ULabel record with same name: 'project1'
ULabel(uid='6J0yA4Aw', name='project1', created_by_id=1, run_id=2, updated_at='2024-09-10 15:13:42 UTC')
Instead of creating a new record, LaminDB loads and returns the existing record from the database.
If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.
Say, we spell “project 1” with a white space:
ln.ULabel(name="project 1")
Show code cell output
! record with similar name exists! did you mean to load it?
uid | name | description | reference | reference_type | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
9 | 6J0yA4Aw | project1 | None | None | None | 2 | 1 | 2024-09-10 15:13:42.619639+00:00 |
ULabel(uid='HfVRvdEb', name='project 1', created_by_id=1, run_id=2)
To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.
You can switch it off for performance gains via search_names
.
Update & delete records¶
label = ln.ULabel.filter(name="project1").first()
label
Show code cell output
ULabel(uid='6J0yA4Aw', name='project1', created_by_id=1, run_id=2, updated_at='2024-09-10 15:13:42 UTC')
label.name = "project1a"
label.save()
label
Show code cell output
ULabel(uid='6J0yA4Aw', name='project1a', created_by_id=1, run_id=2, updated_at='2024-09-10 15:13:42 UTC')
label.delete()
Manage storage¶
Change default storage¶
The default storage location is:
ln.settings.storage
Show code cell output
StorageSettings(root='/home/runner/work/lamindb/lamindb/docs/lamin-tutorial', uid='KVHua8rgn4ib')
You can change it by setting ln.settings.storage = "s3://my-bucket"
.
See all storage locations¶
ln.Storage.df()
Show code cell output
uid | root | description | type | region | instance_uid | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
2 | TNWINBuaDYnA | s3://lamindata | None | s3 | us-east-1 | None | None | 1 | 2024-09-10 15:13:32.226259+00:00 |
1 | KVHua8rgn4ib | /home/runner/work/lamindb/lamindb/docs/lamin-t... | None | local | None | 5WuFt3cW4zRx | None | 1 | 2024-09-10 15:13:26.856617+00:00 |
Show code cell content
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial
! calling anonymously, will miss private instances
• deleting instance anonymous/lamin-tutorial