What does the key parameter do under the hood?

LaminDB is designed around associating biological metadata to artifacts and collections. This enables querying for them in storage by metadata and removes the requirement for semantic artifact and collection names.

Here, we will discuss trade-offs for using the key parameter, which allows for semantic keys, in various scenarios.

We’re simulating an artifact system with several nested folders and artifacts. Such structures are resembled in, for example, the RxRx: cell imaging guide.

# !pip install 'lamindb[jupyter]'
import random
import string
from pathlib import Path


def create_complex_biological_hierarchy(root_folder):
    root_path = Path(root_folder)

    if root_path.exists():
        print("Folder structure already exists. Skipping...")
    else:
        root_path.mkdir()

        raw_folder = root_path / "raw"
        preprocessed_folder = root_path / "preprocessed"
        raw_folder.mkdir()
        preprocessed_folder.mkdir()

        for i in range(1, 5):
            artifact_name = f"raw_data_{i}.txt"
            with (raw_folder / artifact_name).open("w") as f:
                random_text = "".join(
                    random.choice(string.ascii_letters) for _ in range(10)
                )
                f.write(random_text)

        for i in range(1, 3):
            collection_folder = raw_folder / f"Collection_{i}"
            collection_folder.mkdir()

            for j in range(1, 5):
                artifact_name = f"raw_data_{j}.txt"
                with (collection_folder / artifact_name).open("w") as f:
                    random_text = "".join(
                        random.choice(string.ascii_letters) for _ in range(10)
                    )
                    f.write(random_text)

        for i in range(1, 5):
            artifact_name = f"result_{i}.txt"
            with (preprocessed_folder / artifact_name).open("w") as f:
                random_text = "".join(
                    random.choice(string.ascii_letters) for _ in range(10)
                )
                f.write(random_text)


root_folder = "complex_biological_project"
create_complex_biological_hierarchy(root_folder)
!lamin init --storage ./key-eval
 initialized lamindb: testuser1/key-eval
import lamindb as ln


ln.settings.verbosity = "hint"
 connected lamindb: testuser1/key-eval
ln.UPath("complex_biological_project").view_tree()
4 sub-directories & 8 files with suffixes '.txt'
/home/runner/work/lamindb/lamindb/docs/faq/complex_biological_project
├── raw/
│   ├── raw_data_3.txt
│   ├── Collection_2/
│   ├── raw_data_4.txt
│   ├── raw_data_1.txt
│   ├── raw_data_2.txt
│   └── Collection_1/
└── preprocessed/
    ├── result_4.txt
    ├── result_2.txt
    ├── result_1.txt
    └── result_3.txt
ln.track("WIwaNDvlEkwS0000")
 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_ZySk5WfNg0609YXQpYDx.txt
 created Transform('WIwaNDvlEkwS0000'), started new Run('ZySk5WfN...') at 2025-01-20 08:22:40 UTC
 notebook imports: lamindb==1.0.2

Storing artifacts using Storage, File, and Collection

Lamin has three storage classes that manage different types of in-memory and on-disk objects:

  1. Storage: Manages the default storage root that can be either local or in the cloud. For more details we refer to Storage FAQ.

  2. Artifact: Manages datasets with an optional key that acts as a relative path within the current default storage root (see Storage). An example is a single h5 artifact.

  3. Collection: Manages a collection of datasets with an optional key that acts as a relative path within the current default storage root (see Storage). An example is a collection of h5 artifacts.

For more details we refer to Tutorial: Artifacts.

The current storage root is:

ln.settings.storage
StorageSettings(root='/home/runner/work/lamindb/lamindb/docs/faq/key-eval', uid='gsdXxvOWVS6e')

By default, Lamin uses virtual keys that are only reflected in the database but not in storage. It is possible to turn this behavior off by setting ln.settings.creation._artifact_use_virtual_keys = False. Generally, we discourage disabling this setting manually. For more details we refer to Storage FAQ.

ln.settings.creation._artifact_use_virtual_keys
True

We will now create File objects with and without semantic keys using key and also save them as Collections.

artifact_no_key_1 = ln.Artifact("complex_biological_project/raw/raw_data_1.txt")
artifact_no_key_2 = ln.Artifact("complex_biological_project/raw/raw_data_2.txt")
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/nzeCFU74IPt9cV760000.txt')
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/1RMsyNeByzO7bDUN0000.txt')

The logging suggests that the artifacts will be saved to our current default storage with auto generated storage keys.

artifact_no_key_1.save()
artifact_no_key_2.save()
 storing artifact 'nzeCFU74IPt9cV760000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/nzeCFU74IPt9cV760000.txt'
 storing artifact '1RMsyNeByzO7bDUN0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/1RMsyNeByzO7bDUN0000.txt'
Artifact(uid='1RMsyNeByzO7bDUN0000', is_latest=True, suffix='.txt', size=10, hash='AiGXKw_BmcO3FTRS34sOnA', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC)
artifact_key_3 = ln.Artifact(
    "complex_biological_project/raw/raw_data_3.txt", key="raw/raw_data_3.txt"
)
artifact_key_4 = ln.Artifact(
    "complex_biological_project/raw/raw_data_4.txt", key="raw/raw_data_4.txt"
)
artifact_key_3.save()
artifact_key_4.save()
• path content will be copied to default storage upon `save()` with key 'raw/raw_data_3.txt'
• path content will be copied to default storage upon `save()` with key 'raw/raw_data_4.txt'
 storing artifact 'bfCumwbaGLMOrpax0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/bfCumwbaGLMOrpax0000.txt'
 storing artifact 'j9aVKMg69SJqPtT60000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/j9aVKMg69SJqPtT60000.txt'
Artifact(uid='j9aVKMg69SJqPtT60000', is_latest=True, key='raw/raw_data_4.txt', suffix='.txt', size=10, hash='KG2eC_osahDKXMPr_1GaHw', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC)

Files with keys are not stored in different locations because of the usage of virtual keys. However, they are still semantically queryable by key.

ln.Artifact.filter(key__contains="raw").df().head()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
3 bfCumwbaGLMOrpax0000 raw/raw_data_3.txt None .txt None None 10 TaBpaFCU-tNvoNVAw97fNA None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.077000+00:00 1 None 1
4 j9aVKMg69SJqPtT60000 raw/raw_data_4.txt None .txt None None 10 KG2eC_osahDKXMPr_1GaHw None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.082000+00:00 1 None 1

Collection does not have a key parameter because it does not store any additional data in Storage. In contrast, it has a name parameter that serves as a semantic identifier of the collection.

ds_1 = ln.Collection([artifact_no_key_1, artifact_no_key_2], name="no key collection")
ds_2 = ln.Collection([artifact_key_3, artifact_key_4], name="sample collection")
ds_1
/tmp/ipykernel_2964/436726562.py:1: FutureWarning: argument `name` will be removed, please pass no key collection to `key` instead
  ds_1 = ln.Collection([artifact_no_key_1, artifact_no_key_2], name="no key collection")
/tmp/ipykernel_2964/436726562.py:2: FutureWarning: argument `name` will be removed, please pass sample collection to `key` instead
  ds_2 = ln.Collection([artifact_key_3, artifact_key_4], name="sample collection")
Collection(uid='Bbo8D3cfc4svJpHM0000', is_latest=True, key='no key collection', hash='Nit1xagL4fOr-HKha1i41g', created_by_id=1, space_id=1, run_id=1, created_at=<django.db.models.expressions.DatabaseDefault object at 0x7f831e38cce0>)

Advantages and disadvantages of semantic keys

Semantic keys have several advantages and disadvantages that we will discuss and demonstrate in the remaining notebook:

Advantages:

  • Simple: It can be easier to refer to specific collections in conversations

  • Familiarity: Most people are familiar with the concept of semantic names

Disadvantages

  • Length: Semantic names can be long with limited aesthetic appeal

  • Inconsistency: Lack of naming conventions can lead to confusion

  • Limited metadata: Semantic keys can contain some, but usually not all metadata

  • Inefficiency: Writing lengthy semantic names is a repetitive process and can be time-consuming

  • Ambiguity: Overly descriptive artifact names may introduce ambiguity and redundancy

  • Clashes: Several people may attempt to use the same semantic key. They are not unique

Renaming artifacts

Renaming Files that have associated keys can be done on several levels.

In storage

A artifact can be locally moved or renamed:

artifact_key_3.path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/bfCumwbaGLMOrpax0000.txt')
loaded_artifact = artifact_key_3.load()
!mkdir complex_biological_project/moved_artifacts
!mv complex_biological_project/raw/raw_data_3.txt complex_biological_project/moved_artifacts
artifact_key_3.path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/bfCumwbaGLMOrpax0000.txt')

After moving the artifact locally, the storage location (the path) has not changed and the artifact can still be loaded.

artifact_3 = artifact_key_3.load()

The same applies to the key which has not changed.

artifact_key_3.key
'raw/raw_data_3.txt'

By key

Besides moving the artifact in storage, the key can also be renamed.

artifact_key_4.key
'raw/raw_data_4.txt'
artifact_key_4.key = "bad_samples/sample_data_4.txt"
artifact_key_4.key
'bad_samples/sample_data_4.txt'

Due to the usage of virtual keys, modifying the key does not change the storage location and the artifact stays accessible.

artifact_key_4.path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/j9aVKMg69SJqPtT60000.txt')
artifact_4 = artifact_key_4.load()

Modifying the path attribute

However, modifying the path directly is not allowed:

try:
    artifact_key_4.path = f"{ln.settings.storage}/here_now/sample_data_4.txt"
except AttributeError as e:
    print(e)
property of 'Artifact' object has no setter

Clashing semantic keys

Semantic keys should not clash. Let’s attempt to use the same semantic key twice

print(artifact_key_3.key)
print(artifact_key_4.key)
raw/raw_data_3.txt
bad_samples/sample_data_4.txt
artifact_key_4.key = "raw/raw_data_3.txt"
print(artifact_key_3.key)
print(artifact_key_4.key)
raw/raw_data_3.txt
raw/raw_data_3.txt

When filtering for this semantic key it is now unclear to which artifact we were referring to:

ln.Artifact.filter(key__icontains="sample_data_3").df()
uid id key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code

When querying by key LaminDB cannot resolve which artifact we actually wanted. In fact, we only get a single hit which does not paint a complete picture.

print(artifact_key_3.uid)
print(artifact_key_4.uid)
bfCumwbaGLMOrpax0000
j9aVKMg69SJqPtT60000

Both artifacts still exist though with unique uids that can be used to get access to them. Most importantly though, saving these artifacts to the database will result in an IntegrityError to prevent this issue.

try:
    artifact_key_3.save()
    artifact_key_4.save()
except Exception:
    print(
        "It is not possible to save artifacts to the same key. This results in an"
        " Integrity Error!"
    )

We refer to What happens if I save the same artifacts & records twice? for more detailed explanations of behavior when attempting to save artifacts multiple times.

Hierarchies

Another common use-case of keys are artifact hierarchies. It can be useful to resemble the artifact structure in “complex_biological_project” from above also in LaminDB to allow for queries for artifacts that were stored in specific folders. Common examples of this are folders specifying different processing stages such as raw, preprocessed, or curated.

Note that this use-case may also be overlapping with Collection which also allows for grouping Files. However, Collection cannot model hierarchical groupings.

Key

import os

for root, _, artifacts in os.walk("complex_biological_project/raw"):
    for artifactname in artifacts:
        file_path = Path(root) / artifactname
        key_path = str(file_path).removeprefix("complex_biological_project")
        ln_artifact = ln.Artifact(file_path, key=key_path)
        ln_artifact.save()
 returning existing artifact with same hash: Artifact(uid='j9aVKMg69SJqPtT60000', is_latest=True, key='raw/raw_data_3.txt', suffix='.txt', size=10, hash='KG2eC_osahDKXMPr_1GaHw', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
! key raw/raw_data_3.txt on existing artifact differs from passed key /raw/raw_data_4.txt
 returning existing artifact with same hash: Artifact(uid='nzeCFU74IPt9cV760000', is_latest=True, suffix='.txt', size=10, hash='1w-ky3HSCgf0OMu3-FvmeQ', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
! updated key from None to /raw/raw_data_1.txt
 returning existing artifact with same hash: Artifact(uid='1RMsyNeByzO7bDUN0000', is_latest=True, suffix='.txt', size=10, hash='AiGXKw_BmcO3FTRS34sOnA', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
! updated key from None to /raw/raw_data_2.txt
• path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_3.txt'
 storing artifact 'Q4d4n7dbg0UWxzVE0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/Q4d4n7dbg0UWxzVE0000.txt'
• path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_4.txt'
 storing artifact 'V0NPcv0KJVM89q5J0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/V0NPcv0KJVM89q5J0000.txt'
• path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_1.txt'
 storing artifact 'x8xb68GqO2aNLt730000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/x8xb68GqO2aNLt730000.txt'
• path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_2.txt'
 storing artifact 'q4IoEK0SZJHLe9Dh0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/q4IoEK0SZJHLe9Dh0000.txt'
• path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_3.txt'
 storing artifact 'iMfFZJ0QExiZHGbL0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/iMfFZJ0QExiZHGbL0000.txt'
• path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_4.txt'
 storing artifact '8kKqYU7HbPOTWwJZ0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/8kKqYU7HbPOTWwJZ0000.txt'
• path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_1.txt'
 storing artifact 'xkeTB8FDgu716Cyf0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/xkeTB8FDgu716Cyf0000.txt'
• path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_2.txt'
 storing artifact 'gX0c3BBSCR1rOhXr0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/gX0c3BBSCR1rOhXr0000.txt'
ln.Artifact.filter(key__startswith="raw").df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
3 bfCumwbaGLMOrpax0000 raw/raw_data_3.txt None .txt None None 10 TaBpaFCU-tNvoNVAw97fNA None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.077000+00:00 1 None 1
4 j9aVKMg69SJqPtT60000 raw/raw_data_3.txt None .txt None None 10 KG2eC_osahDKXMPr_1GaHw None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.082000+00:00 1 None 1

Collection

Alternatively, it would have been possible to create a Collection with a corresponding name:

all_data_paths = []
for root, _, artifacts in os.walk("complex_biological_project/raw"):
    for artifactname in artifacts:
        file_path = Path(root) / artifactname
        all_data_paths.append(file_path)

all_data_artifacts = []
for path in all_data_paths:
    all_data_artifacts.append(ln.Artifact(path))

data_ds = ln.Collection(all_data_artifacts, name="data")
data_ds.save()
 returning existing artifact with same hash: Artifact(uid='j9aVKMg69SJqPtT60000', is_latest=True, key='raw/raw_data_3.txt', suffix='.txt', size=10, hash='KG2eC_osahDKXMPr_1GaHw', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='nzeCFU74IPt9cV760000', is_latest=True, key='/raw/raw_data_1.txt', suffix='.txt', size=10, hash='1w-ky3HSCgf0OMu3-FvmeQ', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='1RMsyNeByzO7bDUN0000', is_latest=True, key='/raw/raw_data_2.txt', suffix='.txt', size=10, hash='AiGXKw_BmcO3FTRS34sOnA', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='Q4d4n7dbg0UWxzVE0000', is_latest=True, key='/raw/Collection_2/raw_data_3.txt', suffix='.txt', size=10, hash='mUjQUHSEYLQYFUeSyv7mkg', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='V0NPcv0KJVM89q5J0000', is_latest=True, key='/raw/Collection_2/raw_data_4.txt', suffix='.txt', size=10, hash='InmFBxaqbpW01pg2cMNB_w', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='x8xb68GqO2aNLt730000', is_latest=True, key='/raw/Collection_2/raw_data_1.txt', suffix='.txt', size=10, hash='qFSM2RyebuHeuUx5J1bveg', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='q4IoEK0SZJHLe9Dh0000', is_latest=True, key='/raw/Collection_2/raw_data_2.txt', suffix='.txt', size=10, hash='MI_xk1ByRtBdphl6rl7TLA', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='iMfFZJ0QExiZHGbL0000', is_latest=True, key='/raw/Collection_1/raw_data_3.txt', suffix='.txt', size=10, hash='Te_kdG9u1V_SrnhU0D5Frw', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='8kKqYU7HbPOTWwJZ0000', is_latest=True, key='/raw/Collection_1/raw_data_4.txt', suffix='.txt', size=10, hash='O6RnJmF8i6_-WJs8ftXOvA', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='xkeTB8FDgu716Cyf0000', is_latest=True, key='/raw/Collection_1/raw_data_1.txt', suffix='.txt', size=10, hash='v3uHaqsNC9UFa5NqA6AhTQ', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing artifact with same hash: Artifact(uid='gX0c3BBSCR1rOhXr0000', is_latest=True, key='/raw/Collection_1/raw_data_2.txt', suffix='.txt', size=10, hash='EW1tjJgRVmZe5awk35H5gw', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
/tmp/ipykernel_2964/1482051281.py:11: FutureWarning: argument `name` will be removed, please pass data to `key` instead
  data_ds = ln.Collection(all_data_artifacts, name="data")
Collection(uid='5NhLAslBG0c87zt00000', is_latest=True, key='data', hash='FpDUyKKRYCMC5hAjlsGpgQ', created_by_id=1, space_id=1, run_id=1, created_at=2025-01-20 08:22:42 UTC)
ln.Collection.filter(name__icontains="data").df()
uid key description hash reference reference_type space_id meta_artifact_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1 5NhLAslBG0c87zt00000 data None FpDUyKKRYCMC5hAjlsGpgQ None None 1 None None True 1 2025-01-20 08:22:42.811000+00:00 1 None 1

This approach will likely lead to clashes. Alternatively, Ulabels can be added to Files to resemble hierarchies.

Ulabels

for root, _, artifacts in os.walk("complex_biological_project/raw"):
    for artifactname in artifacts:
        file_path = Path(root) / artifactname
        key_path = str(file_path).removeprefix("complex_biological_project")
        ln_artifact = ln.Artifact(file_path, key=key_path)
        ln_artifact.save()

        data_label = ln.ULabel(name="data")
        data_label.save()
        ln_artifact.ulabels.add(data_label)
 returning existing artifact with same hash: Artifact(uid='j9aVKMg69SJqPtT60000', is_latest=True, key='raw/raw_data_3.txt', suffix='.txt', size=10, hash='KG2eC_osahDKXMPr_1GaHw', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
! key raw/raw_data_3.txt on existing artifact differs from passed key /raw/raw_data_4.txt
 returning existing artifact with same hash: Artifact(uid='nzeCFU74IPt9cV760000', is_latest=True, key='/raw/raw_data_1.txt', suffix='.txt', size=10, hash='1w-ky3HSCgf0OMu3-FvmeQ', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
 returning existing artifact with same hash: Artifact(uid='1RMsyNeByzO7bDUN0000', is_latest=True, key='/raw/raw_data_2.txt', suffix='.txt', size=10, hash='AiGXKw_BmcO3FTRS34sOnA', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
 returning existing artifact with same hash: Artifact(uid='Q4d4n7dbg0UWxzVE0000', is_latest=True, key='/raw/Collection_2/raw_data_3.txt', suffix='.txt', size=10, hash='mUjQUHSEYLQYFUeSyv7mkg', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
 returning existing artifact with same hash: Artifact(uid='V0NPcv0KJVM89q5J0000', is_latest=True, key='/raw/Collection_2/raw_data_4.txt', suffix='.txt', size=10, hash='InmFBxaqbpW01pg2cMNB_w', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
 returning existing artifact with same hash: Artifact(uid='x8xb68GqO2aNLt730000', is_latest=True, key='/raw/Collection_2/raw_data_1.txt', suffix='.txt', size=10, hash='qFSM2RyebuHeuUx5J1bveg', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
 returning existing artifact with same hash: Artifact(uid='q4IoEK0SZJHLe9Dh0000', is_latest=True, key='/raw/Collection_2/raw_data_2.txt', suffix='.txt', size=10, hash='MI_xk1ByRtBdphl6rl7TLA', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
 returning existing artifact with same hash: Artifact(uid='iMfFZJ0QExiZHGbL0000', is_latest=True, key='/raw/Collection_1/raw_data_3.txt', suffix='.txt', size=10, hash='Te_kdG9u1V_SrnhU0D5Frw', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
 returning existing artifact with same hash: Artifact(uid='8kKqYU7HbPOTWwJZ0000', is_latest=True, key='/raw/Collection_1/raw_data_4.txt', suffix='.txt', size=10, hash='O6RnJmF8i6_-WJs8ftXOvA', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
 returning existing artifact with same hash: Artifact(uid='xkeTB8FDgu716Cyf0000', is_latest=True, key='/raw/Collection_1/raw_data_1.txt', suffix='.txt', size=10, hash='v3uHaqsNC9UFa5NqA6AhTQ', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
 returning existing artifact with same hash: Artifact(uid='gX0c3BBSCR1rOhXr0000', is_latest=True, key='/raw/Collection_1/raw_data_2.txt', suffix='.txt', size=10, hash='EW1tjJgRVmZe5awk35H5gw', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:42 UTC); if you intended to query to track this artifact as an input, use: ln.Artifact.get()
 returning existing ULabel record with same name: 'data'
labels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels__in=[labels.data]).df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
4 j9aVKMg69SJqPtT60000 raw/raw_data_3.txt None .txt None None 10 KG2eC_osahDKXMPr_1GaHw None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.082000+00:00 1 None 1
1 nzeCFU74IPt9cV760000 /raw/raw_data_1.txt None .txt None None 10 1w-ky3HSCgf0OMu3-FvmeQ None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.027000+00:00 1 None 1
2 1RMsyNeByzO7bDUN0000 /raw/raw_data_2.txt None .txt None None 10 AiGXKw_BmcO3FTRS34sOnA None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.032000+00:00 1 None 1
5 Q4d4n7dbg0UWxzVE0000 /raw/Collection_2/raw_data_3.txt None .txt None None 10 mUjQUHSEYLQYFUeSyv7mkg None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.604000+00:00 1 None 1
6 V0NPcv0KJVM89q5J0000 /raw/Collection_2/raw_data_4.txt None .txt None None 10 InmFBxaqbpW01pg2cMNB_w None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.616000+00:00 1 None 1
7 x8xb68GqO2aNLt730000 /raw/Collection_2/raw_data_1.txt None .txt None None 10 qFSM2RyebuHeuUx5J1bveg None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.629000+00:00 1 None 1
8 q4IoEK0SZJHLe9Dh0000 /raw/Collection_2/raw_data_2.txt None .txt None None 10 MI_xk1ByRtBdphl6rl7TLA None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.642000+00:00 1 None 1
9 iMfFZJ0QExiZHGbL0000 /raw/Collection_1/raw_data_3.txt None .txt None None 10 Te_kdG9u1V_SrnhU0D5Frw None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.655000+00:00 1 None 1
10 8kKqYU7HbPOTWwJZ0000 /raw/Collection_1/raw_data_4.txt None .txt None None 10 O6RnJmF8i6_-WJs8ftXOvA None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.668000+00:00 1 None 1
11 xkeTB8FDgu716Cyf0000 /raw/Collection_1/raw_data_1.txt None .txt None None 10 v3uHaqsNC9UFa5NqA6AhTQ None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.681000+00:00 1 None 1
12 gX0c3BBSCR1rOhXr0000 /raw/Collection_1/raw_data_2.txt None .txt None None 10 EW1tjJgRVmZe5awk35H5gw None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.693000+00:00 1 None 1

However, Ulabels are too versatile for such an approach and clashes are also to be expected here.

Metadata

Due to the chance of clashes for the aforementioned approaches being rather high, we generally recommend not to store hierarchical data with solely semantic keys. Biological metadata makes Files and Collections unambiguous and easily queryable.

Legacy data and multiple storage roots

Distributed Collections

LaminDB can ingest legacy data that already had a structure in their storage. In such cases, it disables _artifact_use_virtual_keys and the artifacts are ingested with their actual storage location. It might be therefore be possible that Files stored in different storage roots may be associated with a single Collection. To simulate this, we are disabling _artifact_use_virtual_keys and ingest artifacts stored in a different path (the “legacy data”).

ln.settings.creation._artifact_use_virtual_keys = False
for root, _, artifacts in os.walk("complex_biological_project/preprocessed"):
    for artifactname in artifacts:
        file_path = Path(root) / artifactname
        key_path = str(file_path).removeprefix("complex_biological_project")

        print(file_path)
        print()

        ln_artifact = ln.Artifact(file_path, key=f"./{key_path}")
        ln_artifact.save()
complex_biological_project/preprocessed/result_4.txt

• path content will be copied to default storage upon `save()` with key './/preprocessed/result_4.txt'
 storing artifact 'yAOqrEtOCkj2PajB0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_4.txt'
complex_biological_project/preprocessed/result_2.txt

• path content will be copied to default storage upon `save()` with key './/preprocessed/result_2.txt'
 storing artifact '01ub2sNTOI0UI2q60000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_2.txt'
complex_biological_project/preprocessed/result_1.txt

• path content will be copied to default storage upon `save()` with key './/preprocessed/result_1.txt'
 storing artifact 'Qe7zia5FJuGqJpU30000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_1.txt'
complex_biological_project/preprocessed/result_3.txt

• path content will be copied to default storage upon `save()` with key './/preprocessed/result_3.txt'
 storing artifact '0q4Q6z1y64TizFke0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_3.txt'
ln.Artifact.df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
16 0q4Q6z1y64TizFke0000 .//preprocessed/result_3.txt None .txt None None 10 KaF9nHJLseYFAA-FaEwTKw None None md5 False False 1 1 None None True 1 2025-01-20 08:22:43.287000+00:00 1 None 1
15 Qe7zia5FJuGqJpU30000 .//preprocessed/result_1.txt None .txt None None 10 aTHxytKVPrv6wxuYwRZnvQ None None md5 False False 1 1 None None True 1 2025-01-20 08:22:43.273000+00:00 1 None 1
14 01ub2sNTOI0UI2q60000 .//preprocessed/result_2.txt None .txt None None 10 0C_RK_ThJpEXgHMgEf-6OQ None None md5 False False 1 1 None None True 1 2025-01-20 08:22:43.259000+00:00 1 None 1
13 yAOqrEtOCkj2PajB0000 .//preprocessed/result_4.txt None .txt None None 10 dF2dR0CRWYksKJ19JbUutw None None md5 False False 1 1 None None True 1 2025-01-20 08:22:43.246000+00:00 1 None 1
12 gX0c3BBSCR1rOhXr0000 /raw/Collection_1/raw_data_2.txt None .txt None None 10 EW1tjJgRVmZe5awk35H5gw None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.693000+00:00 1 None 1
11 xkeTB8FDgu716Cyf0000 /raw/Collection_1/raw_data_1.txt None .txt None None 10 v3uHaqsNC9UFa5NqA6AhTQ None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.681000+00:00 1 None 1
10 8kKqYU7HbPOTWwJZ0000 /raw/Collection_1/raw_data_4.txt None .txt None None 10 O6RnJmF8i6_-WJs8ftXOvA None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.668000+00:00 1 None 1
9 iMfFZJ0QExiZHGbL0000 /raw/Collection_1/raw_data_3.txt None .txt None None 10 Te_kdG9u1V_SrnhU0D5Frw None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.655000+00:00 1 None 1
8 q4IoEK0SZJHLe9Dh0000 /raw/Collection_2/raw_data_2.txt None .txt None None 10 MI_xk1ByRtBdphl6rl7TLA None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.642000+00:00 1 None 1
7 x8xb68GqO2aNLt730000 /raw/Collection_2/raw_data_1.txt None .txt None None 10 qFSM2RyebuHeuUx5J1bveg None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.629000+00:00 1 None 1
6 V0NPcv0KJVM89q5J0000 /raw/Collection_2/raw_data_4.txt None .txt None None 10 InmFBxaqbpW01pg2cMNB_w None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.616000+00:00 1 None 1
5 Q4d4n7dbg0UWxzVE0000 /raw/Collection_2/raw_data_3.txt None .txt None None 10 mUjQUHSEYLQYFUeSyv7mkg None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.604000+00:00 1 None 1
4 j9aVKMg69SJqPtT60000 raw/raw_data_3.txt None .txt None None 10 KG2eC_osahDKXMPr_1GaHw None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.082000+00:00 1 None 1
3 bfCumwbaGLMOrpax0000 raw/raw_data_3.txt None .txt None None 10 TaBpaFCU-tNvoNVAw97fNA None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.077000+00:00 1 None 1
2 1RMsyNeByzO7bDUN0000 /raw/raw_data_2.txt None .txt None None 10 AiGXKw_BmcO3FTRS34sOnA None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.032000+00:00 1 None 1
1 nzeCFU74IPt9cV760000 /raw/raw_data_1.txt None .txt None None 10 1w-ky3HSCgf0OMu3-FvmeQ None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.027000+00:00 1 None 1
artifact_from_raw = ln.Artifact.filter(key__icontains="Collection_2/raw_data_1").first()
artifact_from_preprocessed = ln.Artifact.filter(
    key__icontains="preprocessed/result_1"
).first()

print(artifact_from_raw.path)
print(artifact_from_preprocessed.path)
/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/x8xb68GqO2aNLt730000.txt
/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_1.txt

Let’s create our Collection:

ds = ln.Collection(
    [artifact_from_raw, artifact_from_preprocessed],
    name="raw_and_processed_collection_2",
)
ds.save()
/tmp/ipykernel_2964/3378863863.py:1: FutureWarning: argument `name` will be removed, please pass raw_and_processed_collection_2 to `key` instead
  ds = ln.Collection(
Collection(uid='53xmgclnzbFZb8mj0000', is_latest=True, key='raw_and_processed_collection_2', hash='IUhtBGDS9ezDM9coUfS0_w', created_by_id=1, space_id=1, run_id=1, created_at=2025-01-20 08:22:43 UTC)
ds.artifacts.df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
7 x8xb68GqO2aNLt730000 /raw/Collection_2/raw_data_1.txt None .txt None None 10 qFSM2RyebuHeuUx5J1bveg None None md5 True False 1 1 None None True 1 2025-01-20 08:22:42.629000+00:00 1 None 1
15 Qe7zia5FJuGqJpU30000 .//preprocessed/result_1.txt None .txt None None 10 aTHxytKVPrv6wxuYwRZnvQ None None md5 False False 1 1 None None True 1 2025-01-20 08:22:43.273000+00:00 1 None 1

Modeling directories

ln.settings.creation._artifact_use_virtual_keys = True
dir_path = ln.core.datasets.dir_scrnaseq_cellranger("sample_001")
ln.UPath(dir_path).view_tree()
• file has more than one suffix (path.suffixes), using only last suffix: '.bai' - if you want your composite suffix to be recognized add it to lamindb.core.storage.VALID_SIMPLE_SUFFIXES.add()
3 sub-directories & 15 files with suffixes '.mtx.gz', '.bam', '.html', '.h5', '.bai', '.cloupe', '.tsv.gz', '.csv'
/home/runner/work/lamindb/lamindb/docs/faq/sample_001
├── metrics_summary.csv
├── raw_feature_bc_matrix/
│   ├── matrix.mtx.gz
│   ├── barcodes.tsv.gz
│   └── features.tsv.gz
├── molecule_info.h5
├── filtered_feature_bc_matrix/
│   ├── matrix.mtx.gz
│   ├── barcodes.tsv.gz
│   └── features.tsv.gz
├── raw_feature_bc_matrix.h5
├── filtered_feature_bc_matrix.h5
├── web_summary.html
├── possorted_genome_bam.bam
├── analysis/
│   └── analysis.csv
├── possorted_genome_bam.bam.bai
└── cloupe.cloupe

There are two ways to create Artifact objects from directories: from_dir() and Artifact.

cellranger_raw_artifact = ln.Artifact.from_dir("sample_001/raw_feature_bc_matrix/")
! folder is outside existing storage location, will copy files from sample_001/raw_feature_bc_matrix/ to /home/runner/work/lamindb/lamindb/docs/faq/key-eval/raw_feature_bc_matrix
 created 3 artifacts from directory using storage /home/runner/work/lamindb/lamindb/docs/faq/key-eval and key = raw_feature_bc_matrix/
for artifact in cellranger_raw_artifact:
    artifact.save()
 storing artifact 'm5x1Sb4YM8ESObDr0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/m5x1Sb4YM8ESObDr0000.mtx.gz'
 storing artifact 'fKyLQnIb9Pewi9zt0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/fKyLQnIb9Pewi9zt0000.tsv.gz'
 storing artifact 'Ap9H4iCSc0D1D8tR0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/Ap9H4iCSc0D1D8tR0000.tsv.gz'
cellranger_raw_folder = ln.Artifact(
    "sample_001/raw_feature_bc_matrix/", description="cellranger raw"
)
cellranger_raw_folder.save()
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/pcbBcNmjS25m5ujH')
 storing artifact 'pcbBcNmjS25m5ujH0000' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/pcbBcNmjS25m5ujH'
Artifact(uid='pcbBcNmjS25m5ujH0000', is_latest=True, description='cellranger raw', suffix='', size=18, hash='Vv9WJ-uI4AUEmVtfVMtcww', n_files=3, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-20 08:22:43 UTC)
ln.Artifact.filter(key__icontains="raw_feature_bc_matrix").df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
17 m5x1Sb4YM8ESObDr0000 raw_feature_bc_matrix/matrix.mtx.gz None .mtx.gz None None 6 8wC--xuna-uUw8fR-_ZGmg None None md5 True False 1 1 None None True 1 2025-01-20 08:22:43.461000+00:00 1 None 1
18 fKyLQnIb9Pewi9zt0000 raw_feature_bc_matrix/barcodes.tsv.gz None .tsv.gz None None 6 lD8fws6zASKe0x8AhSIfuQ None None md5 True False 1 1 None None True 1 2025-01-20 08:22:43.467000+00:00 1 None 1
19 Ap9H4iCSc0D1D8tR0000 raw_feature_bc_matrix/features.tsv.gz None .tsv.gz None None 6 Ahq08rj6CmFtL-USVOGhqg None None md5 True False 1 1 None None True 1 2025-01-20 08:22:43.474000+00:00 1 None 1
ln.Artifact.get(key__icontains="raw_feature_bc_matrix/matrix.mtx.gz").path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/m5x1Sb4YM8ESObDr0000.mtx.gz')
artifact = ln.Artifact.get(description="cellranger raw")
artifact.path.glob("*")
<generator object Path.glob at 0x7f83395518b0>