Curate datasets of any format

Our previous guide explained how to validate, standardize & annotate DataFrame and AnnData. In this guide, we’ll walk through the basic API that lets you work with any format of data.

How do I validate based on a public ontology?

LaminDB makes it easy to validate categorical variables based on registries that inherit from CanCurate.

CanCurate methods validate against the registries in your LaminDB instance. In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable ontology object: public = Record.public(). By default, from_values() considers a match in a public reference a validated value for any bionty entity.

# !pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --schema bionty
Hide code cell output
 connected lamindb: testuser1/test-curate-any
import lamindb as ln
import bionty as bt
import zarr
import numpy as np

data = zarr.create((10,), dtype=[('value', 'f8'), ("gene", "U15"), ('disease', 'U16')], store='data.zarr')
data["gene"] = ["ENSG00000139618", "ENSG00000141510", "ENSG00000133703", "ENSG00000157764", "ENSG00000171862", "ENSG00000091831", "ENSG00000141736", "ENSG00000133056", "ENSG00000146648", "ENSG00000118523"]
data["disease"] = np.random.choice(['MONDO:0004975', 'MONDO:0004980'], 10)
 connected lamindb: testuser1/test-curate-any

Define validation criteria

Entities that don’t have a dedicated registry (“are not typed”) can be validated & registered using ULabel:

criteria = {
    "disease": bt.Disease.ontology_id,
    "project": ln.ULabel.name,
    "gene": bt.Gene.ensembl_gene_id,
}

Validate and standardize metadata

validate() validates passed values against reference values in a registry. It returns a boolean vector indicating whether a value has an exact match in the reference values.

bt.Disease.validate(data["disease"], field=bt.Disease.ontology_id)
! Your Disease registry is empty, consider populating it first!
   → use `.import_source()` to import records from a source, e.g. a public ontology
array([False, False, False, False, False, False, False, False, False,
       False])

When validation fails, you can call inspect() to figure out what to do.

inspect() applies the same definition of validation as validate(), but returns a rich return value InspectResult. Most importantly, it logs recommended curation steps that would render the data validated.

Note: you can use standardize() to standardize synonyms.

bt.Disease.inspect(data["disease"], field=bt.Disease.ontology_id);
! received 2 unique terms, 8 empty/duplicated terms are ignored
! 2 unique terms (100.00%) are not validated for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
   detected 2 Disease terms in Bionty for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
→  add records from Bionty to your Disease registry via .from_values()

Following the suggestions to register new labels:

Bulk creating records using from_values() only returns validated records:

Note: Terms validated with public reference are also created with .from_values, see Manage biological registries for details.

diseases = bt.Disease.from_values(data["disease"], field=bt.Disease.ontology_id)
ln.save(diseases)

Repeat the process for more labels:

projects = ln.ULabel.from_values(
    ["Project A", "Project B"], 
    field=ln.ULabel.name, 
    create=True, # create non-existing labels rather than attempting to load them from the database
)
ln.save(projects)
genes = bt.Gene.from_values(data["gene"], field=bt.Gene.ensembl_gene_id)
ln.save(genes)

Annotate and save dataset with validated metadata

Register the dataset as an artifact:

artifact = ln.Artifact("data.zarr", description="a zarr object").save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run

Link the artifact to validated labels. You could directly do this, e.g., via artifact.ulabels.add(projects) or artifact.diseases.add(diseases).

However, often, you want to track the features that measured labels. Hence, let’s try to associate our labels with features:

from lamindb.core.exceptions import ValidationError

try:
    artifact.features.add_values({"project": projects, "disease": diseases})
except ValidationError as e:
    print(e)
Hide code cell output
! cannot infer feature type of: [ULabel(uid='nwFLCHZE', name='Project A', created_by_id=1, created_at=2024-12-02 00:33:13 UTC), ULabel(uid='Id4Zkyi6', name='Project B', created_by_id=1, created_at=2024-12-02 00:33:13 UTC)], returning '?
! cannot infer feature type of: [Disease(uid='4JmTj6Sn', name='atopic eczema', ontology_id='MONDO:0004980', synonyms='allergic dermatitis|Atopic dermatitis|allergic form of dermatitis|Besnier's prurigo|Atopic neurodermatitis|eczema|allergic|atopic eczema|eczematous dermatitis', description='A Chronic Inflammatory Genetically Determined Disease Of The Skin Marked By Increased Ability To Form Reagin (Ige), With Increased Susceptibility To Allergic Rhinitis And Asthma, And Hereditary Disposition To A Lowered Threshold For Pruritus. It Is Manifested By Lichenification, Excoriation, And Crusting, Mainly On The Flexural Surfaces Of The Elbow And Knee. In Infants It Is Known As Infantile Eczema.', created_by_id=1, source_id=49, created_at=2024-12-02 00:33:13 UTC), Disease(uid='4F2HPJ3w', name='Alzheimer disease', ontology_id='MONDO:0004975', synonyms='Alzheimers disease|Alzheimer's dementia|Alzheimer's disease|Alzheimers dementia|AD|presenile and senile dementia|Alzheimer dementia|Alzheimer disease', description='A Progressive, Neurodegenerative Disease Characterized By Loss Of Function And Death Of Nerve Cells In Several Areas Of The Brain Leading To Loss Of Cognitive Function Such As Memory And Language.', created_by_id=1, source_id=49, created_at=2024-12-02 00:33:13 UTC)], returning '?
These keys could not be validated: ['project', 'disease']
Here is how to create a feature:

  ln.Feature(name='project', dtype='?').save()
  ln.Feature(name='disease', dtype='?').save()

This errored because we hadn’t yet registered features. After copy and paste from the error message, things work out:

ln.Feature(name='project', dtype='cat[ULabel]').save()
ln.Feature(name='disease', dtype='cat[bionty.Disease]').save()
artifact.features.add_values({"project": projects, "disease": diseases})
artifact.features
Hide code cell output
Artifact .zarr
└── Annotations
    └── Features                                                                                        
        disease                     cat[bionty.Disease]        Alzheimer disease, atopic eczema         
        project                     cat[ULabel]                Project A, Project B                     

Since genes are the measurements, we register them as features:

feature_set = ln.FeatureSet(genes)
feature_set.save()
artifact.features.add_feature_set(feature_set, slot="genes")
artifact.describe()
Hide code cell output
Artifact .zarr
├── General
│   ├── .uid = 'zeTM9Jko1fLCMiY10000'
│   ├── .size = 974
│   ├── .hash = 'mz7tB7q_5vgl3_HQCr2ehQ'
│   ├── .n_objects = 2
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate-any/.lamindb/zeTM9Jko1fLCMiY1.zarr
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2024-12-02 00:33:16
├── Dataset/.feature_sets
│   └── genes10                  [bionty.Gene]                                                       
BRCA2                       num                                                                 
TP53                        num                                                                 
KRAS                        num                                                                 
BRAF                        num                                                                 
PTEN                        num                                                                 
ESR1                        num                                                                 
ERBB2                       num                                                                 
PIK3C2B                     num                                                                 
EGFR                        num                                                                 
CCN2                        num                                                                 
└── Annotations
    ├── Features                                                                                        
disease                     cat[bionty.Disease]        Alzheimer disease, atopic eczema         
project                     cat[ULabel]                Project A, Project B                     
    └── Labels                                                                                          
        .diseases                   bionty.Disease             'atopic eczema', 'Alzheimer disease'     
        .ulabels                    ULabel                     'Project A', 'Project B'                 
Hide code cell content
# clean up test instance
!lamin delete --force test-curate-any
!rm -r data.zarr
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.12.7/x64/bin/lamin", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/rich_click/rich_command.py", line 367, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/lamin_cli/__main__.py", line 209, in delete
    return delete(instance, force=force)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/lamindb_setup/_delete.py", line 102, in delete
    n_objects = check_storage_is_empty(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.7/x64/lib/python3.12/site-packages/lamindb_setup/core/upath.py", line 826, in check_storage_is_empty
    raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage '/home/runner/work/lamindb/lamindb/docs/test-curate-any/.lamindb' contains 2 objects - delete them prior to deleting the instance