How do I validate & annotate arbitrary data structures?

This guide walks through the low-level API that lets you validate iterables.

You can then use the records create inferred during validation to annotate a dataset.

How do I validate based on a public ontology?

LaminDB makes it easy to validate categorical variables based on registries that inherit from CanCurate.

CanCurate methods validate against the registries in your LaminDB instance. In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable ontology object: public = Record.public(). By default, from_values() considers a match in a public reference a validated value for any bionty entity.

# pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-curate-any

Define a test dataset.

import lamindb as ln
import bionty as bt
import zarr
import numpy as np

data = zarr.create(
    (3,),
    dtype=[("temperature", "f8"), ("knockout_gene", "U15"), ("disease", "U16")],
    store="data.zarr",
)
data["knockout_gene"] = ["ENSG00000139618", "ENSG00000141510", "ENSG00000133703"]
data["disease"] = np.random.default_rng().choice(["MONDO:0004975", "MONDO:0004980"], 3)
 connected lamindb: testuser1/test-curate-any

Validate and standardize vectors

validate() validates vectore-like values against reference values in a registry. It returns a boolean vector indicating where a value has an exact match in the reference values.

bt.Disease.validate(data["disease"], field=bt.Disease.ontology_id)
Hide code cell output
! Your Disease registry is empty, consider populating it first!
   → use `.import_source()` to import records from a source, e.g. a public ontology
array([False, False, False])

When validation fails, you can call inspect() to figure out what to do.

inspect() applies the same definition of validation as validate(), but returns a rich return value InspectResult. Most importantly, it logs recommended curation steps that would render the data validated.

Note: you can use standardize() to standardize synonyms.

bt.Disease.inspect(data["disease"], field=bt.Disease.ontology_id)
Hide code cell output
! received 2 unique terms, 1 empty/duplicated term is ignored
! 2 unique terms (100.00%) are not validated for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
   detected 2 Disease terms in public source for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
→  add records from public source to your Disease registry via .from_values()
<lamin_utils._inspect.InspectResult at 0x7f813b1e8160>

Bulk creating records using from_values() only returns validated records.

diseases = bt.Disease.from_values(data["disease"], field=bt.Disease.ontology_id).save()

Repeat the process for more labels:

projects = ln.ULabel.from_values(
    ["Project A", "Project B"],
    field=ln.ULabel.name,
    create=True,  # create non-validated labels
).save()
genes = bt.Gene.from_values(data["knockout_gene"], field=bt.Gene.ensembl_gene_id).save()

Annotate the dataset

Register the dataset as an artifact:

artifact = ln.Artifact("data.zarr", key="my_dataset.zarr").save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run

Annotate with features:

ln.Feature(name="project", dtype=ln.ULabel).save()
ln.Feature(name="disease", dtype=bt.Disease.ontology_id).save()
ln.Feature(name="knockout_gene", dtype=bt.Gene.ensembl_gene_id).save()
artifact.features.add_values(
    {"project": projects, "knockout_gene": genes, "disease": diseases}
)
artifact.describe()
Hide code cell output
Artifact .zarr
├── General
│   ├── .uid = 'YZBJH3FNSPgF6Yd20000'
│   ├── .key = 'my_dataset.zarr'
│   ├── .size = 848
│   ├── .hash = 'SilFmsZ-n7ruAxHRzVSG7w'
│   ├── .n_files = 2
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/faq/test-curate-any/.lamindb/YZBJH3FNSPgF6Yd2.zarr
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-03-16 20:57:58
├── Linked features
│   └── disease                     cat[bionty.Disease.ontol…  Alzheimer disease, atopic eczema         
knockout_gene               cat[bionty.Gene.ensembl_…  BRCA2, KRAS, TP53                        
project                     cat[ULabel]                Project A, Project B                     
└── Labels
    └── .genes                      bionty.Gene                BRCA2, TP53, KRAS                        
        .diseases                   bionty.Disease             atopic eczema, Alzheimer disease         
        .ulabels                    ULabel                     Project A, Project B                     
Hide code cell content
# clean up test instance
!rm -r data.zarr
!rm -r ./test-curate-any
!lamin delete --force test-curate-any
 deleting instance testuser1/test-curate-any