How do I validate & annotate arbitrary data structures?¶
This guide walks through the low-level API that lets you validate iterables.
You can then use the records create inferred during validation to annotate a dataset.
How do I validate based on a public ontology?
LaminDB makes it easy to validate categorical variables based on registries that inherit from CanCurate
.
CanCurate
methods validate against the registries in your LaminDB instance.
In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable
ontology object: public = Record.public()
.
By default, from_values()
considers a match in a public reference a validated value for any bionty
entity.
# pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-curate-any
Define a test dataset.
import lamindb as ln
import bionty as bt
import zarr
import numpy as np
data = zarr.create(
(3,),
dtype=[("temperature", "f8"), ("knockout_gene", "U15"), ("disease", "U16")],
store="data.zarr",
)
data["knockout_gene"] = ["ENSG00000139618", "ENSG00000141510", "ENSG00000133703"]
data["disease"] = np.random.default_rng().choice(["MONDO:0004975", "MONDO:0004980"], 3)
→ connected lamindb: testuser1/test-curate-any
Validate and standardize vectors¶
validate()
validates vectore-like values against reference values in a registry.
It returns a boolean vector indicating where a value has an exact match in the reference values.
bt.Disease.validate(data["disease"], field=bt.Disease.ontology_id)
Show code cell output
! Your Disease registry is empty, consider populating it first!
→ use `.import_source()` to import records from a source, e.g. a public ontology
array([False, False, False])
When validation fails, you can call inspect()
to figure out what to do.
inspect()
applies the same definition of validation as validate()
, but returns a rich return value InspectResult
. Most importantly, it logs recommended curation steps that would render the data validated.
Note: you can use standardize()
to standardize synonyms.
bt.Disease.inspect(data["disease"], field=bt.Disease.ontology_id)
Show code cell output
! received 2 unique terms, 1 empty/duplicated term is ignored
! 2 unique terms (100.00%) are not validated for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
detected 2 Disease terms in public source for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
→ add records from public source to your Disease registry via .from_values()
<lamin_utils._inspect.InspectResult at 0x7f813b1e8160>
Bulk creating records using from_values()
only returns validated records.
diseases = bt.Disease.from_values(data["disease"], field=bt.Disease.ontology_id).save()
Repeat the process for more labels:
projects = ln.ULabel.from_values(
["Project A", "Project B"],
field=ln.ULabel.name,
create=True, # create non-validated labels
).save()
genes = bt.Gene.from_values(data["knockout_gene"], field=bt.Gene.ensembl_gene_id).save()
Annotate the dataset¶
Register the dataset as an artifact:
artifact = ln.Artifact("data.zarr", key="my_dataset.zarr").save()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
Annotate with features:
ln.Feature(name="project", dtype=ln.ULabel).save()
ln.Feature(name="disease", dtype=bt.Disease.ontology_id).save()
ln.Feature(name="knockout_gene", dtype=bt.Gene.ensembl_gene_id).save()
artifact.features.add_values(
{"project": projects, "knockout_gene": genes, "disease": diseases}
)
artifact.describe()
Show code cell output
Artifact .zarr ├── General │ ├── .uid = 'YZBJH3FNSPgF6Yd20000' │ ├── .key = 'my_dataset.zarr' │ ├── .size = 848 │ ├── .hash = 'SilFmsZ-n7ruAxHRzVSG7w' │ ├── .n_files = 2 │ ├── .path = /home/runner/work/lamindb/lamindb/docs/faq/test-curate-any/.lamindb/YZBJH3FNSPgF6Yd2.zarr │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-03-16 20:57:58 ├── Linked features │ └── disease cat[bionty.Disease.ontol… Alzheimer disease, atopic eczema │ knockout_gene cat[bionty.Gene.ensembl_… BRCA2, KRAS, TP53 │ project cat[ULabel] Project A, Project B └── Labels └── .genes bionty.Gene BRCA2, TP53, KRAS .diseases bionty.Disease atopic eczema, Alzheimer disease .ulabels ULabel Project A, Project B
Show code cell content
# clean up test instance
!rm -r data.zarr
!rm -r ./test-curate-any
!lamin delete --force test-curate-any
• deleting instance testuser1/test-curate-any