How do I validate & annotate arbitrary data structures?¶
This guide walks through the low-level API that lets you validate iterables.
You can then use the records create inferred during validation to annotate a dataset.
How do I validate based on a public ontology?
LaminDB makes it easy to validate categorical variables based on registries that inherit from CanCurate.
CanCurate methods validate against the registries in your LaminDB instance.
In Manage biological ontologies, you’ll see how to extend standard validation to validation against public references using a PubliOntology object, e.g., via public_genes = bt.Gene.public().
By default, from_values() considers a match in a public reference a validated value for any bionty entity.
# pip install 'lamindb[zarr]'
!lamin init --storage ./test-curate-any --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-curate-any
Define a test dataset.
import lamindb as ln
import bionty as bt
import zarr
import numpy as np
data = zarr.open_group(store="data.zarr", mode="a")
data.create_dataset(name="temperature", shape=(3,), dtype="float32")
data.create_dataset(name="knockout_gene", shape=(3,), dtype=str)
data.create_dataset(name="disease", shape=(3,), dtype=str)
data["knockout_gene"][:] = np.array(
["ENSG00000139618", "ENSG00000141510", "ENSG00000133703"]
)
data["disease"][:] = np.random.default_rng().choice(
["MONDO:0004975", "MONDO:0004980"], 3
)
→ connected lamindb: testuser1/test-curate-any
Validate and standardize vectors¶
Read the disease array from the zarr group into memory.
disease = data["disease"][:]
validate() validates vectore-like values against reference values in a registry.
It returns a boolean vector indicating where a value has an exact match in the reference values.
bt.Disease.validate(disease, field=bt.Disease.ontology_id)
Show code cell output
! Your Disease registry is empty, consider populating it first!
→ use `.import_source()` to import records from a source, e.g. a public ontology
array([False, False, False])
When validation fails, you can call inspect() to figure out what to do.
inspect() applies the same definition of validation as validate(), but returns a rich return value InspectResult. Most importantly, it logs recommended curation steps that would render the data validated.
Note: you can use standardize() to standardize synonyms.
bt.Disease.inspect(disease, field=bt.Disease.ontology_id)
Show code cell output
! received 2 unique terms, 1 empty/duplicated term is ignored
! 2 unique terms (100.00%) are not validated for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
detected 2 Disease terms in public source for ontology_id: 'MONDO:0004975', 'MONDO:0004980'
→ add records from public source to your Disease registry via .from_values()
<lamin_utils._inspect.InspectResult at 0x7fb34218a4d0>
Bulk creating records using from_values() only returns validated records.
diseases = bt.Disease.from_values(disease, field=bt.Disease.ontology_id).save()
Repeat the process for more labels:
experiments = ln.Record.from_values(
["Experiment A", "Experiment B"],
field=ln.Record.name,
create=True, # create non-validated labels
).save()
genes = bt.Gene.from_values(
data["knockout_gene"][:], field=bt.Gene.ensembl_gene_id
).save()
Annotate the dataset¶
Register the dataset as an artifact:
artifact = ln.Artifact("data.zarr", key="my_dataset.zarr").save()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
Annotate with features:
ln.Feature(name="experiment", dtype=ln.Record).save()
ln.Feature(name="disease", dtype=bt.Disease.ontology_id).save()
ln.Feature(name="knockout_gene", dtype=bt.Gene.ensembl_gene_id).save()
artifact.features.add_values(
{"experiment": experiments, "knockout_gene": genes, "disease": diseases}
)
artifact.describe()
Show code cell output
Artifact .zarr ├── General │ ├── key: my_dataset.zarr │ ├── uid: nOKqrL8eu6hu18Pp0000 hash: 5C6TQClVatQxI3BfBKeXYA │ ├── size: 1.2 KB transform: None │ ├── space: all branch: all │ ├── created_by: testuser1 created_at: 2025-10-16 11:49:29 │ ├── n_files: 6 │ └── storage path: /home/runner/work/lamindb/lamindb/docs/faq/test-curate-any/my_dataset.zarr ├── External features │ └── disease cat[bionty.Disease.ontology_id] Alzheimer disease, atopic eczema │ experiment cat[Record] Experiment A, Experiment B │ knockout_gene cat[bionty.Gene.ensembl_gene_id] BRCA2, KRAS, TP53 └── Labels └── .records Record Experiment A, Experiment B .genes bionty.Gene BRCA2, TP53, KRAS .diseases bionty.Disease atopic eczema, Alzheimer disease
Show code cell content
# clean up test instance
!rm -r data.zarr
!rm -r ./test-curate-any
!lamin delete --force test-curate-any
• deleting instance testuser1/test-curate-any