Glossary .md

artifact

An artifact stores a dataset or model as a file or folder. It is the output of a (tracked or untracked) process.

curator

An object designed to ensure your dataset conforms with a desired schema. It helps with validation, standardization (e.g., by fixing typos or mapping synonyms), and annotation (linking it to metadata entities so that it becomes queryable).

FAIR

FAIR data is data that meets the principles of findability, accessibility, interoperability, and reusability [Wikipedia].

feature

A feature is a measurable property represented in data (e.g., scalar, vector, image, embedding) [Wikipedia]. In these docs, we use “feature” independent of modeling role: a feature can serve as predictor, target, covariate, or a metadata variable, depending on the analysis. A feature maps to one or more dataset dimensions; in tabular data, scalar features map 1:1 to columns.

LaminDB comes with a Feature registry to organize dataset dimensions.

GUI

Graphical user interface, for instance, a browser-based data catalog.

instance

Shorthand for “LaminDB instance”, a database that manages metadata for datasets in different storage locations.

label

A label in LaminDB is an entity in a registry – e.g. a sample, cell type, or perturbation – that can be linked to another entity – e.g. a dataset or model.

lakehouse

A data lakehouse combines the flexibility and cost-effectiveness of a data lake with data-management capabilities commonly associated with data warehouses. Typical capabilities include schema management, transactional guarantees (ACID), and metadata/index structures that reduce full dataset scans for many query patterns. Widely adopted open table formats in this space include Apache Iceberg, Delta Lake, and Apache Hudi. Managed services and platforms include offerings such as Google’s BigLake, Amazon’s Lake Formation, Dremio, and Starburst. For background, see this blog post from Google, this blog post from AWS, this glossary entry, and this paper from Databricks.

ORM

Object-relational mapper. In LaminDB every subclass of SQLRecord is an ORM model that corresponds to a SQL table in the underlying metadata database [Wikipedia]. A SQLRecord object maps to a single row of the table. We refer to the SQLRecord class as a registry, hence the name of its metaclass: Registry.

observation

In statistics and machine learning, an observation refers to a measurement of a set of random variables.

In biology, an observation typically corresponds to measuring (reading out) a set of properties from a biological sample.

record

A record is a data structure that consists of a sequence of typed fields that hold values [Wikipedia].

In LaminDB, any metadata record – including Artifact, Transform, Run, etc. – is modeled as a SQLRecord and is stored in a row in a table in the SQL database. LaminDB also comes with a class to dynamically model records, Record. This is useful for describing more frequently changing dataset schemas, for example, the columns in dynamically ingested parquet files or dynamically created sheets. While changing the fields of a SQLRecord requires updating its Python data model definition and running a migration in the SQL database, changing the features of a Record can be done dynamically.

sample

In biology, a sample is an instance or part of a biological system.

In classical statistics, a sample usually refers to a set of observations drawn from a population. In machine learning, a sample often refers to a single observation (one row) of random variables (features, labels, metadata).

Depending on the observational unit chosen for representing data, the statistical sample might correspond 1:1 to a biological sample. Often, this choice presents interesting cases, as variation across physical samples – targeted in the experimental design – can be directly explained by variation across statistical (digital) samples.

variable

We almost always mean “random variable”, when we say “variable”.

An independent variable is sometimes called a feature (the preferred term in these docs), “predictor variable”, “regressor”, “covariate”, “explanatory variable”, “risk factor”, “input variable”, among others [Wikipedia]. A dependent variable is sometimes called a “response variable”, “regressand”, “predicted variable”, “measured variable”, among others.

schema

A schema is a blueprint for your dataset’s structure and a tool for curating and validating the organization of your dataset, helping maintain data integrity as it evolves through various processing steps.

registry

A table in a SQL database (SQLite/Postgres) holding records, enabling queries, enforcing integrity, and fine-grained access management. In Python, it’s the metaclass for Registry for SQLRecord.

transform

A piece of code (script, notebook, pipeline, function) that can be applied to input data to produce output data.