lamindb.Feature¶
- class lamindb.Feature(name: str, dtype: Dtype | Registry | list[Registry] | FieldAttr, type: Feature | None = None, is_type: bool = False, unit: str | None = None, description: str | None = None, synonyms: str | None = None, nullable: bool = True, default_value: str | None = None, coerce_dtype: bool = False, cat_filters: dict[str, str] | None = None)¶
Bases:
SQLRecord,CanCurate,TracksRun,TracksUpdatesVariables, such as dataframe columns or run parameters.
A feature often represents a dimension of a dataset, such as a column in a
DataFrame. TheFeatureregistry organizes metadata of features.The
Featureregistry helps you organize and query datasets based on their features and corresponding label annotations. For instance, when working with a “T cell” label, it could be measured through different features such as"cell_type_by_expert"where an expert manually classified the cell, or"cell_type_by_model"where a computational model made the classification.The two most important metadata of a feature are its
nameand thedtype. In addition to typical data types, LaminDB has a"num"dtypeto concisely denote the union of all numerical types.- Parameters:
name –
strName of the feature, typically a column name.dtype –
Dtype | Registry | list[Registry] | FieldAttrSeeDtype. For categorical types, you can define to which registry values are restricted, e.g.,ULabelor[ULabel, bionty.CellType].unit –
str | None = NoneUnit of measure, ideally SI ("m","s","kg", etc.) or"normalized"etc.description –
str | None = NoneA description.synonyms –
str | None = NoneBar-separated synonyms.nullable –
bool = TrueWhether the feature can have null-like values (None,pd.NA,NaN, etc.), seenullable.default_value –
Any | None = NoneDefault value for the feature.coerce_dtype –
bool = FalseWhen True, attempts to coerce values to the specified dtype during validation, seecoerce_dtype.cat_filters –
dict[str, str] | None = NoneSubset a registry by additional filters to define valid categories.
Note
For more control, you can use
biontyregistries to manage simple biological entities like genes, proteins & cell markers. Or you define custom registries to manage high-level derived features like gene sets.See also
from_dataframe()Create feature records from DataFrame.
featuresFeature manager of an artifact or collection.
ULabelUniversal labels.
SchemaFeature sets.
Example
A simple
"str"feature.:ln.Feature(name="sample_note", dtype=str).save()
A dtype
"cat[ULabel]"can be more easily passed as below.:ln.Feature(name="project", dtype=ln.ULabel).save()
A dtype
"cat[ULabel|bionty.CellType]"can be more easily passed as below.:ln.Feature( name="cell_type", dtype=[ln.ULabel, bt.CellType], ).save()
A multivalue feature with a list of cell types.:
ln.Feature( name="cell_types", dtype=list[bt.CellType], # or list[str] for a list of strings ).save()
A path feature.:
ln.Feature( name="image_path", dtype="path", # will be validated as `str` ).save()
Hint
Features and labels denote two ways of using entities to organize data:
A feature qualifies what is measured, i.e., a numerical or categorical random variable
A label is a measured value, i.e., a category
Consider annotating a dataset by that it measured expression of 30k genes: genes relate to the dataset as feature identifiers through a feature set with 30k members. Now consider annotating the artifact by whether that it measured the knock-out of 3 genes: here, the 3 genes act as labels of the dataset.
Re-shaping data can introduce ambiguity among features & labels. If this happened, ask yourself what the joint measurement was: a feature qualifies variables in a joint measurement. The canonical data matrix lists jointly measured variables in the columns.
Attributes¶
- property coerce_dtype: bool¶
Whether dtypes should be coerced during validation.
For example, a
objects-dtyped pandas column can be coerced tocategoricaland would pass validation if this is true.
- property default_value: Any¶
A default value that overwrites missing values (default
None).This takes effect when you call
Curator.standardize().If
default_value = None, missing values likepd.NAornp.nanare kept.
- property nullable: bool¶
Indicates whether the feature can have nullable values (default
True).Example:
import lamindb as ln import pandas as pd disease = ln.Feature(name="disease", dtype=ln.ULabel, nullable=False).save() schema = ln.Schema(features=[disease]).save() dataset = {"disease": pd.Categorical([pd.NA, "asthma"])} df = pd.DataFrame(dataset) curator = ln.curators.DataFrameCurator(df, schema) try: curator.validate() except ln.errors.ValidationError as e: assert str(e).startswith("non-nullable series 'disease' contains null values")
Simple fields¶
- uid: str¶
Universal id, valid across DB instances.
- name: str¶
Name of feature.
- is_type: bool¶
Distinguish types from instances of the type.
- unit: str | None¶
Unit of measure, ideally SI (
m,s,kg, etc.) or ‘normalized’ etc. (optional).
- description: str | None¶
A description.
- array_rank: int¶
Rank of feature.
Number of indices of the array: 0 for scalar, 1 for vector, 2 for matrix.
Is called
.ndiminnumpyandpytorchbut shouldn’t be confused with the dimension of the feature space.
- array_size: int¶
Number of elements of the feature.
Total number of elements (product of shape components) of the array.
A number or string (a scalar): 1
A 50-dimensional embedding: 50
A 25 x 25 image: 625
- array_shape: list[int] | None¶
Shape of the feature.
A number or string (a scalar): [1]
A 50-dimensional embedding: [50]
A 25 x 25 image: [25, 25]
Is stored as a list rather than a tuple because it’s serialized as JSON.
- proxy_dtype: Dtype | None¶
Proxy data type.
If the feature is an image it’s often stored via a path to the image file. Hence, while the dtype might be image with a certain shape, the proxy dtype would be str.
- synonyms: str | None¶
Bar-separated (|) synonyms (optional).
- is_locked: bool¶
Whether the record is locked for edits.
- created_at: datetime¶
Time of creation of record.
- updated_at: datetime¶
Time of last update to record.
Relational fields¶
-
type:
Feature| None¶ Type of feature (e.g., ‘Readout’, ‘Metric’, ‘Metadata’, ‘ExpertAnnotation’, ‘ModelPrediction’).
Allows to group features by type, e.g., all read outs, all metrics, etc.
- values: FeatureValue¶
Values for this feature.
- blocks: FeatureBlock¶
Blocks that annotate this feature.
Class methods¶
- classmethod from_dataframe(df, field=None, *, mute=False)¶
Create Feature records for dataframe columns.
- Parameters:
df (
DataFrame) – Source DataFrame to extract column information fromfield (
DeferredAttribute|None, default:None) – FieldAttr for Feature model validation, defaults to Feature.namemute (
bool, default:False) – Whether to mute Feature creation similar names found warnings
- Return type:
- classmethod from_dict(dictionary, field=None, *, str_as_cat=None, type=None, mute=False)¶
Create Feature records for dictionary keys.
- Parameters:
dictionary (
dict[str,Any]) – Source dictionary to extract key information fromfield (
DeferredAttribute|None, default:None) – FieldAttr for Feature model validation, defaults toFeature.namestr_as_cat (
bool|None, default:None) – Deprecated. Will be removed in LaminDB 2.0.0. Create features explicitly with dtype=’cat’ for categorical values.type (
Feature|None, default:None) – Feature type of all created featuresmute (
bool, default:False) – Whether to mute dtype inference and feature creation warnings
- Return type:
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Qobjects.expressions – Fields and values passed as Django query expressions.
- Return type:
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.Project(name="my label").save() >>> ln.Project.filter(name__startswith="my").to_dataframe()
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int|str|None, default:None) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Raises:
lamindb.errors.DoesNotExist – In case no matching record is found.
- Return type:
See also
Guide: Query & search registries
Django documentation: Queries
Examples
record = ln.Record.get("FvtpPJLJ") record = ln.Record.get(name="my-label")
- classmethod to_dataframe(include=None, features=False, limit=100)¶
Evaluate and convert to
pd.DataFrame.By default, maps simple fields and foreign keys onto
DataFramecolumns.Guide: Query & search registries
- Parameters:
include (
str|list[str] |None, default:None) – Related data to include as columns. Takes strings of form"records__name","cell_types__name", etc. or a list of such strings. ForArtifact,Record, andRun, can also pass"features"to include features with data types pointing to entities in the core schema. If"privates", includes private fields (fields starting with_).features (
bool|list[str], default:False) – Configure the features to include. Can be a feature name or a list of such names. If"queryset", infers the features used within the current queryset. Only available forArtifact,Record, andRun.limit (
int, default:100) – Maximum number of rows to display. IfNone, includes all results.order_by – Field name to order the records by. Prefix with ‘-’ for descending order. Defaults to ‘-id’ to get the most recent records. This argument is ignored if the queryset is already ordered or if the specified field does not exist.
- Return type:
DataFrame
Examples
Include the name of the creator:
ln.Record.to_dataframe(include="created_by__name"])
Include features:
ln.Artifact.to_dataframe(include="features")
Include selected features:
ln.Artifact.to_dataframe(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str) – The input string to match against the field ontology values.field (
str|DeferredAttribute|None, default:None) – The field or fields to search. Search all string fields by default.limit (
int|None, default:20) – Maximum amount of top results to return.case_sensitive (
bool, default:False) – Whether the match is case sensitive.
- Return type:
- Returns:
A sorted
DataFrameof search results with a score in columnscore. Ifreturn_querysetisTrue.QuerySet.
Examples
records = ln.Record.from_values(["Label1", "Label2", "Label3"], field="name").save() ln.Record.search("Label2")
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str|DeferredAttribute|None, default:None) – The field to look up the values for. Defaults to first string field.return_field (
str|DeferredAttribute|None, default:None) – The field to return. IfNone, returns the whole record.keep – When multiple records are found for a lookup, how to return the records. -
"first": return the first record. -"last": return the last record. -False: return all records.
- Return type:
NamedTuple- Returns:
A
NamedTupleof lookup information of the field values with a dictionary converter.
See also
Examples
Lookup via auto-complete on
.:import bionty as bt bt.Gene.from_source(symbol="ADGB-DT").save() lookup = bt.Gene.lookup() lookup.adgb_dt
Look up via auto-complete in dictionary:
lookup_dict = lookup.dict() lookup_dict['ADGB-DT']
Look up via a specific field:
lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") genes.ensg00000002745
Return a specific field value instead of the full record:
lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod using(instance)¶
Use a non-default LaminDB instance.
- Parameters:
instance (
str|None) – An instance identifier of form “account_handle/instance_name”.- Return type:
Examples
ln.Record.using("account_handle/instance_name").search("label7", field="name")
- classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, from_source=True, strict_source=False)¶
Inspect if values are mappable to a field.
Being mappable means that an exact match exists.
- Parameters:
values (
list[str] |Series|array) – Values that will be checked against the field.field (
str|DeferredAttribute|None, default:None) – The field of values. Examples are'ontology_id'to map against the source ID or'name'to map against the ontologies field names.mute (
bool, default:False) – Whether to mute logging.organism (
str|SQLRecord|None, default:None) – An Organism name or record.source (
SQLRecord|None, default:None) – Abionty.Sourcerecord that specifies the version to inspect against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.
- Return type:
bionty.base.dev.InspectResult
See also
Example:
import bionty as bt # save some gene records bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save() # inspect gene symbols gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol, organism="human") assert result.validated == ["A1CF", "A1BG"] assert result.non_validated == ["FANCD1", "FANCD20"]
- classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Validate values against existing values of a string field.
Note this is strict_source validation, only asserts exact matches.
- Parameters:
values (
list[str] |Series|array) – Values that will be validated against the field.field (
str|DeferredAttribute|None, default:None) – The field of values. Examples are'ontology_id'to map against the source ID or'name'to map against the ontologies field names.mute (
bool, default:False) – Whether to mute logging.organism (
str|SQLRecord|None, default:None) – An Organism name or record.source (
SQLRecord|None, default:None) – Abionty.Sourcerecord that specifies the version to validate against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.
- Return type:
ndarray- Returns:
A vector of booleans indicating if an element is validated.
See also
Example:
import bionty as bt bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save() gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] bt.Gene.validate(gene_symbols, field=bt.Gene.symbol, organism="human") #> array([ True, True, False, False])
- classmethod from_values(values, field=None, create=False, organism=None, source=None, mute=False)¶
Bulk create validated records by parsing values for an identifier such as a name or an id).
- Parameters:
values (
list[str] |Series|array) – A list of values for an identifier, e.g.["name1", "name2"].field (
str|DeferredAttribute|None, default:None) – ASQLRecordfield to look up, e.g.,bt.CellMarker.name.create (
bool, default:False) – Whether to create records if they don’t exist.organism (
SQLRecord|str|None, default:None) – Abionty.Organismname or record.source (
SQLRecord|None, default:None) – Abionty.Sourcerecord to validate against to create records for.mute (
bool, default:False) – Whether to mute logging.
- Return type:
- Returns:
A list of validated records. For bionty registries. Also returns knowledge-coupled records.
Notes
For more info, see tutorial: Manage biological ontologies.
Example:
import bionty as bt # Bulk create from non-validated values will log warnings & returns empty list ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"]) assert len(ulabels) == 0 # Bulk create records from validated values returns the corresponding existing records ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"], create=True).save() assert len(ulabels) == 3 # Bulk create records from public reference bt.CellType.from_values(["T cell", "B cell"]).save()
- classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, source_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶
Maps input synonyms to standardized names.
- Parameters:
values (
Iterable) – Identifiers that will be standardized.field (
str|DeferredAttribute|None, default:None) – The field representing the standardized names.return_field (
str|DeferredAttribute|None, default:None) – The field to return. Defaults to field.return_mapper (
bool, default:False) – IfTrue, returns{input_value: standardized_name}.case_sensitive (
bool, default:False) – Whether the mapping is case sensitive.mute (
bool, default:False) – Whether to mute logging.source_aware (
bool, default:True) – Whether to standardize from public source. Defaults toTruefor BioRecord registries.keep (
Literal['first','last',False], default:'first') –When a synonym maps to multiple names, determines which duplicates to mark as
pd.DataFrame.duplicated: -"first": returns the first mapped standardized name -"last": returns the last mapped standardized name -False: returns all mapped standardized name.When
keepisFalse, the returned list of standardized names will contain nested lists in case of duplicates.When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
synonyms_field (
str, default:'synonyms') – A field containing the concatenated synonyms.organism (
str|SQLRecord|None, default:None) – An Organism name or record.source (
SQLRecord|None, default:None) – Abionty.Sourcerecord that specifies the version to validate against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against public sources.
- Return type:
list[str] |dict[str,str]- Returns:
If
return_mapperisFalse– a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.
See also
add_synonym()Add synonyms.
remove_synonym()Remove synonyms.
Example:
import bionty as bt # save some gene records bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol", organism="human").save() # standardize gene synonyms gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"] bt.Gene.standardize(gene_synonyms) #> ['A1CF', 'A1BG', 'BRCA2', 'FANCD20']
Methods¶
- with_config(optional=None)¶
Pass addtional configurations to the schema.
- Return type:
tuple[Feature,dict]
- restore()¶
Restore from trash onto the main branch.
- Return type:
None
- delete(permanent=None, **kwargs)¶
Delete record.
- Parameters:
permanent (
bool|None, default:None) – Whether to permanently delete the record (skips trash). IfNone, performs soft delete if the record is not already in the trash.- Return type:
None
Examples
For any
SQLRecordobjectrecord, call:>>> record.delete()
- add_synonym(synonym, force=False, save=None)¶
Add synonyms to a record.
- Parameters:
synonym (
str|list[str] |Series|array) – The synonyms to add to the record.force (
bool, default:False) – Whether to add synonyms even if they are already synonyms of other records.save (
bool|None, default:None) – Whether to save the record to the database.
See also
remove_synonym()Remove synonyms.
Example:
import bionty as bt # save "T cell" record record = bt.CellType.from_source(name="T cell").save() record.synonyms #> "T-cell|T lymphocyte|T-lymphocyte" # add a synonym record.add_synonym("T cells") record.synonyms #> "T cells|T-cell|T-lymphocyte|T lymphocyte"
- remove_synonym(synonym)¶
Remove synonyms from a record.
- Parameters:
synonym (
str|list[str] |Series|array) – The synonym values to remove.
See also
add_synonym()Add synonyms
Example:
import bionty as bt # save "T cell" record record = bt.CellType.from_source(name="T cell").save() record.synonyms #> "T-cell|T lymphocyte|T-lymphocyte" # remove a synonym record.remove_synonym("T-cell") record.synonyms #> "T lymphocyte|T-lymphocyte"
- set_abbr(value)¶
Set value for abbr field and add to synonyms.
- Parameters:
value (
str) – A value for an abbreviation.
See also
Example:
import bionty as bt # save an experimental factor record scrna = bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save() assert scrna.abbr is None assert scrna.synonyms == "single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing" # set abbreviation scrna.set_abbr("scRNA") assert scrna.abbr == "scRNA" # synonyms are updated assert scrna.synonyms == "scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq"