lamindb.FeatureSet

class lamindb.FeatureSet(features: Iterable[Record], dtype: str | None = None, name: str | None = None)

Bases: Record, TracksRun

Feature sets.

Stores references to sets of Feature and other registries that may be used to identify features (e.g., Gene or Protein).

Why does LaminDB model feature sets, not just features?
  1. Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you can link all your artifacts against one feature set and only need to store 1M instead of 1M x 20k = 20B links.

  2. Interpretation: Model protein panels, gene panels, etc.

  3. Data integration: Feature sets provide the currency that determines whether two collections can be easily concatenated.

These reasons do not hold for label sets. Hence, LaminDB does not model label sets.

Parameters:
  • featuresIterable[Record] An iterable of Feature records to hash, e.g., [Feature(...), Feature(...)]. Is turned into a set upon instantiation. If you’d like to pass values, use from_values() or from_df().

  • dtypestr | None = None The simple type. Defaults to None for sets of Feature records. Otherwise defaults to "num" (e.g., for sets of Gene).

  • namestr | None = None A name.

Note

A feature set can be identified by the hash its feature uids. It’s stored in the .hash field.

A slot provides a string key to access feature sets. It’s typically the accessor within the registered data object, here pd.DataFrame.columns.

See also

from_values()

Create from values.

from_df()

Create from dataframe columns.

Examples

Create a featureset from df with types:

>>> df = pd.DataFrame({"feat1": [1, 2], "feat2": [3.1, 4.2], "feat3": ["cond1", "cond2"]})
>>> feature_set = ln.FeatureSet.from_df(df)

Create a featureset from features:

>>> features = [ln.Feature(name=feat, dtype="float").save() for feat in ["feat1", "feat2"]]
>>> feature_set = ln.FeatureSet(features)

Create a featureset from feature values:

>>> import bionty as bt
>>> feature_set = ln.FeatureSet.from_values(adata.var["ensemble_id"], Gene.ensembl_gene_id, organism="mouse")
>>> feature_set.save()

Link a feature set to an artifact:

>>> artifact.features.add_feature_set(feature_set, slot="var")

Link features to an artifact (will create a featureset under the hood):

>>> artifact.features.add_values(features)

Attributes

property members: QuerySet

A queryset for the individual records of the set.

Simple fields

uid: str

A universal id (hash of the set of feature values).

name: str | None

A name (optional).

n

Number of features in the set.

dtype: str | None

Data type, e.g., “num”, “float”, “int”. Is None for Feature.

For Feature, types are expected to be heterogeneous and defined on a per-feature level.

registry: str

The registry that stores the feature identifiers, e.g., 'core.Feature' or 'bionty.Gene'.

Depending on the registry, .members stores, e.g. Feature or Gene records.

hash: str | None

The hash of the set.

created_at: datetime

Time of creation of record.

Relational fields

created_by: User

Creator of record.

run: Run | None

Last run that created or updated the record.

features: Feature

The features related to a FeatureSet record.

artifacts: Artifact

The artifacts related to a FeatureSet record.

Class methods

classmethod df(include=None, features=False, limit=100)

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

Use arguments include or feature to include other data.

Parameters:
  • include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "ulabels__name", "cell_types__name", etc. or a list of such strings.

  • features (bool | list[str], default: False) – If True, map all features of the Feature registry onto the resulting DataFrame. Only available for Artifact.

  • limit (int, default: 100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.

Return type:

DataFrame

Examples

Include the name of the creator in the DataFrame:

>>> ln.ULabel.df(include="created_by__name"])

Include display of features for Artifact:

>>> df = ln.Artifact.df(features=True)
>>> ln.view(df)  # visualize with type annotations

Only include select features:

>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
classmethod filter(*queries, **expressions)

Query records.

Parameters:
  • queries – One or multiple Q objects.

  • expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Examples

>>> ln.ULabel(name="my label").save()
>>> ln.ULabel.filter(name__startswith="my").df()
classmethod from_df(df, field=FieldAttr(Feature.name), name=None, mute=False, organism=None, source=None)

Create feature set for validated features.

Return type:

FeatureSet | None

classmethod from_values(values, field=FieldAttr(Feature.name), type=None, name=None, mute=False, organism=None, source=None, raise_validation_error=True)

Create feature set for validated features.

Parameters:
  • values (List[str] | Series | array) – A list of values, like feature names or ids.

  • field (DeferredAttribute, default: FieldAttr(Feature.name)) – The field of a reference registry to map values.

  • type (str | None, default: None) – The simple type. Defaults to None if reference registry is Feature, defaults to "float" otherwise.

  • name (str | None, default: None) – A name.

  • organism (Record | str | None, default: None) – An organism to resolve gene mapping.

  • source (Record | None, default: None) – A public ontology to resolve feature identifier mapping.

  • raise_validation_error (bool, default: True) – Whether to raise a validation error if some values are not valid.

Raises:

ValidationError – If some values are not valid.

Return type:

FeatureSet

Examples

>>> features = [ln.Feature(name=feat, dtype="str").save() for feat in ["feat11", "feat21"]]
>>> feature_set = ln.FeatureSet.from_values(features)
>>> genes = ["ENSG00000139618", "ENSG00000198786"]
>>> feature_set = ln.FeatureSet.from_values(features, bt.Gene.ensembl_gene_id, "float")
classmethod get(idlike=None, **expressions)

Get a single record.

Parameters:
  • idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.

  • expressions – Fields and values passed as Django query expressions.

Return type:

Record

Returns:

A record.

Raises:

lamindb.core.exceptions.DoesNotExist – In case no matching record is found.

See also

Examples

>>> ulabel = ln.ULabel.get("FvtpPJLJ")
>>> ulabel = ln.ULabel.get(name="my-label")
classmethod lookup(field=None, return_field=None)

Return an auto-complete object for a field.

Parameters:
  • field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.

  • return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
classmethod search(string, *, field=None, limit=20, case_sensitive=False)

Search.

Parameters:
  • string (str) – The input string to match against the field ontology values.

  • field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.

  • limit (int | None, default: 20) – Maximum amount of top results to return.

  • case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")
classmethod using(instance)

Use a non-default LaminDB instance.

Parameters:

instance (str | None) – An instance identifier of form “account_handle/instance_name”.

Return type:

QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

Methods

delete()

Delete.

Return type:

None

save(*args, **kwargs)

Save.

Return type:

FeatureSet