lamindb.Schema¶
- class lamindb.Schema(features: Iterable[Record] | None = None, components: dict[str, Schema] | None = None, name: str | None = None, description: str | None = None, dtype: str | None = None, itype: str | Registry | FieldAttr | None = None, type: Schema | None = None, is_type: bool = False, otype: str | None = None, minimal_set: bool = True, ordered_set: bool = False, maximal_set: bool = False, slot: str | None = None, coerce_dtype: bool = False)¶
Bases:
Record
,CanCurate
,TracksRun
Schemas / feature sets.
Stores references to dataset schemas: these are the sets of columns in a dataset that correspond to
Feature
,Gene
,Protein
or other entities.Why does LaminDB model feature sets, not just features?
Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you can link all your artifacts against one feature set and only need to store 1M instead of 1M x 20k = 20B links.
Interpretation: Model protein panels, gene panels, etc.
Data integration: Feature sets provide the information that determines whether two datasets can be meaningfully concatenated.
These reasons do not hold for label sets. Hence, LaminDB does not model label sets.
- Parameters:
features –
Iterable[Record] | None = None
An iterable ofFeature
records to hash, e.g.,[Feature(...), Feature(...)]
. Is turned into a set upon instantiation. If you’d like to pass values, usefrom_values()
orfrom_df()
.components –
dict[str, Schema] | None = None
A dictionary mapping component names to their correspondingSchema
objects for composite schemas.name –
str | None = None
A name.description –
str | None = None
A description.dtype –
str | None = None
The simple type. Defaults toNone
for sets ofFeature
records. Otherwise defaults to"num"
(e.g., for sets ofGene
).itype –
str | None = None
The schema identifier type (e.g.Feature
,Gene
, …).type –
Schema | None = None
A type.is_type –
bool = False
Distinguish types from instances of the type.otype –
str | None = None
An object type to define the structure of a composite schema.minimal_set –
bool = True
Whether the schema contains a minimal set of linked features.ordered_set –
bool = False
Whether features are required to be ordered.maximal_set –
bool = False
IfTrue
, no additional features are allowed.slot –
str | None = None
The slot name when this schema is used as a component in a composite schema.coerce_dtype –
bool = False
When True, attempts to coerce values to the specified dtype during validation, seecoerce_dtype
.
Note
A feature set can be identified by the
hash
of its feature uids. It’s stored in the.hash
field.A
slot
provides a string key to access feature sets. For instance, for the schema of anAnnData
object, it would be'obs'
foradata.obs
.See also
from_values()
Create from values.
from_df()
Create from dataframe columns.
Examples
Create a schema (feature set) from df with types:
>>> df = pd.DataFrame({"feat1": [1, 2], "feat2": [3.1, 4.2], "feat3": ["cond1", "cond2"]}) >>> schema = ln.Schema.from_df(df)
Create a schema (feature set) from features:
>>> features = [ln.Feature(name=feat, dtype="float").save() for feat in ["feat1", "feat2"]] >>> schema = ln.Schema(features)
Create a schema (feature set) from identifier values:
>>> import bionty as bt >>> schema = ln.Schema.from_values(adata.var["ensemble_id"], Gene.ensembl_gene_id, organism="mouse").save()
Attributes¶
- property coerce_dtype: bool¶
Whether dtypes should be coerced during validation.
For example, a
objects
-dtyped pandas column can be coerced tocategorical
and would pass validation if this is true.
Simple fields¶
-
uid:
str
¶ A universal id (hash of the set of feature values).
-
name:
str
|None
¶ A name.
-
description:
str
|None
¶ A description.
- n¶
Number of features in the set.
-
dtype:
str
|None
¶ Data type, e.g., “num”, “float”, “int”. Is
None
forFeature
.For
Feature
, types are expected to be heterogeneous and defined on a per-feature level.
-
itype:
str
|None
¶ A registry that stores feature identifiers used in this schema, e.g.,
'Feature'
or'bionty.Gene'
.Depending on the registry,
.members
stores, e.g.,Feature
orbionty.Gene
records.Changed in version 1.0.0: Was called
registry
before.
-
is_type:
bool
¶ Distinguish types from instances of the type.
-
otype:
str
|None
¶ Default Python object type, e.g., DataFrame, AnnData.
-
hash:
str
|None
¶ A hash of the set of feature identifiers.
For a composite schema, the hash of hashes.
-
minimal_set:
bool
¶ Whether the schema contains a minimal set of linked features (default
True
).If
False
, no features are linked to this schema.If
True
, features are linked and considered as a minimally required set in validation.
-
ordered_set:
bool
¶ Whether features are required to be ordered (default
False
).
-
maximal_set:
bool
¶ If
False
, additional features are allowed (defaultFalse
).If
True
, the the minimal set is a maximal set and no additional features are allowed.
-
slot:
str
|None
¶ A wrapper for a deferred-loading field. When the value is read from this object the first time, the query is executed.
-
created_at:
datetime
¶ Time of creation of record.
Relational fields¶
-
space:
Space
¶ The space in which the record lives.
-
type:
Schema
|None
¶ Type of schema.
Allows to group schemas by type, e.g., all meassurements evaluating gene expression vs. protein expression vs. multi modal.
You can define types via
ln.Schema(name="ProteinPanel", is_type=True)
.Here are a few more examples for type names:
'ExpressionPanel'
,'ProteinPanel'
,'Multimodal'
,'Metadata'
,'Embedding'
.
-
composites:
Schema
¶ The composite schemas that contains this schema as a component.
For example, an
AnnData
composes multiple schemas:var[DataFrameT]
,obs[DataFrame]
,obsm[Array]
,uns[dict]
, etc.
Class methods¶
- classmethod df(include=None, features=False, limit=100)¶
Convert to
pd.DataFrame
.By default, shows all direct fields, except
updated_at
.Use arguments
include
orfeature
to include other data.- Parameters:
include (
str
|list
[str
] |None
, default:None
) – Related fields to include as columns. Takes strings of form"ulabels__name"
,"cell_types__name"
, etc. or a list of such strings.features (
bool
|list
[str
], default:False
) – IfTrue
, map all features of theFeature
registry onto the resultingDataFrame
. Only available forArtifact
.limit (
int
, default:100
) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.
- Return type:
DataFrame
Examples
Include the name of the creator in the
DataFrame
:>>> ln.ULabel.df(include="created_by__name"])
Include display of features for
Artifact
:>>> df = ln.Artifact.df(features=True) >>> ln.view(df) # visualize with type annotations
Only include select features:
>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Q
objects.expressions – Fields and values passed as Django query expressions.
- Return type:
QuerySet
- Returns:
A
QuerySet
.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.ULabel(name="my label").save() >>> ln.ULabel.filter(name__startswith="my").df()
- classmethod from_df(df, field=FieldAttr(Feature.name), name=None, mute=False, organism=None, source=None)¶
Create feature set for validated features.
- Return type:
Schema
|None
- classmethod from_values(values, field=FieldAttr(Feature.name), type=None, name=None, mute=False, organism=None, source=None, raise_validation_error=True)¶
Create feature set for validated features.
- Parameters:
values (
list
[str
] |Series
|array
) – A list of values, like feature names or ids.field (
DeferredAttribute
, default:FieldAttr(Feature.name)
) – The field of a reference registry to map values.type (
str
|None
, default:None
) – The simple type. Defaults toNone
if reference registry isFeature
, defaults to"float"
otherwise.name (
str
|None
, default:None
) – A name.organism (
Record
|str
|None
, default:None
) – An organism to resolve gene mapping.source (
Record
|None
, default:None
) – A public ontology to resolve feature identifier mapping.raise_validation_error (
bool
, default:True
) – Whether to raise a validation error if some values are not valid.
- Raises:
ValidationError – If some values are not valid.
- Return type:
Examples
>>> features = [ln.Feature(name=feat, dtype="str").save() for feat in ["feat11", "feat21"]] >>> schema = ln.Schema.from_values(features)
>>> genes = ["ENSG00000139618", "ENSG00000198786"] >>> schema = ln.Schema.from_values(features, bt.Gene.ensembl_gene_id, "float")
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int
|str
|None
, default:None
) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A record.
- Raises:
lamindb.errors.DoesNotExist – In case no matching record is found.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ulabel = ln.ULabel.get("FvtpPJLJ") >>> ulabel = ln.ULabel.get(name="my-label")
- classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Inspect if values are mappable to a field.
Being mappable means that an exact match exists.
- Parameters:
values (
list
[str
] |Series
|array
) – Values that will be checked against the field.field (
str
|DeferredAttribute
|None
, default:None
) – The field of values. Examples are'ontology_id'
to map against the source ID or'name'
to map against the ontologies field names.mute (
bool
, default:False
) – Whether to mute logging.organism (
str
|Record
|None
, default:None
) – An Organism name or record.source (
Record
|None
, default:None
) – Abionty.Source
record that specifies the version to inspect against.strict_source (
bool
, default:False
) – Determines the validation behavior against records in the registry. - IfFalse
, validation will include all records in the registry, ignoring the specified source. - IfTrue
, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.
- Return type:
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol")) >>> gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] >>> result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol) >>> result.validated ['A1CF', 'A1BG'] >>> result.non_validated ['FANCD1', 'FANCD20']
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str
|DeferredAttribute
|None
, default:None
) – The field to look up the values for. Defaults to first string field.return_field (
str
|DeferredAttribute
|None
, default:None
) – The field to return. IfNone
, returns the whole record.
- Return type:
NamedTuple
- Returns:
A
NamedTuple
of lookup information of the field values with a dictionary converter.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> bt.Gene.from_source(symbol="ADGB-DT").save() >>> lookup = bt.Gene.lookup() >>> lookup.adgb_dt >>> lookup_dict = lookup.dict() >>> lookup_dict['ADGB-DT'] >>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") >>> genes.ensg00000002745 >>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str
) – The input string to match against the field ontology values.field (
str
|DeferredAttribute
|None
, default:None
) – The field or fields to search. Search all string fields by default.limit (
int
|None
, default:20
) – Maximum amount of top results to return.case_sensitive (
bool
, default:False
) – Whether the match is case sensitive.
- Return type:
QuerySet
- Returns:
A sorted
DataFrame
of search results with a score in columnscore
. Ifreturn_queryset
isTrue
.QuerySet
.
Examples
>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name") >>> ln.save(ulabels) >>> ln.ULabel.search("ULabel2")
- classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, public_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶
Maps input synonyms to standardized names.
- Parameters:
values (
list
[str
] |Series
|array
) – Identifiers that will be standardized.field (
str
|DeferredAttribute
|None
, default:None
) – The field representing the standardized names.return_field (
str
, default:None
) – The field to return. Defaults to field.return_mapper (
bool
, default:False
) – IfTrue
, returns{input_value: standardized_name}
.case_sensitive (
bool
, default:False
) – Whether the mapping is case sensitive.mute (
bool
, default:False
) – Whether to mute logging.public_aware (
bool
, default:True
) – Whether to standardize from Bionty reference. Defaults toTrue
for Bionty registries.keep (
Literal
['first'
,'last'
,False
], default:'first'
) –- When a synonym maps to multiple names, determines which duplicates to mark as
pd.DataFrame.duplicated
: "first"
: returns the first mapped standardized name"last"
: returns the last mapped standardized nameFalse
: returns all mapped standardized name.
When
keep
isFalse
, the returned list of standardized names will contain nested lists in case of duplicates.When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
- When a synonym maps to multiple names, determines which duplicates to mark as
synonyms_field (
str
, default:'synonyms'
) – A field containing the concatenated synonyms.organism (
str
|Record
|None
, default:None
) – An Organism name or record.source (
Record
|None
, default:None
) – Abionty.Source
record that specifies the version to validate against.strict_source (
bool
, default:False
) – Determines the validation behavior against records in the registry. - IfFalse
, validation will include all records in the registry, ignoring the specified source. - IfTrue
, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.
- Return type:
list
[str
] |dict
[str
,str
]- Returns:
If
return_mapper
isFalse
– a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.
See also
add_synonym()
Add synonyms.
remove_synonym()
Remove synonyms.
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol")) >>> gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"] >>> standardized_names = bt.Gene.standardize(gene_synonyms) >>> standardized_names ['A1CF', 'A1BG', 'BRCA2', 'FANCD20']
- classmethod using(instance)¶
Use a non-default LaminDB instance.
- Parameters:
instance (
str
|None
) – An instance identifier of form “account_handle/instance_name”.- Return type:
QuerySet
Examples
>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name") uid score name ULabel7 g7Hk9b2v 100.0 ULabel5 t4Jm6s0q 75.0 ULabel6 r2Xw8p1z 75.0
- classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Validate values against existing values of a string field.
Note this is strict_source validation, only asserts exact matches.
- Parameters:
values (
list
[str
] |Series
|array
) – Values that will be validated against the field.field (
str
|DeferredAttribute
|None
, default:None
) – The field of values. Examples are'ontology_id'
to map against the source ID or'name'
to map against the ontologies field names.mute (
bool
, default:False
) – Whether to mute logging.organism (
str
|Record
|None
, default:None
) – An Organism name or record.source (
Record
|None
, default:None
) – Abionty.Source
record that specifies the version to validate against.strict_source (
bool
, default:False
) – Determines the validation behavior against records in the registry. - IfFalse
, validation will include all records in the registry, ignoring the specified source. - IfTrue
, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.
- Return type:
ndarray
- Returns:
A vector of booleans indicating if an element is validated.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol")) >>> gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] >>> bt.Gene.validate(gene_symbols, field=bt.Gene.symbol) array([ True, True, False, False])
Methods¶
- add_synonym(synonym, force=False, save=None)¶
Add synonyms to a record.
- Parameters:
synonym (
str
|list
[str
] |Series
|array
) – The synonyms to add to the record.force (
bool
, default:False
) – Whether to add synonyms even if they are already synonyms of other records.save (
bool
|None
, default:None
) – Whether to save the record to the database.
See also
remove_synonym()
Remove synonyms.
Examples
>>> import bionty as bt >>> bt.CellType.from_source(name="T cell").save() >>> lookup = bt.CellType.lookup() >>> record = lookup.t_cell >>> record.synonyms 'T-cell|T lymphocyte|T-lymphocyte' >>> record.add_synonym("T cells") >>> record.synonyms 'T cells|T-cell|T-lymphocyte|T lymphocyte'
- delete()¶
Delete.
- Return type:
None
- describe(return_str=False)¶
Describe schema.
- Return type:
None
|str
- remove_synonym(synonym)¶
Remove synonyms from a record.
- Parameters:
synonym (
str
|list
[str
] |Series
|array
) – The synonym values to remove.
See also
add_synonym()
Add synonyms
Examples
>>> import bionty as bt >>> bt.CellType.from_source(name="T cell").save() >>> lookup = bt.CellType.lookup() >>> record = lookup.t_cell >>> record.synonyms 'T-cell|T lymphocyte|T-lymphocyte' >>> record.remove_synonym("T-cell") 'T lymphocyte|T-lymphocyte'
- set_abbr(value)¶
Set value for abbr field and add to synonyms.
- Parameters:
value (
str
) – A value for an abbreviation.
See also
Examples
>>> import bionty as bt >>> bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save() >>> scrna = bt.ExperimentalFactor.get(name="single-cell RNA sequencing") >>> scrna.abbr None >>> scrna.synonyms 'single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing' >>> scrna.set_abbr("scRNA") >>> scrna.abbr 'scRNA' >>> scrna.synonyms 'scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq' >>> scrna.save()