lamindb.Curator

class lamindb.Curator

Bases: BaseCurator

Dataset curator.

A Curator object makes it easy to save validated & annotated artifacts.

Example:

>>> curator = ln.Curator.from_df(
>>>     df,
>>>     # define validation criteria as mappings
>>>     columns=ln.Feature.name,  # map column names
>>>     categoricals={"perturbation": ln.ULabel.name},  # map categories
>>> )
>>> curator.validate()  # validate the data in df
>>> artifact = curator.save_artifact(description="my RNA-seq")
>>> artifact.describe()  # see annotations

curator.validate() maps values within df according to the mapping criteria and logs validated & problematic values.

If you find non-validated values, you have several options:

  • new values found in the data can be registered using add_new_from()

  • non-validated values can be accessed using non_validated() and addressed manually

Class methods

classmethod from_anndata(data, var_index, categoricals=None, obs_columns=FieldAttr(Feature.name), using_key=None, verbosity='hint', organism=None, sources=None)

Curation flow for AnnData.

See also Curator.

Note that if genes are removed from the AnnData object, the object should be recreated using from_anndata().

See Curate AnnData based on the CELLxGENE schema for instructions on how to curate against a specific cellxgene schema version.

Parameters:
  • data (ad.AnnData | UPathStr) – The AnnData object or an AnnData-like path.

  • var_index (FieldAttr) – The registry field for mapping the .var index.

  • categoricals (dict[str, FieldAttr] | None, default: None) – A dictionary mapping .obs.columns to a registry field.

  • obs_columns (FieldAttr, default: FieldAttr(Feature.name)) – The registry field for mapping the .obs.columns.

  • using_key (str | None, default: None) – A reference LaminDB instance.

  • verbosity (str, default: 'hint') – The verbosity level.

  • organism (str | None, default: None) – The organism name.

  • sources (dict[str, Record] | None, default: None) – A dictionary mapping .obs.columns to Source records.

  • exclude – A dictionary mapping column names to values to exclude from validation. When specific Source instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.

Return type:

AnnDataCurator

Examples

>>> import bionty as bt
>>> curator = ln.Curator.from_anndata(
...     adata,
...     var_index=bt.Gene.ensembl_gene_id,
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     },
...     organism="human",
... )
classmethod from_df(df, categoricals=None, columns=FieldAttr(Feature.name), using_key=None, verbosity='hint', organism=None)

Curation flow for a DataFrame object.

See also Curator.

Parameters:
  • df (DataFrame) – The DataFrame object to curate.

  • columns (DeferredAttribute, default: FieldAttr(Feature.name)) – The field attribute for the feature column.

  • categoricals (dict[str, DeferredAttribute] | None, default: None) – A dictionary mapping column names to registry_field.

  • using_key (str | None, default: None) – The reference instance containing registries to validate against.

  • verbosity (str, default: 'hint') – The verbosity level.

  • organism (str | None, default: None) – The organism name.

  • sources – A dictionary mapping column names to Source records.

  • exclude – A dictionary mapping column names to values to exclude from validation. When specific Source instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.

Return type:

DataFrameCurator

Returns:

A curator object.

Examples

>>> import bionty as bt
>>> curator = ln.Curator.from_df(
...     df,
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     }
... )
classmethod from_mudata(mdata, var_index, categoricals=None, using_key=None, verbosity='hint', organism=None)

Curation flow for a MuData object.

See also Curator.

Note that if genes or other measurements are removed from the MuData object, the object should be recreated using from_mudata().

Parameters:
  • mdata (MuData) – The MuData object to curate.

  • var_index (dict[str, dict[str, DeferredAttribute]]) – The registry field for mapping the .var index for each modality. For example: {"modality_1": bt.Gene.ensembl_gene_id, "modality_2": ln.CellMarker.name}

  • categoricals (dict[str, DeferredAttribute] | None, default: None) – A dictionary mapping .obs.columns to a registry field. Use modality keys to specify categoricals for MuData slots such as "rna:cell_type": bt.CellType.name".

  • using_key (str | None, default: None) – A reference LaminDB instance.

  • verbosity (str, default: 'hint') – The verbosity level.

  • organism (str | None, default: None) – The organism name.

  • sources – A dictionary mapping .obs.columns to Source records.

  • exclude – A dictionary mapping column names to values to exclude from validation. When specific Source instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.

Return type:

MuDataCurator

Examples

>>> import bionty as bt
>>> curator = ln.Curator.from_mudata(
...     mdata,
...     var_index={
...         "rna": bt.Gene.ensembl_gene_id,
...         "adt": ln.CellMarker.name
...     },
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     },
...     organism="human",
... )
classmethod from_spatialdata(sdata, var_index, categoricals=None, using_key=None, organism=None, sources=None, exclude=None, verbosity='hint', *, sample_metadata_key='sample')

Curation flow for a Spatialdata object.

See also Curator.

Note that if genes or other measurements are removed from the SpatialData object, the object should be recreated.

In the following docstring, an accessor refers to either a .table key or the sample_metadata_key.

Parameters:
  • sdata – The SpatialData object to curate.

  • var_index (dict[str, DeferredAttribute]) – A dictionary mapping table keys to the .var indices.

  • categoricals (dict[str, dict[str, DeferredAttribute]] | None, default: None) – A nested dictionary mapping an accessor to dictionaries that map columns to a registry field.

  • using_key (str | None, default: None) – A reference LaminDB instance.

  • organism (str | None, default: None) – The organism name.

  • sources (dict[str, dict[str, Record]] | None, default: None) – A dictionary mapping an accessor to dictionaries that map columns to Source records.

  • exclude (dict[str, dict] | None, default: None) – A dictionary mapping an accessor to dictionaries of column names to values to exclude from validation. When specific Source instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.

  • verbosity (str, default: 'hint') – The verbosity level of the logger.

  • sample_metadata_key (str, default: 'sample') – The key in .attrs that stores the sample level metadata.

Examples

>>> import lamindb as ln
>>> import bionty as bt
>>> curator = ln.Curator.from_spatialdata(
...     sdata,
...     var_index={
...         "table_1": bt.Gene.ensembl_gene_id,
...     },
...     categoricals={
...         "table1":
...             {"cell_type_ontology_id": bt.CellType.ontology_id, "donor_id": ln.ULabel.name},
...         "sample":
...             {"experimental_factor": bt.ExperimentalFactor.name},
...     },
...     organism="human",
... )
classmethod from_tiledbsoma(experiment_uri, var_index, categoricals=None, obs_columns=FieldAttr(Feature.name), using_key=None, organism=None, sources=None, exclude=None)

Curation flow for tiledbsoma.

See also Curator.

Parameters:
  • experiment_uri (lamindb.core.types.UPathStr) – A local or cloud path to a tiledbsoma.Experiment.

  • var_index (dict[str, tuple[str, DeferredAttribute]]) – The registry fields for mapping the .var indices for measurements. Should be in the form {"measurement name": ("var column", field)}. These keys should be used in the flattened form ('{measurement name}__{column name in .var}') in .standardize or .add_new_from, see the output of .var_index.

  • categoricals (dict[str, DeferredAttribute] | None, default: None) – A dictionary mapping categorical .obs columns to a registry field.

  • obs_columns (DeferredAttribute, default: FieldAttr(Feature.name)) – The registry field for mapping the names of the .obs columns.

  • organism (str | None, default: None) – The organism name.

  • sources (dict[str, Record] | None, default: None) – A dictionary mapping .obs columns to Source records.

  • exclude (dict[str, str | list[str]] | None, default: None) – A dictionary mapping column names to values to exclude from validation. When specific Source instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.

Return type:

SOMACurator

Examples

>>> import bionty as bt
>>> curator = ln.Curator.from_tiledbsoma(
...     "./my_array_store.tiledbsoma",
...     var_index={"RNA": ("var_id", bt.Gene.symbol)},
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     },
...     organism="human",
... )

Methods

save_artifact(description=None, key=None, revises=None, run=None)

Save the dataset as artifact.

Parameters:
  • description (str | None, default: None) – A description of the DataFrame object.

  • key (str | None, default: None) – A path-like key to reference artifact in default storage, e.g., "myfolder/myfile.fcs". Artifacts with the same key form a revision family.

  • revises (Artifact | None, default: None) – Previous version of the artifact. Triggers a revision.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

Returns:

A saved artifact record.

standardize(key)

Replace synonyms with standardized values.

Inplace modification of the dataset.

Parameters:

key (str) – The name of the column to standardize.

Return type:

None

Returns:

None

validate()

Validate dataset.

This method also registers the validated records in the current instance.

Return type:

bool

Returns:

Boolean indicating whether the dataset is validated.