lamindb.Curator

class lamindb.Curator

Bases: BaseCurator

Dataset curator.

Data curation entails accurately labeling datasets with standardized metadata to facilitate data integration, interpretation and analysis.

The curation flow has several steps:

  1. Instantiate Curator from one of the following dataset objects:

During object creation, any passed categoricals found in the object will be saved.

  1. Run validate() to check the data against the defined criteria. This method identifies:

  • Values that can successfully validated and already exist in the registry.

  • Values which are new and not yet validated or potentially problematic values.

  1. Determine how to handle validated and non-validated values:

  • Validated values not yet in the registry can be automatically registered using add_validated_from().

  • Valid and new values can be registered using add_new_from().

  • All unvalidated values can be accessed using non_validated() and subsequently removed from the object at hand.

Class methods

classmethod from_anndata(data, var_index, categoricals=None, obs_columns=FieldAttr(Feature.name), using_key='default', verbosity='hint', organism=None, sources=None)

Curation flow for AnnData.

See also Curator.

Note that if genes are removed from the AnnData object, the object should be recreated using from_anndata().

See Curate AnnData based on the CELLxGENE schema for instructions on how to curate against a specific cellxgene schema version.

Parameters:
  • data (ad.AnnData | UPathStr) – The AnnData object or an AnnData-like path.

  • var_index (FieldAttr) – The registry field for mapping the .var index.

  • categoricals (dict[str, FieldAttr] | None, default: None) – A dictionary mapping .obs.columns to a registry field.

  • using_key (str, default: 'default') – A reference LaminDB instance.

  • verbosity (str, default: 'hint') – The verbosity level.

  • organism (str | None, default: None) – The organism name.

  • sources (dict[str, Record] | None, default: None) – A dictionary mapping .obs.columns to Source records.

  • exclude – A dictionary mapping column names to values to exclude.

Return type:

AnnDataCurator

Examples

>>> import bionty as bt
>>> curate = ln.Curator.from_anndata(
...     adata,
...     var_index=bt.Gene.ensembl_gene_id,
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     },
...     organism="human",
... )
classmethod from_df(df, categoricals=None, columns=FieldAttr(Feature.name), using_key=None, verbosity='hint', organism=None)

Curation flow for a DataFrame object.

See also Curator.

Parameters:
  • df (DataFrame) – The DataFrame object to curate.

  • columns (DeferredAttribute, default: FieldAttr(Feature.name)) – The field attribute for the feature column.

  • categoricals (dict[str, DeferredAttribute] | None, default: None) – A dictionary mapping column names to registry_field.

  • using_key (str | None, default: None) – The reference instance containing registries to validate against.

  • verbosity (str, default: 'hint') – The verbosity level.

  • organism (str | None, default: None) – The organism name.

  • sources – A dictionary mapping column names to Source records.

  • exclude – A dictionary mapping column names to values to exclude.

Return type:

DataFrameCurator

Examples

>>> import bionty as bt
>>> curate = ln.Curator.from_df(
...     df,
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     }
... )
classmethod from_mudata(mdata, var_index, categoricals=None, using_key='default', verbosity='hint', organism=None)

Curation flow for a MuData object.

See also Curator.

Note that if genes or other measurements are removed from the MuData object, the object should be recreated using from_mudata().

Parameters:
  • mdata (MuData) – The MuData object to curate.

  • var_index (dict[str, dict[str, DeferredAttribute]]) – The registry field for mapping the .var index for each modality. For example: {"modality_1": bt.Gene.ensembl_gene_id, "modality_2": ln.CellMarker.name}

  • categoricals (dict[str, DeferredAttribute] | None, default: None) – A dictionary mapping .obs.columns to a registry field. Use modality keys to specify categoricals for MuData slots such as "rna:cell_type": bt.CellType.name".

  • using_key (str, default: 'default') – A reference LaminDB instance.

  • verbosity (str, default: 'hint') – The verbosity level.

  • organism (str | None, default: None) – The organism name.

  • sources – A dictionary mapping .obs.columns to Source records.

  • exclude – A dictionary mapping column names to values to exclude.

Return type:

MuDataCurator

Examples

>>> import bionty as bt
>>> curate = ln.Curator.from_mudata(
...     mdata,
...     var_index={
...         "rna": bt.Gene.ensembl_gene_id,
...         "adt": ln.CellMarker.name
...     },
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     },
...     organism="human",
... )

Methods

save_artifact(description=None, **kwargs)

Save the dataset as artifact.

Parameters:
  • description (str | None, default: None) – Description of the DataFrame object.

  • **kwargs – Object level metadata.

Return type:

Artifact

Returns:

A saved artifact record.

validate()

Validate dataset.

Return type:

bool

Returns:

Boolean indicating whether the dataset is validated.