lamindb.Curator¶
- class lamindb.Curator¶
Bases:
BaseCurator
Dataset curator.
Data curation entails accurately labeling datasets with standardized metadata to facilitate data integration, interpretation and analysis.
The curation flow has several steps:
Instantiate
Curator
from one of the following dataset objects:
During object creation, any passed categoricals found in the object will be saved.
Run
validate()
to check the data against the defined criteria. This method identifies:
Values that can successfully validated and already exist in the registry.
Values which are new and not yet validated or potentially problematic values.
Determine how to handle validated and non-validated values:
Validated values not yet in the registry can be automatically registered using
add_validated_from()
.Valid and new values can be registered using
add_new_from()
.All unvalidated values can be accessed using
non_validated()
and subsequently removed from the object at hand.
Class methods¶
- classmethod from_anndata(data, var_index, categoricals=None, obs_columns=FieldAttr(Feature.name), using_key='default', verbosity='hint', organism=None, sources=None)¶
Curation flow for
AnnData
.See also
Curator
.Note that if genes are removed from the AnnData object, the object should be recreated using
from_anndata()
.See Curate AnnData based on the CELLxGENE schema for instructions on how to curate against a specific cellxgene schema version.
- Parameters:
data (ad.AnnData | UPathStr) – The AnnData object or an AnnData-like path.
var_index (FieldAttr) – The registry field for mapping the
.var
index.categoricals (dict[str, FieldAttr] | None, default:
None
) – A dictionary mapping.obs.columns
to a registry field.using_key (str, default:
'default'
) – A reference LaminDB instance.verbosity (str, default:
'hint'
) – The verbosity level.organism (str | None, default:
None
) – The organism name.sources (dict[str, Record] | None, default:
None
) – A dictionary mapping.obs.columns
to Source records.exclude – A dictionary mapping column names to values to exclude.
- Return type:
AnnDataCurator
Examples
>>> import bionty as bt >>> curate = ln.Curator.from_anndata( ... adata, ... var_index=bt.Gene.ensembl_gene_id, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... }, ... organism="human", ... )
- classmethod from_df(df, categoricals=None, columns=FieldAttr(Feature.name), using_key=None, verbosity='hint', organism=None)¶
Curation flow for a DataFrame object.
See also
Curator
.- Parameters:
df (
DataFrame
) – The DataFrame object to curate.columns (
DeferredAttribute
, default:FieldAttr(Feature.name)
) – The field attribute for the feature column.categoricals (
dict
[str
,DeferredAttribute
] |None
, default:None
) – A dictionary mapping column names to registry_field.using_key (
str
|None
, default:None
) – The reference instance containing registries to validate against.verbosity (
str
, default:'hint'
) – The verbosity level.organism (
str
|None
, default:None
) – The organism name.sources – A dictionary mapping column names to Source records.
exclude – A dictionary mapping column names to values to exclude.
- Return type:
Examples
>>> import bionty as bt >>> curate = ln.Curator.from_df( ... df, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... } ... )
- classmethod from_mudata(mdata, var_index, categoricals=None, using_key='default', verbosity='hint', organism=None)¶
Curation flow for a
MuData
object.See also
Curator
.Note that if genes or other measurements are removed from the MuData object, the object should be recreated using
from_mudata()
.- Parameters:
mdata (
MuData
) – The MuData object to curate.var_index (
dict
[str
,dict
[str
,DeferredAttribute
]]) – The registry field for mapping the.var
index for each modality. For example:{"modality_1": bt.Gene.ensembl_gene_id, "modality_2": ln.CellMarker.name}
categoricals (
dict
[str
,DeferredAttribute
] |None
, default:None
) – A dictionary mapping.obs.columns
to a registry field. Use modality keys to specify categoricals for MuData slots such as"rna:cell_type": bt.CellType.name"
.using_key (
str
, default:'default'
) – A reference LaminDB instance.verbosity (
str
, default:'hint'
) – The verbosity level.organism (
str
|None
, default:None
) – The organism name.sources – A dictionary mapping
.obs.columns
to Source records.exclude – A dictionary mapping column names to values to exclude.
- Return type:
Examples
>>> import bionty as bt >>> curate = ln.Curator.from_mudata( ... mdata, ... var_index={ ... "rna": bt.Gene.ensembl_gene_id, ... "adt": ln.CellMarker.name ... }, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... }, ... organism="human", ... )
Methods¶
- save_artifact(description=None, **kwargs)¶
Save the dataset as artifact.
- Parameters:
description (
str
|None
, default:None
) – Description of the DataFrame object.**kwargs – Object level metadata.
- Return type:
- Returns:
A saved artifact record.
- validate()¶
Validate dataset.
- Return type:
bool
- Returns:
Boolean indicating whether the dataset is validated.