lamindb.Curator¶
- class lamindb.Curator¶
Bases:
BaseCurator
Dataset curator.
A
Curator
object makes it easy to save validated & annotated artifacts.Example:
>>> curator = ln.Curator.from_df( >>> df, >>> # define validation criteria as mappings >>> columns=ln.Feature.name, # map column names >>> categoricals={"perturbation": ln.ULabel.name}, # map categories >>> ) >>> curator.validate() # validate the data in df >>> artifact = curator.save_artifact(description="my RNA-seq") >>> artifact.describe() # see annotations
curator.validate()
maps values withindf
according to the mapping criteria and logs validated & problematic values.If you find non-validated values, you have several options:
new values found in the data can be registered using
add_new_from()
non-validated values can be accessed using
non_validated()
and addressed manually
Class methods¶
- classmethod from_anndata(data, var_index, categoricals=None, obs_columns=FieldAttr(Feature.name), using_key=None, verbosity='hint', organism=None, sources=None)¶
Curation flow for
AnnData
.See also
Curator
.Note that if genes are removed from the AnnData object, the object should be recreated using
from_anndata()
.See Curate AnnData based on the CELLxGENE schema for instructions on how to curate against a specific cellxgene schema version.
- Parameters:
data (ad.AnnData | UPathStr) – The AnnData object or an AnnData-like path.
var_index (FieldAttr) – The registry field for mapping the
.var
index.categoricals (dict[str, FieldAttr] | None, default:
None
) – A dictionary mapping.obs.columns
to a registry field.obs_columns (FieldAttr, default:
FieldAttr(Feature.name)
) – The registry field for mapping the.obs.columns
.using_key (str | None, default:
None
) – A reference LaminDB instance.verbosity (str, default:
'hint'
) – The verbosity level.organism (str | None, default:
None
) – The organism name.sources (dict[str, Record] | None, default:
None
) – A dictionary mapping.obs.columns
to Source records.exclude – A dictionary mapping column names to values to exclude from validation. When specific
Source
instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.
- Return type:
AnnDataCurator
Examples
>>> import bionty as bt >>> curator = ln.Curator.from_anndata( ... adata, ... var_index=bt.Gene.ensembl_gene_id, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... }, ... organism="human", ... )
- classmethod from_df(df, categoricals=None, columns=FieldAttr(Feature.name), using_key=None, verbosity='hint', organism=None)¶
Curation flow for a DataFrame object.
See also
Curator
.- Parameters:
df (
DataFrame
) – The DataFrame object to curate.columns (
DeferredAttribute
, default:FieldAttr(Feature.name)
) – The field attribute for the feature column.categoricals (
dict
[str
,DeferredAttribute
] |None
, default:None
) – A dictionary mapping column names to registry_field.using_key (
str
|None
, default:None
) – The reference instance containing registries to validate against.verbosity (
str
, default:'hint'
) – The verbosity level.organism (
str
|None
, default:None
) – The organism name.sources – A dictionary mapping column names to Source records.
exclude – A dictionary mapping column names to values to exclude from validation. When specific
Source
instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.
- Return type:
- Returns:
A curator object.
Examples
>>> import bionty as bt >>> curator = ln.Curator.from_df( ... df, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... } ... )
- classmethod from_mudata(mdata, var_index, categoricals=None, using_key=None, verbosity='hint', organism=None)¶
Curation flow for a
MuData
object.See also
Curator
.Note that if genes or other measurements are removed from the MuData object, the object should be recreated using
from_mudata()
.- Parameters:
mdata (
MuData
) – The MuData object to curate.var_index (
dict
[str
,dict
[str
,DeferredAttribute
]]) – The registry field for mapping the.var
index for each modality. For example:{"modality_1": bt.Gene.ensembl_gene_id, "modality_2": ln.CellMarker.name}
categoricals (
dict
[str
,DeferredAttribute
] |None
, default:None
) – A dictionary mapping.obs.columns
to a registry field. Use modality keys to specify categoricals for MuData slots such as"rna:cell_type": bt.CellType.name"
.using_key (
str
|None
, default:None
) – A reference LaminDB instance.verbosity (
str
, default:'hint'
) – The verbosity level.organism (
str
|None
, default:None
) – The organism name.sources – A dictionary mapping
.obs.columns
to Source records.exclude – A dictionary mapping column names to values to exclude from validation. When specific
Source
instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.
- Return type:
Examples
>>> import bionty as bt >>> curator = ln.Curator.from_mudata( ... mdata, ... var_index={ ... "rna": bt.Gene.ensembl_gene_id, ... "adt": ln.CellMarker.name ... }, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... }, ... organism="human", ... )
- classmethod from_spatialdata(sdata, var_index, categoricals=None, using_key=None, organism=None, sources=None, exclude=None, verbosity='hint', *, sample_metadata_key='sample')¶
Curation flow for a
Spatialdata
object.See also
Curator
.Note that if genes or other measurements are removed from the SpatialData object, the object should be recreated.
In the following docstring, an accessor refers to either a
.table
key or thesample_metadata_key
.- Parameters:
sdata – The SpatialData object to curate.
var_index (
dict
[str
,DeferredAttribute
]) – A dictionary mapping table keys to the.var
indices.categoricals (
dict
[str
,dict
[str
,DeferredAttribute
]] |None
, default:None
) – A nested dictionary mapping an accessor to dictionaries that map columns to a registry field.using_key (
str
|None
, default:None
) – A reference LaminDB instance.organism (
str
|None
, default:None
) – The organism name.sources (
dict
[str
,dict
[str
,Record
]] |None
, default:None
) – A dictionary mapping an accessor to dictionaries that map columns to Source records.exclude (
dict
[str
,dict
] |None
, default:None
) – A dictionary mapping an accessor to dictionaries of column names to values to exclude from validation. When specificSource
instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.verbosity (
str
, default:'hint'
) – The verbosity level of the logger.sample_metadata_key (
str
, default:'sample'
) – The key in.attrs
that stores the sample level metadata.
Examples
>>> import lamindb as ln >>> import bionty as bt >>> curator = ln.Curator.from_spatialdata( ... sdata, ... var_index={ ... "table_1": bt.Gene.ensembl_gene_id, ... }, ... categoricals={ ... "table1": ... {"cell_type_ontology_id": bt.CellType.ontology_id, "donor_id": ln.ULabel.name}, ... "sample": ... {"experimental_factor": bt.ExperimentalFactor.name}, ... }, ... organism="human", ... )
- classmethod from_tiledbsoma(experiment_uri, var_index, categoricals=None, obs_columns=FieldAttr(Feature.name), using_key=None, organism=None, sources=None, exclude=None)¶
Curation flow for
tiledbsoma
.See also
Curator
.- Parameters:
experiment_uri (lamindb.core.types.UPathStr) – A local or cloud path to a
tiledbsoma.Experiment
.var_index (
dict
[str
,tuple
[str
,DeferredAttribute
]]) – The registry fields for mapping the.var
indices for measurements. Should be in the form{"measurement name": ("var column", field)}
. These keys should be used in the flattened form ('{measurement name}__{column name in .var}'
) in.standardize
or.add_new_from
, see the output of.var_index
.categoricals (
dict
[str
,DeferredAttribute
] |None
, default:None
) – A dictionary mapping categorical.obs
columns to a registry field.obs_columns (
DeferredAttribute
, default:FieldAttr(Feature.name)
) – The registry field for mapping the names of the.obs
columns.organism (
str
|None
, default:None
) – The organism name.sources (
dict
[str
,Record
] |None
, default:None
) – A dictionary mapping.obs
columns to Source records.exclude (
dict
[str
,str
|list
[str
]] |None
, default:None
) – A dictionary mapping column names to values to exclude from validation. When specificSource
instances are pinned and may lack default values (e.g., “unknown” or “na”), using the exclude parameter ensures they are not validated.
- Return type:
Examples
>>> import bionty as bt >>> curator = ln.Curator.from_tiledbsoma( ... "./my_array_store.tiledbsoma", ... var_index={"RNA": ("var_id", bt.Gene.symbol)}, ... categoricals={ ... "cell_type_ontology_id": bt.CellType.ontology_id, ... "donor_id": ln.ULabel.name ... }, ... organism="human", ... )
Methods¶
- save_artifact(description=None, key=None, revises=None, run=None)¶
Save the dataset as artifact.
- Parameters:
description (
str
|None
, default:None
) – A description of the DataFrame object.key (
str
|None
, default:None
) – A path-like key to reference artifact in default storage, e.g.,"myfolder/myfile.fcs"
. Artifacts with the same key form a revision family.revises (
Artifact
|None
, default:None
) – Previous version of the artifact. Triggers a revision.run (
Run
|None
, default:None
) – The run that creates the artifact.
- Return type:
- Returns:
A saved artifact record.
- standardize(key)¶
Replace synonyms with standardized values.
Inplace modification of the dataset.
- Parameters:
key (
str
) – The name of the column to standardize.- Return type:
None
- Returns:
None
- validate()¶
Validate dataset.
This method also registers the validated records in the current instance.
- Return type:
bool
- Returns:
Boolean indicating whether the dataset is validated.