lamindb.curators.DataFrameCurator¶
- class lamindb.curators.DataFrameCurator(dataset, schema, slot=None)¶
Bases:
SlotsCurator
Curator for
DataFrame
.- Parameters:
Examples
For a simple example using a flexible schema, see
from_dataframe()
.Here is an example that enforces a minimal set of columns in the dataframe.
import lamindb as ln schema = ln.core.datasets.mini_immuno.define_mini_immuno_schema_flexible() df = ln.core.datasets.small_dataset1(otype="DataFrame") df.pop("donor") # remove donor column to trigger validation error try: artifact = ln.Artifact.from_dataframe( df, key="examples/dataset1.parquet", schema=schema ).save() except ln.errors.ValidationError as error: print(error)
Under-the-hood, this used the following schema.
import lamindb as ln schema = ln.Schema( name="Mini immuno schema", features=[ ln.Feature.get(name="perturbation"), ln.Feature.get(name="cell_type_by_model"), ln.Feature.get(name="assay_oid"), ln.Feature.get(name="donor"), ln.Feature.get(name="concentration"), ln.Feature.get(name="treatment_time_h"), ], flexible=True, # _additional_ columns in a dataframe are validated & annotated ).save()
Valid features & labels were defined as:
import lamindb as ln import bionty as bt # define valid labels perturbation_type = ln.ULabel(name="Perturbation", is_type=True).save() ln.ULabel(name="DMSO", type=perturbation_type).save() ln.ULabel(name="IFNG", type=perturbation_type).save() bt.CellType.from_source(name="B cell").save() bt.CellType.from_source(name="T cell").save() # define valid features ln.Feature(name="perturbation", dtype=perturbation_type).save() ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save() ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save() ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save() ln.Feature(name="concentration", dtype=str).save() ln.Feature(name="treatment_time_h", dtype="num", coerce_dtype=True).save() ln.Feature(name="donor", dtype=str, nullable=True).save() ln.Feature(name="donor_ethnicity", dtype=list[bt.Ethnicity]).save()
It is also possible to curate the
attrs
slot.import lamindb as ln from .define_schema_df_metadata import study_metadata_schema df = ln.examples.datasets.mini_immuno.get_dataset1(otype="DataFrame") schema = ln.Schema( features=[ln.Feature(name="perturbation", dtype="str").save()], slots={"attrs": study_metadata_schema}, otype="DataFrame", ).save() curator = ln.curators.DataFrameCurator(df, schema=schema) curator.validate() artifact = curator.save_artifact(key="examples/df_with_attrs.parquet") artifact.describe()
Attributes¶
- property cat: DataFrameCatManager¶
Manage categoricals by updating registries.
- property slots: dict[str, ComponentCurator]¶
Access sub curators by slot.
Methods¶
- standardize()¶
Standardize the dataset. :rtype:
None
Adds missing columns for features
Fills missing values for features with default values
- validate()¶
Validate dataset against Schema.
- Raises:
lamindb.errors.ValidationError – If validation fails.
.
- Return type:
None
- save_artifact(*, key=None, description=None, revises=None, run=None)¶
Save an annotated artifact.
- Parameters:
key (default:
None
) – A path-like key to reference artifact in default storage, e.g.,"myfolder/myfile.fcs"
. Artifacts with the same key form a version family.description (default:
None
) – A description.revises (default:
None
) – Previous version of the artifact. Is an alternative way to passingkey
to trigger a new version.run (default:
None
) – The run that creates the artifact.
- Return type:
- Returns:
A saved artifact record.
.