lamindb.curators
¶
Curators.
High-level curators¶
- class lamindb.curators.DataFrameCurator(dataset, schema, *, slot=None, features=None, require_saved_schema=True)¶
Curator for
DataFrame.- Parameters:
dataset (
DataFrame|Artifact) – The DataFrame-like object to validate & annotate.schema (
Schema) – ASchemaobject that defines the validation constraints.slot (
str|None, default:None) – Indicate the slot in a composite curator for a composite data structure.require_saved_schema (
bool, default:True) – Whether the schema must be saved before curation.
Examples
For a simple example using a flexible schema, see
from_dataframe().Here is an example that enforces a minimal set of columns in the dataframe.
import lamindb as ln schema = ln.examples.datasets.mini_immuno.define_mini_immuno_schema_flexible() df = ln.examples.datasets.mini_immuno.get_dataset1(otype="DataFrame") df.pop("donor") # remove donor column to trigger validation error try: artifact = ln.Artifact.from_dataframe( df, key="examples/dataset1.parquet", schema=schema ).save() except ln.errors.ValidationError as error: print(error)
Under-the-hood, this used the following schema.
import lamindb as ln schema = ln.Schema( name="Mini immuno schema", features=[ ln.Feature.get(name="perturbation"), ln.Feature.get(name="cell_type_by_model"), ln.Feature.get(name="assay_oid"), ln.Feature.get(name="donor"), ln.Feature.get(name="concentration"), ln.Feature.get(name="treatment_time_h"), ], flexible=True, # _additional_ columns in a dataframe are validated & annotated ).save()
Valid features & labels were defined as:
import bionty as bt import lamindb as ln # define valid labels perturbation_type = ln.Record(name="Perturbation", is_type=True).save() ln.Record(name="DMSO", type=perturbation_type).save() ln.Record(name="IFNG", type=perturbation_type).save() bt.CellType.from_source(name="B cell").save() bt.CellType.from_source(name="T cell").save() # define valid features ln.Feature(name="perturbation", dtype=perturbation_type).save() ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save() ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save() ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save() ln.Feature(name="concentration", dtype=str).save() ln.Feature(name="treatment_time_h", dtype="num", coerce=True).save() ln.Feature(name="donor", dtype=str, nullable=True).save() ln.Feature(name="donor_ethnicity", dtype=list[bt.Ethnicity]).save()
It is also possible to curate the
attrsslot.import lamindb as ln from .define_schema_df_metadata import study_metadata_schema df = ln.examples.datasets.mini_immuno.get_dataset1(otype="DataFrame") schema = ln.Schema( features=[ln.Feature(name="perturbation", dtype="str").save()], slots={"attrs": study_metadata_schema}, otype="DataFrame", ).save() curator = ln.curators.DataFrameCurator(df, schema=schema) curator.validate() artifact = curator.save_artifact(key="examples/df_with_attrs.parquet") artifact.describe()
- property cat: DataFrameCatManager¶
Manage categoricals by updating registries.
- standardize()¶
Standardize the dataset. :rtype:
NoneAdds missing columns for features
Fills missing values for features with default values
- validate()¶
Validate dataset against Schema.
- Raises:
lamindb.errors.ValidationError – If validation fails.
.
- Return type:
None
- save_artifact(*, key=None, description=None, revises=None, run=None)¶
Save an annotated artifact.
- Parameters:
key (default:
None) – A path-like key to reference artifact in default storage, e.g.,"myfolder/myfile.fcs". Artifacts with the same key form a version family.description (default:
None) – A description.revises (default:
None) – Previous version of the artifact. Is an alternative way to passingkeyto trigger a new version.run (default:
None) – The run that creates the artifact.
- Return type:
- Returns:
A saved artifact record.
.
- class lamindb.curators.AnnDataCurator(dataset, schema)¶
Curator for
AnnData.Uses slots to specify which component contains which schema. Slots are keys that identify where features are stored within composite data structures.
- Parameters:
Examples
Curate Ensembl gene IDs and valid features in obs:
curate_anndata_flexible.py¶import lamindb as ln ln.examples.datasets.mini_immuno.define_features_labels() adata = ln.examples.datasets.mini_immuno.get_dataset1(otype="AnnData") artifact = ln.Artifact.from_anndata( adata, key="examples/mini_immuno.h5ad", schema="ensembl_gene_ids_and_valid_features_in_obs", ).save() artifact.describe()
Curate
unsdictionary:curate_anndata_uns.py¶import lamindb as ln ln.examples.datasets.mini_immuno.define_features_labels() adata = ln.examples.datasets.mini_immuno.get_dataset1(otype="AnnData") schema = ln.Schema.get(name="Study metadata schema") artifact = ln.Artifact.from_anndata( adata, schema=schema, key="examples/mini_immuno_uns.h5ad" ) artifact.describe()
See also
- class lamindb.curators.MuDataCurator(dataset, schema)¶
Curator for
MuData.Uses slots to specify which component contains which schema. Slots are keys that identify where features are stored within composite data structures.
- Parameters:
dataset (MuData | Artifact) – The MuData-like object to validate & annotate.
schema (Schema) – A
Schemaobject that defines the validation constraints.
Example
curate_mudata.py¶import lamindb as ln import bionty as bt from docs.scripts.define_schema_df_metadata import study_metadata_schema # define labels perturbation = ln.Record(name="Perturbation", is_type=True).save() ln.Record(name="Perturbed", type=perturbation).save() ln.Record(name="NT", type=perturbation).save() replicate = ln.Record(name="Replicate", is_type=True).save() ln.Record(name="rep1", type=replicate).save() ln.Record(name="rep2", type=replicate).save() ln.Record(name="rep3", type=replicate).save() # define the global obs schema obs_schema = ln.Schema( name="mudata_papalexi21_subset_obs_schema", features=[ ln.Feature(name="perturbation", dtype="cat[Record[Perturbation]]").save(), ln.Feature(name="replicate", dtype="cat[Record[Replicate]]").save(), ], ).save() # define the ['rna'].obs schema obs_schema_rna = ln.Schema( name="mudata_papalexi21_subset_rna_obs_schema", features=[ ln.Feature(name="nCount_RNA", dtype=int).save(), ln.Feature(name="nFeature_RNA", dtype=int).save(), ln.Feature(name="percent.mito", dtype=float).save(), ], ).save() # define the ['hto'].obs schema obs_schema_hto = ln.Schema( name="mudata_papalexi21_subset_hto_obs_schema", features=[ ln.Feature(name="nCount_HTO", dtype=float).save(), ln.Feature(name="nFeature_HTO", dtype=int).save(), ln.Feature(name="technique", dtype=bt.ExperimentalFactor).save(), ], ).save() # define ['rna'].var schema var_schema_rna = ln.Schema( name="mudata_papalexi21_subset_rna_var_schema", itype=bt.Gene.symbol, dtype=float, ).save() # define composite schema mudata_schema = ln.Schema( name="mudata_papalexi21_subset_mudata_schema", otype="MuData", slots={ "obs": obs_schema, "rna:obs": obs_schema_rna, "hto:obs": obs_schema_hto, "rna:var": var_schema_rna, "uns:study_metadata": study_metadata_schema, }, ).save() # curate a MuData mdata = ln.examples.datasets.mudata_papalexi21_subset(with_uns=True) bt.settings.organism = "human" # set the organism to map gene symbols curator = ln.curators.MuDataCurator(mdata, mudata_schema) artifact = curator.save_artifact(key="examples/mudata_papalexi21_subset.h5mu") assert artifact.schema == mudata_schema
See also
- class lamindb.curators.SpatialDataCurator(dataset, schema)¶
Curator for
SpatialData.Uses slots to specify which component contains which schema. Slots are keys that identify where features are stored within composite data structures.
- Parameters:
Example
curate_spatialdata.py¶import lamindb as ln spatialdata = ln.examples.datasets.spatialdata_blobs() sdata_schema = ln.Schema.get(name="spatialdata_blobs_schema") curator = ln.curators.SpatialDataCurator(spatialdata, sdata_schema) try: curator.validate() except ln.errors.ValidationError: pass spatialdata.tables["table"].var.drop(index="ENSG00000999999", inplace=True) # validate again (must pass now) and save artifact artifact = ln.Artifact.from_spatialdata( spatialdata, key="examples/spatialdata1.zarr", schema=sdata_schema ).save() artifact.describe()
See also
- class lamindb.curators.TiledbsomaExperimentCurator(dataset, schema)¶
Curator for
tiledbsoma.Experiment.Uses slots to specify which component contains which schema. Slots are keys that identify where features are stored within composite data structures.
- Parameters:
dataset (SOMAExperiment | Artifact) – The
tiledbsoma.Experimentobject.schema (Schema) – A
Schemaobject that defines the validation constraints.
Example
curate_soma_experiment.py¶import lamindb as ln import bionty as bt import tiledbsoma as soma import tiledbsoma.io adata = ln.examples.datasets.mini_immuno.get_dataset1(otype="AnnData") tiledbsoma.io.from_anndata("small_dataset.tiledbsoma", adata, measurement_name="RNA") obs_schema = ln.Schema( name="soma_obs_schema", features=[ ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(), ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(), ], ).save() var_schema = ln.Schema( name="soma_var_schema", features=[ ln.Feature(name="var_id", dtype=bt.Gene.ensembl_gene_id).save(), ], coerce=True, ).save() soma_schema = ln.Schema( name="soma_experiment_schema", otype="tiledbsoma", slots={ "obs": obs_schema, "ms:RNA.T": var_schema, }, ).save() with soma.Experiment.open("small_dataset.tiledbsoma") as experiment: curator = ln.curators.TiledbsomaExperimentCurator(experiment, soma_schema) curator.validate() artifact = curator.save_artifact( key="examples/soma_experiment.tiledbsoma", description="SOMA experiment with schema validation", ) assert artifact.schema == soma_schema artifact.describe()
See also
Low-level module¶
Curator utilities. |