lamindb.models.ArtifactSet¶

class lamindb.models.ArtifactSet¶

Bases: Iterable

Abstract class representing sets of artifacts returned by queries.

This class automatically extends BasicQuerySet and QuerySet when the base model is Artifact.

Examples

>>> artifacts = ln.Artifact.filter(otype="AnnData")
>>> artifacts # an instance of ArtifactQuerySet inheriting from ArtifactSet

Methods¶

load(join='outer', is_run_input=None, **kwargs)¶

Cache and load to memory.

Returns an in-memory concatenated DataFrame or AnnData object.

Return type:: DataFrame | AnnData

open(engine='pyarrow', is_run_input=None, **kwargs)¶

Open a dataset for streaming.

Works for pyarrow and polars compatible formats (.parquet, .csv, .ipc etc. files or directories with such files).

Parameters:

engine (Literal['pyarrow', 'polars'], default: 'pyarrow') – Which module to use for lazy loading of a dataframe from pyarrow or polars compatible formats.
is_run_input (bool | None, default: None) – Whether to track this artifact as run input.
**kwargs – Keyword arguments for pyarrow.dataset.dataset or polars.scan_* functions.

Return type:

Dataset | Iterator[LazyFrame]

Notes

For more info, see guide: Slice arrays.

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)¶

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

By default (stream=False) AnnData arrays are moved into a local cache first.

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections or query sets of AnnData artifacts.

Parameters:

layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.
obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.
obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.
obs_filter (dict[str, str | list[str]] | None, default: None) – Select only observations with these values for the given obs columns. Should be a dictionary with obs column names as keys and filtering values (a string or a list of strings) as values.
join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.
encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.
unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.
cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.
parallel (bool, default: False) – Enable sampling with multiple processes.
dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm
stream (bool, default: False) – Whether to stream data from the array backend.
is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.get(description="my collection")
>>> mapped = collection.mapped(obs_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
>>> # also works for query sets of artifacts, '...' represents some filtering condition
>>> # additional filtering on artifacts of the collection
>>> mapped = collection.artifacts.all().filter(...).order_by("-created_at").mapped()
>>> # or directly from a query set of artifacts
>>> mapped = ln.Artifact.filter(..., otype="AnnData").order_by("-created_at").mapped()