lamindb.models.ArtifactSet

class lamindb.models.ArtifactSet

Bases: Iterable

Abstract class representing sets of artifacts returned by queries.

This class automatically extends BasicQuerySet and QuerySet when the base model is Artifact.

Examples

>>> artifacts = ln.Artifact.filter(otype="AnnData")
>>> artifacts # an instance of ArtifactQuerySet inheriting from ArtifactSet

Methods

load(join='outer', is_run_input=None, **kwargs)

Cache and load to memory.

Returns an in-memory concatenated DataFrame or AnnData object.

Return type:

DataFrame | AnnData

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

By default (stream=False) AnnData arrays are moved into a local cache first.

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections or query sets of AnnData artifacts.

Parameters:
  • layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.

  • obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.

  • obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.

  • obs_filter (dict[str, str | list[str]] | None, default: None) – Select only observations with these values for the given obs columns. Should be a dictionary with obs column names as keys and filtering values (a string or a list of strings) as values.

  • join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.

  • encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.

  • unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.

  • cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm

  • stream (bool, default: False) – Whether to stream data from the array backend.

  • is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.get(description="my collection")
>>> mapped = collection.mapped(obs_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
>>> # also works for query sets of artifacts, '...' represents some filtering condition
>>> # additional filtering on artifacts of the collection
>>> mapped = collection.artifacts.all().filter(...).order_by("-created_at").mapped()
>>> # or directly from a query set of artifacts
>>> mapped = ln.Artifact.filter(..., otype="AnnData").order_by("-created_at").mapped()
open(engine='pyarrow', is_run_input=None, **kwargs)

Open a dataset for streaming.

Works for pyarrow and polars compatible formats (.parquet, .csv, .ipc etc. files or directories with such files).

Parameters:
  • engine (Literal['pyarrow', 'polars'], default: 'pyarrow') – Which module to use for lazy loading of a dataframe from pyarrow or polars compatible formats.

  • is_run_input (bool | None, default: None) – Whether to track this artifact as run input.

  • **kwargs – Keyword arguments for pyarrow.dataset.dataset or polars.scan_* functions.

Return type:

Dataset | Iterator[LazyFrame]

Notes

For more info, see guide: Slice arrays.