lamindb.core.MappedCollection

class lamindb.core.MappedCollection(path_list, layers_keys=None, obs_keys=None, obsm_keys=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None)

Bases: object

Map-style collection for use in data loaders.

This class virtually concatenates AnnData arrays as a pytorch map-style dataset.

If your AnnData collection is in the cloud, move them into a local cache first for faster access.

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in path_list. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

For more convenient use within MappedCollection, see mapped().

This currently only works for collections of AnnData objects.

The implementation was influenced by the SCimilarity data loader.

Parameters:
  • path_list (list[lamindb.core.types.UPathStr]) – A list of paths to AnnData objects stored in .h5ad or .zarr formats.

  • layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.

  • obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.

  • obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.

  • join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.

  • encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.

  • unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.

  • cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm

Attributes

property closed: bool

Check if connections to array streaming backend are closed.

Does not matter if parallel=True.

property original_shapes: list[tuple[int, int]]

Shapes of the underlying AnnData objects.

property shape: tuple[int, int]

Shape of the (virtually aligned) dataset.

Methods

check_vars_non_aligned(vars)

Returns indices of objects with non-aligned variables.

Parameters:

vars (Index | list) – Check alignment against these variables.

Return type:

list[int]

check_vars_sorted(ascending=True)

Returns True if all variables are sorted in all objects.

Return type:

bool

close()

Close connections to array streaming backend.

No effect if parallel=True.

get_label_weights(obs_keys, scaler=None, return_categories=False)

Get all weights for the given label keys.

This counts the number of labels for each label and returns weights for each obs label accoding to the formula 1 / num of this label in the data. If scaler is provided, then scaler / (scaler + num of this label in the data).

Parameters:
  • obs_keys (str | list[str]) – A key in the .obs slots or a list of keys. If a list is provided, the labels from the obs keys will be concatenated with "__" delimeter

  • scaler (float | None, default: None) – Use this number to scale the provided weights.

  • return_categories (bool, default: False) – If False, returns weights for each observation, can be directly passed to a sampler. If True, returns a dictionary with unique categories for labels (concatenated if obs_keys is a list) and their weights.

get_merged_categories(label_key)

Get merged categories for label_key from all .obs.

get_merged_labels(label_key)

Get merged labels for label_key from all .obs.