lamindb.core.MappedCollection ¶

class lamindb.core.MappedCollection(path_list, layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None)¶

Bases: object

Map-style collection for use in data loaders.

This class virtually concatenates AnnData arrays as a pytorch map-style dataset.

If your AnnData collection is in the cloud, move them into a local cache first for faster access.

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in path_list. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Background: blog.lamin.ai/mapped-collection.

Note

For a guide, see Train a machine learning model on a collection.

For more convenient use within MappedCollection, see mapped().

This currently only works for collections of AnnData objects.

The implementation was influenced by the SCimilarity data loader.

Parameters:

path_list (list[str | Path | UPath]) – A list of paths to AnnData objects stored in .h5ad or .zarr formats.
layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X. "raw.X" retrieves .X from .raw slot. Keys not present in an object are omitted from the output for that object.
obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots. Keys not present in an object are omitted from the output for that object.
obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots. Keys not present in an object are omitted from the output for that object.
obs_filter (dict[str, str | list[str]] | None, default: None) – Select only observations with these values for the given obs columns. Should be a dictionary with obs column names as keys and filtering values (a string or a list of strings) as values.
join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join. The join is applied to layers_keys except for "raw.X".
encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.
unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.
cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.
parallel (bool, default: False) – Enable sampling with multiple processes.
dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm

property shape: tuple[int, int]¶: Shape of the (virtually aligned) dataset.

property original_shapes: list[tuple[int, int]]¶: Shapes of the underlying AnnData objects (with obs_filter applied).

property closed: bool¶

Check if connections to array streaming backend are closed.

Does not matter if parallel=True.

classmethod torch_worker_init_fn(worker_id)¶

worker_init_fn for torch.utils.data.DataLoader.

Improves performance for num_workers > 1.

check_vars_sorted(ascending=True)¶

Returns True if all variables are sorted in all objects.

Return type:: bool

check_vars_non_aligned(vars)¶

Returns indices of objects with non-aligned variables.

Parameters:: vars (Index | list) – Check alignment against these variables.
Return type:: list[int]

get_label_weights(obs_keys, scaler=None, return_categories=False)¶

Get all weights for the given label keys.

This counts the number of labels for each label and returns weights for each obs label accoding to the formula 1 / num of this label in the data. If scaler is provided, then scaler / (scaler + num of this label in the data).

Parameters:

obs_keys (str | list[str]) – A key in the .obs slots or a list of keys. If a list is provided, the labels from the obs keys will be concatenated with "__" delimeter
scaler (float | None, default: None) – Use this number to scale the provided weights.
return_categories (bool, default: False) – If False, returns weights for each observation, can be directly passed to a sampler. If True, returns a dictionary with unique categories for labels (concatenated if obs_keys is a list) and their weights.

get_merged_labels(label_key)¶: Get merged labels for label_key from all .obs.

get_merged_categories(label_key)¶: Get merged categories for label_key from all .obs.

close()¶

Close connections to array streaming backend.

No effect if parallel=True.