lamindb.core.MappedCollection¶
- class lamindb.core.MappedCollection(path_list, layers_keys=None, obs_keys=None, obsm_keys=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None)¶
Bases:
object
Map-style collection for use in data loaders.
This class virtually concatenates
AnnData
arrays as a pytorch map-style dataset.If your
AnnData
collection is in the cloud, move them into a local cache first for faster access.__getitem__
of theMappedCollection
object takes a single integer index and returns a dictionary with the observation data sample for this index from theAnnData
objects inpath_list
. The dictionary has keys forlayers_keys
(.X
is in"X"
),obs_keys
,obsm_keys
(underf"obsm_{key}"
) and also"_store_idx"
for the index of theAnnData
object containing this observation sample.Note
For a guide, see Train a machine learning model on a collection.
For more convenient use within
MappedCollection
, seemapped()
.This currently only works for collections of
AnnData
objects.The implementation was influenced by the SCimilarity data loader.
- Parameters:
path_list (
list
[lamindb.core.types.UPathStr]) – A list of paths toAnnData
objects stored in.h5ad
or.zarr
formats.layers_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.layers
slot.layers_keys=None
or"X"
in the list retrieves.X
.obsm_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.obsm
slots.obs_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.obs
slots.join (
Literal
['inner'
,'outer'
] |None
, default:'inner'
) –"inner"
or"outer"
virtual joins. IfNone
is passed, does not join.encode_labels (
bool
|list
[str
], default:True
) – Encode labels into integers. Can be a list with elements fromobs_keys
.unknown_label (
str
|dict
[str
,str
] |None
, default:None
) – Encode this label to -1. Can be a dictionary with keys fromobs_keys
ifencode_labels=True
or fromencode_labels
if it is a list.cache_categories (
bool
, default:True
) – Enable caching categories ofobs_keys
for faster access.parallel (
bool
, default:False
) – Enable sampling with multiple processes.dtype (
str
|None
, default:None
) – Convert numpy arrays from.X
,.layers
and.obsm
Attributes¶
- property closed: bool¶
Check if connections to array streaming backend are closed.
Does not matter if
parallel=True
.
- property original_shapes: list[tuple[int, int]]¶
Shapes of the underlying AnnData objects.
- property shape: tuple[int, int]¶
Shape of the (virtually aligned) dataset.
Methods¶
- check_vars_non_aligned(vars)¶
Returns indices of objects with non-aligned variables.
- Parameters:
vars (
Index
|list
) – Check alignment against these variables.- Return type:
list
[int
]
- check_vars_sorted(ascending=True)¶
Returns
True
if all variables are sorted in all objects.- Return type:
bool
- close()¶
Close connections to array streaming backend.
No effect if
parallel=True
.
- get_label_weights(obs_keys, scaler=None, return_categories=False)¶
Get all weights for the given label keys.
This counts the number of labels for each label and returns weights for each obs label accoding to the formula
1 / num of this label in the data
. Ifscaler
is provided, thenscaler / (scaler + num of this label in the data)
.- Parameters:
obs_keys (
str
|list
[str
]) – A key in the.obs
slots or a list of keys. If a list is provided, the labels from the obs keys will be concatenated with"__"
delimeterscaler (
float
|None
, default:None
) – Use this number to scale the provided weights.return_categories (
bool
, default:False
) – IfFalse
, returns weights for each observation, can be directly passed to a sampler. IfTrue
, returns a dictionary with unique categories for labels (concatenated ifobs_keys
is a list) and their weights.
- get_merged_categories(label_key)¶
Get merged categories for
label_key
from all.obs
.
- get_merged_labels(label_key)¶
Get merged labels for
label_key
from all.obs
.