lamindb.Collection

class lamindb.Collection(artifacts: list[Artifact], name: str, version: str, description: str | None = None, meta: Any | None = None, reference: str | None = None, reference_type: str | None = None, run: Run | None = None, is_new_version_of: Collection | None = None)

Bases: Record, HasFeatures, IsVersioned, TracksRun, TracksUpdates

Collections of artifacts.

For more info: Tutorial: Artifacts.

Parameters:
  • dataList[Artifact] A list of artifacts.

  • namestr A name.

  • descriptionstr | None = None A description.

  • versionstr | None = None A version string.

  • is_new_version_ofCollection | None = None An old version of the collection.

  • runRun | None = None The run that creates the collection.

  • metaArtifact | None = None An artifact that defines metadata for the collection.

  • referencestr | None = None For instance. n external ID or a URL.

  • reference_typestr | None = None For instance, "url".

See also

Artifact

Examples

Create a collection from a collection of Artifact objects:

>>> collection = ln.Collection([artifact1. rtifact2], name="My collection")
>>> collection.save()

If you have more than 100k artifacts, consider creating a collection directly from the directory without creating File records (e.g., here RxRx: cell imaging):

>>> collection = ln.Artifact("s3://my-bucket/my-images/", name="My collection", meta=df)
>>> collection.save()

Make a new version of a collection:

>>> # a non-versioned collection
>>> collection = ln.Collection(df1, description="My dataframe")
>>> collection.save()
>>> # create new collection from old collection and version both
>>> new_collection = ln.Collection(df2, is_new_version_of=collection)
>>> assert new_collection.stem_uid == collection.stem_uid
>>> assert collection.version == "1"
>>> assert new_collection.version == "2"

Attributes

property artifacts: QuerySet

Ordered QuerySet of artifacts.

features: FeatureManager

Feature manager.

Features denote dataset dimensions, i.e., the variables that measure labels & numbers.

Curate with features & values:

artifact.features.add_values({
     "species": organism,  # here, organism is an Organism record
     "scientist": ['Barbara McClintock', 'Edgar Anderson'],
     "temperature": 27.6,
     "study": "Study 0: initial plant gathering"
})

Query for features & values:

ln.Artifact.features.filter(scientist="Barbara McClintock")

Features may or may not be part of the artifact content in storage. For instance, the Curate flow validates the columns of a DataFrame-like artifact and annotates it with features corresponding to these columns. artifact.features.add_values, by contrast, does not validate the content of the artifact.

Fields

version: str

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

id: int

Internal id, valid only in one DB instance.

uid: str

Universal id, valid across DB instances.

name: str

Name or title of collection (required).

description: str

A description.

hash: str

Hash of collection content. 86 base64 chars allow to store 64 bytes, 512 bits.

reference: str

A reference like URL or external ID.

reference_type: str

Type of reference, e.g., cellxgene Census collection_id.

transform: Transform

Transform whose run created the collection.

run: Run

Run that created the collection.

artifact: Artifact

Storage of collection as a one artifact.

visibility: int

Visibility of record, 0-default, 1-hidden, 2-trash.

feature_sets: FeatureSet

The feature sets measured in this collection (see FeatureSet).

ulabels: ULabel

ULabels sampled in the collection (see Feature).

input_of: Run

Runs that use this collection as an input.

previous_runs: Run

Sequence of runs that created or updated the record.

unordered_artifacts: Artifact

Storage of collection as multiple artifacts.

created_at: datetime

Time of creation of record.

created_by: User

Creator of record.

updated_at: datetime

Time of last update to record.

Methods

cache(is_run_input=None)

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:

is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

list[UPath]

delete(permanent=None)

Delete collection.

Parameters:

permanent (bool | None, default: None) – Whether to permanently delete the collection record (skips trash).

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.delete()
load(join='outer', is_run_input=None, **kwargs)

Stage and load to memory.

Returns in-memory representation if possible, e.g.. concatenated DataFrame or AnnData object.

Return type:

Any

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

If your AnnData collection is in the cloud, move them into a local cache first via cache().

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections of AnnData artifacts.

Parameters:
  • layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.

  • obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.

  • obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.

  • join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.

  • encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.

  • unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.

  • cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm

  • stream (bool, default: False) – Whether to stream data from the array backend.

  • is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.filter(description="my collection").one()
>>> mapped = collection.mapped(label_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
restore()

Restore collection record from trash.

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.restore()
save(using=None)

Save the collection and underlying artifacts to database & storage.

Parameters:

using (str | None, default: None) – The database to which you want to save.

Return type:

Collection

Examples

>>> collection = ln.Collection("./myfile.csv", name="myfile")
>>> collection.save()
stage(is_run_input=None)

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:

is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

list[UPath]