lamindb.Collection

class lamindb.Collection(artifacts: list[Artifact], name: str, version: str, description: str | None = None, meta: Any | None = None, reference: str | None = None, reference_type: str | None = None, run: Run | None = None, is_new_version_of: Collection | None = None)

Bases: Registry, HasFeatures, IsVersioned, TracksRun, TracksUpdates

Collections: collections of artifacts.

For more info: Tutorial: Artifacts.

Parameters:
  • dataList[Artifact] A list of artifacts.

  • namestr A name.

  • descriptionstr | None = None A description.

  • versionstr | None = None A version string.

  • is_new_version_ofCollection | None = None An old version of the collection.

  • runRun | None = None The run that creates the collection.

  • metaArtifact | None = None An artifact that defines metadata for the collection.

  • referencestr | None = None For instance. n external ID or a URL.

  • reference_typestr | None = None For instance, "url".

See also

Artifact

Examples

Create a collection from a collection of Artifact objects:

>>> collection = ln.Collection([artifact1. rtifact2], name="My collection")
>>> collection.save()

If you have more than 100k artifacts, consider creating a collection directly from the directory without creating File records (e.g., here RxRx: cell imaging):

>>> collection = ln.Artifact("s3://my-bucket/my-images/", name="My collection", meta=df)
>>> collection.save()

Make a new version of a collection:

>>> # a non-versioned collection
>>> collection = ln.Collection(df1, description="My dataframe")
>>> collection.save()
>>> # create new collection from old collection and version both
>>> new_collection = ln.Collection(df2, is_new_version_of=collection)
>>> assert new_collection.stem_uid == collection.stem_uid
>>> assert collection.version == "1"
>>> assert new_collection.version == "2"

Attributes

artifacts

Ordered QuerySet of artifacts..

features

Feature manager. FeatureManager

labels

Label manager. LabelManager.

stem_uid

Universal id characterizing the version family. str.

The full uid of a record is obtained via concatenating the stem uid and version information:

stem_uid = random_base62(n_char)  # a random base62 sequence of length n_char
version_uid = encode_base62(md5_hash(version))[:4]  # version is, e.g., "1" or "2.1.0" or "2022-03-01"
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid
versions

Lists all records of the same version family. QuerySet.

>>> new_artifact = ln.Artifact(df2, is_new_version_of=artifact)
>>> new_artifact.save()
>>> new_artifact.versions()

Fields

version CharField

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

id AutoField

Internal id, valid only in one DB instance.

uid CharField

Universal id, valid across DB instances.

name CharField

Name or title of collection (required).

description TextField

A description.

hash CharField

Hash of collection content. 86 base64 chars allow to store 64 bytes, 512 bits.

reference CharField

A reference like URL or external ID.

reference_type CharField

Type of reference, e.g., cellxgene Census collection_id.

transform ForeignKey

Transform whose run created the collection.

run ForeignKey

Run that created the collection.

artifact OneToOneField

Storage of collection as a one artifact.

visibility SmallIntegerField

Visibility of record, 0-default, 1-hidden, 2-trash.

feature_sets ManyToManyField

The feature sets measured in this collection (see FeatureSet).

ulabels ManyToManyField

ULabels sampled in the collection (see Feature).

input_of ManyToManyField

Runs that use this collection as an input.

previous_runs ManyToManyField

Sequence of runs that created or updated the record.

unordered_artifacts ManyToManyField

Storage of collection as multiple artifacts.

created_at DateTimeField

Time of creation of record.

created_by ForeignKey

Creator of record. User

updated_at DateTimeField

Time of last update to record.

Methods

cache(is_run_input=None)

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:

is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

list[UPath]

delete(permanent=None)

Delete collection.

Parameters:

permanent (bool | None, default: None) – Whether to permanently delete the collection record (skips trash).

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.delete()
load(join='outer', is_run_input=None, **kwargs)

Stage and load to memory.

Returns in-memory representation if possible, e.g.. concatenated DataFrame or AnnData object.

Return type:

Any

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

If your AnnData collection is in the cloud, move them into a local cache first via cache().

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections of AnnData artifacts.

Parameters:
  • layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.

  • obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.

  • obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.

  • join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.

  • encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.

  • unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.

  • cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm

  • stream (bool, default: False) – Whether to stream data from the array backend.

  • is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.filter(description="my collection").one()
>>> mapped = collection.mapped(label_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
restore()

Restore collection record from trash.

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.restore()
save(using=None)

Save the collection and underlying artifacts to database & storage.

Parameters:

using (str | None, default: None) – The database to which you want to save.

Return type:

None

Examples

>>> collection = ln.Collection("./myfile.csv", name="myfile")
>>> collection.save()
stage(is_run_input=None)

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:

is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

list[UPath]