lamindb.Collection

class lamindb.Collection(artifacts: list[Artifact], name: str, description: str | None = None, meta: Any | None = None, reference: str | None = None, reference_type: str | None = None, run: Run | None = None, revises: Collection | None = None)

Bases: Record, IsVersioned, TracksRun, TracksUpdates

Collections of artifacts.

Collections provide a simple way of versioning collections of artifacts.

Parameters:
  • artifactslist[Artifact] A list of artifacts.

  • namestr A name.

  • descriptionstr | None = None A description.

  • revisesCollection | None = None An old version of the collection.

  • runRun | None = None The run that creates the collection.

  • metaArtifact | None = None An artifact that defines metadata for the collection.

  • referencestr | None = None For instance, an external ID or a URL.

  • reference_typestr | None = None For instance, "url".

See also

Artifact

Examples

Create a collection from a list of Artifact objects:

>>> collection = ln.Collection([artifact1, artifact2], name="My collection")

Create a collection that groups a data & a metadata artifact (e.g., here RxRx: cell imaging):

>>> collection = ln.Collection(data_artifact, name="My collection", meta=metadata_artifact)

Attributes

property data_artifact: Artifact | None

Access to a single data artifact.

If the collection has a single data & metadata artifact, this allows access via:

collection.data_artifact  # first & only element of collection.artifacts
collection.meta_artifact  # metadata
property name: str

Name of the collection.

Splits key on / and returns the last element.

property ordered_artifacts: QuerySet

Ordered QuerySet of .artifacts.

Accessing the many-to-many field collection.artifacts directly gives you non-deterministic order.

Using the property .ordered_artifacts allows to iterate through a set that’s ordered in the order of creation.

property stem_uid: str

Universal id characterizing the version family.

The full uid of a record is obtained via concatenating the stem uid and version information:

stem_uid = random_base62(n_char)  # a random base62 sequence of length 12 (transform) or 16 (artifact, collection)
version_uid = "0000"  # an auto-incrementing 4-digit base62 number
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid
property transform: Transform | None

Transform whose run created the collection.

property versions: QuerySet

Lists all records of the same version family.

>>> new_artifact = ln.Artifact(df2, revises=artifact).save()
>>> new_artifact.versions()

Simple fields

uid: str

Universal id, valid across DB instances.

key: str

Name or path-like key.

description: str | None

A description or title.

hash: str | None

Hash of collection content. 86 base64 chars allow to store 64 bytes, 512 bits.

reference: str | None

A reference like URL or external ID.

reference_type: str | None

Type of reference, e.g., cellxgene Census collection_id.

meta_artifact: Artifact | None

An artifact that stores metadata that indexes a collection.

It has a 1:1 correspondence with an artifact. If needed, you can access the collection from the artifact via a private field: artifact._meta_of_collection.

version: str | None

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

is_latest: bool

Boolean flag that indicates whether a record is the latest in its version family.

created_at: datetime

Time of creation of record.

updated_at: datetime

Time of last update to record.

Relational fields

created_by: User

Creator of record.

space: Space

The space in which the record lives.

run: Run | None

Run that created the collection.

ulabels: ULabel

ULabels sampled in the collection (see Feature).

input_of_runs: Run

Runs that use this collection as an input.

artifacts: Artifact

Artifacts in collection.

projects

Accessor to the related objects manager on the forward and reverse sides of a many-to-many relation.

In the example:

class Pizza(Model):
    toppings = ManyToManyField(Topping, related_name='pizzas')

Pizza.toppings and Topping.pizzas are ManyToManyDescriptor instances.

Most of the implementation is delegated to a dynamically defined manager class built by create_forward_many_to_many_manager() defined below.

references

Accessor to the related objects manager on the forward and reverse sides of a many-to-many relation.

In the example:

class Pizza(Model):
    toppings = ManyToManyField(Topping, related_name='pizzas')

Pizza.toppings and Topping.pizzas are ManyToManyDescriptor instances.

Most of the implementation is delegated to a dynamically defined manager class built by create_forward_many_to_many_manager() defined below.

Class methods

classmethod df(include=None, features=False, limit=100)

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

Use arguments include or feature to include other data.

Parameters:
  • include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "ulabels__name", "cell_types__name", etc. or a list of such strings.

  • features (bool | list[str], default: False) – If True, map all features of the Feature registry onto the resulting DataFrame. Only available for Artifact.

  • limit (int, default: 100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.

Return type:

DataFrame

Examples

Include the name of the creator in the DataFrame:

>>> ln.ULabel.df(include="created_by__name"])

Include display of features for Artifact:

>>> df = ln.Artifact.df(features=True)
>>> ln.view(df)  # visualize with type annotations

Only include select features:

>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
classmethod filter(*queries, **expressions)

Query records.

Parameters:
  • queries – One or multiple Q objects.

  • expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Examples

>>> ln.ULabel(name="my label").save()
>>> ln.ULabel.filter(name__startswith="my").df()
classmethod get(idlike=None, **expressions)

Get a single record.

Parameters:
  • idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.

  • expressions – Fields and values passed as Django query expressions.

Return type:

Record

Returns:

A record.

Raises:

lamindb.core.exceptions.DoesNotExist – In case no matching record is found.

See also

Examples

>>> ulabel = ln.ULabel.get("FvtpPJLJ")
>>> ulabel = ln.ULabel.get(name="my-label")
classmethod lookup(field=None, return_field=None)

Return an auto-complete object for a field.

Parameters:
  • field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.

  • return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
classmethod search(string, *, field=None, limit=20, case_sensitive=False)

Search.

Parameters:
  • string (str) – The input string to match against the field ontology values.

  • field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.

  • limit (int | None, default: 20) – Maximum amount of top results to return.

  • case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")
classmethod using(instance)

Use a non-default LaminDB instance.

Parameters:

instance (str | None) – An instance identifier of form “account_handle/instance_name”.

Return type:

QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

Methods

append(artifact, run=None)

Add an artifact to the collection.

Creates a new version of the collection.

Parameters:
  • artifact (Artifact) – An artifact to add to the collection.

  • run (Run | None, default: None) – The run that creates the new version of the collection.

Return type:

Collection

Added in version 0.76.14.

cache(is_run_input=None)

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:

is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

list[UPath]

delete(permanent=None)

Delete collection.

Parameters:

permanent (bool | None, default: None) – Whether to permanently delete the collection record (skips trash).

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.delete()
describe(print_types=False)

Describe relations of record.

Examples

>>> artifact.describe()
load(join='outer', is_run_input=None, **kwargs)

Stage and load to memory.

Returns in-memory representation if possible such as a concatenated DataFrame or AnnData object.

Return type:

Any

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

If your AnnData collection is in the cloud, move them into a local cache first via cache().

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections of AnnData artifacts.

Parameters:
  • layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.

  • obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.

  • obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.

  • obs_filter (dict[str, str | tuple[str, ...]] | None, default: None) – Select only observations with these values for the given obs columns. Should be a dictionary with obs column names as keys and filtering values (a string or a tuple of strings) as values.

  • join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.

  • encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.

  • unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.

  • cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm

  • stream (bool, default: False) – Whether to stream data from the array backend.

  • is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.get(description="my collection")
>>> mapped = collection.mapped(obs_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
restore()

Restore collection record from trash.

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.restore()
save(using=None)

Save the collection and underlying artifacts to database & storage.

Parameters:

using (str | None, default: None) – The database to which you want to save.

Return type:

Collection

Examples

>>> collection = ln.Collection("./myfile.csv", name="myfile")
>>> collection.save()
view_lineage(with_children=True)

Graph of data flow.

Return type:

None

Notes

For more info, see use cases: Data lineage.

Examples

>>> collection.view_lineage()
>>> artifact.view_lineage()