lamindb.Collection¶
- class lamindb.Collection(artifacts: list[Artifact], name: str, description: str | None = None, meta: Any | None = None, reference: str | None = None, reference_type: str | None = None, run: Run | None = None, revises: Collection | None = None)¶
Bases:
Record
,IsVersioned
,TracksRun
,TracksUpdates
Collections of artifacts.
Collections provide a simple way of versioning collections of artifacts.
- Parameters:
artifacts –
list[Artifact]
A list of artifacts.name –
str
A name.description –
str | None = None
A description.revises –
Collection | None = None
An old version of the collection.run –
Run | None = None
The run that creates the collection.meta –
Artifact | None = None
An artifact that defines metadata for the collection.reference –
str | None = None
For instance, an external ID or a URL.reference_type –
str | None = None
For instance,"url"
.
See also
Examples
Create a collection from a list of
Artifact
objects:>>> collection = ln.Collection([artifact1, artifact2], name="My collection")
Create a collection that groups a data & a metadata artifact (e.g., here RxRx: cell imaging):
>>> collection = ln.Collection(data_artifact, name="My collection", meta=metadata_artifact)
Attributes¶
- property data_artifact: Artifact | None¶
Access to a single data artifact.
If the collection has a single data & metadata artifact, this allows access via:
collection.data_artifact # first & only element of collection.artifacts collection.meta_artifact # metadata
- property ordered_artifacts: QuerySet¶
Ordered
QuerySet
of.artifacts
.Accessing the many-to-many field
collection.artifacts
directly gives you non-deterministic order.Using the property
.ordered_artifacts
allows to iterate through a set that’s ordered in the order of creation.
- property stem_uid: str¶
Universal id characterizing the version family.
The full uid of a record is obtained via concatenating the stem uid and version information:
stem_uid = random_base62(n_char) # a random base62 sequence of length 12 (transform) or 16 (artifact, collection) version_uid = "0000" # an auto-incrementing 4-digit base62 number uid = f"{stem_uid}{version_uid}" # concatenate the stem_uid & version_uid
- property versions: QuerySet¶
Lists all records of the same version family.
>>> new_artifact = ln.Artifact(df2, revises=artifact).save() >>> new_artifact.versions()
Simple fields¶
-
uid:
str
¶ Universal id, valid across DB instances.
-
name:
str
¶ Name or title of collection (required).
-
description:
str
|None
¶ A description.
-
hash:
str
|None
¶ Hash of collection content. 86 base64 chars allow to store 64 bytes, 512 bits.
-
reference:
str
|None
¶ A reference like URL or external ID.
-
reference_type:
str
|None
¶ Type of reference, e.g., cellxgene Census collection_id.
-
meta_artifact:
Artifact
|None
¶ An artifact that stores metadata that indexes a collection.
It has a 1:1 correspondence with an artifact. If needed, you can access the collection from the artifact via a private field:
artifact._meta_of_collection
.
-
visibility:
int
¶ Visibility of collection record in queries & searches (1 default, 0 hidden, -1 trash).
-
version:
str
|None
¶ Version (default
None
).Defines version of a family of records characterized by the same
stem_uid
.Consider using semantic versioning with Python versioning.
-
is_latest:
bool
¶ Boolean flag that indicates whether a record is the latest in its version family.
-
created_at:
datetime
¶ Time of creation of record.
-
updated_at:
datetime
¶ Time of last update to record.
Relational fields¶
Class methods¶
- classmethod df(include=None, features=False, limit=100)¶
Convert to
pd.DataFrame
.By default, shows all direct fields, except
updated_at
.Use arguments
include
orfeature
to include other data.- Parameters:
include (
str
|list
[str
] |None
, default:None
) – Related fields to include as columns. Takes strings of form"ulabels__name"
,"cell_types__name"
, etc. or a list of such strings.features (
bool
|list
[str
], default:False
) – IfTrue
, map all features of theFeature
registry onto the resultingDataFrame
. Only available forArtifact
.limit (
int
, default:100
) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.
- Return type:
DataFrame
Examples
Include the name of the creator in the
DataFrame
:>>> ln.ULabel.df(include="created_by__name"])
Include display of features for
Artifact
:>>> df = ln.Artifact.df(features=True) >>> ln.view(df) # visualize with type annotations
Only include select features:
>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Q
objects.expressions – Fields and values passed as Django query expressions.
- Return type:
QuerySet
- Returns:
A
QuerySet
.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.ULabel(name="my label").save() >>> ln.ULabel.filter(name__startswith="my").df()
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int
|str
|None
, default:None
) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A record.
- Raises:
lamindb.core.exceptions.DoesNotExist – In case no matching record is found.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ulabel = ln.ULabel.get("FvtpPJLJ") >>> ulabel = ln.ULabel.get(name="my-label")
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str
|DeferredAttribute
|None
, default:None
) – The field to look up the values for. Defaults to first string field.return_field (
str
|DeferredAttribute
|None
, default:None
) – The field to return. IfNone
, returns the whole record.
- Return type:
NamedTuple
- Returns:
A
NamedTuple
of lookup information of the field values with a dictionary converter.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> bt.Gene.from_source(symbol="ADGB-DT").save() >>> lookup = bt.Gene.lookup() >>> lookup.adgb_dt >>> lookup_dict = lookup.dict() >>> lookup_dict['ADGB-DT'] >>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") >>> genes.ensg00000002745 >>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str
) – The input string to match against the field ontology values.field (
str
|DeferredAttribute
|None
, default:None
) – The field or fields to search. Search all string fields by default.limit (
int
|None
, default:20
) – Maximum amount of top results to return.case_sensitive (
bool
, default:False
) – Whether the match is case sensitive.
- Return type:
QuerySet
- Returns:
A sorted
DataFrame
of search results with a score in columnscore
. Ifreturn_queryset
isTrue
.QuerySet
.
Examples
>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name") >>> ln.save(ulabels) >>> ln.ULabel.search("ULabel2")
- classmethod using(instance)¶
Use a non-default LaminDB instance.
- Parameters:
instance (
str
|None
) – An instance identifier of form “account_handle/instance_name”.- Return type:
QuerySet
Examples
>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name") uid score name ULabel7 g7Hk9b2v 100.0 ULabel5 t4Jm6s0q 75.0 ULabel6 r2Xw8p1z 75.0
Methods¶
- append(artifact, run=None)¶
Add an artifact to the collection.
Creates a new version of the collection.
- Parameters:
- Return type:
Added in version 0.76.14.
- cache(is_run_input=None)¶
Download cloud artifacts in collection to local cache.
Follows synching logic: only caches outdated artifacts.
Returns paths to locally cached on-disk artifacts.
- Parameters:
is_run_input (
bool
|None
, default:None
) – Whether to track this collection as run input.- Return type:
list
[UPath
]
- delete(permanent=None)¶
Delete collection.
- Parameters:
permanent (
bool
|None
, default:None
) – Whether to permanently delete the collection record (skips trash).- Return type:
None
Examples
For any
Collection
objectcollection
, call:>>> collection.delete()
- describe(print_types=False)¶
Describe relations of record.
Examples
>>> artifact.describe()
- load(join='outer', is_run_input=None, **kwargs)¶
Stage and load to memory.
Returns in-memory representation if possible such as a concatenated
DataFrame
orAnnData
object.- Return type:
Any
- mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)¶
Return a map-style dataset.
Returns a pytorch map-style dataset by virtually concatenating
AnnData
arrays.If your
AnnData
collection is in the cloud, move them into a local cache first viacache()
.__getitem__
of theMappedCollection
object takes a single integer index and returns a dictionary with the observation data sample for this index from theAnnData
objects in the collection. The dictionary has keys forlayers_keys
(.X
is in"X"
),obs_keys
,obsm_keys
(underf"obsm_{key}"
) and also"_store_idx"
for the index of theAnnData
object containing this observation sample.Note
For a guide, see Train a machine learning model on a collection.
This method currently only works for collections of
AnnData
artifacts.- Parameters:
layers_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.layers
slot.layers_keys=None
or"X"
in the list retrieves.X
.obs_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.obs
slots.obsm_keys (
str
|list
[str
] |None
, default:None
) – Keys from the.obsm
slots.obs_filter (
tuple
[str
,str
|tuple
[str
,...
]] |None
, default:None
) – Select only observations with these values for the given obs column. Should be a tuple with an obs column name as the first element and filtering values (a string or a tuple of strings) as the second element.join (
Literal
['inner'
,'outer'
] |None
, default:'inner'
) –"inner"
or"outer"
virtual joins. IfNone
is passed, does not join.encode_labels (
bool
|list
[str
], default:True
) – Encode labels into integers. Can be a list with elements fromobs_keys
.unknown_label (
str
|dict
[str
,str
] |None
, default:None
) – Encode this label to -1. Can be a dictionary with keys fromobs_keys
ifencode_labels=True
or fromencode_labels
if it is a list.cache_categories (
bool
, default:True
) – Enable caching categories ofobs_keys
for faster access.parallel (
bool
, default:False
) – Enable sampling with multiple processes.dtype (
str
|None
, default:None
) – Convert numpy arrays from.X
,.layers
and.obsm
stream (
bool
, default:False
) – Whether to stream data from the array backend.is_run_input (
bool
|None
, default:None
) – Whether to track this collection as run input.
- Return type:
Examples
>>> import lamindb as ln >>> from torch.utils.data import DataLoader >>> ds = ln.Collection.get(description="my collection") >>> mapped = collection.mapped(obs_keys=["cell_type", "batch"]) >>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
- restore()¶
Restore collection record from trash.
- Return type:
None
Examples
For any
Collection
objectcollection
, call:>>> collection.restore()
- save(using=None)¶
Save the collection and underlying artifacts to database & storage.
- Parameters:
using (
str
|None
, default:None
) – The database to which you want to save.- Return type:
Examples
>>> collection = ln.Collection("./myfile.csv", name="myfile") >>> collection.save()
- view_lineage(with_children=True)¶
Graph of data flow.
- Return type:
None
Notes
For more info, see use cases: Data lineage.
Examples
>>> collection.view_lineage() >>> artifact.view_lineage()