lamindb.Collection¶
- class lamindb.Collection(artifacts: Artifact | list[Artifact], key: str, description: str | None = None, meta: Any | None = None, reference: str | None = None, reference_type: str | None = None, run: Run | None = None, revises: Collection | None = None, skip_hash_lookup: bool = False)¶
Bases:
SQLRecord,IsVersioned,TracksRun,TracksUpdatesVersioned collections of artifacts.
- Parameters:
artifacts –
Artifact | list[Artifact]One or several artifacts.key –
strA file-path like key, analogous to thekeyparameter ofArtifactandTransform.description –
str | None = NoneA description.revises –
Collection | None = NoneAn old version of the collection.run –
Run | None = NoneThe run that creates the collection.meta –
Artifact | None = NoneAn artifact that defines metadata for the collection.reference –
str | None = NoneA simple reference, e.g. an external ID or a URL.reference_type –
str | None = NoneA way to indicate to indicate the type of the simple reference"url".
See also
Examples
Create a collection from a list of
Artifactobjects:collection = ln.Collection([artifact1, artifact2], key="my_project/my_collection")
Create a collection that groups a data & a metadata artifact (e.g., here RxRx: cell imaging):
collection = ln.Collection(data_artifact, key="my_project/my_collection", meta=metadata_artifact)
Attributes¶
- property data_artifact: Artifact | None¶
Access to a single data artifact.
If the collection has a single data & metadata artifact, this allows access via:
collection.data_artifact # first & only element of collection.artifacts collection.meta_artifact # metadata
- property name: str¶
Name of the collection.
Splits
keyon/and returns the last element.
- property ordered_artifacts: QuerySet¶
Ordered
QuerySetof.artifacts.Accessing the many-to-many field
collection.artifactsdirectly gives you non-deterministic order.Using the property
.ordered_artifactsallows to iterate through a set that’s ordered by the order of the list that created the collection.
- property stem_uid: str¶
Universal id characterizing the version family.
The full uid of a record is obtained via concatenating the stem uid and version information:
stem_uid = random_base62(n_char) # a random base62 sequence of length 12 (transform) or 16 (artifact, collection) version_uid = "0000" # an auto-incrementing 4-digit base62 number uid = f"{stem_uid}{version_uid}" # concatenate the stem_uid & version_uid
Simple fields¶
- uid: str¶
Universal id, valid across DB instances.
- key: str¶
Name or path-like key.
- description: str | None¶
A description or title.
- hash: str | None¶
Hash of collection content.
- reference: str | None¶
A reference like URL or external ID.
- reference_type: str | None¶
Type of reference, e.g., cellxgene Census collection_id.
-
meta_artifact:
Artifact| None¶ An artifact that stores metadata that indexes a collection.
It has a 1:1 correspondence with an artifact. If needed, you can access the collection from the artifact via a private field:
artifact._meta_of_collection.
- version: str | None¶
Version (default
None).Defines version of a family of records characterized by the same
stem_uid.Consider using semantic versioning with Python versioning.
- is_latest: bool¶
Boolean flag that indicates whether a record is the latest in its version family.
- is_locked: bool¶
Whether the record is locked for edits.
- created_at: datetime¶
Time of creation of record.
- updated_at: datetime¶
Time of last update to record.
Relational fields¶
- blocks: CollectionBlock¶
Blocks that annotate this collection.
Class methods¶
- classmethod get(idlike=None, *, is_run_input=False, **expressions)¶
Get a single collection.
- Parameters:
idlike (
int|str|None, default:None) – Either a uid stub, uid or an integer id.is_run_input (
bool|Run, default:False) – Whether to track this collection as run input.expressions – Fields and values passed as Django query expressions.
- Raises:
lamindb.errors.DoesNotExist – In case no matching record is found.
- Return type:
See also
Method in
SQLRecordbase class:get()
Examples
collection = ln.Collection.get("okxPW6GIKBfRBE3B0000") collection = ln.Collection.get(key="scrna/collection1")
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Qobjects.expressions – Fields and values passed as Django query expressions.
- Return type:
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.Project(name="my label").save() >>> ln.Project.filter(name__startswith="my").to_dataframe()
- classmethod to_dataframe(include=None, features=False, limit=100)¶
Evaluate and convert to
pd.DataFrame.By default, maps simple fields and foreign keys onto
DataFramecolumns.Guide: Query & search registries
- Parameters:
include (
str|list[str] |None, default:None) – Related data to include as columns. Takes strings of form"records__name","cell_types__name", etc. or a list of such strings. ForArtifact,Record, andRun, can also pass"features"to include features with data types pointing to entities in the core schema. If"privates", includes private fields (fields starting with_).features (
bool|list[str], default:False) – Configure the features to include. Can be a feature name or a list of such names. If"queryset", infers the features used within the current queryset. Only available forArtifact,Record, andRun.limit (
int, default:100) – Maximum number of rows to display. IfNone, includes all results.order_by – Field name to order the records by. Prefix with ‘-’ for descending order. Defaults to ‘-id’ to get the most recent records. This argument is ignored if the queryset is already ordered or if the specified field does not exist.
- Return type:
DataFrame
Examples
Include the name of the creator:
ln.Record.to_dataframe(include="created_by__name"])
Include features:
ln.Artifact.to_dataframe(include="features")
Include selected features:
ln.Artifact.to_dataframe(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str) – The input string to match against the field ontology values.field (
str|DeferredAttribute|None, default:None) – The field or fields to search. Search all string fields by default.limit (
int|None, default:20) – Maximum amount of top results to return.case_sensitive (
bool, default:False) – Whether the match is case sensitive.
- Return type:
- Returns:
A sorted
DataFrameof search results with a score in columnscore. Ifreturn_querysetisTrue.QuerySet.
Examples
records = ln.Record.from_values(["Label1", "Label2", "Label3"], field="name").save() ln.Record.search("Label2")
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str|DeferredAttribute|None, default:None) – The field to look up the values for. Defaults to first string field.return_field (
str|DeferredAttribute|None, default:None) – The field to return. IfNone, returns the whole record.keep – When multiple records are found for a lookup, how to return the records. -
"first": return the first record. -"last": return the last record. -False: return all records.
- Return type:
NamedTuple- Returns:
A
NamedTupleof lookup information of the field values with a dictionary converter.
See also
Examples
Lookup via auto-complete on
.:import bionty as bt bt.Gene.from_source(symbol="ADGB-DT").save() lookup = bt.Gene.lookup() lookup.adgb_dt
Look up via auto-complete in dictionary:
lookup_dict = lookup.dict() lookup_dict['ADGB-DT']
Look up via a specific field:
lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") genes.ensg00000002745
Return a specific field value instead of the full record:
lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
Methods¶
- append(artifact, run=None)¶
Append an artifact to the collection.
This does not modify the original collection in-place, but returns a new version of the original collection with the appended artifact.
- Parameters:
- Return type:
Examples
collection_v1 = ln.Collection(artifact, key="My collection").save() collection_v2 = collection.append(another_artifact) # returns a new version of the collection collection_v2.save() # save the new version
- open(engine='pyarrow', is_run_input=None, **kwargs)¶
Open a dataset for streaming.
Works for
pyarrowandpolarscompatible formats (.parquet,.csv,.ipcetc. files or directories with such files).- Parameters:
engine (
Literal['pyarrow','polars'], default:'pyarrow') – Which module to use for lazy loading of a dataframe frompyarroworpolarscompatible formats.is_run_input (
bool|None, default:None) – Whether to track this artifact as run input.**kwargs – Keyword arguments for
pyarrow.dataset.datasetorpolars.scan_*functions.
- Return type:
Dataset|Iterator[LazyFrame]
Notes
For more info, see guide: Slice & stream arrays.
- mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)¶
Return a map-style dataset.
Returns a pytorch map-style dataset by virtually concatenating
AnnDataarrays.By default (
stream=False)AnnDataarrays are moved into a local cache first.__getitem__of theMappedCollectionobject takes a single integer index and returns a dictionary with the observation data sample for this index from theAnnDataobjects in the collection. The dictionary has keys forlayers_keys(.Xis in"X"),obs_keys,obsm_keys(underf"obsm_{key}") and also"_store_idx"for the index of theAnnDataobject containing this observation sample.Note
For a guide, see Train a machine learning model on a collection.
This method currently only works for collections or query sets of
AnnDataartifacts.- Parameters:
layers_keys (
str|list[str] |None, default:None) – Keys from the.layersslot.layers_keys=Noneor"X"in the list retrieves.X.obs_keys (
str|list[str] |None, default:None) – Keys from the.obsslots.obsm_keys (
str|list[str] |None, default:None) – Keys from the.obsmslots.obs_filter (
dict[str,str|list[str]] |None, default:None) – Select only observations with these values for the given obs columns. Should be a dictionary with obs column names as keys and filtering values (a string or a list of strings) as values.join (
Literal['inner','outer'] |None, default:'inner') –"inner"or"outer"virtual joins. IfNoneis passed, does not join.encode_labels (
bool|list[str], default:True) – Encode labels into integers. Can be a list with elements fromobs_keys.unknown_label (
str|dict[str,str] |None, default:None) – Encode this label to -1. Can be a dictionary with keys fromobs_keysifencode_labels=Trueor fromencode_labelsif it is a list.cache_categories (
bool, default:True) – Enable caching categories ofobs_keysfor faster access.parallel (
bool, default:False) – Enable sampling with multiple processes.dtype (
str|None, default:None) – Convert numpy arrays from.X,.layersand.obsmstream (
bool, default:False) – Whether to stream data from the array backend.is_run_input (
bool|None, default:None) – Whether to track this collection as run input.
- Return type:
Examples
>>> import lamindb as ln >>> from torch.utils.data import DataLoader >>> ds = ln.Collection.get(description="my collection") >>> mapped = collection.mapped(obs_keys=["cell_type", "batch"]) >>> dl = DataLoader(mapped, batch_size=128, shuffle=True) >>> # also works for query sets of artifacts, '...' represents some filtering condition >>> # additional filtering on artifacts of the collection >>> mapped = collection.artifacts.all().filter(...).order_by("-created_at").mapped() >>> # or directly from a query set of artifacts >>> mapped = ln.Artifact.filter(..., otype="AnnData").order_by("-created_at").mapped()
- cache(is_run_input=None)¶
Download cloud artifacts in collection to local cache.
Follows syncing logic: only downloads outdated artifacts.
Returns ordered paths to locally cached on-disk artifacts via
.ordered_artifacts.all():- Parameters:
is_run_input (
bool|None, default:None) – Whether to track this collection as run input.- Return type:
list[UPath]
- load(join='outer', is_run_input=None, **kwargs)¶
Cache and load to memory.
Returns an in-memory concatenated
DataFrameorAnnDataobject.- Return type:
DataFrame|AnnData
- save(using=None)¶
Save the collection and underlying artifacts to database & storage.
- Parameters:
using (
str|None, default:None) – The database to which you want to save.- Return type:
Examples
>>> collection = ln.Collection("./myfile.csv", name="myfile")
- restore()¶
Restore collection record from trash.
- Return type:
None
Examples
For any
Collectionobjectcollection, call:>>> collection.restore()
- describe(return_str=False)¶
Describe record including relations.
- Parameters:
return_str (
bool, default:False) – Return a string instead of printing.- Return type:
None|str
- view_lineage(with_children=True, return_graph=False)¶
View data lineage graph.
- Return type:
Digraph|None
- delete(permanent=None, **kwargs)¶
Delete record.
- Parameters:
permanent (
bool|None, default:None) – Whether to permanently delete the record (skips trash). IfNone, performs soft delete if the record is not already in the trash.- Return type:
None
Examples
For any
SQLRecordobjectrecord, call:>>> record.delete()