lamindb.Artifact

class lamindb.Artifact(data: UPathStr, type: ArtifactType | None = None, key: str | None = None, description: str | None = None, revises: Artifact | None = None, run: Run | None = None)

Bases: Record, IsVersioned, TracksRun, TracksUpdates

Datasets & models stored as files, folders, or arrays.

Artifacts manage data in local or remote storage (Tutorial: Artifacts).

Some artifacts are array-like, e.g., when stored as .parquet, .h5ad, .zarr, or .tiledb.

Parameters:
  • dataUPathStr A path to a local or remote folder or file.

  • typeLiteral["dataset", "model"] | None = None The artifact type.

  • keystr | None = None A path-like key to reference artifact in default storage, e.g., "myfolder/myfile.fcs". Artifacts with the same key form a revision family.

  • descriptionstr | None = None A description.

  • revisesArtifact | None = None Previous version of the artifact. Triggers a revision.

  • runRun | None = None The run that creates the artifact.

Typical storage formats & their API accessors

Arrays:

  • Table: .csv, .tsv, .parquet, .ipcDataFrame, pyarrow.Table

  • Annotated matrix: .h5ad, .h5mu, .zradAnnData, MuData

  • Generic array: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders

Non-arrays:

  • Image: .jpg, .pngnp.ndarray, …

  • Fastq: .fastq ⟷ /

  • VCF: .vcf ⟷ /

  • QC: .html ⟷ /

You’ll find these values in the suffix & accessor fields.

LaminDB makes some default choices (e.g., serialize a DataFrame as a .parquet file).

See also

Storage

Storage locations for artifacts.

Collection

Collections of artifacts.

from_df()

Create an artifact from a DataFrame.

from_anndata()

Create an artifact from an AnnData.

Examples

Create an artifact from a path to a file or folder:

>>> artifact = ln.Artifact("s3://my_bucket/my_folder/my_file.csv", description="My file")
>>> artifact = ln.Artifact("./my_local_file.jpg", description="My image")
>>> artifact = ln.Artifact("s3://my_bucket/my_folder", description="My folder")
>>> artifact = ln.Artifact("./my_local_folder", description="My local folder")
Why does the API look this way?

It’s inspired by APIs building on AWS S3.

Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a key argument.

In boto3:

# signature: S3.Bucket.upload_file(filepath, key)
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
bucket.upload_file('/tmp/hello.txt', 'hello.txt')

In quilt3:

# signature: quilt3.Bucket.put_file(key, filepath)
import quilt3
bucket = quilt3.Bucket('mybucket')
bucket.put_file('hello.txt', '/tmp/hello.txt')

Make a new version of an artifact:

>>> artifact = ln.Artifact.from_df(df, description="My dataframe")
>>> artifact.save()
>>> artifact_v2 = ln.Artifact(df_updated, revises=artifact)

Attributes

features: FeatureManager

Feature manager.

Features denote dataset dimensions, i.e., the variables that measure labels & numbers.

Annotate with features & values:

artifact.features.add_values({
     "species": organism,  # here, organism is an Organism record
     "scientist": ['Barbara McClintock', 'Edgar Anderson'],
     "temperature": 27.6,
     "study": "Candidate marker study"
})

Query for features & values:

ln.Artifact.features.filter(scientist="Barbara McClintock")

Features may or may not be part of the artifact content in storage. For instance, the Curator flow validates the columns of a DataFrame-like artifact and annotates it with features corresponding to these columns. artifact.features.add_values, by contrast, does not validate the content of the artifact.

property labels: LabelManager

Label manager.

To annotate with labels, you typically use the registry-specific accessors, for instance ulabels:

candidate_marker_study = ln.ULabel(name="Candidate marker study").save()
artifact.ulabels.add(candidate_marker_study)

Similarly, you query based on these accessors:

ln.Artifact.filter(ulabels__name="Candidate marker study").all()

Unlike the registry-specific accessors, the .labels accessor provides a way of associating labels with features:

study = ln.Feature(name="study", dtype="cat").save()
artifact.labels.add(candidate_marker_study, feature=study)

Note that the above is equivalent to:

artifact.features.add_values({"study": candidate_marker_study})
params: ParamManager

Param manager.

Example:

artifact.params.add_values({
    "hidden_size": 32,
    "bottleneck_size": 16,
    "batch_size": 32,
    "preprocess_params": {
        "normalization_type": "cool",
        "subset_highlyvariable": True,
    },
})
property path: Path | UPath

Path.

File in cloud storage, here AWS S3:

>>> artifact = ln.Artifact("s3://my-bucket/my-file.csv").save()
>>> artifact.path
S3Path('s3://my-bucket/my-file.csv')

File in local storage:

>>> ln.Artifact("./myfile.csv", description="myfile").save()
>>> artifact = ln.Artifact.get(description="myfile")
>>> artifact.path
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')
property stem_uid: str

Universal id characterizing the version family.

The full uid of a record is obtained via concatenating the stem uid and version information:

stem_uid = random_base62(n_char)  # a random base62 sequence of length 12 (transform) or 16 (artifact, collection)
version_uid = "0000"  # an auto-incrementing 4-digit base62 number
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid
property versions: QuerySet

Lists all records of the same version family.

>>> new_artifact = ln.Artifact(df2, revises=artifact)
>>> new_artifact.save()
>>> new_artifact.versions()

Simple fields

uid: str

A universal random id (20-char base62 ~ UUID), valid across DB instances.

description: str

A description.

key: str

Storage key, the relative path within the storage location.

suffix: str

Path suffix or empty string if no canonical suffix exists.

This is either a file suffix (".csv", ".h5ad", etc.) or the empty string “”.

type: Literal['dataset', 'model'] | None

ArtifactType (default None).

size: int

Size in bytes.

Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.

hash: str

Hash or pseudo-hash of artifact content.

Useful to ascertain integrity and avoid duplication.

n_objects: int

Number of objects.

Typically, this denotes the number of files in an artifact.

n_observations: int

Number of observations.

Typically, this denotes the first array dimension.

visibility: int

Visibility of artifact record in queries & searches (1 default, 0 hidden, -1 trash).

version: str

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

is_latest: bool

Boolean flag that indicates whether a record is the latest in its version family.

created_at: datetime

Time of creation of record.

updated_at: datetime

Time of last update to record.

Relational fields

storage: Storage

Storage location, e.g. an S3 or GCP bucket or a local directory.

transform: Transform

Transform whose run created the artifact.

run: Run

Run that created the artifact.

created_by: User

Creator of record.

ulabels: ULabel

The ulabels measured in the artifact (ULabel).

input_of_runs: Run

Runs that use this artifact as an input.

feature_sets: FeatureSet

The feature sets measured in the artifact.

collections: Collection

The collections that this artifact is part of.

Class methods

classmethod df(include=None, join='inner', limit=100)

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

Use parameter include to include other fields.

Parameters:
  • include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "labels__name", "cell_types__name", etc. or a list of such strings.

  • join (str, default: 'inner') – The join parameter of pandas.

  • limit (int, default: 100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.

Return type:

DataFrame

Examples

>>> labels = [ln.ULabel(name="Label {i}") for i in range(3)]
>>> ln.save(labels)
>>> ln.ULabel.filter().df(include=["created_by__name"])
classmethod filter(*queries, **expressions)

Query records.

Parameters:
  • queries – One or multiple Q objects.

  • expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Examples

>>> ln.ULabel(name="my ulabel").save()
>>> ulabel = ln.ULabel.get(name="my ulabel")
classmethod from_anndata(adata, key=None, description=None, run=None, revises=None, **kwargs)

Create from AnnData, validate & link features.

Parameters:
  • adata (AnnData | UPathStr) – An AnnData object or a path of AnnData-like.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5ad".

  • description (str | None, default: None) – A description.

  • revises (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> adata = ln.core.datasets.anndata_with_obs()
>>> artifact = ln.Artifact.from_anndata(adata, description="mini anndata with obs")
>>> artifact.save()
classmethod from_df(df, key=None, description=None, run=None, revises=None, **kwargs)

Create from DataFrame, validate & link features.

For more info, see tutorial: Tutorial: Artifacts.

Parameters:
  • df (DataFrame) – A DataFrame object.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.parquet".

  • description (str | None, default: None) – A description.

  • revises (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> artifact = ln.Artifact.from_df(df, description="Iris flower collection batch1")
>>> artifact.save()
classmethod from_dir(path, key=None, *, run=None)

Create a list of artifact objects from a directory.

Hint

If you have a high number of files (several 100k) and don’t want to track them individually, create a single Artifact via Artifact(path) for them. See, e.g., RxRx: cell imaging.

Parameters:
  • path (lamindb.core.types.UPathStr) – Source path of folder.

  • key (str | None, default: None) – Key for storage destination. If None and directory is in a registered location, the inferred key will reflect the relative position. If None and directory is outside of a registered storage location, the inferred key defaults to path.name.

  • run (Run | None, default: None) – A Run object.

Return type:

list[Artifact]

Examples

>>> dir_path = ln.core.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)
>>> artifacts = ln.Artifact.from_dir(dir_path)
>>> ln.save(artifacts)
classmethod from_mudata(mdata, key=None, description=None, run=None, revises=None, **kwargs)

Create from MuData, validate & link features.

Parameters:
  • mdata (MuData) – An MuData object.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5mu".

  • description (str | None, default: None) – A description.

  • revises (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> mdata = ln.core.datasets.mudata_papalexi21_subset()
>>> artifact = ln.Artifact.from_mudata(mdata, description="a mudata object")
>>> artifact.save()
classmethod get(idlike=None, **expressions)

Get a single record.

Parameters:
  • idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.

  • expressions – Fields and values passed as Django query expressions.

Return type:

Record

Returns:

A record.

Raises:

lamindb.core.exceptions.DoesNotExist – In case no matching record is found.

See also

Examples

>>> ulabel = ln.ULabel.get("2riu039")
>>> ulabel = ln.ULabel.get(name="my-label")
classmethod lookup(field=None, return_field=None)

Return an auto-complete object for a field.

Parameters:
  • field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.

  • return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
classmethod search(string, *, field=None, limit=20, case_sensitive=False)

Search.

Parameters:
  • string (str) – The input string to match against the field ontology values.

  • field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.

  • limit (int | None, default: 20) – Maximum amount of top results to return.

  • case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")
classmethod using(instance)

Use a non-default LaminDB instance.

Parameters:

instance (str | None) – An instance identifier of form “account_handle/instance_name”.

Return type:

QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

Methods

cache(is_run_input=None)

Download cloud artifact to local cache.

Follows synching logic: only caches an artifact if it’s outdated in the local cache.

Returns a path to a locally cached on-disk object (say a .jpg file).

Return type:

Path

Examples

Sync file from cloud and return the local path of the cache:

>>> artifact.cache()
PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
delete(permanent=None, storage=None, using_key=None)

Trash or permanently delete.

A first call to .delete() puts an artifact into the trash (sets visibility to -1). A second call permanently deletes the artifact.

FAQ: Storage FAQ

Parameters:
  • permanent (bool | None, default: None) – Permanently delete the artifact (skip trash).

  • storage (bool | None, default: None) – Indicate whether you want to delete the artifact in storage.

Return type:

None

Examples

For an Artifact object artifact, call:

>>> artifact.delete()
describe(print_types=False)

Describe relations of record.

Examples

>>> artifact.describe()
load(is_run_input=None, **kwargs)

Cache and load into memory.

See all loaders.

Return type:

Any

Examples

Load a DataFrame-like artifact:

>>> artifact.load().head()
sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0

Load an AnnData-like artifact:

>>> artifact.load()
AnnData object with n_obs × n_vars = 70 × 765

Fall back to cache() if no in-memory representation is configured:

>>> artifact.load()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
open(mode='r', is_run_input=None)

Return a cloud-backed data object.

Works for AnnData (.h5ad and .zarr), generic hdf5 and zarr, tiledbsoma objects (.tiledbsoma), pyarrow compatible formats.

Parameters:

mode (str, default: 'r') – can only be "w" (write mode) for tiledbsoma stores, otherwise should be always "r" (read-only mode).

Return type:

AnnDataAccessor | BackedAccessor | SOMACollection | SOMAExperiment | PyArrowDataset

Notes

For more info, see tutorial: Query arrays.

Examples

Read AnnData in backed mode from cloud:

>>> artifact = ln.Artifact.get(key="lndb-storage/pbmc68k.h5ad")
>>> artifact.open()
AnnDataAccessor object with n_obs × n_vars = 70 × 765
    constructed for the AnnData object pbmc68k.h5ad
    ...
replace(data, run=None, format=None)

Replace artifact content.

Parameters:
  • data (lamindb.core.types.UPathStr) – A file path.

  • run (Run | None, default: None) – The run that created the artifact gets auto-linked if ln.track() was called.

Return type:

None

Examples

Say we made a change to the content of an artifact, e.g., edited the image paradisi05_laminopathic_nuclei.jpg.

This is how we replace the old file in storage with the new file:

>>> artifact.replace("paradisi05_laminopathic_nuclei.jpg")
>>> artifact.save()

Note that this neither changes the storage key nor the filename.

However, it will update the suffix if it changes.

restore()

Restore from trash.

Return type:

None

Examples

For any Artifact object artifact, call:

>>> artifact.restore()
save(upload=None, **kwargs)

Save to database & storage.

Parameters:

upload (bool | None, default: None) – Trigger upload to cloud storage in instances with hybrid storage mode.

Return type:

Artifact

Examples

>>> artifact = ln.Artifact("./myfile.csv", description="myfile")
>>> artifact.save()
view_lineage(with_children=True)

Graph of data flow.

Return type:

None

Notes

For more info, see use cases: Data lineage.

Examples

>>> collection.view_lineage()
>>> artifact.view_lineage()