lamindb.Artifact

class lamindb.Artifact(data: UPathStr, type: Literal['dataset', 'model', 'code'] = 'dataset', key: str | None = None, description: str | None = None, is_new_version_of: Artifact | None = None, run: Run | None = None)

Bases: Registry, HasFeatures, HasParams, IsVersioned, TracksRun, TracksUpdates

Artifacts: datasets & models stored as files, folders, or arrays.

Artifacts manage data in local or remote storage.

An artifact stores a dataset or model as either a file or a folder.

Some artifacts are array-like, e.g., when stored as .parquet, .h5ad, .zarr, or .tiledb.

For more info, see tutorial: Tutorial: Artifacts.

Parameters:
  • dataUPathStr A path to a local or remote folder or file.

  • typeLiteral["dataset", "model", "code"] | None = None The artifact type.

  • keystr | None = None A relative path within default storage, e.g., "myfolder/myfile.fcs".

  • descriptionstr | None = None A description.

  • versionstr | None = None A version string.

  • is_new_version_ofArtifact | None = None A previous version of the artifact.

  • runRun | None = None The run that creates the artifact.

Typical storage formats & their API accessors

Arrays:

  • Table: .csv, .tsv, .parquet, .ipcDataFrame, pyarrow.Table

  • Annotated matrix: .h5ad, .h5mu, .zradAnnData, MuData

  • Generic array: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders

Non-arrays:

  • Image: .jpg, .pngnp.ndarray, …

  • Fastq: .fastq ⟷ /

  • VCF: .vcf ⟷ /

  • QC: .html ⟷ /

You’ll find these values in the suffix & accessor fields.

LaminDB makes some default choices (e.g., serialize a DataFrame as a .parquet file).

See also

Storage

Storage locations for artifacts.

Collection

Collections of artifacts.

from_df()

Create an artifact from a DataFrame.

from_anndata()

Create an artifact from an AnnData.

from_dir()

Bulk create file-like artifacts from a directory.

Examples

Create an artifact from a file in the cloud:

>>> artifact = ln.Artifact("s3://my-bucket/my-folder/my-file.csv", description="My file")
>>> artifact.save()  # only metadata is saved

Create an artifact from a local filepath:

>>> artifact = ln.Artifact("./my_file.jpg", description="My image")
>>> artifact.save()
Why does the API look this way?

It’s inspired by APIs building on AWS S3.

Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a key argument.

In boto3:

# signature: S3.Bucket.upload_file(filepath, key)
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
bucket.upload_file('/tmp/hello.txt', 'hello.txt')

In quilt3:

# signature: quilt3.Bucket.put_file(key, filepath)
import quilt3
bucket = quilt3.Bucket('mybucket')
bucket.put_file('hello.txt', '/tmp/hello.txt')

Make a new version of an artifact:

>>> # a non-versioned artifact
>>> artifact = ln.Artifact(df1, description="My dataframe")
>>> artifact.save()
>>> # version an artifact
>>> new_artifact = ln.Artifact(df2, is_new_version_of=artifact)
>>> assert new_artifact.stem_uid == artifact.stem_uid
>>> assert artifact.version == "1"
>>> assert new_artifact.version == "2"

Attributes

features

Feature manager. FeatureManager

labels

Label manager. LabelManager.

params

Param manager. ParamManager

path

Path. UPath.

File in cloud storage, here AWS S3:

>>> artifact = ln.Artifact("s3://my-bucket/my-file.csv").save()
>>> artifact.path
S3Path('s3://my-bucket/my-file.csv')

File in local storage:

>>> ln.Artifact("./myfile.csv", description="myfile").save()
>>> artifact = ln.Artifact.filter(description="myfile").one()
>>> artifact.path
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')
.
stem_uid

Universal id characterizing the version family. str.

The full uid of a record is obtained via concatenating the stem uid and version information:

stem_uid = random_base62(n_char)  # a random base62 sequence of length n_char
version_uid = encode_base62(md5_hash(version))[:4]  # version is, e.g., "1" or "2.1.0" or "2022-03-01"
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid
versions

Lists all records of the same version family. QuerySet.

>>> new_artifact = ln.Artifact(df2, is_new_version_of=artifact)
>>> new_artifact.save()
>>> new_artifact.versions()

Fields

version CharField

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

id AutoField

Internal id, valid only in one DB instance.

uid CharField

A universal random id (20-char base62 ~ UUID), valid across DB instances.

description CharField

A description.

storage ForeignKey

Storage location (Storage), e.g.. n S3 or GCP bucket or a local directory.

key CharField

Storage key, the relative path within the storage location.

suffix CharField

Path suffix or empty string if no canonical suffix exists.

This is either a file suffix (".csv", ".h5ad", etc.) or the empty string “”.

type CharField

Artifact type (default None).

accessor CharField

Default backed or memory accessor, e.g., DataFrame, AnnData.

size BigIntegerField

Size in bytes.

Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.

hash CharField

Hash or pseudo-hash of artifact content.

Useful to ascertain integrity and avoid duplication.

hash_type CharField

Type of hash.

n_objects BigIntegerField

Number of objects.

Typically, this denotes the number of files in an artifact.

n_observations BigIntegerField

Number of observations.

Typically, this denotes the first array dimension.

transform ForeignKey

Transform whose run created the artifact.

run ForeignKey

Run that created the artifact.

visibility SmallIntegerField

Visibility of artifact record in queries & searches (0 default, 1 hidden, 2 trash).

key_is_virtual BooleanField

Indicates whether key is virtual or part of an actual file path.

ulabels ManyToManyField

The ulabels measured in the artifact (ULabel).

input_of ManyToManyField

Runs that use this artifact as an input.

previous_runs ManyToManyField

Sequence of runs that created or updated the record.

feature_sets ManyToManyField

The feature sets measured in the artifact (FeatureSet).

feature_values ManyToManyField

Non-categorical feature values for annotation.

param_values ManyToManyField

Parameter values.

created_at DateTimeField

Time of creation of record.

created_by ForeignKey

Creator of record. User

updated_at DateTimeField

Time of last update to record.

Methods

backed(is_run_input=None)

Return a cloud-backed data object.

Return type:

AnnDataAccessor | BackedAccessor

Notes

For more info, see tutorial: Query arrays.

Examples

Read AnnData in backed mode from cloud:

>>> artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> artifact.backed()
AnnData object with n_obs × n_vars = 70 × 765 backed at 's3://lamindb-ci/lndb-storage/pbmc68k.h5ad'
cache(is_run_input=None)

Download cloud artifact to local cache.

Follows synching logic: only caches an artifact if it’s outdated in the local cache.

Returns a path to a locally cached on-disk object (say. .jpg file).

Return type:

Path

Examples

Sync file from cloud and return the local path of the cache:

>>> artifact.cache()
PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
delete(permanent=None, storage=None, using_key=None)

Delete.

A first call to .delete() puts an artifact into the trash (sets visibility to -1).

A second call permanently deletes the artifact.

FAQ: Storage FAQ

Parameters:
  • permanent (bool | None, default: None) – Permanently delete the artifact (skip trash).

  • storage (bool | None, default: None) – Indicate whether you want to delete the artifact in storage.

Return type:

None

Examples

For an Artifact object artifact, call:

>>> artifact.delete()
classmethod from_anndata(adata, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)

Create from AnnData, validate & link features.

Parameters:
  • adata (AnnData | str | Path) – An AnnData object or a path of AnnData-like.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5ad".

  • description (str | None, default: None) – A description.

  • version (str | None, default: None) – A version string.

  • is_new_version_of (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> adata = ln.core.datasets.anndata_with_obs()
>>> artifact = ln.Artifact.from_anndata(adata, description="mini anndata with obs")
>>> artifact.save()

.

classmethod from_df(df, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)

Create from DataFrame, validate & link features.

For more info, see tutorial: Tutorial: Artifacts.

Parameters:
  • df (DataFrame) – A DataFrame object.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.parquet".

  • description (str | None, default: None) – A description.

  • version (str | None, default: None) – A version string.

  • is_new_version_of (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> artifact = ln.Artifact.from_df(df, description="Iris flower collection batch1")
>>> artifact.save()

.

classmethod from_dir(path, key=None, *, run=None)

Create a list of artifact objects from a directory.

Hint

If you have a high number of files (several 100k) and don’t want to track them individually, create a single Artifact via Artifact(path) for them. See, e.g., RxRx: cell imaging.

Parameters:
  • path (str | Path) – Source path of folder.

  • key (str | None, default: None) – Key for storage destination. If None and directory is in a registered location. n inferred key will reflect the relative position. If None and directory is outside of a registered storage location, the inferred key defaults to path.name.

  • run (Run | None, default: None) – A Run object.

Return type:

list[Artifact]

Examples

>>> dir_path = ln.core.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)
>>> artifacts = ln.Artifact.from_dir(dir_path)
>>> ln.save(artifacts)

.

classmethod from_mudata(mdata, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)

Create from MuData, validate & link features.

Parameters:
  • mdata (MuData) – An MuData object.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5mu".

  • description (str | None, default: None) – A description.

  • version (str | None, default: None) – A version string.

  • is_new_version_of (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> mdata = ln.core.datasets.mudata_papalexi21_subset()
>>> artifact = ln.Artifact.from_mudata(mdata, description="a mudata object")
>>> artifact.save()

.

get_type_display(*, field=<django.db.models.fields.CharField: type>)
load(is_run_input=None, stream=False, **kwargs)

Stage and load to memory.

Returns in-memory representation if possible, e.g.. n AnnData object for an h5ad file.

Return type:

Any

Examples

Load as a DataFrame:

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> ln.Artifact.from_df(df, description="iris").save()
>>> artifact = ln.Artifact.filter(description="iris").one()
>>> artifact.load().head()
sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0

Load as an AnnData:

>>> artifact.load()
AnnData object with n_obs × n_vars = 70 × 765

Fall back to cache() if no in-memory representation is configured:

>>> artifact.load()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
replace(data, run=None, format=None)

Replace artifact content.

Parameters:
  • data (str | Path) – A file path.

  • run (Run | None, default: None) – The run that created the artifact gets auto-linked if ln.track() was called.

Return type:

None

Examples

Say we made a change to the content of an artifact, e.g., edited the image paradisi05_laminopathic_nuclei.jpg.

This is how we replace the old file in storage with the new file:

>>> artifact.replace("paradisi05_laminopathic_nuclei.jpg")
>>> artifact.save()

Note that this neither changes the storage key nor the filename.

However, it will update the suffix if it changes.

restore()

Restore from trash.

Return type:

None

Examples

For any Artifact object artifact, call:

>>> artifact.restore()
save(upload=None, **kwargs)

Save to database & storage.

Parameters:

upload (bool | None, default: None) – Trigger upload to cloud storage in instances with hybrid storage mode.

Return type:

None

Examples

>>> artifact = ln.Artifact("./myfile.csv", description="myfile")
>>> artifact.save()