## lamindb.Artifact

| class lamindb.Artifact(path: AnyPathStr, *, key: str | None = None, description: str | None = None, kind: ArtifactKind | str | None = None, features: dict[str, Any] | None = None, schema: Schema | None = None, revises: Artifact | None = None, overwrite_versions: bool | None = None, run: Run | False | None = None, storage: Storage | None = None, branch: Branch | None = None, space: Space | None = None, skip_hash_lookup: bool | None = None, key_is_virtual: bool | None = None) |
class lamindb.Artifact(*db_args)

 Bases: "SQLRecord", "IsVersioned", "TracksRun", "TracksUpdates"

 Datasets & models stored as files, folders, or arrays.

 Some artifacts are table- or array-like, e.g., when stored as
 ".parquet", ".h5ad", ".zarr", or ".tiledb".

 Parameters:
 * **path** -- "AnyPathStr" A path to a local or remote folder or
 file from which to create the artifact.

| * **key** -- "str | None = None" A key within the storage |
 location, e.g., ""myfolder/myfile.fcs"". Artifacts with the
 same key form a version family.

| * **description** -- "str | None = None" A description. |

| * **kind** -- "Literal["dataset", "model"] | str | None = None" |
 Distinguish models from datasets from other files & folders.

| * **features** -- "dict | None = None" External features to |
 annotate via "set_values".

| * **schema** -- "Schema | None = None" A schema to validate |
 features.

| * **revises** -- "Artifact | None = None" Previous version of |
 the artifact. An alternative to passing "key" when creating a
 new version.

| * **overwrite_versions** -- "bool | None = None" Whether to |
 overwrite versions. Defaults to "True" for folders and "False"
 for files.

| * **run** -- "Run | bool | None = None" The run that creates the |
 artifact. If "False", suppress tracking the run. If "None",
 infer the run from the global run context.

| * **branch** -- "Branch | None = None" The branch of the |
 artifact. If "None", uses the current branch.

| * **space** -- "Space | None = None" The space of the artifact. |
 If "None", uses the passed "storage.space" if "storage" is
 passed; otherwise uses the default space ("space").

| * **storage** -- "Storage | None = None" The storage location |
 for the artifact. If "None", uses a storage location of the
 "space" if "space" is passed; otherwise uses the default
 storage location ("storage").

| * **skip_hash_lookup** -- "bool | None = None" Controls hash- |
 based deduplication. If "None", checks hashes for upload flows
 and skips hash lookup for paths already in registered storage.
 If "True", always skips hash lookup. If "False", always
 attempts hash lookup. Empty files are always treated as if
 this were "True" because empty content hashes are not used for
 deduplication.

| * **key_is_virtual** -- "bool | None = None" Whether to use a |
 virtual key for managed storage paths. If "None", uses the
 current default via "_artifact_use_virtual_keys". Inspect the
 current default via
 "ln.settings.creation._artifact_use_virtual_keys" and change
 it globally, e.g.,
 "ln.settings.creation._artifact_use_virtual_keys = False". If
 "True", "key" is treated as metadata for versioning/querying
 and the on-storage path is auto-generated from the artifact
 "uid". If "False", "key" is treated as the concrete relative
 storage path for writes in managed storage.

 See also:

 "Storage"
 Storage locations for artifacts.

 "Collection"
 Collections of artifacts.

 "from_dir()"
 Bulk-create artifacts for each file in a directory.

 "from_dataframe()"
 Create an artifact from a "DataFrame".

 "from_anndata()"
 Create an artifact from an "AnnData".

 "from_spatialdata()"
 Create an artifact from a "SpatialData".

 "from_mudata()"
 Create an artifact from a "MuData".

 "from_tiledbsoma()"
 Create an artifact from a "tiledbsoma" store.

 "from_lazy()"
 Create a lazy artifact for streaming to auto-generated
 internal paths.

# Examples

 Create an artifact **from a local file or folder**:

 artifact = ln.Artifact("./my_file.parquet", key="examples/my_file.parquet").save()
 artifact = ln.Artifact("./my_folder", key="project1/my_folder").save()

 Calling ".save()" copies or uploads the file to the default storage
 location of your lamindb instance. If you create an artifact **from
 a remote file or folder**, lamindb registers the S3 "key" and
 avoids copying the data:

 artifact = ln.Artifact("s3://my_bucket/my_folder/my_file.csv").save()  # can omit key/description because file is remote

 If you then want to query & access the artifact later on, this is
 how you do it:

 artifact = ln.Artifact.get(key="examples/my_file.parquet")
 cached_path = artifact.cache()  # sync to local cache & get local path

 If the storage format supports it, you can load the artifact
 directly into memory or query it through a streaming interface,
 e.g., for parquet files:

 df = artifact.load() # load parquet file as DataFrame
 pyarrow_dataset = artifact.open()  # open a streaming file-like object

 To bulk-create artifacts for every file in a directory and **group
 them in a folder**, use "from_dir()":

 artifacts = ln.Artifact.from_dir("project_alpha/run_001").save()  # create one artifact per file in the directory
 artifacts = ln.Artifact.filter(key__startswith="project_alpha/run_001/")  # query ingested artifacts via the folder prefix

 To create a **versioned immutable collection** of artifacts for a
 data release, use "Collection":

 collection = ln.Collection(artifacts, key="project_alpha/run_001").save()

 -[ Virtual folders (key prefixes) vs. "Collection" objects ]-

 * prefix query on "key": If a colleague adds a new file to that
 prefix tomorrow, your "filter(key__startswith=...)" result will
 change.

 * collection: A collection object provides a "uid" for every
 version and its content won't change.

 If you want to **validate & annotate** a dataframe or an array
 using the feature & label registries, pass "schema" to one of the
 ".from_dataframe()", ".from_anndata()", ... constructors:

 artifact = ln.Artifact.from_dataframe(
 "./my_file.parquet",
 key="my_dataset.parquet",
 schema="valid_features"
 ).save()

 To annotate by **external features**:

 artifact = ln.Artifact("./my_file.parquet", features={"cell_type_by_model": "T cell"}).save()

 You can make a **new version** of an artifact by passing an
 existing "key":

 artifact_v2 = ln.Artifact("./my_file.parquet", key="examples/my_file.parquet").save()
 artifact_v2.versions.to_dataframe()  # see all versions

 You can write artifacts to **non-default storage locations** by
 passing the "storage" argument:

 storage_loc = ln.Storage.get(root="s3://my_bucket")  # get storage location, or create via ln.Storage(root="s3://my_bucket").save()
 ln.Artifact("./my_file.parquet", key="examples/my_file.parquet", storage=storage_loc).save()  # upload to s3://my_bucket

# Notes

 -[ Storage formats & object types ]-

 The "Artifact" registry tracks the storage format via "suffix" and
 an abstract object type via "otype".

| --- | --- | --- | --- |
| description | "suffix" | "otype" | Python type examples |
| ================== | ======================================== | ================== | ====================================================================== |
| table | ".csv", ".tsv", ".parquet", ".ipc" | ""DataFrame"" | "pandas.DataFrame", "polars.DataFrame", "pyarrow.Table" |
| --- | --- | --- | --- |
| annotated matrix | ".h5ad", ".zarr", ".h5mu" | ""AnnData"" | "anndata.AnnData" |
| --- | --- | --- | --- |
| stacked matrix | ".zarr" ".tiledbsoma" | ""MuData"" | "mudata.MuData" "tiledbsoma.Experiment" |
| ""tiledbsoma"" |
| --- | --- | --- | --- |
| spatial data | ".zarr" | ""SpatialData"" | "spatialdata.SpatialData" |
| --- | --- | --- | --- |
| generic arrays | ".h5", ".zarr", ".tiledb" | --- | "h5py.Dataset", "zarr.Array", "tiledb.Array" |
| --- | --- | --- | --- |
| unstructured | ".fastq", ".pdf", ".vcf", ".html" | --- | --- |
| --- | --- | --- | --- |

 You can map storage formats onto **R types**, e.g., an "AnnData"
 might be accessed via "anndataR".

 Because "otype" accepts any "str", you can define custom object
 types that enable queries & logic that you need, e.g.,
 ""SingleCellExperiment"" or ""MyCustomZarrDataStructure"".

 LaminDB makes some default choices (e.g., serialize a "DataFrame"
 as a ".parquet" file).

 -[ Will artifacts get duplicated? ]-

 If an artifact with the exact same hash already exists,
 "Artifact()" returns the existing artifact. Exception: paths that
 already live in a registered storage location and empty files skip
 hash deduplication by default.

 In concurrent workloads where the same artifact is created
 repeatedly at the exact same time, ".save()" detects the
 duplication and will return the existing artifact.

 -[ I cannot come up with a good file name, can I avoid mapping
 artifacts into a hierarchy? ]-

 Sometimes you want to **avoid mapping the artifact into a path
 hierarchy**. You can do so by omitting the "key" argument and only
 passing "description". However, note that a shared "description"
 does not trigger mapping artifacts into the same version family.

 artifact = ln.Artifact("./my_folder", description="My
 folder").save() artifact_v2 = ln.Artifact("./my_folder",
 revises=old_artifact).save()  # need to version based on
 "revises", a shared description does not trigger a new version

 -[ Why does the constructor look the way it looks? ]-

 It's inspired by APIs building on AWS S3.

 Both boto3 and quilt select a bucket (a storage location in
 LaminDB) and define a target path through a "key" argument.

 In boto3:

 # signature: S3.Bucket.upload_file(filepath, key)
 import boto3
 s3 = boto3.resource('s3')
 bucket = s3.Bucket('mybucket')
 bucket.upload_file('/tmp/hello.txt', 'hello.txt')

 In quilt3:

 # signature: quilt3.Bucket.put_file(key, filepath)
 import quilt3
 bucket = quilt3.Bucket('mybucket')
 bucket.put_file('hello.txt', '/tmp/hello.txt')

 property features: FeatureManager

 Feature manager.

 Define a few features:

 species_name = ln.Feature(name="species_name", dtype=str).save()
 scientist_names = ln.Feature(name="scientist_names", dtype=str).save()
 temperature = ln.Feature(name="temperature_in_celsius", dtype=float).save()
 experiment = ln.Feature(name="experiment", dtype=str).save()

 Annotate with features via "set_values()":

 -[ Via strings ]-

 artifact.features.set_values({
 "species_name": "human",
 "scientist_names": ["Barbara McClintock", "Edgar Anderson"],
 "temperature_in_celsius": 27.6,
 "experiment": "Experiment 1"
 })

 -[ Via objects ]-

 artifact.features.set_values({
 species_name: "human",
 scientist_names: ["Barbara McClintock", "Edgar Anderson"],
 temperature: 27.6,
 experiment: "Experiment 1"
 })

 Query artifacts by features:

 -[ Via strings ]-

 ln.Artifact.filter(scientist_names="Barbara McClintock")

 -[ Via objects ]-

 ln.Artifact.filter(scientist_names == "Barbara McClintock")

 Get all feature annotations as a dictionary:

 artifact.features.get_values()
 #> {
 #> "species_name": "human",
 #> "scientist_names": ["Barbara McClintock", "Edgar Anderson"],
 #> "temperature_in_celsius": 27.6,
 #> "experiment": "Experiment 1"
 #> }

 Get a value for a single feature, returning categoricals as
 Python objects:

 organism = artifact.features["species_name"]  # returns an Organism object, not "human"
 temperature = artifact.features["temperature_in_celsius"]  # returns a temperature value, a float

 -[ Dataset features vs. external features ]-

 Features may or may not be stored in the dataset, i.e., the
 artifact content in storage. If you pass a schema to
 "from_dataframe()" you validate the columns of the "DataFrame"
 and annotate it with values parsed from these columns.
 "artifact.features.set_values()", by contrast, does **not**
 validate the content of the artifact but annotates it with
 external features.

 property labels: LabelManager

 Label manager.

 A way to access all label annotations of an artifact,
 irrespective of their type.

 To annotate with labels, use the type-specific accessor, for
 example:

 ulabel = ln.ULabel(name="raw_data").save()
 artifact.ulabels.add(ulabel)
 project = ln.Project(name="Project A").save()
 artifact.projects.add(project)

| property transform: Transform | None |

 Transform whose run created the artifact.

 property overwrite_versions: bool

 Indicates whether to keep or overwrite versions.

 It defaults to "False" for file-like artifacts and to "True" for
 folder-like artifacts.

 Note that this requires significant storage space for large
 folders with many duplicated files. Currently, "lamindb" does
 *not* de-duplicate files across versions as in git, but keeps
 all files for all versions of the folder in storage.

 property path: UPath

 Path.

 Examples:

 import lamindb as ln

 # File in cloud storage, here AWS S3:
 artifact = ln.Artifact("s3://my-bucket/my-file.csv").save()
 artifact.path
 #> S3QueryPath('s3://my-bucket/my-file.csv')

 # File in local storage:
 ln.Artifact("./myfile.csv", key="myfile.csv").save()
 artifact.path
 #> PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')

 uid: str

 A universal random id.

| key: str | None |

 A (virtual) relative file path within the artifact's storage
 location.

 Setting a "key" is useful to automatically group artifacts into
 a version family.

 LaminDB defaults to a virtual file path to make renaming of data
 in object storage easy.

 If you register existing files in a storage location, the "key"
 equals the actual filepath on the underyling filesytem or object
 store.

| description: str | None |

 A description.

 suffix: str

 The path suffix or an empty string if no suffix exists.

 This is either a file suffix ("".csv"", "".h5ad"", etc.) or the
 empty string "".

| kind: ArtifactKind | str | None |

 "ArtifactKind" or custom "str" value (default "None").

| otype: Literal['DataFrame', 'AnnData', 'MuData', 'SpatialData', 'tiledbsoma'] | str | None |

 The object type represented as a string.

 The field is automatically set when using the
 "from_dataframe()", "from_anndata()", ... constructors.
 Unstructured artifacts have "otype=None".

 The field also accepts custom "str" values to allow for building
 logic around them in third-party packages.

 See section storage formats & object types for more background.

| size: int | None |

 The size in bytes.

 Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12
 etc.

| hash: str | None |

 The hash or pseudo-hash of the artifact content in storage.

 Useful to ascertain integrity and avoid duplication.

 Different versions of the artifact have different hashes.

| n_files: int | None |

 The number of files for folder-like artifacts.

 Is "None" for file-like artifacts.

 Note that some arrays are also stored as folders, e.g., ".zarr"
 or ".tiledbsoma".

| n_observations: int | None |

 The number of observations in this artifact.

 Typically, this denotes the first array dimension.

| extra_data: dict | None |

 Extra data in JSON format, not validated as features.

 storage: Storage

 Storage location, e.g. an S3 or GCP bucket or a local directory
 ← "artifacts".

| schema: Schema | None |

 The validating schema of this artifact ← "validated_artifacts".

 The validating schema is helpful to query artifacts that were
 validated by the same schema.

 input_of_runs: RelatedManager[Run]

 The runs that use this artifact as an input ← "input_artifacts".

 recreating_runs: RelatedManager[Run]

 The runs that re-created the artifact after its initial creation
 ← "recreated_artifacts".

 schemas: RelatedManager[Schema]

 The inferred schemas of this artifact ← "artifacts".

 The inferred schemas are helpful to answer the question: "Which
 features are present in the artifact?"

 The validating schema typically allows a range of valid actual
 dataset schemas. The inferred schemas link the actual schemas of
 the artifact, and are auto-generated by parsing the artifact
 content during validation.

 json_values: RelatedManager[JsonValue]

 The feature-indexed JSON values annotating this artifact ←
 "artifacts".

 artifacts: RelatedManager[Artifact]

 The annotating artifacts of this artifact ←
 "linked_by_artifacts".

 linked_in_records: RelatedManager[Record]

 The records linking this artifact as a feature value ←
 "linked_artifacts".

 users: RelatedManager[User]

 The users annotating this artifact ← "artifacts".

 runs: RelatedManager[Run]

 The runs annotating this artifact ← "artifacts".

 linked_by_runs: RelatedManager[Run]

 The runs linking this artifact ← "linked_by_artifacts".

 ulabels: RelatedManager[ULabel]

 The ulabels annotating this artifact ← "artifacts".

 linked_by_artifacts: RelatedManager[Artifact]

 The artifacts annotated by this artifact ← "artifacts".

 collections: RelatedManager[Collection]

 The collections that this artifact is part of ← "artifacts".

 records: RelatedManager[Record]

 The records annotating this artifact ← "artifacts".

 references: RelatedManager[Reference]

 The references annotating this artifact ← "artifacts".

 projects: RelatedManager[Project]

 The projects annotating this artifact ← "artifacts".

 ablocks: RelatedManager[ArtifactBlock]

 Attached blocks ← "artifact".

 get(*, key=None, path=None, is_run_input=False, **expressions)

 Get a single record.

 Parameters:
| * **idlike** ("int" | "str" | "None", default: "None") -- |
 Either a uid stub, uid or an integer id.

 * **expressions** -- Fields and values passed as Django query
 expressions.

 Raises:
 **lamindb.errors.ObjectDoesNotExist** -- In case no matching
 record is found.

 Return type:
 "Artifact"

 See also:

 * Guide: Query & search

 * Django documentation: Queries

 -[ Examples ]-

 record = ln.Record.get("FvtpPJLJ")
 record = ln.Record.get(name="my-label")

 filter(**expressions)

 Query records.

 Parameters:
 * **queries** -- One or multiple "Q" objects.

 * **expressions** -- Fields and values passed as Django query
 expressions.

 Return type:
 "QuerySet"

 See also:

 * Guide: Query & search

 * Django documentation: Queries

 -[ Examples ]-

 >>> ln.Project(name="my label").save()
 >>> ln.Project.filter(name__startswith="my").to_dataframe()

 classmethod from_lazy(suffix, overwrite_versions, key=None, description=None, run=None, **kwargs)

 Create a lazy artifact for streaming to auto-generated internal
 paths.

 This is needed when it is desirable to stream to a "lamindb"
 auto-generated internal path and register the path as an
 artifact. It allows writing directly into the default cloud (or
 local) storage of the current instance and then saving as an
 "Artifact".

 The lazy artifact object (see "LazyArtifact") creates a real
 artifact on ".save()" with the provided arguments.

 Parameters:
 * **suffix** ("str") -- The suffix for the auto-generated
 internal path

 * **overwrite_versions** ("bool") -- Whether to overwrite
 versions.

| * **key** ("str" | "None", default: "None") -- An optional |
 key to reference the artifact.

| * **description** ("str" | "None", default: "None") -- A |
 description.

| * **run** ("Run" | "None", default: "None") -- The run that |
 creates the artifact.

 * ****kwargs** -- Other keyword arguments for the artifact to
 be created.

 Return type:
 "LazyArtifact"

 -[ Examples ]-

 Local storage: create a lazy artifact, stream to the path, then
 save:

 lazy = ln.Artifact.from_lazy(suffix=".zarr", overwrite_versions=True, key="mydata.zarr")
 zarr.open(lazy.path, mode="w")["test"] = np.array(["test"])
 artifact = lazy.save()

 Cloud storage (e.g. S3): use "zarr.storage.FsspecStore" to
 stream arrays:

 lazy = ln.Artifact.from_lazy(suffix=".zarr", overwrite_versions=True, key="mydata.zarr")
 store = zarr.storage.FsspecStore.from_url(lazy.path.as_posix())
 group = zarr.open(store, mode="w")
 group["ones"] = np.ones(3)
 artifact = lazy.save()

 classmethod from_dataframe(df, *, key=None, description=None, run=None, revises=None, schema=None, features=None, parquet_kwargs=None, csv_kwargs=None, **kwargs)

 Create from "DataFrame", optionally validate & annotate.

 Sets ".otype" to ""DataFrame"" and populates ".n_observations".

 Parameters:
| * **df** ("DataFrame" | "str" | "Path" | "UPath") -- A |
 "DataFrame" object or an "AnyPathStr" pointing to a
 "DataFrame" in storage, e.g. a ".parquet" or ".csv" file.

| * **key** ("str" | "None", default: "None") -- A relative |
 path within default storage, e.g.,
 ""myfolder/myfile.parquet"".

| * **description** ("str" | "None", default: "None") -- A |
 description.

| * **revises** ("Artifact" | "None", default: "None") -- An |
 old version of the artifact.

| * **run** ("Run" | "None", default: "None") -- The run that |
 creates the artifact.

| * **schema** ("Schema" | "Literal"["'valid_features'"] |
 "None", default: "None") -- A schema that defines how to
 validate & annotate.

| * **features** ("dict"["str", "Any"] | "None", default: |
 "None") -- Additional external features to annotate the
 artifact via "set_values" (keys can be feature names or
 "Feature" objects).

| * **parquet_kwargs** ("dict"["str", "Any"] | "None", default: |
 "None") -- Additional keyword arguments passed to the
 "pandas.DataFrame.to_parquet" method, which are passed on
 to "pyarrow.parquet.ParquetWriter".

| * **csv_kwargs** ("dict"["str", "Any"] | "None", default: |
 "None") -- Additional keyword arguments passed to the
 "pandas.DataFrame.to_csv" method.

 Return type:
 "Artifact"

 -[ Examples ]-

 No validation and annotation:

 ln.Artifact.from_dataframe(df, key="examples/dataset1.parquet").save()

 With validation and annotation:

 ln.Artifact.from_dataframe(df, key="examples/dataset1.parquet", schema="valid_features").save()

 Under-the-hood, this uses the following build-in schema
 ("valid_features()"):

 schema = ln.Schema(name="valid_features", itype="Feature").save()

 External features:

 import lamindb as ln
 from datetime import date

 df = ln.examples.datasets.mini_immuno.get_dataset1(otype="DataFrame")

 temperature = ln.Feature(name="temperature", dtype=float).save()
 date_of_study = ln.Feature(name="date_of_study", dtype=date).save()
 external_schema = ln.Schema(features=[temperature, date_of_study]).save()

 concentration = ln.Feature(name="concentration", dtype=str).save()
 donor = ln.Feature(name="donor", dtype=str, nullable=True).save()
 schema = ln.Schema(
 features=[concentration, donor],
 slots={"__external__": external_schema},
 otype="DataFrame",
 ).save()

 artifact = ln.Artifact.from_dataframe(
 df,
 key="examples/dataset1.parquet",
 features={"temperature": 21.6, "date_of_study": date(2024, 10, 1)},
 schema=schema,
 ).save()
 artifact.describe()

 Parquet kwargs:

 import lamindb as ln
 import pandas as pd
 import pyarrow.parquet as pq

 def test_parquet_kwargs():
 df = pd.DataFrame(
 {
 "a": [3, 1, 4, 2],
 "b": ["c", "a", "d", "b"],
 "c": [3.3, 1.1, 4.4, 2.2],
 }
 )
 df_sorted = df.sort_values(by=["a", "b"])
 sorting_columns = [
 pq.SortingColumn(0, descending=False, nulls_first=False),
 pq.SortingColumn(1, descending=False, nulls_first=False),
 ]
 artifact = ln.Artifact.from_dataframe(
 df_sorted,
 key="df_sorted.parquet",
 parquet_kwargs={"sorting_columns": sorting_columns},
 ).save()
 pyarrow_dataset = artifact.open()
 fragment = next(pyarrow_dataset.get_fragments())
 assert list(fragment.metadata.row_group(0).sorting_columns) == sorting_columns

 classmethod from_anndata(adata, *, key=None, description=None, run=None, revises=None, schema=None, format=None, h5ad_kwargs=None, zarr_kwargs=None, **kwargs)

 Create from "AnnData", optionally validate & annotate.

 Sets ".otype" to ""AnnData"" and populates ".n_observations".

 Parameters:
| * **adata** ("AnnData" | "str" | "Path" | "UPath") -- An |
 "AnnData" object or a path of AnnData-like.

| * **key** ("str" | "None", default: "None") -- A relative |
 path within default storage, e.g.,
 ""myfolder/myfile.h5ad"".

| * **description** ("str" | "None", default: "None") -- A |
 description.

| * **revises** ("Artifact" | "None", default: "None") -- An |
 old version of the artifact.

| * **run** ("Run" | "None", default: "None") -- The run that |
 creates the artifact.

| * **schema** ("Schema" |
| "Literal"["'ensembl_gene_ids_and_valid_features_in_obs'"] |
 "None", default: "None") -- A schema that defines how to
 validate & annotate.

 * **format** ("Literal"["'h5ad'", "'zarr'", "'anndata.zarr'"]
| "None", default: "None") -- Storage format used when |
 writing in-memory "AnnData". In-memory "AnnData" is first
 written to cache in this format, then saved to instance
 storage when calling ".save()". If "None", infer from "key"
 suffix when available, otherwise default to ""h5ad"". If
 provided, suffix is formed as ""." + format" (e.g.,
 ""zarr"" -> "".zarr"").

| * **h5ad_kwargs** ("dict"["str", "Any"] | "None", default: |
 "None") -- Additional keyword arguments passed to the
 "anndata.AnnData.write_h5ad" method when writing in-memory
 "AnnData" to cache.

| * **zarr_kwargs** ("dict"["str", "Any"] | "None", default: |
 "None") -- Additional keyword arguments passed to the
 "anndata.AnnData.write_zarr" method. when writing in-memory
 "AnnData" to cache. Use "key" with suffix ".zarr" or pass
 "format="zarr"" for this to work.

 Return type:
 "Artifact"

 See also:

 "Collection()"
 Track collections.

 "Feature"
 Track features.

 -[ Example ]-

 Write H5AD with custom serialization settings:

 ln.Artifact.from_anndata(
 adata,
 key="examples/dataset1.h5ad",
 h5ad_kwargs={"compression": "gzip"},
 ).save()

 Write Zarr with custom chunking settings:

 ln.Artifact.from_anndata(
 adata,
 key="examples/dataset1.zarr",
 format="zarr",
 zarr_kwargs={"chunks": [1024, 1024]},
 ).save()

 No validation and annotation:

 ln.Artifact.from_anndata(adata, key="examples/dataset1.h5ad").save()

 With validation and annotation:

 ln.Artifact.from_anndata(adata, key="examples/dataset1.h5ad", schema="ensembl_gene_ids_and_valid_features_in_obs").save()

 Under-the-hood, this uses the following build-in schema
 ("anndata_ensembl_gene_ids_and_valid_features_in_obs()"):

 import bionty as bt

 import lamindb as ln

 obs_schema = ln.examples.schemas.valid_features()
 varT_schema = ln.Schema(
 name="valid_ensembl_gene_ids", itype=bt.Gene.ensembl_gene_id
 ).save()
 schema = ln.Schema(
 name="anndata_ensembl_gene_ids_and_valid_features_in_obs",
 otype="AnnData",
 slots={"obs": obs_schema, "var.T": varT_schema},
 ).save()

 This schema tranposes the "var" DataFrame during curation, so
 that one validates and annotates the columns of "var.T", i.e.,
 "[ENSG00000153563, ENSG00000010610, ENSG00000170458]". If one
 doesn't transpose, one would annotate the columns of "var",
 i.e., "[gene_symbol, gene_type]".

 [image]

 classmethod from_mudata(mdata, *, key=None, description=None, run=None, revises=None, schema=None, **kwargs)

 Create from "MuData", optionally validate & annotate.

 Sets ".otype" to ""MuData"".

 Parameters:
| * **mdata** ("MuData" | "str" | "Path" | "UPath") -- A |
 "MuData" object.

| * **key** ("str" | "None", default: "None") -- A relative |
 path within default storage, e.g.,
 ""myfolder/myfile.h5mu"".

| * **description** ("str" | "None", default: "None") -- A |
 description.

| * **revises** ("Artifact" | "None", default: "None") -- An |
 old version of the artifact.

| * **run** ("Run" | "None", default: "None") -- The run that |
 creates the artifact.

| * **schema** ("Schema" | "None", default: "None") -- A schema |
 that defines how to validate & annotate.

 Return type:
 "Artifact"

 See also:

 "Collection()"
 Track collections.

 "Feature"
 Track features.

 Example:

 import lamindb as ln

 mdata = ln.examples.datasets.mudata_papalexi21_subset()
 artifact = ln.Artifact.from_mudata(mdata, key="mudata_papalexi21_subset.h5mu").save()

 classmethod from_spatialdata(sdata, *, key=None, description=None, run=None, revises=None, schema=None, **kwargs)

 Create from "SpatialData", optionally validate & annotate.

 Sets ".otype" to ""SpatialData"".

 Background: blog.lamin.ai/spatialdata.

 Parameters:
| * **sdata** (SpatialData | AnyPathStr) -- A "SpatialData" |
 object.

| * **key** (str | None, default: "None") -- A relative path |
 within default storage, e.g., ""myfolder/myfile.zarr"".

| * **description** (str | None, default: "None") -- A |
 description.

| * **revises** (Artifact | None, default: "None") -- An old |
 version of the artifact.

| * **run** (Run | None, default: "None") -- The run that |
 creates the artifact.

| * **schema** (Schema | None, default: "None") -- A schema |
 that defines how to validate & annotate.

 Return type:
 Artifact

 See also:

 "Collection()"
 Track collections.

 "Feature"
 Track features.

 -[ Example ]-

 No validation and annotation:

 import lamindb as ln

 artifact = ln.Artifact.from_spatialdata(sdata, key="my_dataset.zarr").save()

 With validation and annotation. First, find a "SpatialData"
 schema, e.g.:

 ln.Schema.filter(otype="SpatialData").to_dataframe()
 schema = ln.Schema.get(name="spatialdata_blobs_schema")

 Then, pass the schema to the "from_spatialdata" method:

 artifact = ln.Artifact.from_spatialdata(sdata, key="my_dataset.zarr", schema=schema).save()

 You can also define a schema from scratch:

 import lamindb as ln
 import bionty as bt

 # a very comprehensive schema for different slots of a SpatialData object

 # define or query features
 bio_dict = ln.Feature(name="bio", dtype=dict).save()
 tech_dict = ln.Feature(name="tech", dtype=dict).save()
 disease = ln.Feature(name="disease", dtype=bt.Disease, coerce=True).save()
 developmental_stage = ln.Feature(
 name="developmental_stage",
 dtype=bt.DevelopmentalStage,
 coerce=True,
 ).save()
 assay = ln.Feature(name="assay", dtype=bt.ExperimentalFactor, coerce=True).save()
 sample_region = ln.Feature(name="sample_region", dtype=str).save()
 analysis = ln.Feature(name="analysis", dtype=str).save()

 # define or query schema components
 attrs_schema = ln.Schema([bio_dict, tech_dict]).save()
 sample_schema = ln.Schema([disease, developmental_stage]).save()
 tech_schema = ln.Schema([assay]).save()
 obs_schema = ln.Schema([sample_region]).save()
 uns_schema = ln.Schema([analysis]).save()
 # enforces only registered Ensembl Gene IDs pass validation (maximal_set=True)
 varT_schema = ln.Schema(itype=bt.Gene.ensembl_gene_id, maximal_set=True).save()

 # compose the SpatialData schema
 sdata_schema = ln.Schema(
 name="spatialdata_blobs_schema",
 otype="SpatialData",
 slots={
 "attrs:bio": sample_schema,
 "attrs:tech": tech_schema,
 "attrs": attrs_schema,
 "tables:table:obs": obs_schema,
 "tables:table:var.T": varT_schema,
 },
 ).save()

 classmethod from_tiledbsoma(exp, *, key=None, description=None, run=None, revises=None, **kwargs)

 Create from a "tiledbsoma.Experiment" store.

 Sets ".otype" to ""tiledbsoma"" and populates ".n_observations".

 Parameters:
| * **exp** (SOMAExperiment | AnyPathStr) -- TileDB-SOMA |
 Experiment object or path to Experiment store.

| * **key** (str | None, default: "None") -- A relative path |
 within default storage, e.g.,
 ""myfolder/mystore.tiledbsoma"".

| * **description** (str | None, default: "None") -- A |
 description.

| * **revises** (Artifact | None, default: "None") -- An old |
 version of the artifact.

| * **run** (Run | None, default: "None") -- The run that |
 creates the artifact.

 Return type:
 Artifact

 Example:

 import lamindb as ln

 artifact = ln.Artifact.from_tiledbsoma("s3://mybucket/store.tiledbsoma", description="a tiledbsoma store").save()

 classmethod from_dir(path, *, key=None, run=None)

 Create a list of "Artifact" objects from a directory.

 Hint:

 If you have a high number of files (several 100k) and don't
 want to track them individually, create a single "Artifact"
 via "Artifact(path)" for them. See, e.g., RxRx: cell imaging.

 Parameters:
| * **path** ("str" | "Path" | "UPath") -- Source path of |
 folder.

| * **key** ("str" | "None", default: "None") -- Key for |
 storage destination. If "None" and directory is in a
 registered location, the inferred "key" will reflect the
 relative position. If "None" and directory is outside of a
 registered storage location, the inferred key defaults to
 "path.name".

| * **run** ("Run" | "None", default: "None") -- A "Run" |
 object.

 Return type:
 "SQLRecordList"

 Example:

 import lamindb as ln

 dir_path = ln.examples.datasets.dir_scrnaseq_cellranger("sample_001", ln.settings.storage)
 ln.Artifact.from_dir(dir_path).save()  # creates one artifact per file in dir_path

 replace(data, run=None, format=None)

 Replace the artifact content in storage **without** making a new
 version.

 **Note:** If you want to create a new version, do **not** use
 the ".replace()" method but rather any "Artifact" constructor.

 Parameters:
| * **data** ("str" | "Path" | "UPath" | "DataFrame" |
| "AnnData" | "MuData") -- A file path or in-memory dataset |
 object like a "DataFrame", "AnnData", "MuData", or
 "SpatialData".

| * **run** ("Run" | "bool" | "None", default: "None") -- "Run |
| bool | None = None" The run that creates the artifact. If |
 "False", suppress tracking the run. If "None", infer the
 run from the global run context.

| * **format** ("str" | "None", default: "None") -- "str | None |
 = None" The format of the data to write into storage. If
 "None", infer the format from the data.

 Return type:
 "None"

 -[ Example ]-

 Query a text file and replace its content:

 artifact = ln.Artifact.get(key="my_file.txt")
 artifact.replace("./my_new_file.txt")
 artifact.save()

 Note that you need to call ".save()" to persist the changes in
 storage.

 open(mode='r', engine='pyarrow', is_run_input=None, **kwargs)

 Open a dataset for streaming.

 Works for the following object types (storage formats):

 * "DataFrame" (".parquet", ".csv", ".ipc" files or directories
 with such files)

 * "AnnData" (".h5ad", ".zarr")

 * "SpatialData" (".zarr")

 * "tiledbsoma" (".tiledbsoma")

 * generic arrays (".h5", ".zarr")

 Parameters:
 * **mode** (str, default: "'r'") -- can be ""r"" or ""w""
 (write mode) for "tiledbsoma" stores, ""r"" or ""r+"" for
 "AnnData" or "SpatialData" "zarr" stores, otherwise should
 be always ""r"" (read-only mode).

 * **engine** (Literal['pyarrow', 'polars'], default:
 "'pyarrow'") -- Which module to use for lazy loading of a
 dataframe from "pyarrow" or "polars" compatible formats.
 This has no effect if the artifact is not a dataframe, i.e.
 if it is an "AnnData," "hdf5", "zarr", "tiledbsoma" object
 etc.

| * **is_run_input** (bool | None, default: "None") -- Whether |
 to track this artifact as run input.

 * ****kwargs** -- Keyword arguments for the accessor, i.e.
 "h5py" or "zarr" connection, "pyarrow.dataset.dataset",
 "polars.scan_*" function.

 Return type:
| PyArrowDataset | Iterator[PolarsLazyFrame] | AnnDataAccessor |
| SpatialDataAccessor | BackedAccessor | SOMACollection |
| SOMAExperiment | SOMAMeasurement |

 Returns:
 Streaming accessors, in particular, a
 "pyarrow.dataset.Dataset" object, a context manager yielding
 a polars.LazyFrame, and objects of type "AnnDataAccessor",
 "SpatialDataAccessor", "BackedAccessor",
 "tiledbsoma.Collection", "tiledbsoma.Experiment",
 "tiledbsoma.Measurement".

 Note:

 For TileDB-SOMA stores on S3 with federated credentials,
 credentials are updated only when the storage is opened, not
 while the store handle is held open. If credentials expire
 during a long-lived session, close the store and open it again
 to refresh.

 -[ Examples ]-

 Open a "DataFrame"-like artifact via "pyarrow.dataset.Dataset":

 artifact = ln.Artifact.get(key="sequences/mydataset.parquet")
 artifact.open()
 #> pyarrow._dataset.FileSystemDataset

 Open a "DataFrame"-like artifact via polars.LazyFrame:

 artifact = ln.Artifact.get(key="sequences/mydataset.parquet")
 with artifact.open(engine="polars") as df:
 # use the `polars.LazyFrame` object similar to a `DataFrame` object

 Open an "AnnData"-like artifact via "AnnDataAccessor":

 import lamindb as ln

 artifact = ln.Artifact.get(key="scrna/mydataset.h5ad")
 with artifact.open() as adata:
 # use the `AnnDataAccessor` similar to an `AnnData` object

 For more examples and background, see guide: Stream datasets
 from storage .

 load(*, is_run_input=None, mute=False, **kwargs)

 Cache artifact in local cache and then load it into memory.

 See: "loaders".

 Parameters:
| * **is_run_input** (bool | None, default: "None") -- Whether |
 to track this artifact as run input.

 * **mute** (bool, default: "False") -- Silence logging of
 caching progress.

 * ****kwargs** -- Keyword arguments for the loader.

 Return type:
| pd.DataFrame | ScverseDataStructures | dict[str, Any] |
| list[Any] | AnyPathStr | None |

 -[ Examples ]-

 Load a "DataFrame"-like artifact:

 df = artifact.load()

 Load an "AnnData"-like artifact:

 adata = artifact.load()

 cache(*, is_run_input=None, mute=False, **kwargs)

 Download cloud artifact to local cache.

 Follows synching logic: only caches an artifact if it's outdated
 in the local cache.

 Returns a path to a locally cached on-disk object (say a ".jpg"
 file).

 Parameters:
 * **mute** ("bool", default: "False") -- Silence logging of
 caching progress.

| * **is_run_input** ("bool" | "None", default: "None") -- |
 Whether to track this artifact as run input.

 Return type:
 "UPath"

 -[ Example ]-

 Sync the artifact from the cloud and return the local path to
 the cached file:

 artifact.cache()
 #> PosixPath('/home/runner/work/Caches/lamindb/lamindata/pbmc68k.h5ad')

 delete(permanent=None, storage=None, using_key=None)

 Trash or permanently delete.

 A first call to ".delete()" puts an artifact into the trash
 (sets "branch_id" to "-1"). A second call permanently deletes
 the artifact.

 For an "artifact" that has multiple versions and for which
 "artifact.overwrite_versions is True", the default behavior for
 folders, deleting a non-latest version will not delete the
 underlying storage unless "storage=True" is passed. Deleting the
 latest version will delete all versions.

 Parameters:
| * **permanent** ("bool" | "None", default: "None") -- |
 Permanently delete the artifact (skip trash).

| * **storage** ("bool" | "None", default: "None") -- Indicate |
 whether you want to delete the artifact in storage.

 Return type:
 "None"

 -[ Examples ]-

 Delete a single file artifact:

 import lamindb as ln

 artifact = ln.Artifact.get(key="some.csv")
 artifact.delete() # delete a single file artifact

 Delete an old version of a folder-like artifact:

 artifact = ln.Artifact.filter(key="folder.zarr", is_latest=False).first()
 artiact.delete() # delete an old version, the data will not be deleted

 Delete all versions of a folder-like artifact:

 artifact = ln.Artifact.get(key="folder.zarr". is_latest=True)
 artifact.delete() # delete all versions, the data will be deleted or prompted for deletion.

 save(upload=None, transfer='record', **kwargs)

 Save to database & storage.

 Parameters:
| * **upload** ("bool" | "None", default: "None") -- Trigger |
 upload to cloud storage in instances with hybrid storage
 mode.

 * **transfer** ("Literal"["'record'", "'annotations'"],
 default: "'record'") -- In case artifact was queried on a
 different instance, dictates behavior of sync. If "record",
 only the artifact record is synced to the current instance.
 If "annotations", also the annotations linked in the source
 instance are synced.

 Return type:
 "Artifact"

 See also: Transfer & sync data across databases

 -[ Example ]-

 Save a file-like artifact after creating it with the default
 constructor "Artifact()":

 import lamindb as ln

 artifact = ln.Artifact("./myfile.csv", key="myfile.parquet").save()

 view_lineage(with_children=True, return_graph=False)

 View data lineage graph.

 Return type:
| "Digraph" | "None" |