Volumes
A Volume is a durable file system that your task mounts and uses like an ordinary local directory — backed by object storage, but with real file-system semantics: open, read, write, list, and seek over many files in place.
Unlike
flyte.io.File and flyte.io.Dir, which pass
a snapshot of data as a value between tasks, a Volume is long-lived and
versioned. You write to it during one run and commit it as an immutable
version; any later task or run can mount that version and pick up exactly where
you left off. Each commit is a new immutable version, and versions share
unchanged data, so keeping history is cheap.
Volumes vs. files and directories
File/Dir and Volumes solve different problems. Use the one that matches how
your data is shaped and used.
flyte.io.File / flyte.io.Dir |
Volume | |
|---|---|---|
| What it is | A single file or folder passed as a value | A whole file system you mount |
| Access | Uploaded/downloaded as a unit | Mounted; read and written in place, like a local disk |
| Lifetime | Tied to the run that produced it | Long-lived; remount across tasks and runs |
| Changing it | Produce a new File/Dir |
Fork, write, and commit a new version (copy-on-write) |
| Best for | Handing a finished artifact to the next task | Evolving, file-system-heavy state |
If you just need to hand a finished file or folder from one task to the next,
reach for
flyte.io.File or flyte.io.Dir — they’re
simpler and need no setup. Choose a Volume when you need a mountable, durable
file system that evolves over time.
A Volume is a durable, network-backed file system — not an in-memory cache. Use it when you want file-system semantics over durable, shared data (mounting, partial and random reads, tools that expect files on disk). It is not a way to speed up model loading: pulling weights into memory through a Volume is slower than streaming them directly from object storage with a purpose-built loader.
When to use a Volume
Volumes fit AI and agentic workloads, where work is long-running, stateful, and file-heavy:
- Agent memory and state. Give an agent a durable workspace it builds up across turns, tasks, and sessions — notes, intermediate artifacts, a growing working set of files — and resume exactly where it left off, instead of starting cold each run.
- Sandboxes and code execution. Back a sandbox or code-execution environment with a Volume so agent- or model-generated code has a real, writable file system to work in. Fork a clean base per session so concurrent runs stay isolated from each other.
- Shared, durable datasets. Keep a dataset, index, or other large working set on a Volume and mount it from many tasks to read — or fork and update — it as files, without re-fetching or re-uploading the whole thing each run.
- Branching experiments. Fork a base Volume per experiment or per run; copy-on-write makes each branch independent and cheap, with version history to compare against or roll back to.
More broadly, reach for a Volume whenever you need long-lived, versioned state that carries forward across tasks or runs — anything you’d otherwise rebuild from scratch every time.
Read-write and read-only volumes
A Volume is always one of two types, and the type tells you what you can do with it:
RWVolume— a writable handle.Volume.new()returns one. Mount it, write to it, andcommit()to record an immutable version. While it is mounted it is the single writer.ROVolume— an immutable, committed version. Mount it read-only to read its contents. To change it,fork()it into a newRWVolume.
Because the type is part of a task’s signature, the read/write contract is
enforced at the task boundary: a task that declares vol: ROVolume can read
shared data but cannot mutate it. Returning a writable RWVolume from a task
commits it and hands the next task an ROVolume.
Setup
Volumes are mounted inside the task pod, so the task environment needs two
things: an image with the volume client (flyteplugins-union) and the FUSE
tools, and a pod template that grants the mount capability.
import flyte
from flyteplugins.union.io import Volume, ROVolume
image = (
flyte.Image.from_debian_base()
.with_pip_packages("flyteplugins-union") # volume client (bundles the mount binary)
.with_apt_packages("fuse3") # FUSE userspace tools needed to mount
)
env = flyte.TaskEnvironment(
name="volumes-demo",
image=image,
# grant the pod what it needs to mount a Volume
pod_template=flyte.PodTemplate().allow_fuse(),
resources=flyte.Resources(cpu="1", memory="2Gi"),
)Two pieces make a mount possible, and you need both:
flyte.PodTemplate.allow_fuse() grants the kernel side: it requests the
FUSE device resource and adds the capability the mount needs, without running
the container as privileged. Your cluster must run a FUSE device plugin for
this — the Union data plane ships an opt-in one. (For clusters without it,
allow_fuse(privileged=True) is a fallback that runs the container
privileged.)
The fuse3 apt package provides the userspace side: the fusermount3
helper that the mount client invokes (the package’s post-install makes it
setuid-root, which is what lets an unprivileged task mount). The default
minimal images don’t include it, so without it the mount fails with
fusermount: not found. If you bring your own image, see
Custom images.
A task missing either piece cannot mount a Volume.
Get started
The lifecycle is: create → mount → write → return. Returning a writable
volume from a task commits it into an immutable ROVolume; downstream tasks
receive that and mount it read-only.
import flyte
from flyteplugins.union.io import Volume, RWVolume, ROVolume
@env.task
async def create_dataset() -> RWVolume:
vol = Volume.new(name="my-dataset") # a fresh writable RWVolume
data = await vol.mount() # mount (default: ~/flyte-volume); returns the path
(data / "greeting.txt").write_text("hello from a volume\n")
return vol # auto-committed; the next task receives an ROVolume
@env.task
async def read_dataset(vol: ROVolume) -> str:
data = await vol.mount() # ROVolume always mounts read-only
return (data / "greeting.txt").read_text()
@env.task
async def main() -> str:
dataset = await create_dataset()
return await read_dataset(dataset)Volume.new() hands you a writable RWVolume. When you return it from a task,
your writes are flushed and the volume is committed into an immutable ROVolume
— a durable version safe to pass between tasks. The next task receives that
ROVolume and mounts the exact same data.
Returning a mounted RWVolume commits and unmounts it for you. To attach a
message to that final version, return finalize() explicitly:
return await vol.finalize(message="initial dataset"). To record a version
partway through a task without unmounting, use commit() — see
Checkpoint while you work.
Updating a volume by forking
An ROVolume is immutable, so you never edit one in place. Instead you fork
it — creating an independent, writable RWVolume branch — then write and commit
a new version:
@env.task
async def add_file(vol: ROVolume) -> ROVolume:
rw = await vol.fork(name="my-dataset-v2") # copy-on-write writable branch
data = await rw.mount()
(data / "extra.txt").write_text("added in a later run\n")
return await rw.finalize(message="add extra.txt")Forking is copy-on-write: the branch shares all unchanged data with its parent and only stores what you actually change, so it stays cheap even for very large Volumes. The parent version is never touched, so you keep a clean lineage of versions to compare against or roll back to. Forks are also isolated — two branches (or two parallel runs) can write at the same time without clobbering each other.
Writing in parallel
A Volume has a single writer while it is mounted. One task mounts an
RWVolume, writes, and commits; mounting the same volume read-write from two
tasks at once is not supported, and there is no distributed file locking.
To write in parallel, don’t share one mount — fork. Each fork is an
independent RWVolume on a disjoint key space, so branches never collide, even
when they run at the same time:
import asyncio
@env.task
async def process_shard(base: ROVolume, i: int) -> ROVolume:
branch = await base.fork(name=f"shard-{i}") # isolated writable branch
data = await branch.mount()
(data / f"shard-{i}.bin").write_bytes(compute_shard(i))
return await branch.finalize(message=f"shard {i}")
@env.task
async def fan_out(base: ROVolume) -> list[ROVolume]:
# each shard runs as its own action, writing its own fork concurrently
return await asyncio.gather(*(process_shard(base, i) for i in range(8)))Each branch commits its own immutable version; downstream you can read them
independently or fork a new branch from any of them. Reading is never
restricted — any number of tasks can mount the same ROVolume read-only at once.
Going further
Checkpoint while you work
Use commit() to record a version without unmounting — useful in
long-running loops where you want a durable point you can resume from if the run
is interrupted:
@env.task
async def train(base: ROVolume) -> ROVolume:
rw = await base.fork(name="training-run")
data = await rw.mount()
for epoch in range(100):
train_one_epoch(data) # writes under the mounted volume
if epoch % 10 == 0:
await rw.commit(message=f"epoch {epoch}") # durable checkpoint
return await rw.finalize(message="training complete")Each commit() records a durable, immutable version you can resume from. Those
versions are retained, so commit on a cadence that matches how often you’d
actually want to roll back — checkpoint periodically rather than every step, and
prune versions you no longer need.
Reference a volume across runs
The usual way to receive a volume is as a typed task input or from
run.outputs. When you instead want to pin a specific version and reach it
from an unrelated run — a config value, a scheduled job, an external system —
save its locator. Every committed version exposes one via the locator
property: a stable object-store address you can store anywhere.
@env.task
async def publish() -> str:
vol = Volume.new(name="my-dataset")
await vol.mount()
# ... write data ...
ro = await vol.finalize(message="v1")
return ro.locator # e.g. persist this string somewhereLater — in a different run, with no shared task input — load it back with
Volume.from_locator. It returns a read-only ROVolume with everything
recovered (the data index, bucket, store type, stats and lineage), so you can
mount it directly or fork() it to branch and write:
@env.task
async def consume(addr: str) -> int:
ro = await Volume.from_locator(addr) # -> ROVolume, no task context needed
data = await ro.mount() # read-only
return len(list(data.glob("**/*")))The locator stays resolvable as long as the producing run’s outputs are
retained. locator is None for a freshly created volume that hasn’t been
committed yet — there’s no published version to point at.
High-throughput mode
The default configuration suits most workloads. For workloads that create or
update very large numbers of files — package installs, build trees, code
generation — switch on high-throughput mode by preparing the image with
flyteplugins.union.io.with_high_throughput_volume_deps:
from flyteplugins.union.io import with_high_throughput_volume_deps
image = with_high_throughput_volume_deps(
flyte.Image.from_debian_base().with_pip_packages("flyteplugins-union")
)
env = flyte.TaskEnvironment(
name="high-throughput-volumes",
image=image,
pod_template=flyte.PodTemplate().allow_fuse(),
)Volumes created in this environment automatically use the faster metadata path — no change to your task code is required.
Tuning the mount
mount() accepts options to match the I/O profile of your workload — where to
mount, how aggressively to upload, and how long to cache metadata:
data = await vol.mount(
mount_path="/tmp/data", # a writable mount point (default: ~/flyte-volume)
max_uploads=100, # raise upload concurrency for write-heavy bursts (default 50)
attr_cache=120.0, # cache file metadata longer (default 60s)
entry_cache=120.0, # cache name lookups longer
dir_entry_cache=120.0, # cache directory listings longer
)| Option | Default | Use it to… |
|---|---|---|
mount_path |
~/flyte-volume |
Mount somewhere other than the default. |
max_uploads |
50 |
Raise the cap on concurrent uploads. Bump it during write-heavy bursts of many small files, where the default concurrency can’t saturate the upload link to object storage. |
attr_cache / entry_cache / dir_entry_cache |
60.0 |
Cache file metadata, name lookups, and directory listings longer to collapse repeated stat/listing calls. |
Raising the cache TTLs is safe because a Volume has a single writer while it is mounted. It helps most when a tool repeatedly stats or lists the same paths (common with package managers and build systems).
Custom images
The two-package setup above (flyteplugins-union + fuse3) works on top of any
image built from flyte.Image.from_debian_base(). If you bring a fully custom
image (your own Dockerfile / base), it must satisfy the same two requirements:
- The volume client —
pip install flyteplugins-union. The wheel bundles the mount binary, so there’s nothing else to fetch. - FUSE userspace tools — the
fuse3package. The mount runs unprivileged, sofusermount3must be setuid-root; the Debian package’s post-install sets that bit, so install it with the package manager (don’t just copy the binary in — a copy loses the setuid bit and the mount fails withEPERM).
In a Dockerfile that’s:
RUN apt-get update && apt-get install -y --no-install-recommends fuse3 \
&& pip install flyteplugins-unionBeyond the image, the same runtime prerequisites apply as for the default setup:
the pod must use flyte.PodTemplate().allow_fuse() (FUSE device + capability),
and the cluster must run a FUSE device plugin (the Union data plane ships one).
The container also needs to run as a user that can write the volume’s
mount_path, meta_dir, and cache_dir. The defaults live under the task
user’s $HOME, which the default image owns; if your image runs as a
different user or root, either keep those dirs writable or pass explicit
writable paths to mount().
Inspecting a volume
To browse a volume without mounting it, use the CLI. flyte explore volume
opens an interactive view of a volume’s file tree and its version history,
reading only the small metadata index — no FUSE mount and no file downloads:
# Explore the volume produced by a run (auto-discovers the volume output)
flyte explore volume <run-name>
# Pin a specific action and the exact output to inspect
flyte explore volume <run-name> <action-name> --op-name my_volumeIt follows the version lineage, so you can step back through earlier commits and
jump to the action that produced any version. To open an index you already have
on disk, pass --from-file <path> --store-type sqlite.
Performance and trade-offs
A Volume is a durable, object-store-backed file system, so it behaves differently from a local disk. Know the trade-offs before reaching for one:
- It is not local memory or disk. Reads and writes go through a cache over
object storage. Sequential, file-system-style I/O is fast, but small random
operations have higher latency than
tmpfsor a local SSD. For raw throughput into memory — streaming model weights, say — a purpose-built loader reading directly from object storage will beat mounting a Volume. - Writes are decoupled from durability. With write-back (the default),
writes land in a local cache and upload in the background; the cost of making
them durable is paid at
commit()/finalize(), not on each write. Budget for commit time separately from your write loop. - Per-file work dominates with many small files. Mounting itself stays fast even with tens of thousands of files, but operations that touch every file — creating or traversing them — are bounded by per-file metadata cost. The metadata cache TTLs and high-throughput mode exist to absorb this; reach for them on file-count-heavy workloads.
- Versions are retained. Every commit keeps an immutable version, so commit on a deliberate cadence and prune versions you no longer need.
Benchmark
Numbers from a single run on AWS (S3 storage, us-east-2 region) on a
4 vCPU / 8 GiB pod, via benchmarks/volume_benchmark.py. They depend heavily on
cloud provider, region, file sizes, and instance type, so treat them as ballpark
and re-run the benchmark for your own environment.
Head-to-head against a local disk (the pod’s container filesystem), in the default mode and in high-throughput mode:
| Operation | Local disk | Volume (default) | Volume (high-throughput) |
|---|---|---|---|
| Sequential write (512 MB) | ~2,200 MB/s | ~930 MB/s | ~930 MB/s |
| Create small files | ~21,500 files/s | ~1,750 files/s | ~2,960 files/s |
| Stat / traverse files | ~235,000 files/s | ~34,000 files/s | ~132,000 files/s |
Volume-specific costs (no local-disk equivalent):
| Operation | Default | High-throughput |
|---|---|---|
| Mount time, 100 → 50,000 files | ~0.55 s → ~0.63 s | ~0.75 s → ~0.99 s |
| Commit 512 MB to durable storage | ~3.3 s (~160 MB/s) | ~3.3 s (~160 MB/s) |
High-throughput mode only changes metadata operations — writes and commits
are identical (same data path). It speeds those up sharply (here, stat ~4×,
create ~1.7×) by keeping the volume’s whole namespace resident in memory, so
its RAM grows with file count and the mount is a touch slower (it loads that
namespace at startup). Reach for it on metadata-heavy workloads; the default
mode’s lower memory and faster mount win otherwise.
In other words, a Volume trades raw speed for durability and sharing. Sequential
writes run ~0.4× local disk — even though uploads are async, each write still
passes through the FUSE layer and the client’s chunking/hashing into the cache.
The gap is widest for many small files (create ~12× slower, stat ~7× slower),
so batch those or keep them on local scratch. Mounting stays sub-second even at
50k files, and making 512 MB durable adds a few seconds at commit().
Reference
- API:
Volume,RWVolume,ROVolume, andflyteplugins.union.io.with_high_throughput_volume_deps. - Related: Files and directories for passing snapshot data between tasks.