Written byandJune 22, 2026

How we built Meshia so an agent could operate the GPU

The first version looked like a simple idea: put a chat box next to a GPU workspace. The actual problem was stranger. We had to make cloud compute, files, money, failure, and proof simple enough that an agent could reason about them.

Watercolor agent, GPU, receipt — One GPU. One receipt.

A GPU cloud gives you a machine. We wanted the platform to feel like the machine was already part of the conversation.

That difference sounds small until you build it. A user can say "fine-tune this model" or "audit this board" or "run this eval," but the sentence hides dozens of chores: choose hardware, set a budget, launch a provider, stream progress, move large files, keep the workspace alive, recover from crashes, stop billing, and prove what happened afterward.

Product ideaThe agent that operates the GPU

Durability ruleManifests, generations, and CAS blocks

Upload lesson53,589 files become 23 archive bundles

Proof ruleEvery hidden action needs a receipt

The first creative move was to stop treating the GPU as the product

A GPU is powerful, but it is a terrible user interface. It has a price, a region, a boot path, a container image, a driver stack, a network proxy, an idle timer, and a failure mode. Most clouds expose those details because their customer is an engineer who expects to drive the machine by hand.

We needed the product to start somewhere else. The user should not have to ask for an A5000 or an H100 first. The user should ask for a result, and we should decide what kind of machine makes sense. That meant our control plane had to translate between a human goal and the messy shape of real providers.

Control-plane rule

The agent can be creative inside the workspace. The control plane has to be boring: quote the cost, enforce the cap, mint scoped tickets, and clean up the machine even when the agent is wrong.

This is why we are not building "Claude with a terminal." The agent gets tools, but the dangerous edges go through code paths we can audit. It can ask to attach compute. It cannot silently ignore a spend cap. It can write code and run experiments. It cannot make provider cleanup optional.

The second move was deciding what is allowed to disappear

If you build around cloud GPUs, you eventually learn that the GPU is not a home. It is a rented room. It can boot slowly, restart, lose a local disk, or be replaced by a different provider. For a research workspace, that is unacceptable if the user's files live only inside the room.

So we made MeshiaFabric treat provider disks as cache and object storage as authority. The durable record is not "whatever happens to be on the pod right now." It is the object prefix recorded for the workspace, plus a manifest ledger that says which files exist, which generation they belong to, and which content-addressed blocks hold the bytes.

Simple stack showing the user view, provider cache, manifest ledger, and durable truth outside the GPU — The simple user model is a folder. The durable system is a short stack: local view, cache, manifest ledger, and truth outside the GPU.

That sounds abstract, so here is the practical version

When the agent writes a file, the workspace sync agent watches the path, publishes dirty files into content-addressed blocks, and updates the manifest. When two runtimes touch the same file, the conflict resolver tries the safe cases first: fast-forward if both sides match, append-merge if both extended the same base, sidecar if the edits are genuinely different.

When the user stops a session, billing freezes first. Then we ask the workspace to drain exactly once. We do not chase writes forever, because that would turn cleanup into another place to lose money. Late writes are reported. Finished writes are saved.

This is the kind of technical detail that makes the product feel calm. The user only sees "Saving files..." or "resume to recover." Underneath, the system has rules for generations, stop-drain, conflict resolution, and content-addressed storage.

Large files forced us to care about the shape of waiting

The first naive upload path is easy to imagine: for every file, call the server, send the bytes, update the UI. That works until somebody drops a real dataset folder. Then "uploading files" turns into thousands of browser events, thousands of server calls, and a progress bar that tells the truth too slowly to be useful.

A concrete planning drill made the problem obvious: 53,589 files and 9.096 GiB. Per-file upload would have created a storm. We now plan folder drops into tar bundles, uploads those bundles directly to object storage, and asks the pod to extract them safely into /workspace.

The UX trick

The user sees one upload with speed and ETA. The system sees 23 archive bundles, presigned object-store writes, staged commit, safe extraction, and Fabric publishing behind it.

There is a security story hiding here too. Archive extraction rejects symlinks, devices, absolute paths, and traversal attempts. Capacity is checked before extraction. The browser never needs to know the provider's private storage shape. The product gets to feel like drag and drop, but the implementation treats every filename as untrusted input.

Long jobs need a place to live

A chat response can be ephemeral. A training run cannot. If an agent launches a job that takes an hour, the user expects the work to have a life outside the WebSocket. It should survive a daemon restart. It should leave logs. It should know whether it can resume.

Long-running work we launch lives under a durable job directory in the workspace. Each job has a manifest, a status file, an append-only output log, a process-group record when active, and a recovery-safe runner. The manifest stores the original command, optional resume command, checkpoint globs, restore policy, attempt count, latest checkpoint, kind, and session lineage.

We do not pretend we can checkpoint every arbitrary Linux process. If we launch the job, we own the recovery path. If the process falls outside that path, we label it clearly instead of selling fake resilience.

The final move was showing our work

The more work we hide behind one button, the more evidence we owe the user. If the system attaches a GPU, runs a command, writes files, spends money, and then cleans up, a screenshot is not enough. The user needs a receipt.

Not a marketing receipt. A technical receipt. The kind that can tell another engineer what happened without asking them to trust the UI.

What runtime ran the work
Which GPU or provider accepted it
What command, files, logs, and artifacts were produced
What it cost and when billing stopped
Whether cleanup actually settled

This changed how we think about the whole product. The receipt is not just a download at the end. It is a requirement that reaches backward through the system. If you want to prove cleanup, cleanup needs a durable state. If you want to prove cost, billing needs quote lineage. If you want to prove a result, the runtime needs to know which logs and artifacts count as evidence.

What we learned

The surprising lesson is that agent products do not get better just by making the agent sound smarter. They get better when the system around the agent is easier to check, easier to limit, and easier to recover.

That is the work behind the product. The visible product is a place where you describe the research and watch it run. The hidden product is a set of guardrails and records that make that sentence safe: scoped tickets, durable manifests, content-addressed bytes, bounded drains, resumable jobs, cost caps, cleanup gates, and evidence bundles.

We did not build a chatbot for GPUs. We built the control surface we wanted before letting an agent touch expensive machines.