Run Replicate models without burning compute

Replicate is a serverless GPU API. Most agents call it wrong: they hard-code replicate.run() against a public model name, hit a cold start every time, then panic-poll the prediction at 100ms intervals. This skill fixes that. Replicate bills per second of GPU time, so every avoided second is real money saved.

When to use Replicate

Use it when:

Calling open-source models (Flux, SDXL, Whisper, Llama, upscalers, depth models) without standing up GPU infra
One billing pane for many models instead of separate accounts at Fal, Together, Modal
Fine-tuning (LoRA training is first-class on Replicate)
Deploying your own model via cog to a private endpoint

Do not reach for Replicate when:

Sub-second latency matters on every call (cold starts hurt; use Fal or a hot deployment)
The user wants ComfyUI graph execution (use ComfyUI Cloud or RunComfy)
Text generation at scale (token-priced APIs from Anthropic, OpenAI, Groq win on price and latency)
The model has a non-commercial license and the user is shipping a paid product without going through Replicate's hosted endpoint (see License gotcha below)

Install

Python (1.0.7): pip install replicate. Node (1.4.0): npm install replicate. Go: . Swift, Elixir, Ruby clients also exist. Set in env.

Tier	Approx $/sec	Good for
CPU	$0.0001	Tiny utilities, format converters
T4	$0.000225	SD 1.5, small classifiers, Whisper-small
A40	~$0.000725	SDXL, mid-size diffusion
L40S	$0.000975	Modern diffusion, mid-size LLMs
A100 80GB	$0.0014	Flux Dev, Llama 70B, video
H100	$0.001525	Largest models, lowest wall-clock time

Run Replicate models without burning compute

Run Replicate models without burning compute

When to use Replicate

Install

MCP server (official, hosted)

How calls should be structured

Predictions lifecycle and polling

Webhook pattern for long-running predictions

Streaming (LLMs and supported models only)

Hardware and cost reference

Common gotchas

License gotcha (do not skip)

What to deliver to the user

What NOT to do

Useful follow-ups