
Pin the version, poll at 2-5s, use webhooks for long predictions, and pick the cheapest hardware tier the model fits on. Stops your agent from burning Replicate credits on cold starts and unpinned models.
Install command
npm install @scopeful/replicate-models-runnerDownload skill file
replicate-models-runner.md
9 KB
Fetch via the Scopeful MCP (any client)
Once your agent is connected to the Scopeful MCP, it can load this skill on demand, no install required:
get_skill('replicate-models-runner')Replicate is a serverless GPU API. Most agents call it wrong: they hard-code replicate.run() against a public model name, hit a cold start every time, then panic-poll the prediction at 100ms intervals. This skill fixes that. Replicate bills per second of GPU time, so every avoided second is real money saved.
Use it when:
cog to a private endpointDo not reach for Replicate when:
Python (1.0.7): pip install replicate. Node (1.4.0): npm install replicate. Go: . Swift, Elixir, Ruby clients also exist. Set in env.
go get github.com/replicate/replicate-goREPLICATE_API_TOKEN=r8_...Replicate ships its own MCP server. Remote-hosted at mcp.replicate.com (auto-updated with the HTTP API), plus a local stdio version (replicate-mcp on npm):
// claude_desktop_config.json / .cursor/mcp.json / .vscode/mcp.json
{
"mcpServers": {
"replicate": {
"command": "npx",
"args": ["-y", "replicate-mcp"],
"env": { "REPLICATE_API_TOKEN": "r8_..." }
}
}
}
The MCP exposes the full HTTP API surface: model search, prediction create/get/cancel, deployment management, file uploads. Tool names mirror the API verbs (search_models, create_prediction, get_prediction, cancel_prediction, list_predictions). [VERIFY] exact tool naming.
The single most common agent mistake: calling a model by name and hoping for the best. Pin the version.
# Bad: ambient version, output changes silently when the model owner updates
output = replicate.run("black-forest-labs/flux-dev", input={"prompt": "..."})
# Good: pinned version, reproducible across months
output = replicate.run(
"black-forest-labs/flux-dev:843b6e1c...", # 64-char version hash
input={"prompt": "a tabby cat in soft window light"}
)
Get the version hash from the model's "Versions" tab on replicate.com, or query the API: GET /v1/models/{owner}/{name}/versions. Hard-code it in your code. Bump it deliberately, not silently.
replicate.run() is the high-level helper: it creates a prediction, waits for completion, returns the output. For anything longer than ~10 seconds, prefer the low-level predictions.create() so you control polling and don't tie up the calling process.
Statuses: starting -> processing -> succeeded | failed | canceled.
import replicate, time
pred = replicate.predictions.create(
version="black-forest-labs/flux-dev:843b6e1c...",
input={"prompt": "..."}
)
while pred.status not in ("succeeded", "failed", "canceled"):
time.sleep(2) # 2s is plenty; faster gets you 429s
pred.reload()
output_urls = pred.output if pred.status == "succeeded" else None
Three rules: (1) Poll every 2-5 seconds, never tighter; (2) for short predictions, prefer replicate.run() or the Prefer: wait=n header (one request, no polling); (3) for long predictions (video, training, slow upscalers), use webhooks instead of polling.
pred = replicate.predictions.create(
version="...", input={"prompt": "..."},
webhook="https://your.app/replicate/hook",
webhook_events_filter=["completed"] # also: start, output, logs
)
The webhook POST body is identical to a predictions.get response. Verify the signature before trusting it: validateWebhook() (JS) / replicate.signatures.verify() (Python) using the webhook secret from your account settings. Skipping verification means anyone with your webhook URL can forge completions.
Llama-family models and a handful of others stream tokens over SSE.
for event in replicate.stream(
"meta/meta-llama-3-70b-instruct",
input={"prompt": "..."}
):
print(event, end="") # event.data has the token
Streaming only works when the model declared support. Image and video models don't stream output; they finish then return URLs.
Replicate bills per second of GPU/CPU time. Pick the smallest tier the model fits on.
| Tier | Approx $/sec | Good for |
|---|---|---|
| CPU | $0.0001 | Tiny utilities, format converters |
| T4 | $0.000225 | SD 1.5, small classifiers, Whisper-small |
| A40 | ~$0.000725 | SDXL, mid-size diffusion |
| L40S | $0.000975 | Modern diffusion, mid-size LLMs |
| A100 80GB | $0.0014 | Flux Dev, Llama 70B, video |
| H100 | $0.001525 | Largest models, lowest wall-clock time |
Multi-GPU (4x/8x A100, H100, L40S) needs a committed-spend contract. [VERIFY] rates; some hosted models (Flux Schnell, Whisper) are flat per prediction, not per-second. Live USD math at scopeful.org/tools/replicate.
replicate.delivery. Web-UI predictions are kept indefinitely; API predictions are notmin_instances >= 1. Only worth it above ~1 request/minute, otherwise you're paying for idle GPUFileOutput vs URL string. Python SDK 1.0+ returns FileOutput objects. To get strings back, pass use_file_output=False to replicate.run(), or call .urlReplicate hosts models with mixed licenses. The platform itself does not gatekeep, but the model license follows the output. Flux Dev is non-commercial when self-hosted, but Replicate holds a commercial agreement with Black Forest Labs, so images generated through Replicate's hosted endpoint are commercially usable. Pull the same weights to your own GPU and that exemption is gone. Before shipping a paid product, read the license box on the model page. When in doubt, flag it.
When you run a prediction on the user's behalf, return:
metrics.predict_time (actual billed seconds) and a rough cost estimateIf a prediction fails, return error verbatim and stop. Do not auto-retry without telling the user.
replicate.run() in a tight while-loop; that's a thousand cold startsowner/model string in productionostris/flux-dev-lora-trainer is the canonical pathmin_instances=1scopeful.org/tools/replicate before quoting prices