How to retry failed items in a content batch job

You start a batch job to generate 100 SEO articles. At article 87, a brief network hiccup or an API rate limit occurs. The entire process halts. Without a system to handle partial failures, you face a frustrating choice — restart the entire batch from scratch, which wastes API credits and time, or manually search through database records to find where the process broke.

Managing large-scale content operations requires a reliable system for handling batch job failures. By isolating failed items and using targeted retry workflows, you can keep your content pipeline moving without duplicating effort or inflating your API bills.

Why batch content jobs fail

At scale, content generation jobs interact with multiple external APIs, database clusters, and translation services. These touchpoints introduce several common failure vectors:

API rate limits (HTTP 429): Large language models and translation APIs enforce strict rate limits on concurrent requests or tokens per minute. If your batch script sends too many requests at once, the external server rejects them.
Temporary network timeouts (HTTP 504): Generating a long-form article can take 30 seconds or more. If a gateway or proxy server along the network path expects a response in 15 seconds, it may drop the connection before the generation completes.
Malformed input data: A batch run is only as good as its seed data. If one row in your CSV or JSON payload contains a blank keyword field, special characters that break your parser, or an unsupported language code, that specific item fails.

Most batch failures are transient. A network timeout or a rate limit error does not mean your code is permanently broken — it simply means the system needs to try again when resources are available.

Monitoring batch job status in your admin UI

An operations manager needs real-time visibility into batch progress. Relying on raw server logs to track a job is inefficient. Instead, your admin dashboard should display a clear visual breakdown of the batch status.

A functional batch monitoring UI should track three key metrics:

The completion ratio: A progress bar showing the number of completed, pending, and failed items — for example, 85 completed, 10 pending, and 5 failed.
Error categorization: A view that groups failures by error code — such as grouping all HTTP 429 errors together — so you can quickly identify systemic issues.
Real-time execution logs: A filtered log window that displays only the warning and error logs associated with the active batch.

By isolating failed items in your admin UI, you can address errors immediately. You do not have to wait for the entire batch run to time out or finish.

Isolating failed items for targeted retries

When five items fail in a batch of 100, running the entire batch again is highly inefficient. You pay for the 95 successful generations a second time. You also risk creating duplicate content in your database.

To avoid this, your system must isolate the failed items. This requires a database schema that tracks the status of each individual item within a batch, rather than just the status of the batch as a whole.

For example, a typical batch item table might include these fields:

Item ID	Batch ID	Topic	Status	Error Message
`item_001`	`batch_982`	"How to scale cold email"	`completed`	`NULL`
`item_002`	`batch_982`	"B2B SaaS marketing tips"	`failed`	`HTTP 429: Rate limit exceeded`
`item_003`	`batch_982`	"How to write an SEO brief"	`completed`	`NULL`

By querying your database for items where batch_id = 'batch_982' and status = 'failed', you can extract the exact IDs that require attention.

Using the retry-failed endpoint to resume progress

A well-designed content generation API includes a dedicated endpoint to retry only the failed items in a specific batch. This endpoint accepts a POST request containing the batch ID. It runs the generation pipeline only for items marked as failed or pending.

Here is a worked example of how an operations manager triggers a retry using a standard curl request:

curl -X POST https://api.yourdomain.com/v1/batches/batch_982/retry-failed \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "max_retries": 3,
    "backoff_factor": 2
  }'

When the server receives this request, it executes the following steps:

It locks the batch record to prevent duplicate cron jobs from running simultaneously.
It queries the database for all items associated with batch_982 that do not have a completed status.
It queues only those failed items back into the generation pipeline.
It updates the status of those items from failed back to pending.

This approach preserves your completed drafts, saves API credits, and ensures that your content pipeline resumes exactly where it stopped.

Setting up automated retry workflows

While manual retries via an admin UI or API client are useful for debugging, you should automate the handling of transient errors. You can achieve this by implementing an exponential backoff strategy.

Exponential backoff increases the waiting time between retries. This gives the downstream API time to recover from traffic spikes. For example, if a request fails, the system might wait 2 seconds before the first retry, 4 seconds before the second, and 8 seconds before the third.

To prevent infinite loops, establish a strict retry limit:

Attempt 1 to 3: Automatically retry using exponential backoff. This resolves 90% of temporary network timeouts and rate limits.
After Attempt 3: Stop automatic retries. Mark the item status as failed, log the specific error response, and route the item to an admin queue for manual review. This prevents your system from wasting resources on hard failures — such as an invalid API key or a malformed prompt template.

How TopicForge handles batch failures and retries

Large-scale content generation requires structured guardrails to prevent minor API hiccups from ruining a large run. TopicForge uses a batch jobs API designed to generate dozens of articles in a single call while protecting your progress.

Instead of relying on a single, long-running API request that can easily time out, TopicForge processes each article through a four-stage AI pipeline — outline creation, drafting, a voice pass, and finally, CTA and SEO metadata generation. Because these stages are decoupled, a failure during the final voice pass does not force the system to regenerate the outline or the draft from scratch. The platform isolates the error at the specific stage where it occurred — allowing you to retry the failed step without losing the work already completed in the earlier stages.

This architecture ensures that your programmatic SEO campaigns run predictably. It keeps your costs aligned with actual output.

If you are managing high-volume content production, running one-off prompts is a recipe for operational bottlenecks. TopicForge offers structured batch orchestration with built-in editorial guardrails to keep your generation runs efficient and error-free. Planned self-serve pricing starts at $49 for a 10-pack of articles, allowing B2B marketing teams, founders, and agencies to scale their content program without monthly agency fees.

FAQs

What is the difference between a hard failure and a soft failure in batch jobs?

A soft failure is caused by temporary issues like network timeouts or rate limits. You can resolve these by retrying the request. A hard failure is caused by permanent issues like invalid input data or authentication errors. These require manual correction before you retry.

How many times should you retry a failed batch item?

We recommend attempting a maximum of three retries using an exponential backoff strategy. If an item fails after three attempts, flag it for manual review to check for structural or input errors.

Does retrying a failed batch item overwrite successful items?

No. A properly designed retry-failed endpoint only targets the specific items that did not complete successfully. It leaves your already generated and approved content untouched.

How can I prevent rate limit failures during large content runs?

You can prevent rate limit failures by implementing request throttling, spacing out your API calls, or using a platform like TopicForge that manages queue concurrency and API limits automatically.