You trigger a batch run of 100 articles. Ten minutes later, your terminal shows a wall of 429 "Too Many Requests" errors. Half your articles are missing. This happens because your batch concurrency settings do not match your API limits. Scaling up content production requires managing data flow to match what language models can handle at any given second.
What is batch concurrency in programmatic SEO?
Batch concurrency is the number of article generation jobs that run at the exact same time.
If you generate 50 articles sequentially, the system finishes one before starting the next. If each article takes two minutes, the entire run takes 100 minutes.
If you set your batch concurrency to 5, the system processes five articles simultaneously. The system sends five requests to the AI model, monitors five parallel runs, and writes five files at once. This reduces your total run time from 100 minutes to roughly 20 minutes.
Parallel processing saves time—but it also multiplies the load on your API connections. Every concurrent run requires its own prompts, context, and output generation cycles.
The trade-off: Generation speed vs. API rate limits
To run parallel generation successfully, you must balance speed against the strict rate limits set by LLM providers. Providers restrict API usage using two main metrics:
- Requests Per Minute (RPM): The total number of individual API calls you can make in a 60-second window.
- Tokens Per Minute (TPM): The total volume of text—both your input prompts and the model's generated output—processed in a 60-second window.
When you run multiple articles in parallel, you consume tokens and requests quickly. If your pipeline requests too many tokens at once, the LLM provider rejects your requests with a rate limit error.
A realistic concurrency example
Let us look at a realistic example of how token usage accumulates. Suppose you want to generate 10 articles at the same time.
- Average prompt size: 3,000 tokens—including background data, brand guidelines, and outlines.
- Average output size: 2,000 tokens per article.
- Total tokens per article: 5,000 tokens.
If you set your concurrency to 10, all 10 articles start processing at the same moment. Within a single minute, your system sends and receives roughly 50,000 tokens.
If your API tier limits you to 40,000 TPM, the provider blocks the remaining requests. The first few articles generate successfully—the rest fail with 429 errors.
How TopicForge manages parallel article generation
Managing these limits manually requires writing complex queue systems. TopicForge handles this complexity by running each article through a four-stage AI pipeline—outline, draft, voice pass, and CTA/SEO metadata generation—orchestrated via our batch jobs API.
Because the pipeline is split into four distinct stages, the system does not send one massive request per article. Instead, it makes smaller, sequential calls for each stage. Our batch jobs API orchestrates these stages in parallel across your entire topic list. This multi-stage approach keeps token usage predictable. It also ensures that brand guardrails and voice profiles are applied to every draft.
Recommended concurrency settings for different batch sizes
Your ideal concurrency setting depends on your API tier and the size of your batch. If you use standard developer accounts on platforms like Google Cloud Vertex AI, start with conservative settings.
Small batches (10 to 20 articles)
- Recommended Concurrency: 3 to 5
- Why: This range provides a safe baseline. It speeds up production compared to sequential runs but rarely triggers TPM limits—even on lower-tier API accounts.
Medium batches (20 to 100 articles)
- Recommended Concurrency: 5 to 8
- Why: This setting balances speed and safety. A batch of 100 articles completes in under an hour without overloading standard API quotas.
Large batches (100+ articles)
- Recommended Concurrency: 8 to 12 (with high-tier API quotas)
- Why: Only use double-digit concurrency if you confirm that your TPM limits support the load. If you run 10 parallel jobs, you must monitor your error rates closely during the first five minutes of the run.
How to handle rate limit errors in your pipeline
Even with conservative settings, network spikes or temporary provider slowdowns trigger rate limits. Your generation pipeline must handle these errors gracefully.
If your code stops when it hits a 429 error, you end up with half-finished batches and fragmented data. Instead, implement exponential backoff with jitter.
Implementing exponential backoff
Exponential backoff means that when a request fails due to a rate limit, the system waits before retrying. If the retry fails, the system waits longer.
A standard backoff schedule looks like this:
- First failure: Wait 2 seconds, then retry.
- Second failure: Wait 4 seconds, then retry.
- Third failure: Wait 8 seconds, then retry.
Adding "jitter" means introducing random variation to these wait times—like waiting 4.3 seconds instead of exactly 4 seconds. Jitter prevents multiple parallel jobs from retrying at the exact same millisecond, which would trigger another rate limit error.
Best practices for staging and running large batch jobs
To keep your production pipeline running smoothly, follow a structured workflow for every major batch run.
1. Run a small test batch
Never launch a batch of 500 articles on a new system configuration. Start with a test batch of 3 articles. Verify that the formatting, metadata, and structure meet your standards before scaling up.
2. Monitor your error rates
Keep an eye on your API dashboard during the run. If your error rate rises above 2%, your concurrency setting is too high for your current API tier. Lower the concurrency immediately to let the active jobs finish.
3. Queue your jobs
Use a queue manager to handle large workloads. Instead of sending 200 articles to your generation engine at once, load them into a queue. Let the queue manager feed the articles into the active pipeline at your designated concurrency rate.
If you want to scale your content production without managing API limits, rate retries, or complex prompt pipelines yourself, TopicForge handles the entire orchestration for you. You can run batch jobs via our API and get publish-ready articles for as low as $3.99 per article with a 100-pack.
FAQs
What is the default concurrency limit for parallel generation?
Most programmatic SEO setups start with a default concurrency of 3 to 5 parallel runs. This baseline prevents your API keys from hitting rate limits with LLM providers like Gemini on Vertex AI—while still generating content much faster than a sequential process.
How do Vertex AI rate limits affect TopicForge batch runs?
Because TopicForge uses Gemini via Vertex AI, batch runs are bound by your Google Cloud project's quotas—specifically Requests Per Minute (RPM) and Tokens Per Minute (TPM). If your batch concurrency is set too high, the volume of simultaneous requests triggers rate-limiting errors.
How do you recover from a rate limit error during a batch run?
When a rate limit error occurs, the system should pause, wait for a designated period using exponential backoff, and then retry the failed stage. TopicForge manages these guardrails internally to ensure that temporary API limits do not fail your entire batch job.
