The Gap Between Demo and Production
An AI SaaS that works in a demo and an AI SaaS that runs reliably in production are separated by a long list of unglamorous engineering work. The AI part — the prompts, the RAG pipeline, the agent logic — is usually the easiest part to get right. The hard parts are auth, billing, rate limiting, error handling, observability, and infrastructure that holds up under concurrent users.
This guide is a practical roadmap through that gap, focused on what actually needs to happen and in what order.
Phase 1: Foundation (Week 1-2)
Authentication and Authorization
Use Auth.js (formerly NextAuth) or Clerk rather than rolling your own auth. The implementation cost of getting auth right — email verification, password reset, OAuth, session management, CSRF protection — is much higher than it looks. Ship auth in a day with a library; ship it in two weeks if you build it yourself.
// With Clerk in Next.js
import { clerkMiddleware } from '@clerk/nextjs/server'
export default clerkMiddleware()
// Protect API routes
import { auth } from '@clerk/nextjs/server'
export async function POST(request: Request) {
const { userId } = await auth()
if (!userId) return new Response('Unauthorized', { status: 401 })
// ... your AI logic
}
Database and ORM
Use PostgreSQL with Prisma or Drizzle ORM. Set up connection pooling with PgBouncer or Prisma Data Proxy from the start — LLM API calls mean requests take 2-10 seconds, and without connection pooling you will hit connection limits quickly under concurrent load.
Phase 2: AI Infrastructure (Week 2-3)
LLM Client Abstraction
Build a thin wrapper around your LLM provider from day one. This enables switching providers without touching application code, adding retry logic in one place, and logging all calls for cost tracking.
class LLMClient {
async complete(prompt: string, options: CompletionOptions) {
const start = Date.now()
try {
const response = await this.provider.complete(prompt, options)
await this.logUsage({ prompt, response, duration: Date.now() - start })
return response
} catch (error) {
await this.logError({ prompt, error, duration: Date.now() - start })
throw error
}
}
}
Per-User Rate Limiting
Implement rate limiting before you have real users, not after. Use Redis with a sliding window algorithm. Expose limits to users in the UI so they can see how many requests remain.
Phase 3: Billing (Week 3-4)
Use Stripe. Do not build billing yourself. Integrate Stripe Billing with subscription plans that map to usage tiers. The critical pieces: webhook handling for subscription events, usage metering if you bill per-token or per-query, and dunning management for failed payments.
- Listen to
customer.subscription.updatedto update user permissions in your DB - Listen to
invoice.payment_failedto downgrade accounts gracefully - Store
stripe_customer_idon your user model from the first checkout
Phase 4: Observability (Week 4)
You cannot fix what you cannot see. Set up three layers of observability before launch:
- Application errors: Sentry for error tracking and stack traces
- Infrastructure metrics: Vercel Analytics or Datadog for request volume, latency, error rates
- AI-specific telemetry: Langfuse for LLM request tracing, token usage, and quality metrics
Phase 5: Deployment and Scaling
For a Next.js AI SaaS, Vercel is the path of least resistance for the web tier. For any background processing (document ingestion, batch embedding, scheduled tasks), run separate workers on Railway, Render, or EC2.
Common Failure Modes
- Timeout errors at scale: LLM calls take 5-30s. Vercel functions timeout at 60s on Pro. Use background jobs for anything that takes longer.
- Database connection exhaustion: Every concurrent user holds a connection during their 10-second LLM call. Add connection pooling before you launch.
- Cost explosions: A single user running a tight loop against your API can generate thousands of dollars in LLM costs. Rate limiting and per-user budgets are not optional.
- Provider outages: Build fallback providers before you have users depending on the system.
The teams that fail at this transition are almost always under-invested in the boring parts: monitoring, rate limiting, error handling, and billing edge cases. The AI part works. The plumbing is what breaks in production.