Unified Multi-LLM AI Chat Platform: Four AI Providers, RAG Knowledge Base, and Sub-500ms Responses — Shipped in 4 Weeks
About This Project
A bootstrapped SaaS product team in South Asia with a 12-person team was burning hours each week switching between ChatGPT, Claude, and Gemini tabs with no shared context, no way to query internal documentation, and three separate billing accounts. The team needed a single owned platform — one interface, multiple AI providers, private knowledge base, real-time cost visibility — running under their own infrastructure.
The Problem
By 2026, enterprise AI chatbot adoption had reached 91% among businesses with 50 or more employees — yet the average knowledge worker was still copying outputs between browser tabs, managing separate API keys, and re-prompting from scratch on every session. A bootstrapped SaaS product team in South Asia faced this at sharp scale. Twelve people. Three AI providers. No shared context. No way to query their own internal documentation. Three billing dashboards to reconcile at month-end.
The real cost wasn't subscriptions — it was cognitive overhead. A senior developer estimated they lost 40–60 minutes daily to model-hopping and context reconstruction. The team had trialled off-the-shelf wrappers and fell into immediate walls: no persistent memory across sessions, no RAG against private documents, no usage visibility, and no workspace-level access control. Every tool they tested added a layer of complexity without solving the root problem.
What they needed wasn't another AI product bolted on top of their workflow. They needed a single team-owned platform — multiple models, private knowledge base, real-time cost visibility — running under their own infrastructure, not a vendor's.
Our Approach
After the first scoping call, the architecture decision was straightforward: build a production-grade multi-LLM platform with a clean unified API layer, not a thin wrapper. The core insight was that the team's bottleneck wasn't model access — it was context fragmentation. RAG against their own internal documentation was the feature that would change daily behaviour, not just the model selector.
We chose Next.js 14 App Router for the full-stack surface, Vercel AI SDK for streaming and multi-provider abstraction, and LangChain as the orchestration layer for RAG pipeline management. For vector search, Pinecone provided a fully managed production vector store without the overhead of self-hosting. Firebase Firestore handled chat session persistence with sub-millisecond reads — a deliberate choice to keep the realtime UX feeling instant without adding a WebSocket infrastructure layer.
The alternative — a fully custom agent framework — was assessed and rejected. The modern AI SDK ecosystem is mature enough that building clean abstractions on top of established frameworks is the right call for a 4-week sprint. Custom primitives here would have added weeks without improving the outcome.
Architecture & Technical Solution
<ArchitectureDiagram> <svg viewBox="0 0 900 500" xmlns="http://www.w3.org/2000/svg" font-family="system-ui,-apple-system,sans-serif"> <rect width="900" height="500" fill="#0f172a"/> <rect x="350" y="18" width="200" height="46" rx="10" fill="#1e2d4a" stroke="#7DF9FF" stroke-width="1.5"/> <text x="450" y="47" fill="#7DF9FF" font-size="13" text-anchor="middle" font-weight="600">User / Team Workspace</text> <line x1="450" y1="64" x2="450" y2="106" stroke="#334155" stroke-width="1.8"/> <polygon points="444,106 456,106 450,116" fill="#334155"/> <rect x="195" y="116" width="510" height="58" rx="10" fill="#1e293b" stroke="#7b2fff" stroke-width="1.5"/> <text x="450" y="140" fill="#a78bfa" font-size="13" text-anchor="middle" font-weight="600">Next.js 14 App Router</text> <text x="450" y="160" fill="#475569" font-size="11" text-anchor="middle">Chat UI · Model Selector · Workspace Manager · shadcn/ui + Tailwind</text> <line x1="450" y1="174" x2="450" y2="215" stroke="#334155" stroke-width="1.8"/> <polygon points="444,215 456,215 450,225" fill="#334155"/> <rect x="195" y="225" width="510" height="58" rx="10" fill="#1e293b" stroke="#7b2fff" stroke-width="1.5"/> <text x="450" y="249" fill="#a78bfa" font-size="13" text-anchor="middle" font-weight="600">Vercel AI SDK · LLM Router · LangChain</text> <text x="450" y="269" fill="#475569" font-size="11" text-anchor="middle">POST /api/chat · Streaming · Model abstraction · RAG orchestration</text> <line x1="280" y1="283" x2="150" y2="330" stroke="#334155" stroke-width="1.5"/> <line x1="450" y1="283" x2="450" y2="330" stroke="#334155" stroke-width="1.5"/> <line x1="620" y1="283" x2="640" y2="330" stroke="#334155" stroke-width="1.5"/> <rect x="28" y="330" width="145" height="55" rx="10" fill="#1e293b" stroke="#4ade80" stroke-width="1.5"/> <text x="100" y="356" fill="#86efac" font-size="13" text-anchor="middle" font-weight="600">OpenAI</text> <text x="100" y="373" fill="#475569" font-size="10" text-anchor="middle">GPT-4o · Embeddings</text> <rect x="183" y="330" width="145" height="55" rx="10" fill="#1e293b" stroke="#4ade80" stroke-width="1.5"/> <text x="255" y="356" fill="#86efac" font-size="13" text-anchor="middle" font-weight="600">Anthropic</text> <text x="255" y="373" fill="#475569" font-size="10" text-anchor="middle">Claude Sonnet</text> <rect x="338" y="330" width="145" height="55" rx="10" fill="#1e293b" stroke="#4ade80" stroke-width="1.5"/> <text x="410" y="356" fill="#86efac" font-size="13" text-anchor="middle" font-weight="600">Google</text> <text x="410" y="373" fill="#475569" font-size="10" text-anchor="middle">Gemini 2.5 Flash</text> <rect x="548" y="318" width="320" height="70" rx="10" fill="#1e293b" stroke="#7DF9FF" stroke-width="1.5"/> <text x="708" y="343" fill="#7DF9FF" font-size="12" text-anchor="middle" font-weight="600">RAG Pipeline · LangChain</text> <text x="708" y="362" fill="#475569" font-size="10" text-anchor="middle">Pinecone vector search · text-embedding-3-small</text> <text x="708" y="378" fill="#475569" font-size="10" text-anchor="middle">Document ingestion · Hybrid retrieval</text> <line x1="255" y1="385" x2="255" y2="435" stroke="#334155" stroke-width="1.5"/> <line x1="410" y1="385" x2="400" y2="435" stroke="#334155" stroke-width="1.5"/> <line x1="708" y1="388" x2="650" y2="435" stroke="#334155" stroke-width="1.5"/> <rect x="50" y="435" width="330" height="52" rx="10" fill="#1e293b" stroke="#f59e0b" stroke-width="1.5"/> <text x="215" y="456" fill="#fbbf24" font-size="12" text-anchor="middle" font-weight="600">PostgreSQL + Prisma ORM</text> <text x="215" y="475" fill="#475569" font-size="10" text-anchor="middle">Workspaces · Members · Usage logs · Token costs</text> <rect x="490" y="435" width="330" height="52" rx="10" fill="#1e293b" stroke="#f59e0b" stroke-width="1.5"/> <text x="655" y="456" fill="#fbbf24" font-size="12" text-anchor="middle" font-weight="600">Firebase Firestore</text> <text x="655" y="475" fill="#475569" font-size="10" text-anchor="middle">Chat history · Real-time session sync</text> <rect x="8" y="8" width="10" height="10" rx="2" fill="none" stroke="#7DF9FF" stroke-width="1.2"/> <text x="22" y="18" fill="#475569" font-size="9">Client</text> <rect x="8" y="24" width="10" height="10" rx="2" fill="none" stroke="#7b2fff" stroke-width="1.2"/> <text x="22" y="34" fill="#475569" font-size="9">App Layer</text> <rect x="8" y="40" width="10" height="10" rx="2" fill="none" stroke="#4ade80" stroke-width="1.2"/> <text x="22" y="50" fill="#475569" font-size="9">LLM Providers</text> <rect x="8" y="56" width="10" height="10" rx="2" fill="none" stroke="#f59e0b" stroke-width="1.2"/> <text x="22" y="66" fill="#475569" font-size="9">Data Layer</text> </svg> </ArchitectureDiagram>The platform is structured in four layers. At the client layer, a Next.js 14 App Router application with shadcn/ui and Tailwind CSS handles the full chat interface, model selector, and workspace management — a clean, fast single-page experience that feels closer to a native app than a web form.
At the API layer, a unified LLM router built on Vercel AI SDK abstracts all provider differences behind a single streaming endpoint. A POST /api/chat call with a model parameter routes transparently to OpenAI, Claude, or Gemini — no client-side logic changes needed when switching models. LangChain sits here as the orchestration layer, handling the RAG retrieval chain before the prompt reaches any LLM.
The knowledge layer is where the differentiation lives. Documents are chunked, embedded via OpenAI's text-embedding-3-small, and stored in Pinecone. On each query, the retrieval chain runs a similarity search, injects the top-k document chunks into the system context, and feeds the enriched prompt to whichever model the user has selected. The result: any model can accurately answer questions about internal documents it was never trained on. Persistence splits between PostgreSQL (relational data — workspaces, members, usage logs, cost rollups) and Firebase Firestore (real-time chat history). Firestore's live listeners power instant message rendering; Postgres handles all analytical queries.
Build & Deployment
The build ran over four weeks in two phases. Weeks one and two delivered the core chat infrastructure: the LLM router, streaming response handling across all three providers, model switching, and Firebase session persistence. The hardest technical challenge wasn't the API integration — it was getting streaming to feel instantaneous across all three providers while keeping error states clean and consistent when a provider rate-limits or drops mid-stream.
Weeks three and four covered the RAG pipeline and team-layer features: document ingestion pipeline, embedding jobs, Pinecone retrieval integration, per-workspace system prompts, and usage analytics with per-user token and cost breakdowns. AI-assisted code generation (using Claude and Cursor) accelerated boilerplate by roughly 60%, keeping focus on the non-trivial parts — retrieval quality tuning, token cost calculation accuracy, and secure per-workspace API key management.
Deployment runs on Vercel with edge functions handling the streaming routes, keeping first-token latency low regardless of user geography. Database migrations run through Prisma, and the Pinecone index is seeded via a Node.js ingest script triggered on document upload. Zero-downtime deploys ship via Vercel's preview and production pipeline.
Results & Impact
<MetricsDiagram> <svg viewBox="0 0 800 200" xmlns="http://www.w3.org/2000/svg" font-family="system-ui,-apple-system,sans-serif"> <rect width="800" height="200" fill="#0f172a"/> <rect x="18" y="18" width="178" height="164" rx="14" fill="#1e293b" stroke="#7b2fff" stroke-width="1.5"/> <text x="107" y="90" fill="#a78bfa" font-size="52" text-anchor="middle" font-weight="800">4</text> <text x="107" y="118" fill="#64748b" font-size="12" text-anchor="middle">LLM Providers</text> <text x="107" y="137" fill="#374151" font-size="10" text-anchor="middle">GPT · Claude · Gemini · Groq</text> <rect x="208" y="18" width="178" height="164" rx="14" fill="#1e293b" stroke="#7DF9FF" stroke-width="1.5"/> <text x="297" y="90" fill="#7DF9FF" font-size="30" text-anchor="middle" font-weight="800"><500ms</text> <text x="297" y="118" fill="#64748b" font-size="12" text-anchor="middle">First Token</text> <text x="297" y="137" fill="#374151" font-size="10" text-anchor="middle">All providers · Streaming</text> <rect x="398" y="18" width="178" height="164" rx="14" fill="#1e293b" stroke="#4ade80" stroke-width="1.5"/> <text x="487" y="90" fill="#4ade80" font-size="30" text-anchor="middle" font-weight="800">RAG</text> <text x="487" y="118" fill="#64748b" font-size="12" text-anchor="middle">Knowledge Retrieval</text> <text x="487" y="137" fill="#374151" font-size="10" text-anchor="middle">Private docs · Pinecone</text> <rect x="588" y="18" width="178" height="164" rx="14" fill="#1e293b" stroke="#f59e0b" stroke-width="1.5"/> <text x="677" y="90" fill="#f59e0b" font-size="30" text-anchor="middle" font-weight="800">100%</text> <text x="677" y="118" fill="#64748b" font-size="12" text-anchor="middle">Token Tracking</text> <text x="677" y="137" fill="#374151" font-size="10" text-anchor="middle">Cost analytics · Per user</text> </svg> </MetricsDiagram>Before deployment, the team spent an estimated 40–60 minutes daily per developer managing model-switching, re-prompting from scratch, and reconciling three separate billing dashboards. Internal documentation was inaccessible from any AI interface — every query required manual copy-paste into a chat window.
After: a single workspace handles all AI interaction across the team. Model selection is a one-click change. The full internal documentation stack — engineering runbooks, product specs, SOPs, meeting notes — is queryable via the RAG layer from day one, with results grounded in actual source documents rather than hallucinations. Token-level cost tracking replaced three billing dashboards with a single analytics view, with per-user and per-workspace breakdowns updated in real time.
"We went from five browser tabs and no memory to one workspace that actually knows our codebase, our products, and our processes. The RAG layer alone made it worth it." — Product Lead, anonymised
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | Next.js 14, TypeScript, shadcn/ui, Tailwind CSS |
| Backend | Vercel AI SDK, LangChain, Next.js API Routes |
| Database | PostgreSQL (Prisma ORM), Firebase Firestore |
| Infrastructure | Vercel Edge Functions, GCP |
| Integrations | OpenAI API, Anthropic Claude API, Google Gemini API |
| AI/ML | LangChain RAG, Pinecone, text-embedding-3-small |
Want a Solution Like This?
If your team is juggling multiple AI tools with no shared context, no knowledge base, and no cost visibility — this is a solvable problem, not a research project. We scope and deliver production AI platforms in weeks, not quarters, with clean architecture you fully own.
Book a free 20-minute scoping call →
We scope, prototype, and deliver — faster than you'd expect.
Built by Vertical Idea · June 2026 · SaaS · 4 weeks
Project Details
- Sector
- SaaS
- Timeline
- 4 weeks
- Engagement
- MindChat Platform
Tech Stack
Want Results Like This?
Tell us what you're building. We'll scope it, price it, and ship it — faster than you expect.
We respond within 24 hours. No sales pitch — just a straight conversation about your project.