Carriva audits a French pension document and tells a retirement advisor "this RIS has these errors". When the LLM is wrong, an advisor gives wrong advice, and a real person ends up with a smaller pension than they should have. That is the stake. There is no "oops" tier in regulated industries. Building RAG regulated industries trust requires a different posture than building a chat demo, and we spent most of 2024 and 2025 learning the difference. Here is the architecture, the failure modes, and the discipline that ships in production.
Why RAG, not fine-tuning
The first instinct of half the people who hear "regulated AI" is "fine-tune a model on the regulation". We considered it. We rejected it.
French pension law changes. The Code de la Sécurité Sociale gets updated. Specific articles around réformes (the 2023 réforme being the most recent big one) shift values, ages, and conditions. If we fine-tune a model in March and the law changes in June, we either ship outdated answers or we re-train. Re-training is expensive, slow, and obscures the source of every answer.
Retrieval-augmented generation flips the architecture. The model is the reasoning engine. The truth lives in a vector store of authoritative documents that we update when the law updates. The model is forced to retrieve and cite. When the law changes, we ingest the new document, replace the old one in the store, and the next query gets the right answer.
Fine-tuning encodes the law into the weights. RAG keeps the law in a place where you can edit it.
For an industry where "what does the regulation actually say today?" is the central question, retrieval is not just easier. It is the only honest answer.
The architecture we shipped
Here is the actual shape of the system in Carriva.
The corpus
We maintain a corpus of authoritative documents: Code de la Sécurité Sociale articles relevant to retirement, official guidance documents from the CNAV and AGIRC-ARRCO, and a handful of internal "facts" documents we wrote and reviewed with a CGP partner (things like "the MDA enfants must be inferred via maternity proxy because it is not in the RIS"). Roughly 1,200 documents, totaling around 6 MB of text.
Each document has metadata: source, last-updated date, jurisdiction (most are national, a few are regional), and a confidence tier. The confidence tier matters: a CSS article is tier-1 (it is the law), a CNAV guidance document is tier-2 (it is administrative interpretation), and an internal fact note is tier-3 (it is our interpretation reviewed by a domain expert).
The chunking
We chunk by article and sub-section, not by sliding window. A pension-law article is a self-contained legal unit. Cutting it in the middle and storing one half in one chunk is exactly the failure mode that produces hallucinated answers.
Each chunk carries the parent document's metadata and an identifier we can use to cite back. The chunks are 200 to 1,200 tokens, which fits comfortably in a generation context.
The vector store
We use a Postgres extension for vector search (pgvector) rather than a separate vector database. The reason is operational: one database to back up, one set of credentials, one Postgres extension to upgrade. We have written about why we self-host Postgres in 2026 and that decision touches every product, including Carriva.
The query path: an advisor asks a question (or our system generates a structured query during a RIS audit). We embed the query using the same embedding model used for ingestion. We retrieve the top 12 chunks, then rerank with a cross-encoder to drop the top 6.
The prompt discipline
Here is where the discipline lives.
The system prompt is explicit: "Use only the provided context. If the context does not answer the question, say so. Cite the source for every claim using the format [SOURCE: doc-id, line]." We tested without the citation requirement. The model hallucinated more. The citation requirement is not just for advisor trust. It is for model alignment.
The output format is structured (JSON with a list of findings, each with a citation). We do not let the model write free-form prose for the parts that affect advice. Free-form prose is for the explanation surface around a finding, not for the finding itself.
The failure modes we caught (and one we almost shipped)
Three categories of failure, in roughly the order they hurt:
Confident wrong citations
The model would cite a document that did exist but did not say what the model claimed it said. We caught this with a verifier step: after generation, we re-fetch each cited chunk and run a smaller model to check that the chunk supports the claim. If the verifier disagrees, we discard the finding.
The verifier disagrees roughly 4 to 7% of the time. That is the rate at which our LLM was about to lie to a CGP about the law before we added the check.
Stale corpus
We had a CSS article that updated in October and we ingested the update in December. For 6 weeks, our system was answering with the pre-update text. No customer noticed because the relevant clause was an edge case. But "no customer noticed" is the worst kind of safety. We now have a weekly check that compares our corpus to a curated diff feed from official sources.
The cohort 1965 bug
This one we almost shipped. Our system had a hardcoded cutoff for the réforme 2023 at the 1964 cohort. The actual law has a sliding scale that affects 1965 differently from 1964. Jean-Luc Caturla, our active CGP tester, caught it on April 22 in his bug report. We fixed it the next day, but the bigger lesson was that we had encoded a piece of regulation in code instead of in retrievable text. That is the trap. Anything that can be expressed as a document should be a document. Code should orchestrate, not memorize.
This is one of the reasons we wrote about the broader vertical AI SaaS thesis: the moat is the discipline of treating regulatory truth as data, not code.
Citations that an advisor will actually use
A citation in a regulated context is not a footnote. It is the artifact the advisor shows to their client when the client asks "why did you tell me this?"
Our citations include the document identifier, the article or section number, the date of the version we used, and a clickable link to the canonical source. When we generate a finding, the advisor's UI shows the citation prominently, not buried in a tooltip. We had a version where citations were in a collapsible section. Advisors did not click. We surfaced them. They clicked.
The principle generalizes: in regulated industries, the citation is part of the deliverable. Treat it like a first-class output, not a metadata afterthought.
The audit log that makes it lawyer-friendly
Every generation in Carriva writes to an audit log: query, retrieved chunks, model used, version of the prompt, output, and verifier result. We retain these for 24 months in encrypted storage.
The audit log was originally for our own debugging. We discovered it is also for the customer. Twice in 18 months a CGP cabinet asked "which version of the law was used to make this finding?" We could answer to the day. That is the difference between a tool and a product an advisor can stake their license on.
If you are building RAG regulated industries customers will pay for, the audit log is not optional. Plan for it on day one.
The cost discipline
There is a quiet cost story behind regulated RAG that most posts skip.
A non-RAG generation might cost us $0.04. A RAG generation with retrieval, reranking, generation, and verification can run $0.18 to $0.30, depending on context length. That is 4 to 7 times more expensive per call.
The compensation is that the customer pays more. Carriva's pricing model assumes a small number of high-stakes generations per advisor per day, not a flood of low-stakes ones. We sized the unit economics around that shape and they hold.
If you are tempted to do RAG on top of a pricing model that assumes $0.04 generations, you will lose money. Re-price first.
Tooling: how we actually build this
A practical note on the engineering loop.
Most of the prompt and retrieval work happens iteratively in Claude Code. The 1M-context window is genuinely useful for this kind of work because we can hold the entire prompt template, the full corpus index, and the test cases in one session. We discussed our broader Claude Code workflow separately, but the punchline for RAG is that long-context coding agents change how you debug retrieval. You can show the model "here is the query, here is what we retrieved, here is what we should have retrieved" and iterate without losing thread.
We do not use ChatGPT for code. We do not run an Anthropic SDK loop in production for engineering work because Claude Code is the right surface for the studio's workflow.
What we would tell a founder starting RAG in regulated industries
Five things, in priority order:
- Start with the corpus, not the model. Spend the first month curating documents and building ingestion pipelines. The model is a commodity.
- Force citations from day one. Even in your prototype. The discipline shapes everything else.
- Add a verifier. Even a small one. The 4 to 7% rate of caught hallucinations is the difference between trust and litigation.
- Build the audit log on day one. Retroactively building one is harder than it sounds.
- Get a domain expert testing weekly. Synthetic testing does not catch the cohort 1965 bug. Jean-Luc did.
What is next
We are working on two upgrades in Q2. First, a feedback loop where advisors can flag a finding as "this is wrong, here is why" and the flag flows into our prompt-and-retrieval evaluation set. Second, we are exploring a more aggressive use of structured outputs (the model's responses constrained by a schema) so the verifier has less work to do.
RAG regulated industries posture is not a tech stack. It is a posture. Treat the regulation as truth, treat the model as an instrument, treat the citation as the deliverable, and treat the audit log as the receipt. If you do those four things, the technology takes care of itself. If you skip any of them, you will eventually ship a wrong answer to someone who staked their license on it. We almost did. We will not.



