Skip to main content
BlogAI Engineering

Building a Typed Prompt Library in TypeScript for Production LLMs

A small pattern that turns your scattered prompt strings into a typed, versioned, testable library, with input/output schemas and provider routing.

Building a Typed Prompt Library in TypeScript for Production LLMs

A prompt is a function. It has typed inputs, a typed output, a version, and a contract. Treating it as a string scattered across your codebase is the production-LLM equivalent of writing your routing logic in inline JavaScript inside HTML attributes. It works for prototypes. It rots in production. The typed prompt library TypeScript pattern is small, opinionated, and pays for itself the first time you ship a model upgrade and want to verify nothing broke. We use it across the lesson-planning monorepo and Carriva's RAG assistant. Here is the exact pattern.

What "typed prompt library" actually means

A handful of properties, all enforced at compile time:

  1. Input schema: a Zod (or Valibot) schema for the inputs the prompt expects.
  2. Output schema: a Zod schema for the structured response.
  3. Template: the prompt text, with typed placeholders.
  4. Version: a string like "v3" so you can A/B test changes.
  5. Provider routing: which model serves this prompt by default and which alternates.
  6. Eval set reference: a pointer to the held-out test set for regression testing.

A prompt becomes a first-class object. Calling it from your business logic is a typed function call, not a string interpolation.

The minimal pattern

Here is the smallest version of the pattern, slightly simplified from our actual code.

import { z } from "zod";
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

type PromptDef<I extends z.ZodTypeAny, O extends z.ZodTypeAny> = {
  name: string;
  version: string;
  inputSchema: I;
  outputSchema: O;
  template: (input: z.infer<I>) => string;
  defaultModel: "claude-sonnet" | "gpt-4" | "gemini-flash";
};

export function definePrompt<I extends z.ZodTypeAny, O extends z.ZodTypeAny>(
  def: PromptDef<I, O>,
) {
  return {
    ...def,
    async run(input: z.infer<I>): Promise<z.infer<O>> {
      const validated = def.inputSchema.parse(input);
      const promptText = def.template(validated);
      const response = await client.messages.create({
        model: "claude-sonnet-4",
        max_tokens: 2000,
        messages: [{ role: "user", content: promptText }],
      });
      const text = response.content[0]?.type === "text" 
        ? response.content[0].text 
        : "";
      const parsed = JSON.parse(extractJson(text));
      return def.outputSchema.parse(parsed);
    },
  };
}

function extractJson(text: string): string {
  // Strip markdown fences if the model wraps JSON in ```json ... ```
  const fenced = text.match(/```(?:json)?\s*([\s\S]*?)```/);
  return (fenced?.[1] ?? text).trim();
}

That is the core. Everything that follows is calibration on this skeleton.

A real prompt definition

Here is a prompt we actually run on the lesson-planning monorepo, slightly anonymized.

import { z } from "zod";
import { definePrompt } from "./prompt-lib";

const Input = z.object({
  subject: z.string(),
  level: z.string(),
  topic: z.string(),
  durationMinutes: z.number().int().positive(),
  language: z.enum(["fr", "en", "pl", "es"]),
});

const Output = z.object({
  title: z.string(),
  objectives: z.array(z.string()).min(1).max(5),
  sections: z.array(z.object({
    name: z.string(),
    durationMinutes: z.number(),
    activities: z.array(z.string()),
  })),
  materials: z.array(z.string()),
});

export const generateLessonPlan = definePrompt({
  name: "generate-lesson-plan",
  version: "v4",
  inputSchema: Input,
  outputSchema: Output,
  defaultModel: "claude-sonnet",
  template: (input) => `
You are a curriculum designer. Produce a lesson plan for:
- Subject: ${input.subject}
- Level: ${input.level}
- Topic: ${input.topic}
- Duration: ${input.durationMinutes} minutes
- Language: ${input.language}

Return JSON matching the schema with title, objectives, sections, and materials.
Do not include prose outside the JSON.
`,
});

Calling it from a route handler:

const plan = await generateLessonPlan.run({
  subject: "Mathematics",
  level: "Grade 7",
  topic: "Introduction to algebra",
  durationMinutes: 50,
  language: "fr",
});
// plan is fully typed: plan.title, plan.objectives[0], etc.

What this buys you:

  • Compile-time safety on inputs. If a route handler forgets a field, TypeScript fails the build.
  • Runtime validation at the LLM boundary. If the model returns malformed JSON or a missing field, Zod throws and you handle it once in one place.
  • Version tracking. When we shipped v4 of this prompt, we did not break v3 callers because we could deprecate the import gradually.
  • Provider routing in one place. When we switched the default model, no business logic changed.

We compared the underlying provider choices (Claude vs GPT vs Gemini) in our Claude vs GPT vs Gemini writeup; the routing layer in this library is where those decisions are encoded.

Why Zod at both ends

The two layers of validation matter for different reasons.

Input validation

Without input validation, you ship a prompt that silently breaks when an upstream API change adds a field, omits a field, or sends a wrong type. Zod parsing at the entry point throws a clean error that you can log and handle.

This is the same pattern we use everywhere data crosses our trust boundary. We covered the broader TypeScript hygiene story in our TypeScript strict mode flags writeup.

Output validation

LLMs are statistically reliable, not deterministically reliable. A frontier model returns valid JSON 99.5% of the time when prompted carefully. The remaining 0.5% is your problem.

Output validation gives you:

  • A clear error path when the model misbehaves.
  • A retry hook with a "your previous output failed schema validation, here is the error, please fix" prompt.
  • A monitoring signal for prompt drift over time.

We log every output validation failure with the input, the prompt version, the model, and the validation error. Over months, this dataset is invaluable. It tells us which prompts are degrading, which model upgrades broke things, and which inputs trigger edge cases.

Versioning prompts

A prompt's version is a string. You can change a prompt by changing the template, the schema, the model, or the default temperature. Any of those is a behavior change.

We bump the version number on any change that could affect output. The downstream code can either pin to a specific version or import the latest:

import { generateLessonPlan } from "./prompts/lesson-plan-v4";
// Or
import { generateLessonPlan } from "./prompts/lesson-plan"; // re-exports latest

For sensitive endpoints, we pin. For experimentation, we use latest. The discipline is: any production code path that has ever shipped uses a pinned version. New endpoints can use latest until they reach production maturity.

The eval harness

A prompt without an eval set is folklore. You think it works. You have no proof.

Each prompt has an associated test file:

// generate-lesson-plan.eval.ts
import { generateLessonPlan } from "./generate-lesson-plan";

const cases = [
  {
    name: "math grade 7 algebra fr",
    input: { subject: "Mathematics", level: "Grade 7", topic: "Algebra", durationMinutes: 50, language: "fr" },
    assertions: (out: any) => {
      expect(out.objectives.length).toBeGreaterThanOrEqual(2);
      expect(out.sections.reduce((s, x) => s + x.durationMinutes, 0)).toBeCloseTo(50, 0);
    },
  },
  // ... 20 to 50 cases per prompt
];

describe("generate-lesson-plan", () => {
  for (const c of cases) {
    it(c.name, async () => {
      const out = await generateLessonPlan.run(c.input);
      c.assertions(out);
    });
  }
});

Run the eval:

  • On every prompt change, before merging.
  • On every model upgrade, to catch regressions.
  • Periodically, to monitor drift.

We do not run these as unit tests in CI because they cost real money. We run them as nightly batches and on-demand before we ship a prompt change. The cost is roughly 0.50 to 5 EUR per full eval run depending on prompt complexity.

Provider routing and fallback

A typed prompt library is the right place to encode "use Claude by default; fall back to GPT-4 if Claude is rate-limited; for cheap reranking use Gemini Flash".

We extend the library with a small router:

async function callWithFallback<O>(
  primary: () => Promise<O>,
  fallback: () => Promise<O>,
): Promise<O> {
  try {
    return await primary();
  } catch (err: unknown) {
    if (isRateLimitOrTimeout(err)) {
      return await fallback();
    }
    throw err;
  }
}

In the prompt's run() method, we use this with a primary and a secondary provider. The business logic does not see the routing.

Prompt caching

Anthropic, OpenAI, and Google all support some form of prompt caching now. This is a non-trivial cost win on prompts that ship with a large fixed prefix (style guide, examples, brand voice).

The library encodes which parts of the prompt are cacheable. With Anthropic, you split the prompt into cached blocks:

const response = await client.messages.create({
  model: "claude-sonnet-4",
  max_tokens: 2000,
  system: [
    {
      type: "text",
      text: BRAND_VOICE_DOCUMENT, // 8k tokens, cached
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: promptText }],
});

For our content pipeline, prompt caching cut effective cost per generation by roughly 60%. This is the kind of optimization that lives best in the library, not scattered across business logic.

A prompt is a function. Treat it like one. Type it, version it, test it, and route it. The rest is application code.

What we tried and abandoned

A few patterns that did not survive production.

LangChain prompt templates

We tried LangChain in early experiments. The prompt-template abstraction added indirection without solving the typing problem. We replaced it with the lightweight pattern above and have not looked back.

Prompt management UI

The pull to "let non-engineers edit prompts in a UI" is real. We resisted because prompts are code; they need version control, code review, and tests. A separate UI breaks all three. If the team needs to iterate on prompts, the right tool is a tighter dev loop, not a non-dev UI.

Generated TypeScript from JSON Schema

We considered generating the TypeScript types from JSON Schema definitions of the prompts. The generation step added complexity. Hand-written Zod is honestly fine.

The Claude Code workflow piece

We use Claude Code (Anthropic's CLI) for the prompt-writing workflow itself. Writing a new prompt with Claude Code feels like a tight loop: explain what you want, the agent drafts the schema and template, runs the eval set, shows you the failures, you iterate. Documented in our Claude Code coding agent writeup.

The library and the workflow reinforce each other. The library makes it cheap to add a new prompt. The agent makes it fast to iterate on the prompt's quality. The result is that we ship LLM features with the same cadence as we ship traditional features.

Common pitfalls

Three patterns we see in other codebases that we avoid.

String concatenation in route handlers

The temptation to inline a prompt string in a route handler is real. It works for the first prompt. By the third, you have copy-pasted "respond only in JSON" three times with three slightly different wordings. The library fixes this with one definition.

Output parsing as a side gig

Many codebases parse LLM output ad-hoc with regex or JSON.parse without validation. The "happy path" works. The 0.5% sad path takes down the request. Always validate the output.

Same prompt for different inputs

We see prompts written once and reused for inputs they were never tested on. The eval harness is what catches this. If your prompt's eval set does not include the new input shape, you do not know it works.

What we would test first

If you are starting an LLM-powered feature in 2026:

  1. Define the prompt with a Zod input schema and a Zod output schema before writing the template.
  2. Write 10 test cases as your eval set, including 2 to 3 adversarial inputs.
  3. Pick a default model and an alternate. Both are real choices.
  4. Wrap the call with input validation, output validation, and retry-on-validation-failure.
  5. Version the prompt from day one. You will change it.

The pattern looks like overhead. It is the difference between a prototype that works on demo day and a production system that works on day 90.

TL;DR

Prompts are typed functions. Build them as typed functions. Zod schemas at both ends. Versioned templates. Provider routing in one place. Eval harness per prompt. The typed prompt library TypeScript pattern is small enough to fit in 200 lines and large enough to carry your LLM features into real production usage. The investment is half a day. The payoff is the next year of not chasing prompt regressions in production.

A small thing

Want to work with us?

We are a small studio shipping focused B2B SaaS for niche professional verticals. If your problem looks like one of ours, we would love to chat.