Aug 22, 2024

Structured outputs in GPT-4o with JSON Schemas

Abstract

With LLMs (large language models) significantly gaining popularity over the last few years, more and more use cases arose that require a certain structure of output, rather than free text. In this post, I will give an overview of methods for dealing with structural requirements in LLM outputs and go into detail on how and why to use GPT-4o’s new json-schema mode.

If you are just interested in the most robust method for generating structured Output for each model you can jump directly to the section:

Provider	Models	Method
OpenAI	since gpt-4o-mini, gpt-4o-2024-08-06	JSON Schema
OpenAI	gpt-3.5-turbo, gpt-4-* and gpt-4o-* models	JSON Mode
Claude	all models	prompting & prefill
Llama	meta-llama/Meta-Llama-3.1-*-Instruct-Turbo	JSON Schema
Mistral	mistralai/Mixtral-8x7B-Instruct-v0.1	JSON Schema
Google	Gemini 1.5 Flash and Gemini 1.5 Pro	JSON Schema

Background

While many use cases for LLMs produce results for human users where the structure can be arbitrary, there are scenarios where a precise, consistent output format is crucial for automated processing and integration with other systems. The following are examples of such use cases:

Write result to a database
Use a value from the result for calculations
Compare multiple multiple responses to each other
Populate fields in a user interface or form
Trigger specific actions or workflows based on output content
Iterate over an array in the response

In all these (and many more) use cases the generated response must match to what we expect. In order to write to a database, the response needs to contain data for the columns in a format we can parse. In order to do calculations based on the response, we need to be able to identify the correct values to use in our calculation. To be able to compare values from multiple responses for our prompt, the format of the response needs to be the same for every response.

In summary, we have many strict requirements for the response and the non-deterministic nature of the transformers inside the LLMs makes it challenging to consistently produce outputs that meet these requirements without additional processing or control mechanisms.

Producing Valid Responses

In order to ensure the responses from our LLM conform to our schematic requirements we could consider the following:

Instructing the LLM to produce a specific output as part of the prompt
Validate the output structure (but not its content)
Validate the output content based on a schema In the following three sections, we will take a look at each of the three ideas and consider why they might be a fitting solution.

Prompting – the traditional approach

Prompt engineering has evolved to be an art form of its own right. Being able to coax an LLM to reliably produce a specific output comes close to black magic. The more requirements are added to the prompt, the more likely it is that some of them will not be considered properly. While larger context sizes improve the reliability significantly, the non-deterministic nature of LLMs remains a fundamental challenge when precise outputs are required. LLMs predict the next token or text snippet based on context and the huge amount of data they were trained on. While it might give the impression that they can reason logically or follow instructions, this is mostly coincidental, because based on context, training and attention a given token is most likely what you want to hear next. This probabilistic approach to text generation means that even with identical inputs, an LLM may produce varying outputs across different runs. The model’s responses are influenced by factors such as temperature settings, random seed values, and subtle differences in how the input is processed through its neural architecture. Consequently, while LLMs can generate highly coherent and contextually appropriate text, they lack the deterministic logic of traditional rule-based systems, making it challenging to guarantee consistent, structured outputs without additional post-processing or control mechanisms.

To deal with the non-deterministic nature and increase the likelihood of a correct output the following prompt-engineering techniques could be applied:

Instruction-based prompting Provide explicit directions or a schema for the desired output without examples.
```
Generate a JSON object with keys for 'name', 'age', and 'occupation'.
All values should be strings.
```

Few-shot prompting Demonstrate the desired input-output relationship through one or more examples.

Input: "John Doe, 30 years old, Engineer"
Output: {"name": "John Doe", "age": "30", "occupation": "Engineer"}

Now do this:
Input: "Jane Smith, 28 years old, Teacher"

Multi-step prompting Use the output from one prompt as input for a subsequent prompt to refine or transform the results.

Extract name, age and occupation from the following text:
{input_text}

Using the summary from Step 1, create a JSON object.
Use the following keys: 'name', 'occupation' and 'age'.

Anthropic Claude - Response Prefill

Anthropic suggests the same three measures in their documentation, as well as a new one.

For their model Claude you can specify the first few tokens the assistant should generate and it will continue from there – in other words, you can pre-fill the response.

const response = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    messages=[
         {"role": "user", "content": `
            Generate a JSON object with keys for 'name', 'age', and 'occupation'.
            All values should be strings.
            Input: {${user_string}}
         `},
         {"role": "assistant", "content": "{"}  # Prefill here
    ]
)

This drastically increases the likelihood of the response starting with JSON. Just make sure to prepend the pre-filled characters again to your response before parsing.

Prompting, Parsing & Runtime Errors

But even being this specific in the prompt does not guarantee a usable response. In order to catch any run-time errors, we need to be prepared for the following scenarios:

Response is not a valid JSON
- response is prefixed with text e.g. Here is your response formatted as JSON: {...}
- response is larger than the context window and therefore an incomplete json {"name": "John Doe", "age": "30", "occupati
- generated json response contains unescaped characters
validate that the JSON actually has the structure we expect
- values are missing
- keys are renamed
- arrays are formatted as objects

In any of those cases, we could try to run the prompt again, creating additional cost and processing time, but still doesn’t guarantee a parsable response.

In my personal experience, I have seen all of the cases above occur, albeit not very frequently. However, as it is possible that they occur, we need to be prepared for it… or even better use one of the following more robust solutions.

JSON Mode

A more robust way of ensuring that the response from the LLM is always a valid JSON was added to GPT 3.5 and 4 in December 2023.

The idea is the following: ask the LLM to provide a valid JSON object as part of your prompt (e.g. using the techniques from the previous section) and set response_format to { "type": "json_object" }. This ensures the output is a valid JSON.

How exactly this JSON mode works internally was not published by OpenAI. As response streaming works for it, however, we can deduct that for each new token that is generated there must be some kind of validation on whether the resulting partial JSON is valid as such.

So as long as the response JSON fits in the specified context, it will produce a structurally valid JSON. This solves our primary concern from the prompting approach.

Our second concern from earlier the JSON actually has the structure we expect is not addressed, however. Depending on the prompt and “mood” of the LLM, the structure we instruct might or might not be followed correctly as Owen Moore from OpenAI explains below.

[…] Note that JSON mode sadly doesn’t guarantee that the output will match your schema (though the model tries to do this and is > continually getting better at it), only that it is JSON that will parse. […]

— owencmoore in the OpenAI Community

JSON Schema – the ideal way

JSON Schema provides a way to define the structure and constraints of JSON data. By specifying a schema, we can dictate exactly what fields should be present, their data types, and any additional rules that the output must follow. It supports arrays, enums, length and format constraints and many more.

In GPT-4o, JSON Schemas are used to constrain the model’s output during the generation process, rather than validating the output afterwards. OpenAI leverages Constrained Decoding: a deterministic, engineering-based approach which constrains the model’s outputs during the generation process, ensuring schema compliance.

The constrained decoding technique works as follows:

The provided JSON Schema is converted into a context-free grammar (CFG).
The CFG undergoes pre-processing to create an efficient, cached data structure.
During text generation, the system determines valid next tokens based on the current state and grammar rules.
Invalid tokens are masked out, ensuring that only schema-compliant options are available for selection.

This process allows the model to handle complex, nested, and recursive structures in JSON Schemas and reduce the risk of non-compliance to zero. However, there are a few limitations:

The first request with a new schema incurs a preprocessing penalty, typically under 10 seconds but potentially up to a minute for particularly complex schemas.
Subsequent requests with the same schema are fast due to cached artefacts.
Only a subset of JSON Schema is supported to ensure optimal performance.
The model may still refuse unsafe requests or fail to complete if it reaches token limits.
While the output will be structurally valid, the content may still contain errors, such as in mathematical calculations.

Even with these limitations, JSON schemas are a great way of handling requirements for a prompt response. They drastically reduce the chance of incompatible data and free up valuable context in the prompt for other specifications.

Example: Schemas with ZOD

One of the most popular libraries for defining, stringifying, and parsing JSON schemas is Zod. Zod is typescript first, which means any schemas defined with zod are typed and can e.g. be used when parsing to safely access specific attributes of a JSON response. Zod can not create a schema based on the interfaces or classes you have defined in your typescript application.

Defining such a schema is trivial.

import { z } from "zod";

const UserSchema = z.object({
  id: z.string().uuid(),
  name: z.string().min(2).max(50),
  age: z.number().int().positive().max(120),
  email: z.string().email(),
  isActive: z.boolean(),
  tags: z.array(z.string()).min(1).max(5),
  preferences: z.object({
    theme: z.enum(["light", "dark", "system"]),
    notifications: z.boolean(),
  }),
  createdAt: z.date(),
});

Example: Schema in GPT-4o

For GPT-4o using the schema is quite simple. We just need to specify the correct response format and attach the schema. We do not need to explain the schema, the data attributes etc or expected output format in the prompt itself unless we want to.

import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";

const unstructuredUserInformation = "";

const completion = await openai.beta.chat.completions.parse({
  model: "gpt-4o-2024-08-06",
  messages: [
    { role: "system", content: "Extract the user information." },
    { role: "user", content: unstructuredUserInformation },
  ],
  response_format: zodResponseFormat(UserSchema, "user"),
});

const user = completion.choices[0].message.parsed;
console.log(user.name);

The attributes of the user can now safely be accessed and the likelihood of errors has been significantly reduced.

Example: Schema in Llama

For Mistral (using the Together AI API), we can use the same Zod schema we defined earlier. We just need to convert it to a JSON schema format using the zod-to-json-schema library. The process is straightforward:

const together = new Together();
const jsonSchema = zodToJsonSchema(UserSchema, "UserSchema");

const mistralExtract = await together.chat.completions.create({
  messages: [
    {
      role: "system",
      content: "Extract the user information from the given text.",
    },
    { role: "user", content: unstructuredUserInformation },
  ],
  model: "mistralai/Mixtral-8x7B-Instruct-v0.1",
  response_format: { type: "json_object", schema: jsonSchema },
});

Example: Schema in Gemini

Google’s Gemini model uses a slightly different approach for schema definition. Instead of using Zod directly, we need to translate our schema into Google’s FunctionDeclarationSchemaType format. Here’s how we can adapt our UserSchema for Gemini:

const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const model = genAI.getGenerativeModel({
  model: "gemini-1.5-pro",
  generationConfig: {
    responseMimeType: "application/json",
    responseSchema: {
      type: FunctionDeclarationSchemaType.OBJECT,
      properties: {
        id: { type: FunctionDeclarationSchemaType.STRING },
        name: { type: FunctionDeclarationSchemaType.STRING },
        age: { type: FunctionDeclarationSchemaType.NUMBER },
        email: { type: FunctionDeclarationSchemaType.STRING },
        isActive: { type: FunctionDeclarationSchemaType.BOOLEAN },
        tags: {
          type: FunctionDeclarationSchemaType.ARRAY,
          items: { type: FunctionDeclarationSchemaType.STRING },
        },
        preferences: {
          type: FunctionDeclarationSchemaType.OBJECT,
          properties: {
            theme: { type: FunctionDeclarationSchemaType.STRING },
            notifications: { type: FunctionDeclarationSchemaType.BOOLEAN },
          },
        },
        createdAt: { type: FunctionDeclarationSchemaType.STRING },
      },
    },
  },
});

const geminiResult = await model.generateContent(
  "Extract user information from: " + unstructuredUserInformation
);
const geminiUser = JSON.parse(geminiResult.response.text());
console.log("Gemini result:", geminiUser.name);

Conclusion & Outlook

The field of Generative AI is rapidly evolving, with new models, technologies, and concepts being introduced nearly weekly. While Large Language Models (LLMs) excel at generating human-like text, many real-world use cases require more deterministic behaviour. The introduction of JSON Schema support in models like GPT-4o, Llama, Mistral, and Gemini represents a significant leap forward in addressing this challenge, allowing developers to harness the power of LLMs while ensuring structured, consistent outputs.

JSON Schema provides a robust method for constraining LLM outputs, effectively bridging the gap between flexible natural language generation and the strict requirements of structured data processing. By employing JSON Schemas, developers can now guarantee that LLM responses conform to specific formats, greatly simplifying integration with existing systems and workflows. As LLM technology progresses, we can anticipate further refinements in schema-based output control, potentially supporting even more complex structures and constraints. Mastering JSON Schema techniques for LLMs will be crucial for developers aiming to harness these models’ full potential in production environments, paving the way for innovative solutions that combine AI’s generative capabilities with the precision demanded by real-world applications.