大语言模型下的JSON数据格式交互

您所在的位置：网站首页 › 提取网站上的模型 › 大语言模型下的JSON数据格式交互

大语言模型下的JSON数据格式交互

2024-07-01 20:19| 来源: 网络整理| 查看: 265

插： AI时代，程序员或多或少要了解些人工智能，前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住分享一下给大家(前言 – 人工智能教程 )

坚持不懈，越努力越幸运，大家一起学习鸭~~~

随着大语言模型能力的增强，传统应用不可避免的需要调用LLM接口，提升应用的智能程度和用户体验，但是一般来说大语言模型的输出都是字符串，除了个别厂商支持JSON Mode，或者使用function call强制大语言模型输出json格式，大部分情况下，还是需要业务放自己去处理JSON格式，下面我来总结一下在解析JSON过程中遇到的一些问题和解决方案。

一、如何让大语言模型返回JSON格式？

其实LLM对Markdown和JSON格式还是比较友好的，在指令中指定返回JSON格式，基本都会遵循，

你是一个翻译大师，我给你一段中文，你翻译为英文、日文、韩文。返回JSON格式，包含三个属性，分别为：english、japanese、korean。现在开始翻译，中文内容是：阿里巴巴是一家伟大的公司。

返回结果:

```json { "english": "Alibaba is a great company.", "japanese": "アリババは素晴らしい会社です。", "korean": "알리바바는 위대한 회사입니다." } ```

这个时候，我们可以使用正则表达式，提取出Markdown格式下的JSON内容：

const match = /```(json)?(.*)```/s.exec(s); if (!match) { return JSON.parse(s); } else { return JSON.parse(match[2]); }

但是返回一个稳定的JSON格式，也不是那么容易，如果模型能力不强，可以会返回以下内容：

Here is the translation in JSON format: { "english": "Alibaba is a great company.", "japanese": "アルイババは偉大な企業です。", "korean": "알리바바는 위대한 기업입니다." } Let me know if you need anything else! 😊

即使返回了正确的JSON格式，但是属性名和属性值对应的格式（可能嵌套数组、对象），也不定每次都正确，特别是在复杂场景下，目前有以下几种方案，可以确保返回的内容一定是遵循JSON格式。

1.1 JSON mode

在调用 Openai 的 gpt-4-turbo-preview 或 gpt-3.5-turbo-0125 模型时，可以将 response_format 设置为 { "type": "json_object" } 以启用 JSON 模式。启用后，模型仅限于生成解析为有效 JSON 对象的字符串。具体可查看：

https://platform.openai.com/docs/guides/text-generation/json-mode。

示例代码：

import OpenAI from "openai"; const openai = new OpenAI(); async function main() { const completion = await openai.chat.completions.create({ messages: [ { role: "system", content: "You are a helpful assistant designed to output JSON.", }, { role: "user", content: "Who won the world series in 2020?" }, ], model: "gpt-3.5-turbo-0125", response_format: { type: "json_object" }, }); console.log(completion.choices[0].message.content); } main();

返回响应：

"content": "{\"winner\": \"Los Angeles Dodgers\"}"`

值得注意的是：除了Openai，其他厂商基本都不支持JSON mode 。

1.2 function call

function call 其实本身不是解决JSON格式的，主要是解决将大型语言模型连接到外部工具的问题。可以在对话时描述函数，并让模型智能地选择输出包含调用一个或多个函数的参数的 JSON 对象。聊天完成 API 不会调用该函数，模型会生成 JSON，然后使用它来调用代码中的函数。

const messages = [ { role: 'system', content: 'You are a helpful assistant.' }, { role: 'user', content: '给[email protected]发一封邮件，主题是祝福他生日快乐，内容是祝福语', }, ]; const response = await openai.chat.completions.create({ messages: messages, model: 'gpt-4-1106-preview', tools: [ { type: 'function', function: { name: 'send_email', description: 'Send an email', parameters: { type: 'object', properties: { to: { type: 'string', description: 'Email address of the recipient', }, subject: { type: 'string', description: 'Subject of the email', }, body: { type: 'string', description: 'Body of the email', }, }, required: ['to', 'body'], } } } ], }); const responseMessage = response.choices[0].message; console.log(JSON.stringify(responseMessage));

{ "content": null, "role": "assistant", "tool_calls": [ { "function": { "arguments": "{\"to\":\"[email protected]\",\"subject\":\"祝你生日快乐\",\"body\":\"亲爱的无弃，祝你生日快乐！愿你新的一年里，幸福安康、梦想成真。\"}", "name": "send_email" }, "id": "call_JqC8t3jlmg25uDJg7mwHvvOG", "type": "function" } ] }

在这里我们就可以利用tools的function parameters来定义希望返回的JSON格式，parameters遵循了JSON chema的规范，https://json-schema.org/learn/getting-started-step-by-step。这个时候，返回的tool_calls的arguments就是一个标准的JSON字符串。

注意：也不是所有模型都支持function call的能力。

1.3 langchain结合Zod

Zod是一个TypeScript优先的模式声明和验证库。

https://github.com/colinhacks/zod/blob/HEAD/README_ZH.md

import { z } from "zod"; const User = z.object({ username: z.string(), }); User.parseAsync({ username: "无弃" }); // => { username: "无弃" } User.parseAsync({ name: "无弃" }); // => throws ZodError 在langchian.js中，Structured output parser就是使用Zod来声明和校验JSON格式。 1.3.1 声明返回JSON格式 import { z } from "zod"; import { StructuredOutputParser } from "langchain/output_parsers"; const parser = StructuredOutputParser.fromZodSchema( z.object({ answer: z.string().describe("answer to the user's question"), sources: z .array(z.string()) .describe("sources used to answer the question, should be websites."), }) ); console.log(parser.getFormatInstructions()); /* Answer the users question as best as possible. You must format your output as a JSON value that adheres to a given "JSON Schema" instance. "JSON Schema" is a declarative language that allows you to annotate and validate JSON documents. For example, the example "JSON Schema" instance {{"properties": {{"foo": {{"description": "a list of test words", "type": "array", "items": {{"type": "string"}}}}}}, "required": ["foo"]}}}} would match an object with one required property, "foo". The "type" property specifies "foo" must be an "array", and the "description" property semantically describes it as "a list of test words". The items within "foo" must be strings. Thus, the object {{"foo": ["bar", "baz"]}} is a well-formatted instance of this example "JSON Schema". The object {{"properties": {{"foo": ["bar", "baz"]}}}} is not well-formatted. Your output will be parsed and type-checked according to the provided schema instance, so make sure all fields in your output match the schema exactly and there are no trailing commas! Here is the JSON Schema instance your output must adhere to. Include the enclosing markdown codeblock: ``` {"type":"object","properties":{"answer":{"type":"string","description":"answer to the user's question"},"sources":{"type":"array","items":{"type":"string"},"description":"sources used to answer the question, should be websites."}},"required":["answer","sources"],"additionalProperties":false,"$schema":"http://json-schema.org/draft-07/schema#"} ``` What is the capital of France? */

在StructuredOutputParser.fromZodSchema中传入你想要声明的JSON格式，使用parser.getFormatInstructions()就可以得到一段prompt，描述了什么是"JSON Schema"，以及举例，最后描述希望返回的"JSON Schema"格式。把这一段prompt放在最终调用大语言模型的prompt后面，就可以严格要求大语言模型返回这个JSON格式。

1.3.2 提取与校验

把format_instructions拼入到完整的prompt，执行chain.invoke会自动parse返回结果为一个JSON对象。

import { z } from "zod"; import { OpenAI } from "@langchain/openai"; import { RunnableSequence } from "@langchain/core/runnables"; import { StructuredOutputParser } from "langchain/output_parsers"; import { PromptTemplate } from "@langchain/core/prompts"; const chain = RunnableSequence.from([ PromptTemplate.fromTemplate( "Answer the users question as best as possible.\n{format_instructions}\n{question}" ), new OpenAI({ temperature: 0 }), parser, ]); const response = await chain.invoke({ question: "What is the capital of France?", format_instructions: parser.getFormatInstructions(), }); console.log(response); /* { answer: 'Paris', sources: [ 'https://en.wikipedia.org/wiki/Paris' ] } */

如果返回的格式不符合answer、sources的数据类型，会直接报错。也可以利用Auto-fixing parser来重试与修复：

https://js.langchain.com/docs/modules/model_io/output_parsers/types/output_fixing。

注意：在模型能力不怎么强的情况下，parser.getFormatInstructions()返回的那一大段prompt，可能会导致返回结果不正确，大段的prompt反而影响了结果：

比如prompt如下：

你是翻译专家，负责把输入内容从中文翻译成英文，需要翻译的内容为:你好。 You must format your output as a JSON value that adheres to a given "JSON Schema" instance. "JSON Schema" is a declarative language that allows you to annotate and validate JSON documents. For example, the example "JSON Schema" instance {{"properties": {{"foo": {{"description": "a list of test words", "type": "array", "items": {{"type": "string"}}}}}}, "required": ["foo"]}}}} would match an object with one required property, "foo". The "type" property specifies "foo" must be an "array", and the "description" property semantically describes it as "a list of test words". The items within "foo" must be strings. Thus, the object {{"foo": ["bar", "baz"]}} is a well-formatted instance of this example "JSON Schema". The object {{"properties": {{"foo": ["bar", "baz"]}}}} is not well-formatted. Your output will be parsed and type-checked according to the provided schema instance, so make sure all fields in your output match the schema exactly and there are no trailing commas! Here is the JSON Schema instance your output must adhere to. Include the enclosing markdown codeblock: {"type":"object","properties":{"output":{"type":"string","description":"翻译后的结果"}},"required":["output"],"additionalProperties":false,"$schema":"http://json-schema.org/draft-07/schema#"}

结果返回：

```json { "type": "object", "properties": { "output": { "type": "string", "description": "翻译后的结果" } }, "required": ["output"], "additionalProperties": false, "$schema": "http://json-schema.org/draft-07/schema#" } ``` ```json { "output": "Hello." } ```

在这里，误将JSON schema的定义重新输出了一次，导致解析报错，虽然概率很小大概不到5%，但调用次数多了，还是会遇到。

1.4 TypeChat结合Typescript

如果只是声明返回的JSON格式，除了zod，会发现Typescript的interface非常适合描述JSON格式:

你是一个翻译大师，我给你一段中文，你翻译为英文、日文、韩文。返回JSON格式，符合typescript的interface： interface Response { english: string; japanese:string; korean: string; } 现在开始翻译，中文内容是：阿里巴巴是一家伟大的公司。

返回的JSON数据会符合Response定义。

TypeChat就是这个思路，通过编写TypeScript类型定义，而不是自然语言提示来指导语言模型提供类型良好的结构化的响应数据，用schema替代prompt，https://github.com/microsoft/TypeChat。

举一个简单的例子：

import { TypeChat } from 'typechat'; interface CoffeeOrder { type: string; size: string; extras: string[]; } const typeChat = new TypeChat(); // 用户输入 const userInput = "I would like a large cappuccino with extra foam and a shot of vanilla."; // 使用 TypeChat 获取一个结构化的数据 const order = typeChat.process(userInput); console.log(order); // 输出: { type: 'cappuccino', size: 'large', extras: ['extra foam', 'shot of vanilla'] }

实际使用的时候会稍微复杂一点，会涉及到类型校验、纠错与重试：

更多示例参考：https://github.com/microsoft/TypeChat/blob/main/typescript/examples/math/src/main.ts

1.5 few shot

实践下来，会发现这么多方案，各有优劣：

‒JSON mode大部分模型不支持；

‒function call模型支持度不高，对话也不一定会命中function；

‒langchain 结合 Zod 会产生一大段prompt，占用大量token，同时有一定概率误导了返回结果；

‒Typechat需要提前声明ts定义，同时和框架也比较耦合，不适合单独使用；

特别是在一些复杂场景，比如返回的JSON格式是由入参决定的，举个例子：

你是一个数据mock专家，我给你一段数据描述，你生成一份mock数据。现在数据结构如下： ``` [ {"path":"param_0","text":"应用名","isArray":false}, {"path":"param_1","text":"应用图标","isArray":false}, {"path":"param_2","text":"应用描述","isArray":false} ] ``` 请生成mock数据。

期望返回以下内容：

{ "param_0":"oa审批", "param_1":"https://via.placeholder.com/300x200", "param_2":"oa审批是一个表单流程低码搭建平台，可以快速搭建一个审批流" }

这种情况下，我们没法提前定义返回的JSON格式定义，最多只能定义外层属性，让返回内容变成一个字符串：

// zod z.object({ mockData: z.string().describe("mock数据"), }) // typechat interface IResponse { mockData: string; // mock数据 }

但是这样返回是及其不稳定的，有两种返回：

// 正确： { "mockData": "{\"param_0\":\"oa审批\",\"param_1\":\"https://via.placeholder.com/300x200\",\"param_2\":\"oa审批是一个表单流程低码搭建平台，可以快速搭建一个审批流\"}" } // 错误,会导致校验不通过，因为mockdata的值不是一个string { "mockData": {"param_0":"oa审批","param_1":"https://via.placeholder.com/300x200","param_2":"oa审批是一个表单流程低码搭建平台，可以快速搭建一个审批流"} }

这种情况下，使用few shot也是一个不错的选择，举几个例子，然后从返回的结果中，直接提取出json内容：

你是一个数据mock助手，我给你一个生成数据的变量描述，请帮我按照需求生成mock数据。我给你举几个例子：举例一： ------ 输入： [{"path":"paramArray_0","text":"返回内容","isArray":true,"children":[{"path":"param_0","text":"商品标题","isArray":false},{"path":"param_1","text":"商品图片","isArray":false},{"path":"param_2","text":"商品价格","isArray":false},{"path":"param_3","text":"商品链接","isArray":false}]},{"path":"param_4","text":"今天（流程触发时间）","isArray":false}] 推理过程： paramArray_0代表返回内容，isArray为true，是一个数组，子级中，param_0代表商品标题，param_1代表商品图片，是一个http连接，param_2代表商品价格，应该是一个数字字符串，param_3代表商品链接，是一个http链接，param_4代表今日时间，是一个格式化的时间。输出mock数据： ```json {"paramArray_0":[{"param_0":"苹果iPhone 14 Pro Max 5G智能手机 256GB 深空黑","param_1":"https://via.placeholder.com/300x200","param_2":"8999.99","param_3":"https://item.taobao.com/item.htm?id=37221120302"},{"param_0":"三星Galaxy S23 Ultra 5G旗舰手机 12GB+256GB 幻影黑","param_1":"https://via.placeholder.com/300x200","param_2":"8999.99","param_3":"https://item.taobao.com/item.htm?id=525519066299"}],"param_4":"2024-04-16 21:07:45"} ``` ------ 举例二： ------ 输入： [{"path":"param_0","text":"应用名","isArray":false},{"path":"param_1","text":"应用图标","isArray":false},{"path":"param_2","text":"应用描述","isArray":false}] 推理过程： param_0代表应用名；param_1代表应用图标，应该是一个图片http链接；param_2代表应用描述。输出mock数据： ```json {"param_0":"oa审批","param_1":"https://via.placeholder.com/300x200","param_2":"oa审批是一个表单流程低码搭建平台，可以快速搭建一个审批流"} ``` ------ 现在正式开始：输入： [{"path":"paramArray_0","originkey":"$.node_service.payload","text":"返回内容","pathText":"搜索商品.返回内容","isArray":true,"children":[{"path":"param_0","text":"商品标题","isArray":false},{"path":"param_1","text":"商品价格","isArray":false},{"path":"param_2","text":"商品图片","isArray":false},{"path":"param_3","text":"商品链接","isArray":false}]}] 请生成mock数据，生成的数据必须符合变量描述，只返回mock数据，不要返回其他内容。

返回内容：

```json { "paramArray_0": [ { "param_0": "Apple iPhone 14 Pro", "param_1": "5999.99", "param_2": "https://via.placeholder.com/300x200", "param_3": "https://item.taobao.com/item.htm?id=1111" }, { "param_0": "Samsung Galaxy S23", "param_1": "6999.99", "param_2": "https://via.placeholder.com/300x200", "param_3": "https://item.taobao.com/item.htm?id=2222" } ] } ```

这个时候使用正则表达式，直接提取出JSON，相对比较稳定，唯一的缺点是除了返回JSON之外，还会啰嗦的输出一些描述中文，需要反复强调只返回mock数据，不要返回其他内容。

二、模板语法结合JSON格式

有时候并不一定需要每次都去调用AI接口，生成数据，AI接口也可以生成一些JSON模板，比如合成一张卡片模板，用一些占位符和循环语句去挖一些坑位，运行的时候再使用真实数据结合渲染出真实JSON数据。

前端可以使用EJS 、nunjucks等模板渲染引擎，让大语言模型生成模板代码。

实验下来，使用nunjucks相对比较友好：

[ {% for item in paramArray_0 %} { \"type\": \"mediaContent\", \"value\": { \"link\": \"{{ item.param_3 }}\", \"title\": \"{{ item.param_0 }}\", \"cover\": \"{{ item.param_1 }}\", \"tagList\": [\"￥{{ item.param_2 }}\"] } }, {% endfor %} ]

然后使用nunjucks去渲染模板和数据，生成完整JSON字符串。

import nunjucks from 'nunjucks'; const data = nunjucks.renderString( `[ {% for item in paramArray_0 %} { \"type\": \"mediaContent\", \"value\": { \"link\": \"{{ item.param_3 }}\", \"title\": \"{{ item.param_0 }}\", \"cover\": \"{{ item.param_1 }}\", \"tagList\": [\"￥{{ item.param_2 }}\"] } }, {% endfor %} ]`, { "paramArray_0": [ { "param_0": "Apple iPhone 14 Pro", "param_1": "5999.99", "param_2": "https://via.placeholder.com/300x200", "param_3": "https://item.taobao.com/item.htm?id=1111" }, { "param_0": "Samsung Galaxy S23", "param_1": "6999.99", "param_2": "https://via.placeholder.com/300x200", "param_3": "https://item.taobao.com/item.htm?id=2222" } ] } );

三、JSON格式的解析

JSON格式有不同的解析规范：

‒IETF JSON RFC (8259及以前的版本)：这是互联网工程任务组（IETF）的官方规范。

‒ECMAScript标准：对JSON的更改是与RFC版本同步发布的，该标准参考了RFC关于JSON的指导。然而，JavaScript解释器提供的不合规范的便利性，如无引号字符串和注释，则激发了许多解析器的“创造”灵感。

‒JSON5：这个超集规范通过明确地添加便利性特征（如注释、备选引号、无引号字符串、尾部逗号）来增强官方规范。

‒HJSON：HJSON在思想上与JSON5类似，但在设计上则具有不同的选择。

3.1 JSON.parse

一个标准的JSON字符串，可以直接使用JSON.parse来解析成json格式，这个字符串需要严格符合JSON标准。

{"propertyName": "propertyValue"}

与 JavaScript 语法相比，JSON 语法受到限制，因此许多有效的 JavaScript 文本不会解析为 JSON。例如，JSON 中不允许使用尾随逗号，并且对象文本中的属性名称（键）必须用引号引起来。引号、注释、逗号、数字都必须符合规范，多了少了一点都会报错，特别是value是一个JSON字符串的时候，双引号需要转义，及其容易出错，最好避免这种返回格式。

{ "mockData": "{\"param_0\":\"oa审批\",\"param_1\":\"https://via.placeholder.com/300x200\",\"param_2\":\"oa审批是一个表单流程低码搭建平台，可以快速搭建一个审批流\"}" }

3.2 json5

JSON5 是对 JSON 的一种推荐扩展，旨在使人类更易于手动编写和维护。它通过直接从 ECMAScript 5 添加一些最小的语法功能来实现这一点，https://www.npmjs.com/package/json5。

import JSON5 from 'json5'； const obj = JSON5.parse('{unquoted:"key",trailing:"comma",}');

对象

‒对象的 key 可以跟 JavaScript 中对象 key 完全一致

‒末尾可以有一个逗号

数组

‒末尾可以有一个逗号

字符串

‒字符串可以用单引号

‒字符串可以用反引号

‒字符串可以用转义字符

数字

‒数字可以是 16 进制

‒数字可以用点开头或结尾

‒数字可以表示正无穷、负无穷和NaN.

‒数字可以用加号开头

‒支持单行和多行注释

空格

‒允许多余的空格

使用json5可以极大的提高JSON字符串的兼容性。

四、流式输出JSON数据

一次大语言模型对话，如果返回几百的Token，可能需要10几秒才能返回，转10几秒的圈圈让用户一直等待，肯定不是一个好的用户体验，但是返回JSON格式的数据如果使用流式输出，中间是缺失截断的，直接解析肯定会报错。

总不能先显示JSON字符串，等流式结果完全返回，再parse一下吧。你还别说，真有人是这么干的！

，时长00:20

虽然奇怪了一点，但好像确实比干等着转圈圈要好一点，那有没有更优雅一点的方式呢？

4.1 编译原理

参考JSON的解析过程，https://www.json.org/json-zh.html，魔改一下编译AST的过程，可以对中间截断状态的JSON字符串进行补全。

具体参考：https://juejin.cn/post/7063413421298941983

type LiteralValue = boolean | null; type PrimitiveValue = number | LiteralValue | string; type JSONArray = (PrimitiveValue | JSONArray | JSONObject)[]; type JSONObject = { [key: string]: PrimitiveValue | JSONArray | JSONObject }; type JSONValue = PrimitiveValue | JSONArray | JSONObject; type ParseResult = { success: boolean; // 如果转换成功，它的值表示值的最后一位在整个JSON字符串的位置 // 如果失败，它表示失败的那个位置 position: number; value?: T; }; enum MaybeJSONValue { LITERAL, NUMBER, STRING, ARRAY, OBJECT, UNKNOWN, } const ESCAPE_CHAR_MAP: { [key: string]: string; } = { '\\\\': '\\', '\\"': '"', '\\b': '\b', '\\f': '\f', '\\n': '\n', '\\r': '\r', }; export class JSONParserService { private input: string; private parseLiteral(cur = 0): ParseResult { cur = this.skipWhitespace(cur); if (this.input[cur] === 't') { if (this.input.substring(cur, cur + 4) === 'true') { return { success: true, position: cur + 3, value: true, }; } } else if (this.input[cur] === 'f') { if (this.input.substring(cur, cur + 5) === 'false') { return { success: true, position: cur + 4, value: false, }; } } else if (this.input[cur] === 'n') { if (this.input.substring(cur, cur + 4) === 'null') { return { success: true, position: cur + 3, value: null, }; } } return { success: false, position: cur, }; } private parseNumber(cur = 0): ParseResult { cur = this.skipWhitespace(cur); const parseDigit = (cur: number, allowLeadingZero: boolean) => { let dights = ''; if (!allowLeadingZero && this.input[cur] === '0') { return ['', cur] as const; } let allowZero = allowLeadingZero; while ( (allowZero ? '0' : '1') { const { data: people, run } = useJsonStreaming({ url: "/api/people", method: "GET", }); return ( {people && people.length > 0 && ( {people.map((person, i) => ( Name: {person.name} Age: {person.age} City: {person.city} Country: {person.country} ))} )} ); };

【本文地址】

公司简介

联系我们