Structured Generation with LLM（1）：介绍Kor，并用免费的LLM API做点练习

引言

Structured Generation with LLM，是指让LLM按照预先定义的schema，输出符合schema的结构化结果。

常见的应用场景有：

数据处理。主要功能为a -> b，即从源文本中抽取/生成符合schema的结果，例如给定新闻，进行分类、抽取关键词、生成总结等；
Agent。主要功能是Tool Calling，即根据用户query，选择适当的tool和入参。

本期是Structured Generation系列的第一期，主要介绍Kor^[1]，一个基于prompt的技术方案；Kor比较适合数据处理场景，且原理简单、易于理解，适合作为入门。

Kor的原理

使用Kor进行structured generation的流程如下：

定义schema，包括结构、注释还有例子；
Kor用特定的prompt template，将用户提供的schema和待处理的raw text，组装成prompt；
将prompt发送给LLM，借助其通用的In Context Learning能力，尽量生成符合schema的内容；
Kor对LLM的输出进行parse，返回符合schema的结构化结果，但也有概率没有返回（当LLM的输出不符合schema时）。

Kor的工作是其中的第2步、第4步。由此可见，Kor是对LLM的一层包装。

Kor的优点是：使用方便。Kor无需介入decode过程，只要有一个text to text的LLM API即可使用，既可以用闭源模型，也可以用开源模型。

但Kor的缺点也很明显：无法保证抽取结果一定满足schema，这是因为：

本质上Kor只是帮你“组装”了一下prompt而已，输出是否符合schema还取决于模型自身的instruction-following能力。

两则练习

介绍了Kor的原理之后，我们进行两则练习。

在练习中，笔者将使用硅基流动^[2]提供的免费glm4-9b-chat API。

本文涉及的代码，已整理在以下git项目中，欢迎star：

https://github.com/duanyu/structured_generation_with_llm

练习1：翻译

Example 1: 中文翻译器

效果：输入任意文本，返回{"translate_result": {"chinese": 翻译结果}}



在结构化输出中，一般只需两步即可：



设置schema（即想要llm输出的结构，同时包含注释、例子）；

用结构化输出工具（例如本文提到的Kor）得到schema结果。

Kor支持两种设置schema的模式，Kor schema和Pydantic Model，在这个例子中，我们使用Kor schema。



注意：此处不对Kor做过多介绍，细节请读者参阅文档：https://eyurtsev.github.io/kor/

# kor schema，我们想要的输出格式

schema = Object(

    id="translate_result",

    description=(

        "任意文本的翻译结果。"

    ),

    attributes=[

        Text(

            id="chinese",

            description="中文翻译结果",

            examples=[], # Kor支持few-shot examples，但本例子比较简单，故不需要

            many=False, 

        ),

    ],

    many=False,

)

# 运行结果

chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')

text = "We've trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT's code output. We found that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. We are beginning the work to integrate CriticGPT-like models into our RLHF labeling pipeline, providing our trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools."

print(chain.run(text)['data'])

{'translate_result': {'chinese': '我们训练了一个基于GPT-4的模型，称为CriticGPT，用于捕捉ChatGPT代码输出的错误。我们发现，当人们从CriticGPT那里获得帮助来审查ChatGPT代码时，他们比没有帮助的人高出60%的效率。我们正在开始将类似CriticGPT的模型集成到我们的RLHF标记流程中，为我们的训练师提供明确的AI辅助。这是朝着能够评估来自高级AI系统的输出迈出的一步，这些输出在没有更好的工具的情况下很难被人类评估。'}}

示例1成功运行：）



我们打印kor的prompt来看看。



print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.



```TypeScript



translate_result: { // 任意文本的翻译结果。

 chinese: string // 中文翻译结果

}

```



Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: [user input]

Output:

练习2：评价解析

Example 2：评价解析

预期效果：输入一段用户评价，得到评价属性（口味、价格等）、评价极性（正向、负向、中立）、评价词（好吃、贵等）、参考片段。



结构化输出，第一步是定义schema，我们可以设置成这样的schema



[

    {

        'aspect': 评价属性,

        'sentiment': 评价极性,

        'sentiment_word': 评价词,

        'reference': 参考片段,

    }

]

在这个例子中，我们使用Pydantic Model来定义schema，Pydantic Model也能够支持few-shot examples，其额外好处是可以Validate

# 评价解析的pydantic model

class Sentiment(enum.Enum):

    positive = "positive"

    negative = "negative"

    neural = "neural"



class Dianpin(BaseModel):

    aspect: str = Field(

        description="评价属性"

    )

    sentiment_word: str = Field(

        description='对评价属性的评价词，从原文中抽取',

    )

    sentiment: Optional[Sentiment] = Field(

        description='对评价属性的情感，positive\negative\neural中的一个',

    )

    reference: str = Field(

        description='评价的原文'

    )

# 运行kor

schema, validator = from_pydantic(

    Dianpin, 

    description='对评价的解析结果', 

    examples=[],  

    many=True #支持list of aspect

)

chain = create_extraction_chain(

    llm, schema, encoder_or_encoder_class="json", validator=validator

)



pprint(chain.run("整体来说，环境可以，味道的话也还不错，但价格有一点小贵。"))

{'data': {},

 'errors': [ParseError('The LLM has returned structured data which does not match the expected schema. Providing additional examples may help improve the parse.')],

 'raw': '\n'

        '<json>\n'

        '[\n'

        '  {\n'

        '    "aspect": "环境",\n'

        '    "sentiment_word": "可以",\n'

        '    "sentiment": "positive"\n'

        '  },\n'

        '  {\n'

        '    "aspect": "味道",\n'

        '    "sentiment_word": "还不错",\n'

        '    "sentiment": "positive"\n'

        '  },\n'

        '  {\n'

        '    "aspect": "价格",\n'

        '    "sentiment_word": "小贵",\n'

        '    "sentiment": "negative"\n'

        '  }\n'

        ']\n'

        '</json>',

 'validated_data': {}}

注意，此时data字段数据为空，因为LLM的返回不符合预期的schema，kor建议加入examples



于是我们加入一个简单的example

# 运行kor

schema, validator = from_pydantic(

    Dianpin, 

    description='对评价的解析结果', 

    examples=[

        ('味道真不错，下次还来！', [{"aspect":"味道", "sentiment_word": "真不错", "sentiment": "positive", "reference": "味道真不错"}])

    ],

    many=True #支持list of aspect

)

chain = create_extraction_chain(

    llm, schema, encoder_or_encoder_class="json", validator=validator

)



pprint(chain.run("整体来说，环境可以，味道的话也还不错，但价格有一点小贵。"))

{'data': {'dianpin': [{'aspect': '环境',

                       'reference': '整体来说，环境可以',

                       'sentiment': 'positive',

                       'sentiment_word': '可以'},

                      {'aspect': '味道',

                       'reference': '味道的话也还不错',

                       'sentiment': 'positive',

                       'sentiment_word': '还不错'},

                      {'aspect': '价格',

                       'reference': '但价格有一点小贵',

                       'sentiment': 'negative',

                       'sentiment_word': '小贵'}]},

 'errors': [],

 'raw': '\n'

        '<json>\n'

        '{\n'

        '  "dianpin": [\n'

        '    {\n'

        '      "aspect": "环境",\n'

        '      "sentiment_word": "可以",\n'

        '      "sentiment": "positive",\n'

        '      "reference": "整体来说，环境可以"\n'

        '    },\n'

        '    {\n'

        '      "aspect": "味道",\n'

        '      "sentiment_word": "还不错",\n'

        '      "sentiment": "positive",\n'

        '      "reference": "味道的话也还不错"\n'

        '    },\n'

        '    {\n'

        '      "aspect": "价格",\n'

        '      "sentiment_word": "小贵",\n'

        '      "sentiment": "negative",\n'

        '      "reference": "但价格有一点小贵"\n'

        '    }\n'

        '  ]\n'

        '}\n'

        '</json>',

 'validated_data': [Dianpin(aspect='环境', sentiment_word='可以', sentiment=<Sentiment.positive: 'positive'>, reference='整体来说，环境可以'),

                    Dianpin(aspect='味道', sentiment_word='还不错', sentiment=<Sentiment.positive: 'positive'>, reference='味道的话也还不错'),

                    Dianpin(aspect='价格', sentiment_word='小贵', sentiment=<Sentiment.negative: 'negative'>, reference='但价格有一点小贵')]}

加入example之后，示例2成功运行。



我们也打印kor的prompt，看看长什么样，以及few-shot examples是如何使用的。



print(chain.prompt.format_prompt(text="[user input]").to_string())



Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.



```TypeScript



dianpin: Array<{ // 对评价的解析结果

 aspect: string // 评价属性

 sentiment_word: string // 对评价属性的评价词，从原文中抽取

 sentiment: "positive" | "negative" | "neural" // 对评价属性的情感，positive

egative

eural中的一个

 reference: string // 评价的原文

}>

```



Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: 味道真不错，下次还来！

Output: <json>{"dianpin": [{"aspect": "味道", "sentiment_word": "真不错", "sentiment": "positive", "reference": "味道真不错"}]}</json>

Input: [user input]

Output:

总结

本文作为structured generation的第一期，介绍了Kor。Kor主要基于prompt，是对LLM的一层封装；Kor的设计理念使其便于进行数据处理（raw data -> schema），但Kor的最大限制是，并不能保证所抽取内容的结构稳定性，而这点将会被guided decoding类技术解决。

文章转自微信公众号@漫谈NLP