LangChain | 一种语言模型驱动应用的开发框架

1. LangChain介绍

LangChain 是一个框架，用于开发由语言模型驱动的应用程序，它使基于AI模型工作和应用构建的复杂部分变的更容易。LangChain可以将LLMs与外部数据源链接，也允许LMMs模型间的交互。

LangChain 从两个方面帮助我们做到这一点:

– 整合，将外部数据，如本地文件、其他应用程序和api数据，输入指定LLM

– 代理，允许LLMs通过决策与它所处环境互动，使用LLMs来帮助决定下一步要采取的行动，类似RPA

LangChain 的优点：

– 组件化，LangChain 容易替换语言模型需要的抽象和组件

– 自定义链，LangChain 为使用和自定义”chain”（串在一起的一系列actions）提供了开箱即用的支持

– 更新快，LangChain 团队更新速度非常快，开发者更快体验最新的LLM功能。

– 社区支持，精彩的讨论交流区和社区支持

LangChain将构建语言模型驱动的应用程序中必要组件或过程模块化，包括提示模块(Prompts)、代理模块(Agents)、模型(Models)、记忆模块(Memory)、链(Chains)、文本处理与向量存贮与索引模块(Indexes)，极大提高开发效率。

2. LangChain模组

前置环节，设置openai key

import os

os.environ["OPENAI_API_KEY"] = "..."

2.1 Schema – LLMs输入

2.1.1 Text —— 字符串

以自然语言字符串作为LLMs输入

my_text = "What day comes after Friday?"

2.1.2 Chat Messages —— 消息文本

指定消息所属角色类别 (System, Human, AI)

– System – 告诉AI所处背景环境以及要去做的事情的背景信息

– Human – 人类用户信息

– AI – AI回答信息

from langchain.chat_models import ChatOpenAI

from langchain.schema import HumanMessage, SystemMessage, AIMessage



chat = ChatOpenAI(temperature=.7)

chat(

    [

        SystemMessage(content="You are a nice AI bot that helps a user figure out what to eat in one short sentence"),

        HumanMessage(content="I like tomatoes, what should I eat?")

    ]

)

# 输出

# AIMessage(content='You could try making a tomato salad with fresh basil and mozzarella cheese.', additional_kwargs={})

也可增加和AI的聊天记录，例如

chat(

    [

        SystemMessage(content="You are a nice AI bot that helps a user figure out where to travel in one short sentence"),

        HumanMessage(content="I like the beaches where should I go?"),

        AIMessage(content="You should go to Nice, France"),

        HumanMessage(content="What else should I do when I'm there?")

]

)

# 输出

# AIMessage(content='You can take a stroll along the Promenade des Anglais, visit the historic Castle Hill, and explore the colorful Old Town.', additional_kwargs={}

2.1.3 Documents —— 文档

以文档形式保存文本块（chunks）及其所属信息的对象

from langchain.schema import Document

Document(page_content="This is my document. It is full of text that I've gathered from other places",

         metadata={

             'my_document_id' : 234234,

             'my_document_source' : "The LangChain Papers",

             'my_document_create_time' : 1680013019

         })

# 输出

Document(page_content="This is my document. It is full of text that I've gathered from other places", metadata={'my_document_id': 234234, 'my_document_source': 'The LangChain Papers', 'my_document_create_time': 1680013019})

2.2 Models – AI模型接口

2.2.1 Language Model —— 输入输出均为文本

from langchain.llms import OpenAI

llm = OpenAI(model_name="text-ada-001")

llm("What day comes after Friday?")

# 输出: '\n\nSaturday.'

2.2.2 Chat Model —— 输入一系列文本消息，输出文本消息

from langchain.chat_models import ChatOpenAI

from langchain.schema import HumanMessage, SystemMessage, AIMessage

chat = ChatOpenAI(temperature=1)

chat(

    [

        SystemMessage(content="You are an unhelpful AI bot that makes a joke at whatever the user says"),

        HumanMessage(content="I would like to go to New York, how should I do this?")

    ]

)

# 输出

# AIMessage(content="Have you tried walking there? It's only a couple thousand miles or so.", additional_kwargs={})

2.2.3 Text Embedding Model —— 文本向量化

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

text = "Hi! It's time for the beach"

text_embedding = embeddings.embed_query(text)

print (f"Your embedding is length {len(text_embedding)}")

print (f"Here's a sample: {text_embedding[:5]}...")

# 输出

Your embedding is length 1536

Here's a sample: [-0.00017436751566710776, -0.0031537775329516507, -0.0007205927056327557, -0.019407861884521316, -0.015138132716961442]...

2.3 Prompts – 输入模型的提示文本

2.3.1 Prompt —— 传递给底层模型的提示内容

from langchain.llms import OpenAI

llm = OpenAI(model_name="text-davinci-003")

# I like to use three double quotation marks for my prompts because it's easier to read

prompt = """

Today is Monday, tomorrow is Wednesday.

What is wrong with that statement?

"""

llm(prompt)

# 输出 

# '\nThe statement is incorrect. The day after Monday is Tuesday.'

2.3.2 Prompt Template —— 提示模板

Prompt Template是一个根据用户输入、非静态信息和固定模板字符串组合创建提示的对象，可动态传入变量。

from langchain.llms import OpenAI

from langchain import PromptTemplate

llm = OpenAI(model_name="text-davinci-003")

# Notice "location" below, that is a placeholder for another value later

template = """

I really want to travel to {location}. What should I do there?

Respond in one short sentence

"""

prompt = PromptTemplate(

    input_variables=["location"],

    template=template,

)

final_prompt = prompt.format(location='Rome')

print (f"Final Prompt: {final_prompt}")

print ("-----------")

print (f"LLM Output: {llm(final_prompt)}")

Final Prompt: 

I really want to travel to Rome. What should I do there?



Respond in one short sentence



-----------

LLM Output: Visit the Colosseum, the Trevi Fountain, St. Peter's Basilica, and the Pantheon.

2.3.3 Example Selectors —— 相似例子选择

输入一组例子，再从给定大量例子中选择相近例子，允许动态内容以传参形式写入提示内容。

from langchain.prompts.example_selector import SemanticSimilarityExampleSelector

from langchain.vectorstores import FAISS

from langchain.embeddings import OpenAIEmbeddings

from langchain.prompts import FewShotPromptTemplate, PromptTemplate

from langchain.llms import OpenAI

llm = OpenAI(model_name="text-davinci-003")

example_prompt = PromptTemplate(

    input_variables=["input", "output"],

    template="Example Input: {input}\nExample Output: {output}",

)

# Examples of locations that nouns are found

examples = [

    {"input": "pirate", "output": "ship"},

    {"input": "pilot", "output": "plane"},

    {"input": "driver", "output": "car"},

    {"input": "tree", "output": "ground"},

    {"input": "bird", "output": "nest"},

]

# SemanticSimilarityExampleSelector will select examples that are similar to your input by semantic meaning

example_selector = SemanticSimilarityExampleSelector.from_examples(

    # This is the list of examples available to select from.

    examples, 



    # This is the embedding class used to produce embeddings which are used to measure semantic similarity.

    OpenAIEmbeddings(), 



    # This is the VectorStore class that is used to store the embeddings and do a similarity search over.

    FAISS, 



    # This is the number of examples to produce.

    k=2

)

similar_prompt = FewShotPromptTemplate(

    # The object that will help select examples

    example_selector=example_selector,



    # Your prompt

    example_prompt=example_prompt,



    # Customizations that will be added to the top and bottom of your prompt

    prefix="Give the location an item is usually found in",

    suffix="Input: {noun}\nOutput:",



    # What inputs your prompt will receive

    input_variables=["noun"],

)

# Select a noun!

my_noun = "student"



print(similar_prompt.format(noun=my_noun))

llm(similar_prompt.format(noun=my_noun))

Give the location an item is usually found in

Example Input: driver

Example Output: car



Example Input: pilot

Example Output: plane



Input: student

Output:

' classroom'

2.3.4 Output Parsers —— 格式化输出对模型输出的结果进行格式化处理，用于要求输出数据结构化的场景，包含格式说明（Format Instructions）和解析器（Parser）两个概念。Format Instructions 通过描述生成prompt，告诉LLMs按需格式化模型输出；Parser 将模型输出结果结构化包装，例如将字符结果转json。以下代码指定LLM输出格式，并通过描述生成提示词，让模型对输入存在问题的文本实现格式矫正，并结输出结构化数据。

from langchain.output_parsers import StructuredOutputParser, ResponseSchema

from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate

from langchain.llms import OpenAI



llm = OpenAI(model_name="text-davinci-003")



# How you would like your reponse structured. This is basically a fancy prompt template

response_schemas = [

    ResponseSchema(name="bad_string", description="This a poorly formatted user input string"),

    ResponseSchema(name="good_string", description="This is your response, a reformatted response")

]



# How you would like to parse your output

output_parser = StructuredOutputParser.from_response_schemas(response_schemas)



# See the prompt template you created for formatting

format_instructions = output_parser.get_format_instructions()

print (format_instructions)



template = """

You will be given a poorly formatted string from a user.

Reformat it and make sure all the words are spelled correctly



{format_instructions}



% USER INPUT:

{user_input}



YOUR RESPONSE:

"""



prompt = PromptTemplate(

    input_variables=["user_input"],

    partial_variables={"format_instructions": format_instructions},

    template=template

)



promptValue = prompt.format(user_input="welcom to califonya!")



print(promptValue)



llm_output = llm(promptValue)

print(llm_output)



output_parser.parse(llm_output)

The output should be a markdown code snippet formatted in the following schema:



```json

{

  "bad_string": string  // This a poorly formatted user input string

  "good_string": string  // This is your response, a reformatted response

}

```



You will be given a poorly formatted string from a user.

Reformat it and make sure all the words are spelled correctly



The output should be a markdown code snippet formatted in the following schema:



```json

{

  "bad_string": string  // This a poorly formatted user input string

  "good_string": string  // This is your response, a reformatted response

}

```



% USER INPUT:

welcom to califonya!



YOUR RESPONSE:





```json

{

  "bad_string": "welcom to califonya!",

  "good_string": "Welcome to California!"

}

```

{'bad_string': 'welcom to califonya!', 'good_string': 'Welcome to California!'}

2.4 Indexes – 生成LLMs能处理的结构化数据索引

2.4.1 Document Loaders —— 文件加载

from langchain.document_loaders import HNLoader

loader = HNLoader("https://news.ycombinator.com/item?id=34422627")

data = loader.load()

print (f"Found {len(data)} comments")

print (f"Here's a sample:\n\n{''.join([x.page_content[:150] for x in data[:2]])}")

2.4.2 Text Splitters —— 长文本分块

from langchain.text_splitter import RecursiveCharacterTextSplitter



# This is a long document we can split up.

with open('data/interview.txt') as f:

    pg_work = f.read()



print (f"You have {len([pg_work])} document")



text_splitter = RecursiveCharacterTextSplitter(

    # Set a really small chunk size, just to show.

    chunk_size = 150,

    chunk_overlap  = 20,

)



texts = text_splitter.create_documents([pg_work])



print (f"You have {len(texts)} documents")

2.4.3 Retrievers —— 检索推荐

from langchain.document_loaders import TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.vectorstores import FAISS

from langchain.embeddings import OpenAIEmbeddings



loader = TextLoader('data/interview.txt')

documents = loader.load()



# Get your splitter ready

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)



# Split your docs into texts

texts = text_splitter.split_documents(documents)



# Get embedding engine ready

embeddings = OpenAIEmbeddings()



# Embedd your texts

db = FAISS.from_documents(texts, embeddings)



# Init your retriever. Asking for just 1 document back

retriever = db.as_retriever()



docs = retriever.get_relevant_documents("what types of things did the author want to build?")



print("\n\n".join([x.page_content[:200] for x in docs[:1]]))

2.4.4 VectorStores —— 向量存贮

数据向量化存贮数据库，如Pinecome、Weaviate等，Chroma、Faiss易于本地使用。存贮内部包含Embedding和Metadata如下

from langchain.document_loaders import TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.vectorstores import FAISS

from langchain.embeddings import OpenAIEmbeddings



loader = TextLoader('data/interview.txt')

documents = loader.load()

# Get your splitter ready

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

# Split your docs into texts

texts = text_splitter.split_documents(documents)

# Get embedding engine ready

embeddings = OpenAIEmbeddings()

embedding_list = embeddings.embed_documents([text.page_content for text in texts])

print (f"You have {len(embedding_list)} embeddings")

print (f"Here's a sample of one: {embedding_list[0][:3]}...")

You have 29 embeddings

Here's a sample of one: [0.02098408779071363, -0.00444188727815012, 0.029791279689326114]...

2.5 Memory – 记忆注入

给模型注入历史信息，帮助他回忆有用信息，这里对记忆并没有严格定义，可以是过去的对话信息，也可以是复杂的信息检索，应用于聊天机器人，分长期和短期记忆，以便确定适合自己场景的类型。ChatMessageHistory类负责记住所有以前的聊天交互。然后，这些可以直接传递回模型，以某种方式总结，或某种组合。

from langchain.memory import ChatMessageHistory

from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(temperature=0)

history = ChatMessageHistory()

history.add_ai_message("hi!")

history.add_user_message("what is the capital of france?")

ai_response = chat(history.messages)

history.add_ai_message(ai_response.content)

history.messages

[AIMessage(content='hi!', additional_kwargs={}),

 HumanMessage(content='what is the capital of france?', additional_kwargs={}),

 AIMessage(content='The capital of France is Paris.', additional_kwargs={})]

2.6 Chains – 链

Chains 涉及的概念和实际用途均很广，可抽象理解为对模组及其行为的序列化封装，实现特定功能，可以直接使用LangChain封装的程式，也可以自定义程式。Chains可用于输入文本的处理转化TransformChain；可用于处理LLMs模型间参数依赖，A模型输出作为B模型的输入，模型间涉及到多个输入输出时，分为SimpleSequentialChain 和 Sequential Chain；可用于对输入文档端到端的程序化处理AnalyzeDocumentChain，自动实现输入文档-文档拆分-文档总结；可用于图谱的检索与查询GraphQAChain；Chain可被序列化save到disk，再通过load_chain加载。

2.6.1 Simple Sequential Chains —— 简单有序链

Chains=[A, B]，其中A模型输出直接作为B模型的输入，如下

from langchain.llms import OpenAI

from langchain.chains import LLMChain

from langchain.prompts import PromptTemplate

from langchain.chains import SimpleSequentialChain



llm = OpenAI(temperature=1)



template = """Your job is to come up with a classic dish from the area that the users suggests.

% USER LOCATION

{user_location}



YOUR RESPONSE:

"""

prompt_template = PromptTemplate(input_variables=["user_location"], template=template)



# Holds my 'location' chain

location_chain = LLMChain(llm=llm, prompt=prompt_template)



template = """Given a meal, give a short and simple recipe on how to make that dish at home.

% MEAL

{user_meal}



YOUR RESPONSE:

"""

prompt_template = PromptTemplate(input_variables=["user_meal"], template=template)



# Holds my 'meal' chain

meal_chain = LLMChain(llm=llm, prompt=prompt_template)



overall_chain = SimpleSequentialChain(chains=[location_chain, meal_chain], verbose=True)



review = overall_chain.run("Rome")

> Entering new SimpleSequentialChain chain...

Spaghetti Carbonara, a classic Roman dish made of spaghetti, guanciale (Italian type of bacon), eggs, parmesan cheese, and black pepper.



Ingredients: 

- 8 ounces of Spaghetti

- 6 ounces Guanciale (Italian bacon), diced

- 3 whole Eggs

- 2/3 cup Parmesan cheese, grated

- 2 tablespoons Fresh Ground Black Pepper



Instructions: 

1. Bring a large pot of salted water to a boil, then add the spaghetti and cook according to the instructions on the packaging.

2. Meanwhile, cook the guanciale in a large skillet over medium-high heat, stirring occasionally until crispy.

3. In a medium bowl, whisk together the eggs, Parmesan cheese, and black pepper until combined.

4. Once the spaghetti is cooked, reserve 1/2 cup of the cooking liquid, then drain the spaghetti and add it to the skillet.

5. Pour the egg mixture over the spaghetti and guanciale and toss to combine.

6. Add the reserved cooking liquid to create a creamy sauce.

7. Serve and enjoy!



> Finished chain.

2.6.2 Summarization Chain —— 总结链

使用load_summarize_chain，非常容易对长文档内容进行总结

from langchain.chains.summarize import load_summarize_chain

from langchain.document_loaders import TextLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter



llm = OpenAI(temperature=1)



loader = TextLoader('data/interview.txt')

documents = loader.load()



# Get your splitter ready

text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=50)



# Split your docs into texts

texts = text_splitter.split_documents(documents)



# There is a lot of complexity hidden in this one line. 

chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=True)

chain.run(texts)

2.7 Agents – 代理

Agents官方定义

“Some applications will require not just a predetermined chain of calls to LLMs/other tools, but potentially an unknown chain that depends on the user’s input. In these types of chains, there is a “agent” which has access to a suite of tools. Depending on the user input, the agent can then decide which, if any, of these tools to call. ”

Agents使用LLM决定执行哪些动作以及这些动作按照什么顺序执行，动作可以是使用某个工具，也可以是返回响应结果给用户，这也说明了，LLM不仅用于文本输出，还可用于决策制定，即LLMs可视为很好的推理引擎，这是值得大家注意的。（Agents可以作为RPA技术的替代方案）

2.7.1 Agents —— 模型装饰器

“An Agent is a wrapper around a model, which takes in user input and returns a response corresponding to an “action” to take and a corresponding “action input”. “，Agent 接受用户输入、指定动作，动作输入及响应等，其类型如下：

– zero-shot-react-description

– react-docstore

– self-ask-with-search

– conversational-react-description

2.7.2 Tools —— 代理可使用的工具

代理用于和外界交互的工具，包括自定义工具和常见通用工具，通用工具包含Bash（执行sh命令）、Bing Search、Google Search、ChatGPT Plugins、Python REPL（执行python命令）、Wikipedis API、WolframAlpha（计算知识引擎，支持强大的数学计算功能，如因式分解、向量运算、求导、可视化、积分微分、解方程，还涉及物理、化学、生物、人文、金融等领域）、Requests、OpenWeatherMap API（全球各城市的气象数据，包括4天小时级、16天日级、30天预测等）、Zapier Natural Language Actions API（支持5k+ apps 和 20k+ 基于自然语言理解的动作执行）、IFTTT WebHooks、Human as a tool（人机交互）、SearxNG Search API（自建搜索引擎及与web交互）、Apify（爬虫与数据提取云平台）、SerpAPI（谷歌搜索API）

以wolfram解方程为例子

import os

os.environ["WOLFRAM_ALPHA_APPID"] = ""

from langchain.utilities.wolfram_alpha import WolframAlphaAPIWrapper

wolfram = WolframAlphaAPIWrapper()

wolfram.run("What is 2x+5 = -3x + 7?")

# 输出 'x = 2/5'

2.7.3 Toolkit —— 解决特定问题的工具箱

CSV Agent（读取CSV，并支持调用python命令，对CSV文件进行分析，例如行数、列数、数据关联等）；JSON Agent（读取JSON，解析并查询该json，适用于大的json文件）；OpenAPI agents（构造代理来消费任意的api，这里的api符合OpenAPI/Swagger规范）；Natural Language APIs（自然语言 API 工具包（NLAToolkits）允许 LangChain 代理在端点之间高效地进行计划和组合调用）

；Pandas Dataframe Agent（和pandas交互）；Python Agent（编写python代码，执行python脚本）；SQL Database Agent（编写sql语句，执行sql）；Vectorstore Agent（支持对一个或多个向量存贮源文件的检索）。。。

from langchain.agents import create_csv_agent

agent = create_csv_agent(OpenAI(temperature=0), 'titanic.csv', verbose=True)

agent.run("how many rows are there?")

> Entering new AgentExecutor chain...

Thought: I need to count the number of rows

Action: python_repl_ast

Action Input: len(df)

Observation: 891

Thought: I now know the final answer

Final Answer: There are 891 rows in the dataframe.



> Finished chain.

# 最终结果：'There are 891 rows in the dataframe.'

2.7.4 Agent Executors —— 执行与推理

代理执行者使用代理和工具，并使用代理来决定调用哪些工具以及以什么顺序调用，用serpapi找出成龙第一部参演电影中的角色特点，serapi对英文检索效果更好，因此翻译为“What character did Jackie Chan play in his first film?”

from langchain.agents import load_tools

from langchain.agents import initialize_agent

from langchain.llms import OpenAI

import json

serpapi_api_key = "....."

llm = OpenAI(temperature=0)

toolkit = load_tools(["serpapi"], llm=llm, serpapi_api_key=serpapi_api_key)

agent = initialize_agent(toolkit, llm, agent="zero-shot-react-description", verbose=True, return_intermediate_steps=True)

response = agent({"input":"What character did Jackie Chan play in his first film?"})

print(json.dumps(response["intermediate_steps"], indent=2))

代理推理过程，先找到成龙影视作品集—>成龙第一部电影——>角色特点

> Entering new AgentExecutor chain...

 I should research Jackie Chan's filmography

Action: Search

Action Input: "Jackie Chan filmography"

Observation: Jackie Chan SBS MBE PMW is a Hong Kong actor, filmmaker, martial artist, and stuntman known for his slapstick acrobatic fighting style, comic timing, and innovative stunts, which he typically performs himself. Chan has been acting since the 1960s, performing in more than 150 films.

Thought: I should look for his first film

Action: Search

Action Input: "Jackie Chan first film"

Observation: Jackie Chan SBS MBE PMW is a Hong Kong actor, filmmaker, martial artist, and stuntman known for his slapstick acrobatic fighting style, comic timing, and innovative stunts, which he typically performs himself. Chan has been acting since the 1960s, performing in more than 150 films.

Thought: I should look for the character he played in his first film

Action: Search

Action Input: "Jackie Chan first film character"

Observation: Jackie Chan began his film career as an extra child actor in the 1962 film Big and Little Wong Tin Bar. Ten years later, he was a stuntman opposite Bruce Lee in 1972's Fist of Fury and 1973's Enter the Dragon.

Thought: I now know the final answer

Final Answer: Jackie Chan's first film was Big and Little Wong Tin Bar, and he played an extra child actor.



> Finished chain.

[

  [

    [

      "Search",

      "Jackie Chan filmography",

      " I should research Jackie Chan's filmography\nAction: Search\nAction Input: \"Jackie Chan filmography\""

    ],

    "Jackie Chan SBS MBE PMW is a Hong Kong actor, filmmaker, martial artist, and stuntman known for his slapstick acrobatic fighting style, comic timing, and innovative stunts, which he typically performs himself. Chan has been acting since the 1960s, performing in more than 150 films."

  ],

  [

    [

      "Search",

      "Jackie Chan first film",

      " I should look for his first film\nAction: Search\nAction Input: \"Jackie Chan first film\""

    ],

    "Jackie Chan SBS MBE PMW is a Hong Kong actor, filmmaker, martial artist, and stuntman known for his slapstick acrobatic fighting style, comic timing, and innovative stunts, which he typically performs himself. Chan has been acting since the 1960s, performing in more than 150 films."

  ],

  [

    [

      "Search",

      "Jackie Chan first film character",

      " I should look for the character he played in his first film\nAction: Search\nAction Input: \"Jackie Chan first film character\""

    ],

    "Jackie Chan began his film career as an extra child actor in the 1962 film Big and Little Wong Tin Bar. Ten years later, he was a stuntman opposite Bruce Lee in 1972's Fist of Fury and 1973's Enter the Dragon."

  ]

]