LLM之LangChain（五）| 使用LangChain Agent分析非结构化数据

第一部分：从非结构化数据抽取结构化信息

方法一：create_extract_chain

定义数据抽取的结构，并且使用LangChain创建一个提取链。

from langchain.chains import create_extraction_chainfrom langchain.chat_models import ChatOpenAI
# Schemaschema = { "properties": { "company": {"type": "string"}, "offering": {"type": "string"}, "advantage": {"type": "string"}, "products_and_services": {"type": "string"}, "additional_details": {"type": "string"}, }}

定义测试样本

# Inputsin1 = """Sweet Delights Bakery introduced lavender-infused vanilla cupcakes with a honey buttercream frosting, using the "Frosting-Spreader-3000". This innovation could inspire our next cupcake creation"""in2 = """Whisked Away Cupcakes introduced a dessert subscription service, ensuring regular customers receive fresh batches of various sweets. Exploring a similar subscription model using the "SweetSubs" program could boost customer loyalty."""in3 = """At Velvet Frosting Cupcakes, our team learned about the unveiling of a seasonal pastry menu that changes monthly. Introducing a rotating seasonal menu at our bakery using the "SeasonalJoy" subscription platform and adding a special touch to our cookies with the "FloralStamp" cookie stamper could keep our offerings fresh and exciting for customers."""
inputs = [in1, in2, in3]

创建Chain

# Run chainllm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")chain = create_extraction_chain(schema, llm)

运行Chain

for input in inputs: print(chain.run(input))

现在，我们将输出结构化为Python列表：

[{'company': 'Sweet Delights Bakery', 'offering': 'lavender-infused vanilla cupcakes', 'advantage': 'inspiring next cupcake creation', 'products_and_services': 'Frosting-Spreader-3000'}][{'company': 'Whisked Away Cupcakes', 'offering': 'dessert subscription service', 'advantage': 'ensuring regular customers receive fresh batches of various sweets', 'products_and_services': '', 'additional_details': ''}, {'company': '', 'offering': 'subscription model using the "SweetSubs" program', 'advantage': 'boost customer loyalty', 'products_and_services': '', 'additional_details': ''}][{'company': 'Velvet Frosting Cupcakes', 'offering': 'rotating seasonal menu', 'advantage': 'fresh and exciting offerings', 'products_and_services': 'SeasonalJoy subscription platform, FloralStamp cookie stamper'}]

导入包含竞争情报的CSV，将其应用于提取链进行解析和结构化，并将解析后的信息无缝集成回原始数据集。下面的Python代码正是这样做的：

import pandas as pdfrom langchain.chains import create_extraction_chainfrom langchain.chat_models import ChatOpenAI
# Load in the data.csv (semicolon separated) filedf = pd.read_csv("data.csv", sep=';')
# Define Schema based on your dataschema = { "properties": { "company": {"type": "string"}, "offering": {"type": "string"}, "advantage": {"type": "string"}, "products_and_services": {"type": "string"}, "additional_details": {"type": "string"}, }}
# Create extraction chainllm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")chain = create_extraction_chain(schema, llm)
# ----------# Add the data to a data frame# ----------
# Extract information and create a DataFrame from the list of dictionariesextracted_data = df['INTEL'].apply(lambda x: chain.run(x)[0]).apply(pd.Series)
# Replace missing values with NaNextracted_data.replace('', np.nan, inplace=True)
# Concatenate the extracted_data DataFrame with the original dfdf = pd.concat([df, extracted_data], axis=1)
# display the data framedf.head()

这次运行花了大约15秒，但它还没有找到我们要求的所有信息。接下来，让我们尝试一种不同的方法。

方法二：Pydantic

在下面的代码中，Pydantic用于定义表示竞争情报信息结构的数据模型。Pydantic是Python的数据验证和解析库，允许您使用Python数据类型定义简单或复杂的数据结构。在这种情况下，我们使用Pydantic模型（竞争对手和公司）来定义竞争情报数据的结构。

import pandas as pdfrom typing import Optional, Sequencefrom langchain.llms import OpenAIfrom langchain.output_parsers import PydanticOutputParserfrom langchain.prompts import PromptTemplatefrom pydantic import BaseModel
# Load data from CSVdf = pd.read_csv("data.csv", sep=';')
# Pydantic models for competitive intelligenceclass Competitor(BaseModel): company: str offering: str advantage: str products_and_services: str additional_details: str
class Company(BaseModel): """Identifying information about all competitive intelligence in a text.""" company: Sequence[Competitor]
# Set up a Pydantic parser and prompt templateparser = PydanticOutputParser(pydantic_object=Company)prompt = PromptTemplate( template="Answer the user query.\n{format_instructions}\n{query}\n", input_variables=["query"], partial_variables={"format_instructions": parser.get_format_instructions()},)
# Function to process each row and extract informationdef process_row(row): _input = prompt.format_prompt(query=row['INTEL']) model = OpenAI(temperature=0) output = model(_input.to_string()) result = parser.parse(output) # Convert Pydantic result to a dictionary competitor_data = result.model_dump()
 # Flatten the nested structure for DataFrame creation flat_data = {'INTEL': [], 'company': [], 'offering': [], 'advantage': [], 'products_and_services': [], 'additional_details': []}
 for entry in competitor_data['company']: flat_data['INTEL'].append(row['INTEL']) flat_data['company'].append(entry['company']) flat_data['offering'].append(entry['offering']) flat_data['advantage'].append(entry['advantage']) flat_data['products_and_services'].append(entry['products_and_services']) flat_data['additional_details'].append(entry['additional_details'])
 # Create a DataFrame from the flattened data df_cake = pd.DataFrame(flat_data)
 return df_cake
# Apply the function to each row and concatenate the resultsintel_df = pd.concat(df.apply(process_row, axis=1).tolist(), ignore_index=True)
# Display the resulting DataFrameintel_df.head()

速度很快！与create_extract_chain不同，这次找到了所有条目的详细信息。

第一部分总结：

发现PydanticOutputParser更快、更可靠。每次运行大约需要1秒和400个tokens。而create_extract_chain运行大约需要2.5秒和250个tokens。

我们已经设法从非结构化文本中提取了一些结构化数据！第2部分重点是使用LangChain Agent分析这些结构化数据。

第二部分：使用LangChain Agent分析这些结构化数据

什么是LangChain Agent？

在LangChain中，Agent是利用语言模型来选择要执行的操作序列的系统。与Chain不同的是，在Chain中，动作被硬编码在代码中，而Agent利用语言模型作为“推理引擎”，决定采取哪些动作以及以何种顺序采取这些动作。

现在，使用LangChain中的CSV Agent来分析我们的结构化数据了：

步骤1：创建Agent

首先加载必要的库：

from langchain.agents.agent_types import AgentTypefrom langchain_community.llms import OpenAIfrom langchain_experimental.agents.agent_toolkits import create_csv_agent

创建Agent

agent = create_csv_agent( OpenAI(temperature=0), "data/intel.csv", verbose=True, agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,)

现在我们可以用一些问题来测试我们的Agent：

步骤2：向Agent提出问题

当你问LangChain Agent问题时，你会看到它思考自己的行为。

询问通用问题

agent.run("What insights can I get from this data?")

‘This dataframe contains information about different companies and their products/services, as well as additional details and potential opportunities for improvement.’

询问竞争对手优势

agent.run("What are 3 specific areas of focus that you can obtain through analyzing the advantages offered by the competition?")

‘Three specific areas of focus that can be obtained through analyzing the advantages offered by the competition are: streamlining production processes, incorporating unique and distinctive flavors, and using sustainable and high-quality ingredients.’

询问主要竞争对手主题

agent.run("What are some key themes that the competitors represented in the data are focusing on providing? Be specific with examples, and talk about the advantages of these")

‘The key themes that the competitors are focusing on providing are efficiency, unique flavors, and high-quality ingredients. For example, Coco candy co is using the 77Tyrbo Choco machine to coat their candy

gummies, which streamlines the process and saves time. Cinnamon Bliss Bakery adds a secret touch of cinnamon in their chocolate brownies with the CinnaMagic ingredient, which adds a distinctive flavor. Choco Haven factory uses organic and locally sourced ingredients, including the EcoCocoa brand, to elevate the quality of their chocolates.’

参考文献：

[1] https://github.com/ingridstevens/AI-projects/blob/main/unstructured_data/data.csv

[2] https://medium.com/@ingridwickstevens/extract-structured-data-from-unstructured-text-using-llms-71502addf52b

[3] https://medium.com/@ingridwickstevens/analyze-structured-data-extracted-from-unstructured-text-using-llm-agents-4ea4eaf3ae78

[4] https://github.com/ingridstevens/AI-projects/blob/main/unstructured_data/unstructured_extraction_chain.ipynb

[5] https://github.com/ingridstevens/AI-projects/blob/main/unstructured_data/unstructured_pydantic.ipynb

[6] https://github.com/ingridstevens/AI-projects/blob/main/unstructured_data/data.csv

文章转自微信公众号@ArronAI

LLM之LangChain（五）| 使用LangChain Agent分析非结构化数据

第一部分：从非结构化数据抽取结构化信息

方法一：create_extract_chain

方法二：Pydantic

第二部分：使用LangChain Agent分析这些结构化数据

什么是LangChain Agent？

步骤1：创建Agent

步骤2：向Agent提出问题

参考文献：

LLM之LangChain（四）| 介绍LangChain 0.1在可观察性、可组合性、流媒体、工具、RAG和代理方面的改进

LLM之LangChain（六）| 使用LangGraph创建一个超级AI Agent

我们有何不同？

热门场景实测，选对API

#AI文本生成大模型API

#AI深度推理大模型API