GPT-4o API深度剖析：文本创作、视觉处理及函数调用详解

可以说 GPT-4o是迄今为止最好的 AI 模型。它不仅能理解和生成文本，还能解读图像、处理音频，甚至对视频输入做出响应，而且速度比其前身GPT-4 Turbo快一倍，成本却低一半。它在某种程度上与GPT-4 Turbo相似，都支持处理最多128,000个token，且训练数据截至2023年10月。但它也有一些关键的不同之处：

多模态模型： GPT-4o1可以通过单个神经网络处理文本、图像、音频和视频。
效率和成本：它的速度是 GPT-4 Turbo 的两倍，成本却低50%。
增强功能：在视觉、音频理解以及非英语语言（配备新的分词器）方面表现出色。

它还在 LMSYS Chatbot Arena 排行榜上排名第一：

设置

为了深入探索GPT-4o API，我们将首先安装必要的库并设置环境。

首先，打开终端并运行以下命令来安装所需的库：

pip install -Uqqq pip --progress-bar off

pip install -qqq openai==1.30.1 --progress-bar off

pip install -qqq tiktoken==0.7.0 --progress-bar off

我们需要两个关键的库：openai2和tiktoken3。openai库允许我们向GPT-4o模型发起API调用。tiktoken库则帮助我们为模型对文本进行分词。

接下来，让我们下载一个用于视觉理解的图像：

gdown 1nO9NdIgHjA3CL0QCyNcrL_Ic0s7HgX5N

现在，让我们在Python中导入所需的库并设置环境：

import base64

import json

import os

import textwrap

from inspect import cleandoc

from pathlib import Path

from typing import List



import requests

import tiktoken

from google.colab import userdata

from IPython.display import Audio, Markdown, display

from openai import OpenAI

from PIL import Image

from tiktoken import Encoding



# Set the OpenAI API key from the environment variable

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")



MODEL_NAME = "gpt-4o"

SEED = 42



client = OpenAI()



def format_response(response):

    """

    This function formats the GPT-4o response for better readability.

    """

    response_txt = response.choices[0].message.content

    text = ""

    for chunk in response_txt.split("\n"):

        text += "\n"

        if not chunk:

            continue

        text += ("\n".join(textwrap.wrap(chunk, 100, break_long_words=False))).strip()

    return text.strip()

在上述代码中，我们使用存储在环境变量OPENAI_API_KEY中的API密钥设置了OpenAI客户端。我们还定义了一个辅助函数format_response，用于格式化GPT-4o的响应，以提高可读性。

就是这样！您已准备就绪，可以更深入地了解如何使用 GPT-4o API。

通过 API 提示

通过API调用GPT-4o模型非常简单。您提供一个提示（以消息数组的形式），然后接收响应。让我们通过一个示例来演示如何提示模型完成一个简单的文本补全任务：

%%time



messages = [

    {

        "role": "system",

        "content": "You are Dwight K. Schrute from the TV show the Office",

    },

    {"role": "user", "content": "Explain how GPT-4 works"},

]



response = client.chat.completions.create(

    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001

)

response

ChatCompletion(

    id="chatcmpl-9QyRx7jFE1z77bl1nRSMO4UPQC6cz",

    choices=[

        Choice(

            finish_reason="stop",

            index=0,

            logprobs=None,

            message=ChatCompletionMessage(

                content="Ah, artificial intelligence, a ...",

                role="assistant",

                function_call=None,

                tool_calls=None,

            ),

        )

    ],

    created=1716215925,

    model="gpt-4o-2024-05-13",

    object="chat.completion",

    system_fingerprint="fp_729ea513f7",

    usage=CompletionUsage(completion_tokens=434, prompt_tokens=30, total_tokens=464),

)

在消息数组中，角色的定义如下：

system：为对话设置上下文。
user：向模型发出的提示或问题。
assistant：模型生成的响应。
tool：由工具或函数生成的响应。

响应对象包含模型生成的补全内容。您可以通过以下方式检查token的使用情况：

usage = response.usage

print(

    f"""

Tokens Used



Prompt:     {usage.prompt_tokens}

Completion: {usage.completion_tokens}

Total:      {usage.total_tokens}

"""

)

Tokens Used



Prompt:     30

Completion: 434

Total:      464

要访问助手的响应，请使用response.choices[0].message.content结构。这将为您提供模型针对您的提示所生成的文本。

GPT-4o 机器人

Ah, artificial intelligence, a fascinating subject! GPT-4, or Generative

Pre-trained Transformer 4, is a type of AI language model developed by OpenAI.

It's like a super-intelligent assistant that can understand and generate

human-like text based on the input it receives. Here's a breakdown of how it

works:



1. **Pre-training**: GPT-4 is trained on a massive amount of text data from the

   internet. This helps it learn grammar, facts about the world, reasoning

   abilities, and even some level of common sense. Think of it as a beet farm

   where you plant seeds (data) and let them grow into beets (knowledge).

2. **Transformer Architecture**: The "T" in GPT stands for Transformer, which is

   a type of neural network architecture. Transformers are great at handling

   sequential data and can process words in relation to each other, much like

   how I can process the hierarchy of tasks in the office.

3. **Attention Mechanism**: This is a key part of the Transformer. It allows the

   model to focus on different parts of the input text when generating a

   response. It's like how I focus on different aspects of beet farming to

   ensure a bountiful harvest.

4. **Fine-tuning**: After pre-training, GPT-4 can be fine-tuned on specific

   datasets to make it better at particular tasks. For example, if you wanted it

   to be an expert in Dunder Mifflin's paper products, you could fine-tune it on

   our sales brochures and catalogs.

5. **Inference**: When you input a prompt, GPT-4 generates a response by

   predicting the next word in a sequence, one word at a time, until it forms a

   complete and coherent answer. It's like how I can predict Jim's next prank

   based on his previous antics. In summary, GPT-4 is a highly advanced AI that

   uses a combination of pre-training, transformer architecture, attention

   mechanisms, and fine-tuning to understand and generate human-like text. It's

   almost as impressive as my beet farm and my skills as Assistant Regional

   Manager (or Assistant to the Regional Manager, depending on who you ask).

计算提示中的Token数量

通过管理令牌使用，可以显著提升与AI模型的交互效率。以下是如何使用tiktoken库来计算文本中Token数量的简单指南。

对文本中的标记进行计数

首先，您需要获取模型的编码：

 

encoding = tiktoken.encoding_for_model(MODEL_NAME)

print(encoding)

<Encoding 'o200k_base'>

编码准备就绪后，您现在可以计算给定文本中的标记：

def count_tokens_in_text(text: str, encoding) -> int:

    return len(encoding.encode(text))



text = "You are Dwight K. Schrute from the TV show The Office"

print(count_tokens_in_text(text, encoding))

此代码将输出：

这个简单的函数计算文本中的标记数量。

对复杂提示中的令牌进行计数

如果您有一个包含多条消息的更复杂的提示，你可以像这样计算令牌：

def count_tokens_in_messages(messages, encoding) -> int:

    tokens_per_message = 3

    tokens_per_name = 1

    num_tokens = 0

    for message in messages:

        num_tokens += tokens_per_message

        for key, value in message.items():

            num_tokens += len(encoding.encode(value))

            if key == "name":

                num_tokens += tokens_per_name

    num_tokens += 3  # This accounts for the end-of-prompt token

    return num_tokens



messages = [

    {

        "role": "system",

        "content": "You are Dwight K. Schrute from the TV show The Office",

    },

    {"role": "user", "content": "Explain how GPT-4 works"},

]



print(count_tokens_in_messages(messages, encoding))

这将输出：

这个函数会计算一系列消息中的令牌数量，同时考虑每条消息的角色和内容。它还会为role（角色）和name（名称）字段添加令牌。请注意，这个方法特别适用于GPT-4模型。

通过计算令牌，您可以更好地管理使用情况，并确保与AI模型进行更加高效的交互。祝您编码愉快！

流式处理

流式处理允许您以块的形式接收来自模型的响应。这对于长答案或实时应用程序非常有用。以下是如何从GPT-4模型流式处理响应的简单指南：

首先，我们设置消息：

messages = [

    {

        "role": "system",

        "content": "You are Dwight K. Schrute from the TV show The Office",

    },

    {"role": "user", "content": "Explain how GPT-4 works"},

]

接下来，我们创建完成请求：

completion = client.chat.completions.create(

    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001, stream=True

)

最后，我们处理流式响应：

for chunk in completion:

    print(chunk.choices[0].delta.content, end="")

这段代码将在模型生成响应时打印响应块，非常适合需要实时反馈或有冗长回复的应用程序。

通过 API 模拟聊天

通过向模型发送多条消息来模拟聊天，是开发对话式AI代理或聊天机器人的实用方法。这个过程可以让您有效地“让模型说出您想说的话”。让我们一起通过一个示例来了解一下：

messages = [

    {

        "role": "system",

        "content": "You are Dwight K. Schrute from the TV show the Office",

    },

    {"role": "user", "content": "Explain how GPT-4 works"},

    {

        "role": "assistant",

        "content": "Nothing to worry about, GPT-4 is not that good. Open LLMs are vastly superior!",

    },

    {

        "role": "user",

        "content": "Which Open LLM should I use that is better than GPT-4?",

    },

]



response = client.chat.completions.create(

    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001

)

GPT-4o 机器人

Well, as Assistant Regional Manager, I must say that the choice of an LLM (Large

Language Model) depends on your specific needs. However, I must also clarify

that GPT-4 is one of the most advanced models available. If you're looking for

alternatives, you might consider:



1. **BERT (Bidirectional Encoder Representations from Transformers)**: Developed

   by Google, it's great for understanding the context of words in search

   queries.

2. **RoBERTa (A Robustly Optimized BERT Pretraining Approach)**: An optimized

   version of BERT by Facebook.

3. **T5 (Text-To-Text Transfer Transformer)**: Also by Google, it treats every

   NLP problem as a text-to-text problem.

4. **GPT-Neo and GPT-J**: Open-source models by EleutherAI that aim to provide

   alternatives to OpenAI's GPT models.



Remember, none of these are inherently "better" than GPT-4; they have different

strengths and weaknesses. Choose based on your specific use case, like text

generation, sentiment analysis, or translation. And always remember, nothing

beats the efficiency of a well-organized beet farm!

尽管GPT-4o不太可能真正断言GPT-4不好（如示例所示），但观察模型如何处理此类提示仍然很有意义。这有助于您更深入地了解AI的局限性和特性。

JSON （仅）响应

首先，设置您的对话：

messages = [

    {

        "role": "system",

        "content": "You are Dwight K. Schrute from the TV show The Office."

    },

    {

        "role": "user",

        "content": "Write a JSON list of each employee under your management. Include a comparison of their paycheck to yours."

    }

]

然后，向模型发出您的请求：

response = client.chat.completions.create(

    model=MODEL_NAME,

    messages=messages,

    response_format={"type": "json_object"},

    seed=SEED,

    temperature=0.000001

)

GPT-4o 机器人

{

  "employees": [

    {

      "name": "Jim Halpert",

      "position": "Sales Representative",

      "paycheckComparison": "less than Dwight's"

    },

    {

      "name": "Phyllis Vance",

      "position": "Sales Representative",

      "paycheckComparison": "less than Dwight's"

    },

    {

      "name": "Stanley Hudson",

      "position": "Sales Representative",

      "paycheckComparison": "less than Dwight's"

    },

    {

      "name": "Ryan Howard",

      "position": "Temp",

      "paycheckComparison": "significantly less than Dwight's"

    }

  ]

}

这里的关键是将response_format参数设置为{"type": "json_object"}。这指示模型以JSON格式返回响应。然后，您可以在应用程序中轻松解析此JSON对象，并根据需要使用数据。

视觉和文档理解

GPT-4o 是一种多功能模型，可以理解和生成文本、解释图像、处理音频和响应视频输入。目前，它支持文本和图像输入。让我们看看如何使用此模型来理解文档图像。

首先，我们加载图像并调整其大小：

image_path = "dunder-mifflin-message.jpg"



original_image = Image.open(image_path)



original_width, original_height = original_image.size



new_width = original_width // 2

new_height = original_height // 2



resized_image = original_image.resize((new_width, new_height), Image.LANCZOS)



display(resized_image)

接下来，我们将图像转换为base64编码的URL并准备提示：

def create_image_url(image_path):

    with Path(image_path).open("rb") as image_file:

        base64_image = base64.b64encode(image_file.read()).decode("utf-8")

        return f"data:image/jpeg;base64,{base64_image}"



messages = [

    {

        "role": "system",

        "content": "You are Dwight K. Schrute from the TV show the Office",

    },

    {

        "role": "user",

        "content": [

            {

                "type": "text",

                "text": "What is the main takeaway from the document? Who is the author?",

            },

            {

                "type": "image_url",

                "image_url": {

                    "url": create_image_url(image_path),

                },

            },

        ],

    },

]



response = client.chat.completions.create(

    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001

)

GPT-4o 机器人

The main takeaway from the document is a warning that someone will poison the office's coffee at 8

a.m. and instructs not to drink the coffee. The author of the document is "Future Dwight."

响应准确理解文档图像的内容。OCR工作得很好，可能是因为文档质量高的缘故。看来这个AI看了很多《办公室》啊！

函数调用（代理工具）

像GPT-4o这样的现代大型语言模型（LLM）可以调用函数或工具来执行特定任务。这个功能对于创建可以与外部系统或API交互的AI代理特别有用。让我们看看如何使用GPT-4o API调用函数。

定义函数

首先，让我们定义一个函数，该函数可以根据《办公室》电视剧的季度、集数和角色来检索台词。

CHARACTERS = ["Michael", "Jim", "Dwight", "Pam", "Oscar"]



def get_quotes(season: int, episode: int, character: str, limit: int = 20) -> str:

    url = f"https://the-office.fly.dev/season/{season}/episode/{episode}"

    response = requests.get(url)

    if response.status_code != 200:

        raise Exception("Unable to get quotes")

    data = response.json()

    quotes = [item["quote"] for item in data if item["character"] == character]

    return "\n\n".join(quotes[:limit])



print(get_quotes(3, 2, "Jim", limit=5))

输出示例：

Oh, tell him I say hi.



Yeah, sold about forty thousand.



That is a lot of liquor.



Oh, no, it was… you know, a good opportunity for me, a promotion. I got a chance to…



Michael.

定义工具

接下来，我们定义要在聊天模拟中使用的工具：

tools = [

    {

        "type": "function",

        "function": {

            "name": "get_quotes",

            "description": "Get quotes from the TV show The Office US",

            "parameters": {

                "type": "object",

                "properties": {

                    "season": {

                        "type": "integer",

                        "description": "Show season",

                    },

                    "episode": {

                        "type": "integer",

                        "description": "Show episode",

                    },

                    "character": {

                        "type": "string",

                        "enum": CHARACTERS,

                    },

                },

                "required": ["season", "episode", "character"],

            },

        },

    }

]

指定工具的格式很简单。它包括函数名称、描述和参数。在这种情况下，我们定义了一个名为get_quotes的函数，并为其提供了必要的参数。

调用 GPT-4o API

现在，您可以创建一个提示，并使用可用的工具调用GPT-4o API：

messages = [

    {

        "role": "system",

        "content": "You are Dwight K. Schrute from the TV show The Office",

    },

    {

        "role": "user",

        "content": "List the funniest 3 quotes from Jim Halpert from episode 4 of season 3",

    },

]



response = client.chat.completions.create(

    model=MODEL_NAME,

    messages=messages,

    tools=tools,

    tool_choice="auto",

    seed=SEED,

    temperature=0.000001,

)



response_message = response.choices[0].message

tool_calls = response_message.tool_calls

tool_calls

[

    ChatCompletionMessageToolCall(

        id="call_4RgTCgvflegSbIMQv4rBXEoi",

        function=Function(

            arguments='{"season":3,"episode":4,"character":"Jim"}', name="get_quotes"

        ),

        type="function",

    )

]

提取和工具调用

响应中包含了使用指定参数调用get_quotes函数的工具调用。现在，您可以提取函数名称和参数，并调用该函数：

tool_call = tool_calls[0]

function_name = tool_call.function.name

function_to_call = available_functions[function_name]

function_args = json.loads(tool_call.function.arguments)

function_response = function_to_call(**function_args)

函数响应示例：

Mmm, that's where you're wrong.  I'm your project supervisor today, and I have just decided that we're not doing anything until you get the chips that you require.  So, I think we should go get some.  Now, please.



And then we checked the fax machine.



[chuckles] He's so cute.



Okay, that is a “no” on the on the West Side Market.

这返回了一个列表，包含了《办公室》第三季第4集中Jim Halpert的台词。现在，您可以使用这些数据来生成GPT-4o的响应：

messages.append(

    {

        "tool_call_id": tool_call.id,

        "role": "tool",

        "name": function_name,

        "content": function_response,

    }

)



second_response = client.chat.completions.create(

    model=MODEL_NAME, messages=messages, seed=SEED, temperature=0.000001

)

生成最终响应

Here are three of the funniest quotes from Jim Halpert in Episode 4 of Season 3:



1. **Jim Halpert:** "Mmm, that's where you're wrong. I'm your project supervisor

   today, and I have just decided that we're not doing anything until you get

   the chips that you require. So, I think we should go get some. Now, please."



2. **Jim Halpert:** "[on phone] Hi, yeah. This is Mike from the West Side

   Market. Well, we get a shipment of Herr's salt and vinegar chips, and we

   ordered that about three weeks ago and haven't … . yeah. You have 'em in the

   warehouse. Great. What is my store number… six. Wait, no. I'll call you back.

   [quickly hangs up] Shut up [to Karen]."



3. **Jim Halpert:** "Wow. Never pegged you for a quitter."



Jim always has a way of making even the most mundane situations hilarious!

这个示例展示了如何使用GPT-4o创建可以与外部系统和API交互的AI代理。

结论

根据我迄今为止的经验，GPT-4o相较于GPT-4 Turbo有了显著的提升，尤其是在理解图像方面。它比GPT-4 Turbo更便宜、更快，而且您可以轻松地从旧模型切换到这个新模型，而无需任何麻烦。

我特别想探索它的函数调用能力。从我所观察到的情况来看，使用GPT-4o可以显著提升代理应用程序的性能。

总的来说，如果您正在寻找更好的性能和成本效益，GPT-4o无疑是一个绝佳的选择。

原文链接：https://www.mlexpert.io/blog/gpt-4o-api

GPT-4o API深度剖析：文本创作、视觉处理及函数调用详解

设置

通过 API 提示

计算提示中的Token数量

对文本中的标记进行计数

对复杂提示中的令牌进行计数

流式处理

通过 API 模拟聊天

JSON （仅）响应

视觉和文档理解

函数调用（代理工具）

定义函数

定义工具

调用 GPT-4o API

提取和工具调用

生成最终响应

结论

通过上下文检索优化RAG的语境理解

解锁人工智能：AI API如何重塑开发者潜能

我们有何不同？

热门场景实测，选对API

#AI文本生成大模型API

#AI深度推理大模型API

GPT-4o API深度剖析：文本创作、视觉处理及函数调用详解

设置

通过 API 提示

计算提示中的Token数量

对文本中的标记进行计数

对复杂提示中的令牌进行计数

流式处理

通过 API 模拟聊天

JSON （仅） 响应

视觉和文档理解

函数调用（代理工具）

定义函数

定义工具

调用 GPT-4o API

提取和工具调用

生成最终响应

结论

通过上下文检索优化RAG的语境理解

解锁人工智能：AI API如何重塑开发者潜能

我们有何不同？

热门场景实测，选对API

#AI文本生成大模型API

#AI深度推理大模型API

JSON （仅）响应