
如何快速实现REST API集成以优化业务流程
自从ChatGPT发布以来,国内外的开源大模型如雨后春笋般成长,但是对于很多企业和个人从头训练预训练模型不太现实,即使微调开源大模型也捉襟见肘,那么直接部署这些开源大模型服务于企业业务将会有很大的前景,本文将介绍七中主流的LLM推理和服务开源库。
下面首先来总结一下这些框架的特点,如下表所示:
LLM推理有很多框架,各有其特点,下面分别介绍一下表中七个框架的关键点:
下面我们在内存容量为40GB的A100 GPU上,并且使用LLaMA-1 13b模型(因为列表中的所有库都支持它)进行七个部署框架的对比。
vLLM的吞吐量比HuggingFace Transformers(HF)高14x-24倍,比HuggingFace Text Generation Inference(TGI)高2.2x-2.5倍。
离线批量推理
# pip install vllmfrom vllm import LLM, SamplingParams
prompts = [ "Funniest joke ever:", "The capital of France is", "The future of AI is",]sampling_params = SamplingParams(temperature=0.95, top_p=0.95, max_tokens=200)llm = LLM(model="huggyllama/llama-13b")outputs = llm.generate(prompts, sampling_params)
for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
API Server
# Start the server:python -m vllm.entrypoints.api_server --env MODEL_NAME=huggyllama/llama-13b
# Query the model in shell:curl http://localhost:8000/generate \ -d '{ "prompt": "Funniest joke ever:", "n": 1, "temperature": 0.95, "max_tokens": 200 }'
功能:
优点:
缺点:
这是LLM推理最快的库。得益于其内部优化,它显著优于竞争对手。尽管如此,它在支持有限范围的模型方面确实存在弱点。
使用vLLM的开发路线可以参考:https://github.com/vllm-project/vllm/issues/244
Text generation inference是用于文本生成推断的Rust、Python和gRPC服务器,在HuggingFace中已有LLM 推理API使用。
使用docker运行web server
mkdir datadocker run --gpus all --shm-size 1g -p 8080:80 \-v data:/data ghcr.io/huggingface/text-generation-inference:0.9 \ --model-id huggyllama/llama-13b \ --num-shard 1
查询实例
# pip install text-generationfrom text_generation import Client
client = Client("http://127.0.0.1:8080")prompt = "Funniest joke ever:"print(client.generate(prompt, max_new_tokens=17 temperature=0.95).generated_text)
功能:
优点:
缺点:
使用Text generation inference的开发路线可以参考:https://github.com/huggingface/text-generation-inference/issues/232
CTranslate2是一个C++和Python库,用于使用Transformer模型进行高效推理。
转换模型
pip install -qqq transformers ctranslate2
# The model should be first converted into the CTranslate2 model format:ct2-transformers-converter --model huggyllama/llama-13b --output_dir llama-13b-ct2 --force
查询实例
import ctranslate2import transformers
generator = ctranslate2.Generator("llama-13b-ct2", device="cuda", compute_type="float16")tokenizer = transformers.AutoTokenizer.from_pretrained("huggyllama/llama-13b")
prompt = "Funniest joke ever:"tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))results = generator.generate_batch( [tokens], sampling_topk=1, max_length=200, )tokens = results[0].sequences_ids[0]output = tokenizer.decode(tokens)print(output)
功能:
优点:
缺点:
在DeepSpeed支持下,DeepSpeed-MII可以进行低延迟和高通量推理。
运行web服务
# DON'T INSTALL USING pip install deepspeed-mii# git clone https://github.com/microsoft/DeepSpeed-MII.git# git reset --hard 60a85dc3da5bac3bcefa8824175f8646a0f12203# cd DeepSpeed-MII && pip install .# pip3 install -U deepspeed
# ... and make sure that you have same CUDA versions:# python -c "import torch;print(torch.version.cuda)" == nvcc --versionimport mii
mii_configs = { "dtype": "fp16", 'max_tokens': 200, 'tensor_parallel': 1, "enable_load_balancing": False}mii.deploy(task="text-generation", model="huggyllama/llama-13b", deployment_name="llama_13b_deployment", mii_config=mii_configs)
查询实例
import mii
generator = mii.mii_query_handle("llama_13b_deployment")result = generator.query( {"query": ["Funniest joke ever:"]}, do_sample=True, max_new_tokens=200)print(result)
功能:
优点:
缺点:
OpenLLM是一个用于在生产中操作大型语言模型(LLM)的开放平台。
运行web服务
pip install openllm scipyopenllm start llama --model-id huggyllama/llama-13b \ --max-new-tokens 200 \ --temperature 0.95 \ --api-workers 1 \ --workers-per-resource 1
查询实例
import openllm
client = openllm.client.HTTPClient('http://localhost:3000')print(client.query("Funniest joke ever:"))
功能:
优点:
缺点:
Ray Serve是一个可扩展的模型服务库,用于构建在线推理API。Serve与框架无关,因此可以使用一个工具包来为深度学习模型的所有内容提供服务。
运行web服务
# pip install ray[serve] accelerate>=0.16.0 transformers>=4.26.0 torch starlette pandas# ray_serve.pyimport pandas as pd
import rayfrom ray import servefrom starlette.requests import Request
@serve.deployment(ray_actor_options={"num_gpus": 1})class PredictDeployment: def __init__(self, model_id: str): from transformers import AutoModelForCausalLM, AutoTokenizer import torch
self.model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto", ) self.tokenizer = AutoTokenizer.from_pretrained(model_id)
def generate(self, text: str) -> pd.DataFrame: input_ids = self.tokenizer(text, return_tensors="pt").input_ids.to( self.model.device ) gen_tokens = self.model.generate( input_ids, temperature=0.9, max_length=200, ) return pd.DataFrame( self.tokenizer.batch_decode(gen_tokens), columns=["responses"] )
async def __call__(self, http_request: Request) -> str: json_request: str = await http_request.json() return self.generate(prompt["text"])
deployment = PredictDeployment.bind(model_id="huggyllama/llama-13b")
# then run from CLI command:# serve run ray_serve:deployment
查询实例
import requests
sample_input = {"text": "Funniest joke ever:"}output = requests.post("http://localhost:8000/", json=[sample_input]).json()print(output)
功能:
优点:
缺点:
如果需要最适合生产的解决方案,而不仅仅是深度学习,Ray Serve是一个不错的选择。它最适合于可用性、可扩展性和可观察性非常重要的企业。此外,还可以使用其庞大的生态系统进行数据处理、模型训练、微调和服务。最后,从OpenAI到Shopify和Instacart等公司都在使用它。
LLM的机器学习编译(MLC LLM)是一种通用的部署解决方案,它使LLM能够利用本机硬件加速在消费者设备上高效运行。
运行web服务
# 1. Make sure that you have python >= 3.9# 2. You have to run it using conda:conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-nightlyconda activate mlc-chat-venv
# 3. Then install package:pip install --pre --force-reinstall mlc-ai-nightly-cu118 \ mlc-chat-nightly-cu118 \ -f https://mlc.ai/wheels
# 4. Download the model weights from HuggingFace and binary libraries:git lfs install && mkdir -p dist/prebuilt && \ git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib && \ cd dist/prebuilt && \ git clone https://huggingface.co/huggyllama/llama-13b dist/ && \ cd ../.. # 5. Run server:python -m mlc_chat.rest --device-name cuda --artifact-path dist
查询实例
import requests
payload = { "model": "lama-30b", "messages": [{"role": "user", "content": "Funniest joke ever:"}], "stream": False}r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload)print(r.json()['choices'][0]['message']['content'])
功能:
优点:
缺点:
如果需要在iOS或Android设备上部署应用程序,这个库正是你所需要的。它将允许您快速地以本机方式编译模型并将其部署到设备上。但是,如果需要一个高负载的服务器,不建议选择这个框架。
参考文献:
[1] https://github.com/vllm-project/vllm
[2] https://github.com/huggingface/text-generation-inference
[3] https://github.com/OpenNMT/CTranslate2
[4] https://github.com/bentoml/OpenLLM
[5] https://docs.ray.io/en/latest/serve/index.html
[6] https://github.com/mlc-ai/mlc-llm
[7] https://github.com/microsoft/DeepSpeed-MII
[8] https://github.com/microsoft/DeepSpeed
[9] https://www.anyscale.com/blog/continuous-batching-llm-inference
[10] https://vllm.ai/
[11] https://huggingface.co/docs/transformers/main_classes/agent
[12] https://github.com/TimDettmers/bitsandbytes
[13] https://arxiv.org/abs/2210.17323
[14] https://github.com/bentoml/Yatai
[15] https://arxiv.org/abs/2212.09720
文章转自微信公众号@ArronAI