本系列基于Qwen2.5-7B,学习如何使用vllm量化,并使用benchmark_serving.py、lm_eval 测试模型性能和评估模型准确度。
测试环境为:
OS: centos 7
GPU: nvidia l40
driver: 550.54.15
CUDA: 12.3
本文是该系列第3篇——INT4 W4A16
一、量化
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from vllm import LLMmodel_path = "./Qwen2.5-7B"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype="auto", trust_remote_code=False)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=False)NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048# Load and preprocess the dataset
ds = load_dataset("/data/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))def preprocess(example):return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)def tokenize(sample):return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)# Configure the quantization algorithms
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])# Apply quantization
oneshot(model=model,dataset=ds,recipe=recipe,max_seq_length=MAX_SEQUENCE_LENGTH,num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)# Save the compressed model
SAVE_DIR = "./Qwen2.5-1.5B-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
二、部署
vllm serve Qwen2.5-1.5B-W4A16-G128 --disable-log-requests
三、benchmark
python /vllm/benchmarks/benchmark_serving.py --backend vllm --model Qwen2.5-1.5B-W4A16-G128 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 12.28
Total input tokens: 23260
Total generated tokens: 22059
Request throughput (req/s): 8.14
Output token throughput (tok/s): 1795.86
Total Token throughput (tok/s): 3689.50
---------------Time to First Token----------------
Mean TTFT (ms): 1529.15
Median TTFT (ms): 1562.42
P99 TTFT (ms): 2646.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 44.98
Median TPOT (ms): 22.79
P99 TPOT (ms): 223.34
---------------Inter-token Latency----------------
Mean ITL (ms): 19.59
Median ITL (ms): 14.49
P99 ITL (ms): 222.72
==================================================
四、lm_eval
4.1 gsm8k
lm_eval --model vllm \--model_args pretrained="./w4a16/Qwen2.5-1.5B-W4A16-G128",add_bos_token=true,gpu_memory_utilization=0.5 \--tasks gsm8k \--num_fewshot 5 \--limit 250 \--batch_size 'auto'
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.784 | ± | 0.0261 |
strict-match | 5 | exact_match | ↑ | 0.688 | ± | 0.0294 |
4.2 mmlu
lm_eval --model vllm \--model_args pretrained="./w4a16/Qwen2.5-1.5B-W4A16-G128",add_bos_token=true,gpu_memory_utilization=0.5 \--tasks mmlu \--num_fewshot 5 \--limit 250 \--batch_size 'auto'
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | ↑ | 0.7527 | ± | 0.0041 | |
- humanities | 2 | none | acc | ↑ | 0.7635 | ± | 0.0081 | |
- other | 2 | none | acc | ↑ | 0.7485 | ± | 0.0084 | |
- social sciences | 2 | none | acc | ↑ | 0.8333 | ± | 0.0077 | |
- stem | 2 | none | acc | ↑ | 0.6846 | ± | 0.0083 |