Skip to content

Commit f3889f4

Browse files
authored
Merge branch 'main' into dev/dudilester/dynamic_kv
2 parents 49753b5 + 0a6113b commit f3889f4

File tree

112 files changed

+4215
-1994
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

112 files changed

+4215
-1994
lines changed

.cd/Dockerfile.rhel.tenc.pytorch.vllm

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ARG TORCH_TYPE_SUFFIX
1212

1313
FROM ${DOCKER_URL}/${VERSION}/${BASE_NAME}/${REPO_TYPE}/pytorch-${TORCH_TYPE_SUFFIX}installer-${PT_VERSION}:${REVISION}
1414

15-
# Parameterize commit/branch for vllm-fork checkout
15+
# Parameterize commit/branch for vllm-plugin checkout
1616
ARG VLLM_GAUDI_COMMIT=main
1717
# leave empty to use last-good-commit-for-vllm-gaudi
1818
ARG VLLM_PROJECT_COMMIT=

.cd/README.md

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Supports a wide range of validated models including LLaMa, Mistral, and Qwen fam
2727

2828
### 0. Clone the Repository
2929

30-
Before proceeding with any of the steps below, make sure to clone the vLLM fork repository and navigate to the `.cd` directory. This ensures you have all necessary files and scripts for running the server or benchmarks.
30+
Before proceeding with any of the steps below, make sure to clone the vLLM plugin repository and navigate to the `.cd` directory. This ensures you have all necessary files and scripts for running the server or benchmarks.
3131

3232
```bash
3333
git clone https://github.com/vllm-project/vllm-gaudi.git
@@ -129,7 +129,7 @@ cd vllm-gaudi/.cd/
129129
MAX_MODEL_LEN=2048 \
130130
INPUT_TOK=128 \
131131
OUTPUT_TOK=128 \
132-
CON_REQ=16 \
132+
CONCURRENT_REQ=16 \
133133
NUM_PROMPTS=64 \
134134
docker compose --profile benchmark up
135135
```
@@ -159,7 +159,41 @@ cd vllm-gaudi/.cd/
159159
> [!NOTE]
160160
> When using configuration files, you do not need to set the `MODEL` environment variable, as the model name is specified within the configuration file. However, you must still provide your `HF_TOKEN`.
161161
162-
### 7. Running the Server Directly with Docker
162+
### 7. Advance Options with pinning CPU cores for memory access coherence
163+
164+
To improve memory access cohererence and release CPUs to other CPU only workloads like a vLLM serving with Llama3 8B,
165+
pin the CPU cores based on different CPU NUMA nodes by using an auto-generate docker-compose.override.yml file.
166+
Validated Xeon Processors as for now: Intel Xeon 6960P, and Intel Xeon PLATINUM 8568Y+.
167+
168+
Couple python libraries are needed for the python scripts, so install the required packages using following commnad.
169+
170+
```bash
171+
pip install -r vllm-gaudi/.cd/server/cpu_binding/requirements_cpu_binding.txt
172+
```
173+
174+
Run below command to do CPU cores pinning via auto-generated docker-compose.override.yml file.
175+
176+
```bash
177+
export MODEL="Qwen/Qwen2.5-14B-Instruct"
178+
export HF_TOKEN="<your huggingface token>"
179+
export DOCKER_IMAGE="<docker image url>"
180+
python3 server/cpu_binding/generate_cpu_binding_from_csv.py --settings server/cpu_binding/cpu_binding_gnr.csv --output ./docker-compose.override.yml
181+
docker compose --profile benchmark up
182+
```
183+
184+
To also pin idle CPUs to another service like vllm-cpu-service, please give the service name to update
185+
docker-compose.override.yml in order to bind another service to idle cpus.
186+
Here is an exmaple to bind idle cpu for vllm-cpu-service service while docker-compose.vllm-cpu-service.yml defines cpu service.
187+
188+
```bash
189+
export MODEL="Qwen/Qwen2.5-14B-Instruct"
190+
export HF_TOKEN="<your huggingface token>"
191+
export DOCKER_IMAGE="<docker image url>"
192+
python3 server/cpu_binding/generate_cpu_binding_from_csv.py --settings server/cpu_binding/cpu_binding_gnr.csv --output ./docker-compose.override.yml --cpuservice vllm-cpu-service
193+
docker compose --profile benchmark -f docker-compose.yml -f docker-compose.vllm-cpu-service.yml -f docker-compose.override.yml up
194+
```
195+
196+
### 8. Running the Server Directly with Docker
163197

164198
For full control, you can run the server using the `docker run` command. This approach allows you to specify any native Docker parameters as needed.
165199

.cd/benchmark/benchmark_user.env

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
MODEL
22
INPUT_TOK
33
OUTPUT_TOK
4-
CON_REQ
4+
CONCURRENT_REQ
55
NUM_PROMPTS

.cd/docker-compose.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,4 +42,6 @@ services:
4242
- PYTHONUNBUFFERED=1
4343
env_file:
4444
- ./benchmark/benchmark_user.env
45+
volumes:
46+
- ./logs:/root/scripts/logs
4547
command: ["benchmark", "--config-file", "${VLLM_BENCHMARK_CONFIG_FILE}", "--config-name", "${VLLM_BENCHMARK_CONFIG_NAME}"]

.cd/logs/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*\n!/.gitignore
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
import os
3+
import csv
4+
from importlib import util
5+
from enum import Enum
6+
from gaudi_topology import GaudiTopology
7+
8+
REQUIRED_COLUMNS = ["model_id", "input_length", "output_length", "world_size", "data_type", "num_allocated_cpu"]
9+
10+
11+
class BindingPolicy(Enum):
12+
Evenly_on_NUMAs = "evenly"
13+
NUMAs_with_cards = "close2cards"
14+
15+
16+
class CPU_Binding:
17+
18+
def __init__(self, csv_path: str = "cpu_binding_gnr.csv", use_hyperthread: bool = False):
19+
self.libnuma_found = util.find_spec("numa") is not None
20+
self.psutil_found = util.find_spec("psutil") is not None
21+
if self.libnuma_found and self.psutil_found:
22+
import psutil
23+
from numa import info
24+
# Get system Info
25+
self.cpu_count = psutil.cpu_count(logical=False)
26+
self.cpus_allow_list = psutil.Process().cpu_affinity()
27+
#print("cpu allow list:",self.cpus_allow_list)
28+
self.numa_size = info.get_num_configured_nodes()
29+
self.cpu_count_per_numa = self.cpu_count // self.numa_size
30+
31+
# Get CSV info
32+
with open(csv_path, newline="") as f:
33+
rows = list(csv.DictReader(f))
34+
if not rows or any(col not in rows[0] for col in REQUIRED_COLUMNS):
35+
found = list(rows[0].keys()) if rows else "EMPTY CSV"
36+
raise ValueError(f"CSV missing required headers {REQUIRED_COLUMNS}. Found: {found}")
37+
model = os.environ.get("MODEL")
38+
if not model:
39+
raise RuntimeError("Set environment variable MODEL to a model_id in the CSV.")
40+
input_tok = os.environ.get("INPUT_TOK")
41+
output_tok = os.environ.get("OUTPUT_TOK")
42+
con_req = os.environ.get("CONCURRENT_REQ")
43+
num_allocated_cpu = os.environ.get("NUM_CPUS")
44+
print(num_allocated_cpu)
45+
46+
row = self.pick_row_by_parameters(rows, model, input_tok, output_tok, con_req)
47+
print(row["num_allocated_cpu"])
48+
49+
self.world_size = self.parse_int(row["world_size"], "world_size")
50+
binding_policy_index = self.parse_int(row["binding_policy"], "binding_policy")
51+
self.binding_policy = list(BindingPolicy)[binding_policy_index]
52+
53+
if num_allocated_cpu:
54+
self.num_allocated_cpu = int(num_allocated_cpu)
55+
elif row["num_allocated_cpu"] == 'NA':
56+
raise RuntimeError("Invalid NUM_CPU value. Set environment variable NUM_CPUS instead .")
57+
else:
58+
self.num_allocated_cpu = self.parse_int(row["num_allocated_cpu"], "num_allocated_cpu")
59+
60+
# CPU
61+
# check allow node_to_cpus list
62+
self.node_to_cpus = []
63+
for i in range(self.numa_size):
64+
from numa import info
65+
filtered_node_to_cpus = self.filter_one_cpu_per_core(info.node_to_cpus(i))
66+
node_intersect = [cpu for cpu in filtered_node_to_cpus if cpu in self.cpus_allow_list]
67+
if bool(node_intersect):
68+
self.node_to_cpus.append(list(node_intersect))
69+
self.node_to_idle_cpus = self.node_to_cpus.copy()
70+
#self.node_to_idle_cpus_ht = [] #self.node_to_cpus
71+
for i in range(self.numa_size):
72+
if use_hyperthread is False:
73+
self.node_to_idle_cpus[i] = self.node_to_cpus[i][:self.cpu_count_per_numa]
74+
else:
75+
self.node_to_idle_cpus[i] = self.node_to_cpus[i][self.cpu_count_per_numa:]
76+
# Gaudi
77+
topo = GaudiTopology()
78+
self.cards = topo.get_cards()
79+
if self.cards is not None:
80+
self.gaudi_numa_list = []
81+
# Assume to use cards from 0 to 7
82+
for card in self.cards[:self.world_size]:
83+
if card['numa_node'] not in self.gaudi_numa_list:
84+
self.gaudi_numa_list.append(card['numa_node'])
85+
print(f"Card {card['card_id']} ({card['model']}):")
86+
print(f" Bus ID : {card['bus_id']}")
87+
print(f" NUMA Node : {card['numa_node']}")
88+
print(f" Local CPUs : {card['local_cpulist']}")
89+
90+
def parse_int(self, v: str, name: str) -> int:
91+
try:
92+
return int(v)
93+
except Exception as err:
94+
raise ValueError(f"Invalid integer for {name!r}: {v!r}") from err
95+
96+
def pick_row_by_parameters(self, rows: list[dict], model: str, input_tok: str, output_tok: str,
97+
con_req: str) -> dict:
98+
matches = [
99+
r for r in rows if r.get("model_id", "").strip() == model if r.get("input_length", "").strip() == input_tok
100+
if r.get("output_length", "").strip() == output_tok
101+
]
102+
if not matches:
103+
# fallback: match only by model_id
104+
matches = [r for r in rows if r.get('model_id', '') == model]
105+
print(f"Warning: using fallback entry for model '{model}' without exact input/output token match")
106+
if not matches:
107+
available = ", ".join(sorted({r.get('model_id', '') for r in rows}))
108+
raise ValueError(f"MODEL '{model}', input_length '{input_tok}', output_length '{output_tok}' "
109+
f"not found in CSV. Available: {available}")
110+
return matches[0]
111+
112+
def filter_one_cpu_per_core(self, cpus):
113+
"""
114+
Given a list of CPU IDs (possibly with HT pairs),
115+
return a filtered list with only one logical CPU per physical core.
116+
"""
117+
seen_cores = set()
118+
filtered = []
119+
for cpu in sorted(cpus):
120+
core_path = f"/sys/devices/system/cpu/cpu{cpu}/topology/core_id"
121+
try:
122+
with open(core_path) as f:
123+
core_id = int(f.read().strip())
124+
except FileNotFoundError:
125+
continue
126+
if core_id not in seen_cores:
127+
seen_cores.add(core_id)
128+
filtered.append(cpu)
129+
return filtered
130+
131+
def get_cpus_id_binding_based_on_numa_nodes(self, rank: int) -> str:
132+
"""Return CPUs id binding based on NUMA nodes.
133+
"""
134+
rank_to_cpus = ''
135+
if not self.libnuma_found or not self.psutil_found:
136+
print("Auto thread-binding is not supported due to "
137+
"the lack of package numa and psutil,"
138+
"fallback to no thread-binding. To get better performance,"
139+
"please try to manually bind threads.")
140+
return rank_to_cpus
141+
142+
if self.binding_policy is BindingPolicy.Evenly_on_NUMAs or self.cards is None:
143+
#divider = min(self.world_size, len(self.node_to_cpus))
144+
self.allocated_cpu_per_numa = self.num_allocated_cpu // len(self.node_to_cpus)
145+
node_id = rank
146+
elif self.binding_policy is BindingPolicy.NUMAs_with_cards:
147+
self.allocated_cpu_per_numa = self.num_allocated_cpu // len(self.gaudi_numa_list)
148+
node_id = int(self.cards[rank]['numa_node'])
149+
150+
print("binding numa node_id %d allocated_cpu_per_numa %d", node_id, self.allocated_cpu_per_numa)
151+
# Option 1. Bind to the last N cpu cores
152+
start = self.cpu_count_per_numa - self.allocated_cpu_per_numa
153+
rank_to_cpus_list = self.node_to_cpus[node_id][start:self.cpu_count_per_numa]
154+
# Option 2. Bind to the first N cpu cores
155+
#rank_to_cpus_list = self.node_to_cpus[node_id][:self.allocated_cpu_per_numa]
156+
157+
rank_to_cpus = ','.join(str(x) for x in rank_to_cpus_list)
158+
print("rank %d auto thread-binding list: %s", rank, rank_to_cpus)
159+
self.node_to_idle_cpus[node_id] = [
160+
cpu for cpu in self.node_to_idle_cpus[node_id] if cpu not in rank_to_cpus_list
161+
]
162+
return rank_to_cpus
163+
164+
165+
if __name__ == "__main__":
166+
libnuma_found = util.find_spec("numa") is not None
167+
if libnuma_found:
168+
from numa import info
169+
numa_size = info.get_num_configured_nodes()
170+
else:
171+
numa_size = 1
172+
world_size = numa_size
173+
cpu_binder = CPU_Binding(use_hyperthread=False)
174+
if cpu_binder.binding_policy is BindingPolicy.Evenly_on_NUMAs or cpu_binder.cards is None:
175+
max_needed_numa_size = len(cpu_binder.node_to_cpus)
176+
elif cpu_binder.binding_policy is BindingPolicy.NUMAs_with_cards:
177+
max_needed_numa_size = min(cpu_binder.world_size, len(cpu_binder.node_to_cpus))
178+
for i in range(max_needed_numa_size):
179+
rank_to_cpus = cpu_binder.get_cpus_id_binding_based_on_numa_nodes(i)
180+
print(rank_to_cpus)
181+
182+
rank_to_idle_cpus = ','.join(str(x) for row in cpu_binder.node_to_idle_cpus for x in row)
183+
print(rank_to_idle_cpus)
184+
for r in cpu_binder.node_to_idle_cpus:
185+
print(len(r))
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
model_id,input_length,output_length,world_size,data_type,num_allocated_cpu,binding_policy
2+
meta-llama/Llama-3.1-405B-Instruct,128,4096,8,bf16,24,0
3+
meta-llama/Llama-3.1-405B-Instruct,2048,2048,8,bf16,24,0
4+
meta-llama/Llama-3.1-405B-Instruct,4096,128,8,bf16,24,0
5+
meta-llama/Llama-3.1-70B-Instruct,128,4096,4,bf16,24,0
6+
meta-llama/Llama-3.1-70B-Instruct,2048,2048,4,bf16,24,0
7+
meta-llama/Llama-3.1-70B-Instruct,4096,128,4,bf16,24,0
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
model_id,input_length,output_length,world_size,data_type,num_allocated_cpu,binding_policy
2+
meta-llama/Llama-3.1-405B-Instruct,128,4096,8,bf16,18,0
3+
meta-llama/Llama-3.1-405B-Instruct,2048,2048,8,bf16,18,0
4+
meta-llama/Llama-3.1-405B-Instruct,4096,128,8,bf16,18,0
5+
meta-llama/Llama-3.1-70B-Instruct,128,4096,4,bf16,12,0
6+
meta-llama/Llama-3.1-70B-Instruct,2048,2048,4,bf16,12,0
7+
meta-llama/Llama-3.1-70B-Instruct,4096,128,4,bf16,12,0
8+
meta-llama/Llama-3.1-8B-Instruct,128,4096,1,bf16,6,0
9+
meta-llama/Llama-3.1-8B-Instruct,2048,2048,1,bf16,6,0
10+
meta-llama/Llama-3.1-8B-Instruct,4096,128,1,bf16,6,0
11+
Qwen/Qwen2.5-14B-Instruct,2048,2048,1,bf16,6,0

0 commit comments

Comments
 (0)