Skip to content

Commit c00add9

Browse files
dacorvodrbhNarsil
authored
Add Neuron backend (#3033)
* feat: add neuron backend * feat(neuron): add server standalone installation * feat(neuron): add server and integration tests * fix(neuron): increase ulimit when building image The base image used to compile the rust components seems to have a low ulimit for opened files, which leads to errors during compilation. * test(neuron): merge integration tests and fixtures * test: add --neuron option * review: do not use latest tag * review: remove ureq pinned version * review: --privileged should be the exception * feat: add neuron case to build ci * fix(neuron): export models from container in test fixtures The neuron tests require models to have been previously exported and cached on the hub. This is done automatically by the neuron.model fixture the first time the tests are ran for a specific version. This fixture used to export the models using optimum-neuron directly, but this package is not necessarily present on the system. Instead, it is now done through the neuron TGI itself, since it contains all the tools required to export the models. Note that since the CI runs docker in docker (dind) it does not seem possible to share a volume between the CI container and the container used to export the model. For that reason, a specific image with a modified entrypoint is built on-the-fly when a model export is required. * refactor: remove sagemaker entry-point The SageMaker image is built differently anyway. * fix(neuron): avoid using Levenshtein * test(neuron): use smaller llama model * feat(neuron): avoid installing CUDA in image * test(neuron): no error anymore when requesting too many tokens * ci: doing a precompilation step (with a different token). * test(neuron): avoid using image sha when exporting models We now manually evaluate the apparent hash of the neuron backend by combining the hash of the neuron backend directory and Dockerfile. This new hash is used to identify exported neuron models instead of the image sha. This has two benefits: - it changes less frequently (only hwen the neuron backend changes), which means less neuron models being pushed to the hub, - it can be evaluated locally, meaning that running the tests once locally will export the models before the CI uses them. * test(neuron): added a small script to prune test models --------- Co-authored-by: drbh <[email protected]> Co-authored-by: Nicolas Patry <[email protected]>
1 parent 97c5f7e commit c00add9

35 files changed

+3114
-24
lines changed

.github/workflows/build.yaml

Lines changed: 59 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ jobs:
2525
docker_volume: ${{ steps.final.outputs.docker_volume }}
2626
docker_devices: ${{ steps.final.outputs.docker_devices }}
2727
runs_on: ${{ steps.final.outputs.runs_on }}
28-
label: ${{ steps.final.outputs.label }}
28+
label_extension: ${{ steps.final.outputs.label_extension }}
2929
extra_pytest: ${{ steps.final.outputs.extra_pytest }}
3030
concurrency:
3131
group: ${{ github.workflow }}-build-and-push-image-${{ inputs.hardware }}-${{ github.head_ref || github.run_id }}
@@ -114,6 +114,16 @@ jobs:
114114
export extra_pytest="-k test_flash_gemma_simple"
115115
export target=""
116116
;;
117+
neuron)
118+
export dockerfile="Dockerfile.neuron"
119+
export label_extension="-neuron"
120+
export docker_devices="/dev/neuron0"
121+
export docker_volume="/mnt/cache"
122+
export runs_on="aws-inf2-8xlarge"
123+
export platform="cpu"
124+
export extra_pytest="--neuron"
125+
export target=""
126+
;;
117127
esac
118128
echo $dockerfile
119129
echo "Dockerfile=${dockerfile}"
@@ -122,7 +132,7 @@ jobs:
122132
echo $runs_on
123133
echo $platform
124134
echo "DOCKERFILE=${dockerfile}" >> $GITHUB_ENV
125-
echo "LABEL=${label_extension}" >> $GITHUB_ENV
135+
echo "LABEL_EXTENSION=${label_extension}" >> $GITHUB_ENV
126136
echo "PLATFORM=${platform}" >> $GITHUB_ENV
127137
echo "DOCKER_VOLUME=${docker_volume}" >> $GITHUB_ENV
128138
echo "DOCKER_DEVICES=${docker_devices}" >> $GITHUB_ENV
@@ -172,7 +182,7 @@ jobs:
172182
images: |
173183
docker.io/huggingface/text-generation-inference-ci
174184
tags: |
175-
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
185+
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL_EXTENSION }}
176186
# If main, release or tag
177187
- name: Extract metadata (tags, labels) for Docker
178188
if: ${{ github.event_name != 'pull_request' }}
@@ -186,10 +196,10 @@ jobs:
186196
ghcr.io/huggingface/text-generation-inference
187197
db4c2190dd824d1f950f5d1555fbadf0.azurecr.io/text-generation-inference
188198
tags: |
189-
type=semver,pattern={{version}}${{ env.LABEL }}
190-
type=semver,pattern={{major}}.{{minor}}${{ env.LABEL }}
191-
type=raw,value=latest${{ env.LABEL }},enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) }}
192-
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
199+
type=semver,pattern={{version}}${{ env.LABEL_EXTENSION }}
200+
type=semver,pattern={{major}}.{{minor}}${{ env.LABEL_EXTENSION }}
201+
type=raw,value=latest${{ env.LABEL_EXTENSION }},enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) }}
202+
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL_EXTENSION }}
193203
- name: Build and push Docker image
194204
id: build-and-push
195205
uses: docker/build-push-action@v4
@@ -200,7 +210,7 @@ jobs:
200210
platforms: 'linux/amd64'
201211
build-args: |
202212
GIT_SHA=${{ env.GITHUB_SHA }}
203-
DOCKER_LABEL=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
213+
DOCKER_LABEL=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL_EXTENSION }}
204214
PLATFORM=${{ env.PLATFORM }}
205215
build_type=${{ env.BUILD_TYPE }}
206216
sccache_gha_enabled=on
@@ -209,23 +219,55 @@ jobs:
209219
target: ${{ env.TARGET }}
210220
tags: ${{ steps.meta.outputs.tags || steps.meta-pr.outputs.tags }}
211221
labels: ${{ steps.meta.outputs.labels || steps.meta-pr.outputs.labels }}
212-
cache-from: type=s3,region=us-east-1,bucket=ci-docker-buildx-cache,name=text-generation-inference-cache${{ env.LABEL }},mode=max,access_key_id=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_ACCESS_KEY_ID }},secret_access_key=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_SECRET_ACCESS_KEY }},mode=min
213-
cache-to: type=s3,region=us-east-1,bucket=ci-docker-buildx-cache,name=text-generation-inference-cache${{ env.LABEL }},mode=min,access_key_id=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_ACCESS_KEY_ID }},secret_access_key=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_SECRET_ACCESS_KEY }},mode=min
222+
cache-from: type=s3,region=us-east-1,bucket=ci-docker-buildx-cache,name=text-generation-inference-cache${{ env.LABEL_EXTENSION }},mode=max,access_key_id=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_ACCESS_KEY_ID }},secret_access_key=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_SECRET_ACCESS_KEY }},mode=min
223+
cache-to: type=s3,region=us-east-1,bucket=ci-docker-buildx-cache,name=text-generation-inference-cache${{ env.LABEL_EXTENSION }},mode=min,access_key_id=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_ACCESS_KEY_ID }},secret_access_key=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_SECRET_ACCESS_KEY }},mode=min
214224
- name: Final
215225
id: final
216226
run: |
217-
echo "docker_image=docker.io/huggingface/text-generation-inference-ci:sha-${{ env.GITHUB_SHA_SHORT}}${{ env.LABEL }}" >> "$GITHUB_OUTPUT"
227+
echo "docker_image=docker.io/huggingface/text-generation-inference-ci:sha-${{ env.GITHUB_SHA_SHORT}}${{ env.LABEL_EXTENSION }}" >> "$GITHUB_OUTPUT"
218228
echo "docker_devices=${{ env.DOCKER_DEVICES }}" >> "$GITHUB_OUTPUT"
219229
echo "docker_volume=${{ env.DOCKER_VOLUME }}" >> "$GITHUB_OUTPUT"
220230
echo "runs_on=${{ env.RUNS_ON }}" >> "$GITHUB_OUTPUT"
221-
echo "label=${{ env.LABEL }}" >> "$GITHUB_OUTPUT"
231+
echo "label_extension=${{ env.LABEL_EXTENSION }}" >> "$GITHUB_OUTPUT"
222232
echo "extra_pytest=${{ env.EXTRA_PYTEST }}" >> "$GITHUB_OUTPUT"
223-
integration_tests:
233+
precompile_static_models:
224234
concurrency:
225-
group: ${{ github.workflow }}-${{ github.job }}-${{ needs.build-and-push.outputs.label }}-${{ github.head_ref || github.run_id }}
235+
group: ${{ github.workflow }}-${{ github.job }}-${{ needs.build-and-push.outputs.label_extension }}-${{ github.head_ref || github.run_id }}
226236
cancel-in-progress: true
227237
needs: build-and-push
228-
if: needs.build-and-push.outputs.runs_on != 'ubuntu-latest'
238+
if: needs.build-and-push.outputs.label_extension == '-neuron'
239+
runs-on:
240+
group: ${{ needs.build-and-push.outputs.runs_on }}
241+
env:
242+
PYTEST_FLAGS: ${{ (startsWith(github.ref, 'refs/tags/') || github.ref == 'refs/heads/main' || inputs.release-tests == true) && '--release' || '--release' }}
243+
steps:
244+
- name: Checkout repository
245+
uses: actions/checkout@v4
246+
- name: Inject slug/short variables
247+
uses: rlespinasse/[email protected]
248+
- name: Set up Python
249+
uses: actions/setup-python@v4
250+
with:
251+
python-version: "3.11"
252+
- name: Install
253+
run: |
254+
make install-integration-tests
255+
- name: Run tests
256+
run: |
257+
export DOCKER_VOLUME=${{ needs.build-and-push.outputs.docker_volume }}
258+
export DOCKER_IMAGE=${{ needs.build-and-push.outputs.docker_image }}
259+
export DOCKER_DEVICES=${{ needs.build-and-push.outputs.docker_devices }}
260+
export EXTRA_PYTEST="${{ needs.build-and-push.outputs.extra_pytest }}"
261+
export HF_TOKEN=${{ secrets.HF_TOKEN_NEURON }}
262+
echo $DOCKER_IMAGE
263+
docker pull $DOCKER_IMAGE
264+
pytest -s -vv integration-tests ${PYTEST_FLAGS} ${EXTRA_PYTEST}
265+
integration_tests:
266+
concurrency:
267+
group: ${{ github.workflow }}-${{ github.job }}-${{ needs.build-and-push.outputs.label_extension }}-${{ github.head_ref || github.run_id }}
268+
cancel-in-progress: true
269+
needs: [precompile_static_models, build-and-push]
270+
if: ${{ always() && !contains(needs.*.result, 'failure') && !contains(needs.*.result, 'cancelled') && needs.build-and-push.outputs.runs_on != 'ubuntu-latest' }}
229271
runs-on:
230272
group: ${{ needs.build-and-push.outputs.runs_on }}
231273
env:
@@ -255,7 +297,7 @@ jobs:
255297
256298
backend_trtllm_cxx_tests:
257299
needs: build-and-push
258-
if: needs.build-and-push.outputs.label == '-trtllm'
300+
if: needs.build-and-push.outputs.label_extension == '-trtllm'
259301
concurrency:
260302
group: ${{ github.workflow }}-${{ github.job }}-trtllm-${{ github.head_ref || github.run_id }}
261303
cancel-in-progress: true
@@ -270,5 +312,5 @@ jobs:
270312

271313
steps:
272314
- name: Run C++/CUDA tests
273-
if: ${{ env.LABEL == 'ci-runtime' }}
315+
if: ${{ env.LABEL_EXTENSION == 'ci-runtime' }}
274316
run: /usr/local/tgi/bin/tgi_trtllm_backend_tests

.github/workflows/ci_build.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ jobs:
3737
# fail-fast is true by default
3838
fail-fast: false
3939
matrix:
40-
hardware: ["cuda", "cuda-trtllm", "rocm", "intel-xpu", "intel-cpu"]
40+
hardware: ["cuda", "cuda-trtllm", "rocm", "intel-xpu", "intel-cpu", "neuron"]
4141
uses: ./.github/workflows/build.yaml # calls the one above ^
4242
permissions:
4343
contents: write

Dockerfile.neuron

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# Fetch and extract the TGI sources
2+
FROM alpine AS tgi
3+
RUN mkdir -p /tgi
4+
5+
# Fetch the optimum-neuron sources directly to avoid relying on pypi deployments
6+
FROM alpine AS optimum-neuron
7+
RUN mkdir -p /optimum-neuron
8+
ADD https://github.com/huggingface/optimum-neuron/archive/refs/tags/v0.0.28.tar.gz /optimum-neuron/sources.tar.gz
9+
RUN tar -C /optimum-neuron -xf /optimum-neuron/sources.tar.gz --strip-components=1
10+
11+
# Build cargo components (adapted from TGI original Dockerfile)
12+
# Note: we cannot use the cargo-chef base image as it uses python 3.11
13+
FROM ubuntu:22.04 AS chef
14+
15+
RUN apt-get update -y \
16+
&& apt-get install -y --no-install-recommends \
17+
curl ca-certificates build-essential \
18+
&& rm -rf /var/lib/apt/lists/* \
19+
&& apt-get clean
20+
21+
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain 1.80.1 --profile minimal -y
22+
ENV PATH="/root/.cargo/bin:${PATH}"
23+
RUN cargo install cargo-chef --locked
24+
25+
WORKDIR /usr/src
26+
27+
ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse
28+
29+
FROM chef AS planner
30+
COPY backends/neuron/Cargo.toml Cargo.toml
31+
COPY Cargo.lock Cargo.lock
32+
COPY rust-toolchain.toml rust-toolchain.toml
33+
COPY proto proto
34+
COPY router router
35+
COPY backends backends
36+
COPY launcher launcher
37+
RUN cargo chef prepare --recipe-path recipe.json
38+
39+
FROM chef AS builder
40+
41+
RUN apt-get update -y \
42+
&& apt-get install -y --no-install-recommends \
43+
unzip python3-dev libssl-dev pkg-config \
44+
&& rm -rf /var/lib/apt/lists/* \
45+
&& apt-get clean
46+
47+
RUN PROTOC_ZIP=protoc-21.12-linux-x86_64.zip && \
48+
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP && \
49+
unzip -o $PROTOC_ZIP -d /usr/local bin/protoc && \
50+
unzip -o $PROTOC_ZIP -d /usr/local 'include/*' && \
51+
rm -f $PROTOC_ZIP
52+
53+
COPY backends/neuron/Cargo.toml Cargo.toml
54+
COPY --from=planner /usr/src/recipe.json recipe.json
55+
RUN cargo chef cook --release --recipe-path recipe.json
56+
57+
COPY Cargo.lock Cargo.lock
58+
COPY rust-toolchain.toml rust-toolchain.toml
59+
COPY proto proto
60+
COPY router router
61+
COPY backends backends
62+
COPY launcher launcher
63+
RUN cargo build --release
64+
65+
# Python base image
66+
FROM ubuntu:22.04 AS base
67+
68+
RUN apt-get update -y \
69+
&& apt-get install -y --no-install-recommends \
70+
python3-pip \
71+
python3-setuptools \
72+
python-is-python3 \
73+
&& rm -rf /var/lib/apt/lists/* \
74+
&& apt-get clean
75+
RUN pip3 --no-cache-dir install --upgrade pip
76+
77+
# Python server build image
78+
FROM base AS pyserver
79+
80+
RUN apt-get update -y \
81+
&& apt-get install -y --no-install-recommends \
82+
make \
83+
python3-venv \
84+
&& rm -rf /var/lib/apt/lists/* \
85+
&& apt-get clean
86+
87+
RUN install -d /pyserver
88+
WORKDIR /pyserver
89+
COPY backends/neuron/server server
90+
COPY proto proto
91+
RUN pip3 install -r server/build-requirements.txt
92+
RUN VERBOSE=1 BUILDDIR=/pyserver/build PROTODIR=/pyserver/proto make -C server package
93+
94+
# Neuron base image (used for deployment)
95+
FROM base AS neuron
96+
97+
# Install system prerequisites
98+
RUN apt-get update -y \
99+
&& apt-get install -y --no-install-recommends \
100+
gnupg2 \
101+
wget \
102+
python3-dev \
103+
libexpat1 \
104+
&& rm -rf /var/lib/apt/lists/* \
105+
&& apt-get clean
106+
107+
RUN echo "deb https://apt.repos.neuron.amazonaws.com jammy main" > /etc/apt/sources.list.d/neuron.list
108+
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -
109+
110+
# Install neuronx packages
111+
RUN apt-get update -y \
112+
&& apt-get install -y --no-install-recommends \
113+
aws-neuronx-dkms=2.18.20.0 \
114+
aws-neuronx-collectives=2.22.33.0-d2128d1aa \
115+
aws-neuronx-runtime-lib=2.22.19.0-5856c0b42 \
116+
aws-neuronx-tools=2.19.0.0 \
117+
libxml2 \
118+
&& rm -rf /var/lib/apt/lists/* \
119+
&& apt-get clean
120+
121+
ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"
122+
123+
# Install manually torch CPU version to avoid pulling CUDA
124+
RUN pip3 install \
125+
torch==2.1.2 \
126+
torchvision==0.16.2 \
127+
--index-url https://download.pytorch.org/whl/cpu
128+
129+
RUN pip3 install \
130+
neuronx-cc==2.15.143.0 \
131+
torch-neuronx==2.1.2.2.3.2 \
132+
transformers-neuronx==0.12.313 \
133+
neuronx-distributed==0.9.0 \
134+
libneuronxla==2.0.5347.0 \
135+
--extra-index-url=https://pip.repos.neuron.amazonaws.com
136+
137+
# Install HuggingFace packages
138+
RUN pip3 install \
139+
hf_transfer huggingface_hub
140+
141+
# Install optimum-neuron
142+
COPY --from=optimum-neuron /optimum-neuron optimum-neuron
143+
RUN pip3 install ./optimum-neuron
144+
145+
# TGI base env
146+
ENV HUGGINGFACE_HUB_CACHE=/tmp \
147+
HF_HUB_ENABLE_HF_TRANSFER=1 \
148+
PORT=80
149+
150+
# Disable color logs as they are not supported by CloudWatch
151+
ENV LOGURU_COLORIZE=NO
152+
ENV LOG_COLORIZE=0
153+
154+
# Install router
155+
COPY --from=builder /usr/src/target/release/text-generation-router-v2 /usr/local/bin/text-generation-router
156+
# Install launcher
157+
COPY --from=builder /usr/src/target/release/text-generation-launcher /usr/local/bin/text-generation-launcher
158+
# Install python server
159+
COPY --from=pyserver /pyserver/build/dist dist
160+
RUN pip install dist/text_generation_server*.tar.gz
161+
162+
# Final image
163+
FROM neuron
164+
165+
COPY backends/neuron/tgi_env.py /tgi_env.py
166+
COPY backends/neuron/tgi-entrypoint.sh /tgi-entrypoint.sh
167+
RUN chmod +x /tgi-entrypoint.sh
168+
169+
ENTRYPOINT ["/tgi-entrypoint.sh"]

backends/neuron/Cargo.toml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
[workspace]
2+
members = [
3+
"backends/v2",
4+
"backends/grpc-metadata",
5+
"launcher",
6+
"router"
7+
]
8+
default-members = [
9+
"backends/v2",
10+
"backends/grpc-metadata",
11+
"launcher",
12+
"router"
13+
]
14+
resolver = "2"
15+
16+
[workspace.package]
17+
version = "3.0.0"
18+
edition = "2021"
19+
authors = ["Olivier Dehaene"]
20+
homepage = "https://github.com/huggingface/text-generation-inference"
21+
22+
[workspace.dependencies]
23+
base64 = "0.22.0"
24+
tokenizers = { version = "0.20.0", features = ["http"] }
25+
hf-hub = { version = "0.3.1", features = ["tokio"] }
26+
metrics = { version = "0.23.0" }
27+
metrics-exporter-prometheus = { version = "0.15.1", features = [] }
28+
minijinja = { version = "2.2.0", features = ["json"] }
29+
minijinja-contrib = { version = "2.0.2", features = ["pycompat"] }
30+
pyo3 = { version = "0.22.2", features = ["auto-initialize"] }
31+
32+
[profile.release]
33+
incremental = true
34+
35+
[profile.release-binary]
36+
inherits = "release"
37+
debug = 1
38+
incremental = true
39+
panic = "abort"
40+
41+
[profile.release-opt]
42+
inherits = "release"
43+
debug = 0
44+
incremental = false
45+
lto = "fat"
46+
opt-level = 3
47+
codegen-units = 1

0 commit comments

Comments
 (0)