Headline
GHSA-6fvq-23cw-5628: vLLM: Resource-Exhaustion (DoS) through Malicious Jinja Template in OpenAI-Compatible Server
Summary
A resource-exhaustion (denial-of-service) vulnerability exists in multiple endpoints of the OpenAI-Compatible Server due to the ability to specify Jinja templates via the chat_template
and chat_template_kwargs
parameters. If an attacker can supply these parameters to the API, they can cause a service outage by exhausting CPU and/or memory resources.
Details
When using an LLM as a chat model, the conversation history must be rendered into a text input for the model. In hf/transformer
, this rendering is performed using a Jinja template. The OpenAI-Compatible Server launched by vllm serve exposes a chat_template
parameter that lets users specify that template. In addition, the server accepts a chat_template_kwargs
parameter to pass extra keyword arguments to the rendering function.
Because Jinja templates support programming-language-like constructs (loops, nested iterations, etc.), a crafted template can consume extremely large amounts of CPU and memory and thereby trigger a denial-of-service condition.
Importantly, simply forbidding the chat_template
parameter does not fully mitigate the issue. The implementation constructs a dictionary of keyword arguments for apply_hf_chat_template
and then updates that dictionary with the user-supplied chat_template_kwargs
via dict.update
. Since dict.update
can overwrite existing keys, an attacker can place a chat_template
key inside chat_template_kwargs
to replace the template that will be used by apply_hf_chat_template
.
# vllm/entrypoints/openai/serving_engine.py#L794-L816
_chat_template_kwargs: dict[str, Any] = dict(
chat_template=chat_template,
add_generation_prompt=add_generation_prompt,
continue_final_message=continue_final_message,
tools=tool_dicts,
documents=documents,
)
_chat_template_kwargs.update(chat_template_kwargs or {})
request_prompt: Union[str, list[int]]
if isinstance(tokenizer, MistralTokenizer):
...
else:
request_prompt = apply_hf_chat_template(
tokenizer=tokenizer,
conversation=conversation,
model_config=model_config,
**_chat_template_kwargs,
)
Impact
If an OpenAI-Compatible Server exposes endpoints that accept chat_template
or chat_template_kwargs
from untrusted clients, an attacker can submit a malicious Jinja template (directly or by overriding chat_template
inside chat_template_kwargs
) that consumes excessive CPU and/or memory. This can result in a resource-exhaustion denial-of-service that renders the server unresponsive to legitimate requests.
Fixes
- https://github.com/vllm-project/vllm/pull/25794
Summary
A resource-exhaustion (denial-of-service) vulnerability exists in multiple endpoints of the OpenAI-Compatible Server due to the ability to specify Jinja templates via the chat_template and chat_template_kwargs parameters. If an attacker can supply these parameters to the API, they can cause a service outage by exhausting CPU and/or memory resources.
Details
When using an LLM as a chat model, the conversation history must be rendered into a text input for the model. In hf/transformer, this rendering is performed using a Jinja template. The OpenAI-Compatible Server launched by vllm serve exposes a chat_template parameter that lets users specify that template. In addition, the server accepts a chat_template_kwargs parameter to pass extra keyword arguments to the rendering function.
Because Jinja templates support programming-language-like constructs (loops, nested iterations, etc.), a crafted template can consume extremely large amounts of CPU and memory and thereby trigger a denial-of-service condition.
Importantly, simply forbidding the chat_template parameter does not fully mitigate the issue. The implementation constructs a dictionary of keyword arguments for apply_hf_chat_template and then updates that dictionary with the user-supplied chat_template_kwargs via dict.update. Since dict.update can overwrite existing keys, an attacker can place a chat_template key inside chat_template_kwargs to replace the template that will be used by apply_hf_chat_template.
# vllm/entrypoints/openai/serving_engine.py#L794-L816 _chat_template_kwargs: dict[str, Any] = dict( chat_template=chat_template, add_generation_prompt=add_generation_prompt, continue_final_message=continue_final_message, tools=tool_dicts, documents=documents, ) _chat_template_kwargs.update(chat_template_kwargs or {})
request_prompt: Union[str, list[int]] if isinstance(tokenizer, MistralTokenizer): … else: request_prompt = apply_hf_chat_template( tokenizer=tokenizer, conversation=conversation, model_config=model_config, **_chat_template_kwargs, )
Impact
If an OpenAI-Compatible Server exposes endpoints that accept chat_template or chat_template_kwargs from untrusted clients, an attacker can submit a malicious Jinja template (directly or by overriding chat_template inside chat_template_kwargs) that consumes excessive CPU and/or memory. This can result in a resource-exhaustion denial-of-service that renders the server unresponsive to legitimate requests.
Fixes
- vllm-project/vllm#25794
References
- GHSA-6fvq-23cw-5628
- vllm-project/vllm#25794
- vllm-project/vllm@7977e50