Offloading Prefix Cache to CPU Memory
Overview
This guide provides recipes to offload prefix cache to CPU RAM via the vLLM native offloading connector and the LMCache connector.
Prerequisites
- All prerequisites from the upper level.
- Have the proper client tools installed on your local system to use this guide.
- Ensure your cluster infrastructure is sufficient to deploy high scale inference.
Installation
First, set up a namespace for the deployment and create the HuggingFace token secret.
export NAMESPACE=llm-d-pfc-cpu # or any other namespace
kubectl create namespace ${NAMESPACE}
# NOTE: You must have your HuggingFace token stored in the HF_TOKEN environment variable.
export HF_TOKEN="<your-hugging-face-token>"
kubectl create secret generic llm-d-hf-token --from-literal=HF_TOKEN=${HF_TOKEN} -n ${NAMESPACE}
1. Deploy Gateway and HTTPRoute
Deploy the Gateway and HTTPRoute using the gateway recipe.
2. Deploy InferencePool
Deploy the InferencePool using the InferencePool recipe.
3. Deploy vLLM Model Server
- Offloading Connector
Deploy the vLLM model server with the OffloadingConnector enabled.
kubectl apply -k ./manifests/vllm/offloading-connector -n ${NAMESPACE}
</TabItem>
<TabItem value="lmcache" label="LMCache Connector" default>
Deploy the vLLM model server with the `LMCache` connector enabled.
```bash
kubectl apply -k ./manifests/vllm/lm-cache-connector -n ${NAMESPACE}
Verifying the installation
You can verify the installation by checking the status of the created resources.
Check the Gateway
kubectl get gateway -n ${NAMESPACE}
You should see output similar to the following, with the PROGRAMMED status as True.
NAME CLASS ADDRESS PROGRAMMED AGE
llm-d-inference-gateway gke-l7-regional-external-managed <redacted> True 16m
Check the HTTPRoute
kubectl get httproute -n ${NAMESPACE}
NAME HOSTNAMES AGE
llm-d-route 17m
Check the InferencePool
kubectl get inferencepool -n ${NAMESPACE}
NAME AGE
llm-d-infpool 16m
Check the Pods
kubectl get pods -n ${NAMESPACE}
You should see the InferencePool's endpoint pod and the model server pods in a Running state.
NAME READY STATUS RESTARTS AGE
llm-d-infpool-epp-xxxxxxxx-xxxxx 1/1 Running 0 16m
llm-d-model-server-xxxxxxxx-xxxxx 1/1 Running 0 11m
llm-d-model-server-xxxxxxxx-xxxxx 1/1 Running 0 11m
Cleanup
To remove the deployment:
helm uninstall llm-d-infpool -n ${NAMESPACE}
kubectl delete -k ./manifests/vllm/offloading-connector -n ${NAMESPACE}
kubectl delete -k ../../../../recipes/gateway/gke-l7-regional-external-managed -n ${NAMESPACE}
kubectl delete namespace ${NAMESPACE}
This content is automatically synced from guides/prefix-cache-storage/cpu/README.md in the llm-d/llm-d repository.
📝 To suggest changes, please edit the source file or create an issue.