Zudoku
Guides

Llama2

This guide explains how to create and deploy a llama2 inference server and expose an API to it. To run this example, follow these steps:

  1. Install the kraft CLI tool and a container runtime engine, for example Docker.

  2. Clone the examples repository and cd into the examples/llama2/ directory:

Code(bash)
git clone https://github.com/kraftcloud/examples cd examples/llama2/

Make sure to log into Unikraft Cloud by setting your token and a metro close to you. This guide uses fra (Frankfurt, ๐Ÿ‡ฉ๐Ÿ‡ช):

Code(bash)
# Set Unikraft Cloud access token export UKC_TOKEN=token # Set metro to Frankfurt, DE export UKC_METRO=fra

When done, invoke the following command to deploy this app on Unikraft Cloud:

Code(bash)
kraft cloud deploy -p 443:8080 -M 1024 .

Note that in this example the system assigns 1GB of memory. The amount required will vary depending on the model (the section below covers how to deploy different models).

The output shows the instance address and other details:

Code(ansi)
[90m[[0m[92mโ—[0m[90m][0m Deployed successfully! [90mโ”‚[0m [90mโ”œ[0m[90mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€[0m [90mname[0m: llama2-cl5bw [90mโ”œ[0m[90mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€[0m [90muuid[0m: eddb16d4-44e7-48d6-a226-328a18745d13 [90mโ”œ[0m[90mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€[0m [90mstate[0m: [92mrunning[0m [90mโ”œ[0m[90mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€[0m [90murl[0m: https://funky-rain-xds8dxbg.fra.unikraft.app [90mโ”œ[0m[90mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€[0m [90mimage[0m: llama2@sha256:5af77e7381931c9f5b8f605789a238a64784b631d4b3308c5948b681c862f25a [90mโ”œ[0m[90mโ”€โ”€โ”€โ”€โ”€[0m [90mboot time[0m: 38.29 ms [90mโ”œ[0m[90mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€[0m [90mmemory[0m: 1024 MiB [90mโ”œ[0m[90mโ”€โ”€โ”€โ”€โ”€โ”€โ”€[0m [90mservice[0m: funky-rain-xds8dxbg [90mโ”œ[0m[90mโ”€โ”€[0m [90mprivate fqdn[0m: llama2-cl5bw.internal [90mโ”œ[0m[90mโ”€โ”€โ”€โ”€[0m [90mprivate ip[0m: 172.16.6.3 [90mโ””[0m[90mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€[0m [90margs[0m: 8080

In this case, the instance name is llama2-cl5bw and the address is https://funky-rain-xds8dxbg.fra.unikraft.app. They're different for each run.

You can retrieve a story through the llama2 API endpoint:

Code(bash)
curl -o - https://funky-rain-xds8dxbg.fra.unikraft.app/api/llama2
Code(text)
Once upon a time, there was a little girl named Lily. She loved to eat grapes. One day, she saw a big grape on the table. Lily wanted to eat it, but she was too small. She thought, "I will try to get it when no one is looking." The next day, Lily saw a big rock near the tower. She thought, "Maybe I can move the rock." She tried to push the rock, but it was too heavy. Lily did not give up. She tried again and again. Finally, she had a big idea. She would use a long stick to push the rock. Lily went to the tower and pushed the rock with the stick. The rock moved! She was so happy. She picked up the grape and said, "Thank you, Rock!" Lily learned that if you are persistent and try hard, you can do anything.

You can list information about the instance by running:

Code(bash)
kraft cloud instance list
Code(text)
NAME FQDN STATE CREATED AT IMAGE MEMORY ARGS BOOT TIME llama2-cl5bw funky-rain-xds8dxbg.fra.unikraft.app running 1 minute ago llama2@sha256:5af77e73819... 1.0 GiB 8080 38286us

When done, you can remove the instance:

Code(bash)
kraft cloud instance remove llama2-cl5bw

Customize your app

To customize the app, update the files in the repository, listed below:

  • Kraftfile: the Unikraft Cloud specification
  • Dockerfile: the Docker-specified app filesystem
  • tokenizer.bin: Exposes an API for the model
  • stories15M.bin: The LLM model.

Lines in the Kraftfile have the following roles:

  • spec: v0.6: The current Kraftfile specification version is 0.6.

  • runtime: llama2: The Unikraft runtime kernel to use is llama2.

  • rootfs: ./Dockerfile: Build the app root filesystem using the Dockerfile.

  • cmd: ["8080"]: Expose the service via port 8080

Lines in the Dockerfile have the following roles:

  • FROM alpine:3.14 as base: Build the filesystem from the alpine:3.14, to create a base image.

  • COPY: Copy the model and tokenizer to the Docker filesystem (to /models).

The following options are available for customizing the app:

  • You can replace the model with others, for example from Hugging Face

  • The tokenizer comes from here, but feel free to replace it.

You can customize parameters for your story through a POST request on the same API endpoint. The system recognizes the following parameters:

  • prompt: seed the LLM with a specific string
  • model: use specific model instead of DEFAULT
  • temperature: valid range 0.0 - 1.0; 0.0 is deterministic, 1.0 is original (default 1.0)
  • topp: valid range 0.0 - 1.0; top-p in nucleus sampling; 1.0 = off, 0.9 works well, but slower (default 0.9)

For example:

Code(bash)
curl -o - https://funky-rain-xds8dxbg.fra.unikraft.app/api/llama2 -d '{ "model": "stories15M", "temperature": 0.95, "topp": 0.8, "prompt": "There once was a monkey named Bobo." }'

Learn more

Use the --help option for detailed information on using Unikraft Cloud:

Code(bash)
kraft cloud --help

Or visit the CLI Reference.

Last modified on