Llama2
This guide explains how to create and deploy a llama2 inference server and expose an API to it. To run this example, follow these steps:
-
Install the
kraftCLI tool and a container runtime engine, for example Docker. -
Clone the
examplesrepository andcdinto theexamples/llama2/directory:
Code(bash)
Make sure to log into Unikraft Cloud by setting your token and a metro close to you.
This guide uses fra (Frankfurt, ๐ฉ๐ช):
Code(bash)
When done, invoke the following command to deploy this app on Unikraft Cloud:
Code(bash)
Note that in this example the system assigns 1GB of memory. The amount required will vary depending on the model (the section below covers how to deploy different models).
The output shows the instance address and other details:
Code(ansi)
In this case, the instance name is llama2-cl5bw and the address is https://funky-rain-xds8dxbg.fra.unikraft.app.
They're different for each run.
You can retrieve a story through the llama2 API endpoint:
Code(bash)
Code(text)
You can list information about the instance by running:
Code(bash)
Code(text)
When done, you can remove the instance:
Code(bash)
Customize your app
To customize the app, update the files in the repository, listed below:
Kraftfile: the Unikraft Cloud specificationDockerfile: the Docker-specified app filesystemtokenizer.bin: Exposes an API for the modelstories15M.bin: The LLM model.
Lines in the Kraftfile have the following roles:
-
spec: v0.6: The currentKraftfilespecification version is0.6. -
runtime: llama2: The Unikraft runtime kernel to use is llama2. -
rootfs: ./Dockerfile: Build the app root filesystem using theDockerfile. -
cmd: ["8080"]: Expose the service via port 8080
Lines in the Dockerfile have the following roles:
-
FROM alpine:3.14 as base: Build the filesystem from thealpine:3.14, to create a base image. -
COPY: Copy the model and tokenizer to the Docker filesystem (to/models).
The following options are available for customizing the app:
-
You can replace the model with others, for example from Hugging Face
-
The tokenizer comes from here, but feel free to replace it.
You can customize parameters for your story through a POST request on the same API endpoint. The system recognizes the following parameters:
prompt: seed the LLM with a specific stringmodel: use specific model instead of DEFAULTtemperature: valid range 0.0 - 1.0; 0.0 is deterministic, 1.0 is original (default 1.0)topp: valid range 0.0 - 1.0; top-p in nucleus sampling; 1.0 = off, 0.9 works well, but slower (default 0.9)
For example:
Code(bash)
Learn more
Use the --help option for detailed information on using Unikraft Cloud:
Code(bash)
Or visit the CLI Reference.