@@ -35,8 +36,6 @@ import Spoiler from "../../components/Post/Spoiler.tsx";
- Converting the model to GGUF format
- Quantizing the model for better performance
- Hosting a basic LLM model with llama.cpp locally
- (To Be Added) Making a volume mount to persist LLM data across container restarts
- (To Be Added) Tagging the Docker Image for future reuse
</Ganbatte>
@@ -86,7 +85,7 @@ import Spoiler from "../../components/Post/Spoiler.tsx";
zsh
```
With the above commands, you should have a zsh environment, coding language and Docker setup. We will get more in details of all the tools with this setup as we work through the lessons.
With the above commands, you should have a zsh environment, coding language and Docker setup. We will get more in details of all the tools with this setup as we work through the lessons.
</section>
<section data-toc="Docker" data-toc-level="2">
@@ -115,6 +114,7 @@ import Spoiler from "../../components/Post/Spoiler.tsx";
@@ -148,12 +148,12 @@ import Spoiler from "../../components/Post/Spoiler.tsx";
```
<Info>
- `--gpus` all enables GPU support for the container.
- `--gpus` all enables GPU support for the container.
- `--it` makes the container interactive, allowing you to run commands inside it.
- `--name` llm-container gives the container a name for easier reference.
- `-p 8080:8080` = `-p HOST:CONTAINER` maps port 8080 on your host machine to port 8080 inside the container. This is useful if you plan to run a server inside the container and want to access it from your host machine.
- `nvidia/cuda:13.0.2-cudnn-runtime-ubuntu24.04` specifies the Docker image to use.
- `/bin/bash` starts a bash shell inside the container.
- `/bin/bash` start point for the container, which opens a bash shell.
</Info>
Once you are inside the container, you can proceed to setup the environment like we did before in the <a href="#setting-up-developer-environment">WSL section</a>.
@@ -248,7 +248,7 @@ import Spoiler from "../../components/Post/Spoiler.tsx";
@@ -379,7 +379,7 @@ import Spoiler from "../../components/Post/Spoiler.tsx";
</picture>
<Important>
Make sure to copy the token somewhere safe and **DO NOT SHARE IT WITH ANYONE** or **USE IT DIRECTLY IN PUBLIC REPOSITORIES** and **DIRECTLY IN YOUR CODE**! Consult AIs on how to keep your tokens safe if you are unsure, but do not directly share them with the AI.
Make sure to copy the token somewhere safe and **DO NOT SHARE IT WITH ANYONE** or **USE IT DIRECTLY IN PUBLIC REPOSITORIES** and **DIRECTLY IN YOUR CODE**! Consult AIs on how to keep your tokens safe if you are unsure, but do not directly share them with the AI.
</Important>
Now that you have your token, you can use it to download models from Hugging Face. We will use `huggingface-cli` to do this. Let's first make the directory to store the models:
@@ -400,12 +400,12 @@ import Spoiler from "../../components/Post/Spoiler.tsx";
@@ -550,6 +550,7 @@ import Spoiler from "../../components/Post/Spoiler.tsx";
```
Then you have succeeded in converting the model to GGUF format!
</section>
<section data-toc="Quantizing the Model" data-toc-level="1">
@@ -634,7 +635,7 @@ import Spoiler from "../../components/Post/Spoiler.tsx";
<QA>
<span slot="question">How do I know how big of a model size can I fit in my computer</span>
<p>It depends on whether you are running inference on your <strong>CPU (System RAM)</strong> or <strong>GPU (VRAM)</strong>.</p>
<p>For CPU inference, you generally want the model size to be around 2x the size of your system RAM for comfortable operation. For example, if you have 16GB of RAM, you should aim for models that are around 8GB or smaller.</p>
Before we continue, we are going to destroy everything that we have worked on so far:
```zsh frame="none"
exit # As many times as needed to exit the container to your host shell
docker stop llm-container
docker rm llm-container
```
This is to show that, whenever we remove the Docker container, all the data inside the container will be lost. This is bad because we don't want to redownload and reconvert the models every time we restart the container.
To solve this issue, we can use Docker volume mounts to persist our data.
Docker volume maps directories from your host machine to the Docker container.
It's a little bit like plugging in a USB drive to your computer, so that the data on the USB drive is accessible even if you remove the USB drive.
When you run the Docker container, you can use the `-v` option to specify volume mounts.
```zsh frame="none"
docker run \
--gpus all \
-it \
-v ~/Models:/Models \
--name llm-container \
-p 8080:8080 \
nvidia/cuda:13.0.2-cudnn-devel-ubuntu24.04 \
/bin/bash
```
<Info>
- `-v ~/Models:/Models`: This maps the `~/Models` directory on your host machine to the `/Models` directory inside the Docker container.
- The left side (`~/Models`) is the path on your host machine.
- The right side (`/Models`) is the path inside the Docker container.
- With this setup, any models you download to `~/Models` on your host machine will be accessible at `/Models` inside the Docker container, and vice versa.
</Info>
Now, it's your turn to set up everything again inside the Docker container, but this time, when you download and convert the models, make sure to save them to the `/Models` directory inside the container. Try to do it own your own!
<Homework>
<h3>Your Task</h3>
1. Setting up Hugging Face CLI and downloading the model to `~/Models` in your host machine
2. Starting a docker container and mount `~/Models` to `/Models` in the container
3. Initializing the container with the scripts provided
- apt update and install dependencies
- delete default user
- provisional script
- log into to your own user account
4. Cloning llama.cpp and building it
5. Converting the model to GGUF and quantizing it (Remember your models are in `/Models` now!)
6. Running the server with the model from `/Models`
</Homework>
The solution is below if you get stuck:
<Spoiler client:idle>
1. Setting up Hugging Face CLI and downloading the model to `~/Models` in your host machine
6. Running the server with the model from `/Models`
```zsh frame="none"
llama-server \
--model /Models/SmolLM3-3B/SmolLM3-3B.q4.gguf \
--host 0.0.0.0
--port 8080
```
</Spoiler>
If you have done it without help! Congratulations! You have successfully set up a persistent environment for running llama.cpp with Docker volume mounts!
For now, if you want to keep the server running after exiting the terminal, you can use `tmux` or `screen` to create a persistent session inside the Docker container.
1. Enter the Docker container again (if you have exited it):
This session will persist as long as the Docker container is running. Your setup will also persist as long as you don't remove the Docker container. But if you want to free up some resources, you should stop the container when not in use.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.