2025-12-09 23:34:30 +00:00

1 line
45 KiB
JSON
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[["Map",1,2,7,8],"meta::meta",["Map",3,4,5,6],"astro-version","5.16.4","astro-config-digest","{\"root\":{},\"srcDir\":{},\"publicDir\":{},\"outDir\":{},\"cacheDir\":{},\"compressHTML\":true,\"base\":\"/\",\"trailingSlash\":\"ignore\",\"output\":\"static\",\"scopedStyleStrategy\":\"attribute\",\"build\":{\"format\":\"directory\",\"client\":{},\"server\":{},\"assets\":\"_astro\",\"serverEntry\":\"entry.mjs\",\"redirects\":true,\"inlineStylesheets\":\"auto\",\"concurrency\":1},\"server\":{\"open\":false,\"host\":false,\"port\":4321,\"streaming\":true,\"allowedHosts\":[]},\"redirects\":{},\"image\":{\"endpoint\":{\"route\":\"/_image\"},\"service\":{\"entrypoint\":\"astro/assets/services/sharp\",\"config\":{}},\"domains\":[],\"remotePatterns\":[],\"responsiveStyles\":false},\"devToolbar\":{\"enabled\":true},\"markdown\":{\"syntaxHighlight\":false,\"shikiConfig\":{\"langs\":[],\"langAlias\":{},\"theme\":\"github-dark\",\"themes\":{},\"wrap\":false,\"transformers\":[]},\"remarkPlugins\":[],\"rehypePlugins\":[[null,{\"themes\":[\"vitesse-dark\"]}]],\"remarkRehype\":{},\"gfm\":true,\"smartypants\":true},\"security\":{\"checkOrigin\":true,\"allowedDomains\":[]},\"env\":{\"schema\":{},\"validateSecrets\":false},\"experimental\":{\"clientPrerender\":false,\"contentIntellisense\":false,\"headingIdCompat\":false,\"preserveScriptOrder\":false,\"liveContentCollections\":false,\"csp\":false,\"staticImportMetaEnv\":false,\"chromeDevtoolsWorkspace\":false,\"failOnPrerenderConflict\":false,\"svgo\":false},\"legacy\":{\"collections\":false}}","lessons",["Map",9,10],"01-intro",{"id":9,"data":11,"body":15,"filePath":16,"digest":17,"legacyId":18,"deferredRender":19},{"title":12,"description":13,"style":14},"Introduction to Web Dev","Setting up the environment","type-1","{/* Blockquotes */}\nimport Ganbatte from \"../../components/Post/Blockquotes/Ganbatte.astro\";\nimport Homework from \"../../components/Post/Blockquotes/Homework.astro\";\nimport Important from \"../../components/Post/Blockquotes/Important.astro\";\nimport Info from \"../../components/Post/Blockquotes/Info.astro\";\nimport QA from \"../../components/Post/Blockquotes/QA.astro\";\n\nimport Spoiler from \"../../components/Post/Spoiler.tsx\";\nimport QuantizationCalculator from \"../../components/Util/QuantizationCalc.tsx\";\n\n# Hosting a Large Language Model (LLM) Locally\n\n\u003Cpicture>\n\t\u003Cimg src=\"https://pic.mangopig.tech/i/879aaccd-6822-423f-883a-74cf5ba598e7.jpg\" alt=\"Web Development Illustration\" />\n\u003C/picture>\n\n\u003Cblockquote class=\"lesson-meta\">\n\t\u003Cspan>Lesson 01\u003C/span>\n\t\u003Cspan>Created at: **December 2025**\u003C/span>\n\t\u003Cspan>Last Updated: **December 2025**\u003C/span>\n\u003C/blockquote>\n\n\u003CGanbatte toc=\"Lesson Objectives\" tocLevel=\"1\" imageAlt=\"MangoPig Ganbatte\">\n ## Lesson Objectives\n\n - Setting up your Developer Environment\n - Setting up a isolated Docker environment for hosting LLMs\n - Fetching the AI model\n - Converting the model to GGUF format\n - Quantizing the model for better performance\n - Hosting a basic LLM model with llama.cpp locally\n\n\u003C/Ganbatte>\n\n\u003Csection data-toc=\"Setting Up Developer Environment\" data-toc-level=\"1\">\n \u003Ch2>Setting Up Your Developer Environment\u003C/h2>\n \u003Csection data-toc=\"WSL\" data-toc-level=\"2\">\n \u003Ch3>Setting Up WSL (Windows Subsystem for Linux)\u003C/h3>\n To set up WSL on your Windows machine, follow these steps:\n 1. Open PowerShell as Administrator.\n 2. Run the following command to enable WSL and install a Linux distribution (Ubuntu is recommended):\n\n ```zsh frame=\"none\"\n wsl --install\n ```\n\n 3. Restart your computer when prompted.\n 4. After restarting, open the Ubuntu application from the Start menu and complete the initial setup by creating a user account.\n 5. Update your package lists and upgrade installed packages by running:\n\n ```zsh frame=\"none\"\n sudo apt update && sudo apt upgrade -y\n ```\n \u003C/section>\n\n \u003Csection data-toc=\"ZSH\" data-toc-level=\"2\">\n \u003Ch3>Getting Your Environment Ready\u003C/h3>\n\n ```zsh frame=\"none\"\n sudo apt install -y git make curl sudo zsh\n ```\n\n ```zsh frame=\"none\"\n mkdir -p ~/Config/Dotfiles\n git clone https://git.mangopig.tech/MangoPig/Dot-Zsh.git ~/Config/Dotfiles/Zsh\n cd ~/Config/Dotfiles/Zsh\n ```\n\n Whenever there's a prompt to ask to install just confirm with `y` and hit enter.\n\n ```zsh frame=\"none\"\n make setup\n ```\n\n Restart the shell to finalize the zsh setup:\n\n ```zsh frame=\"none\"\n zsh\n ```\n\n With the above commands, you should have a zsh environment, coding language and Docker setup. We will get more in details of all the tools with this setup as we work through the lessons.\n \u003C/section>\n\n \u003Csection data-toc=\"Docker\" data-toc-level=\"2\">\n \u003Ch3>Installing Docker\u003C/h3>\n Docker should already be installed with the above steps. To verify, run:\n\n ```zsh frame=\"none\"\n docker --version\n ```\n and try to run a test container:\n\n ```zsh frame=\"none\"\n docker run hello-world\n ```\n\n If you run into permissions issues, you may need to add your user to the docker group:\n\n ```zsh frame=\"none\"\n sudo usermod -aG docker $USER\n ```\n\n Then restart the shell or log out and back in by doing:\n\n ```zsh frame=\"none\"\n zsh\n ```\n\n \u003C/section>\n\n\u003C/section>\n\n\u003Csection data-toc=\"Docker Environment Setup\" data-toc-level=\"1\">\n \u003Ch2>Setting Up the Isolated Docker Environment for Hosting LLMs\u003C/h2>\n Now that we have the local environment ready, we want to set up an isolated Docker environment for hosting LLMs so that it doesn't interfere with our main system.\n\n \u003Csection data-toc=\"What is Docker?\" data-toc-level=\"2\">\n \u003Ch3>What is Docker?\u003C/h3>\n Docker is a platform that allows you to package your application and its dependencies into containers.\n\n \u003CInfo>\n \u003Cspan>You can find more Docker Images on \u003Ca href=\"https://hub.docker.com/\">Docker Hub\u003C/a>.\u003C/span>\n \u003C/Info>\n\n \u003Csection data-doc=\"Installing Docker\" data-doc-level=\"3\">\n \u003Ch4>Installing Docker\u003C/h4>\n\n \u003C/section>\n\n \u003C/section>\n\n \u003Csection data-toc=\"Creating Docker Container\" data-toc-level=\"2\">\n \u003Ch3>Creating the Docker Container\u003C/h3>\n\n For our current purpose, we will be using the official \u003Ca href=\"https://hub.docker.com/r/nvidia/cuda/tags\">NVIDIA Docker image\u003C/a> so that we can leverage CUDA for GPU acceleration if available.\n\n We will create the Docker container and make it interactive by running:\n\n ```zsh frame=\"none\"\n docker run --gpus all -it --name llm-container -p 8080:8080 nvidia/cuda:13.0.2-cudnn-devel-ubuntu24.04 /bin/bash\n ```\n\n \u003CInfo>\n - `--gpus` all enables GPU support for the container.\n - `--it` makes the container interactive, allowing you to run commands inside it.\n - `--name` llm-container gives the container a name for easier reference.\n - `-p 8080:8080` = `-p HOST:CONTAINER` maps port 8080 on your host machine to port 8080 inside the container. This is useful if you plan to run a server inside the container and want to access it from your host machine.\n - `nvidia/cuda:13.0.2-cudnn-runtime-ubuntu24.04` specifies the Docker image to use.\n - `/bin/bash` start point for the container, which opens a bash shell.\n \u003C/Info>\n\n Once you are inside the container, you can proceed to setup the environment like we did before in the \u003Ca href=\"#setting-up-developer-environment\">WSL section\u003C/a>.\n\n \u003CInfo>\n There's a few things you need to do before you can setup the Environment like we did before:\n 1. Update the package lists and install necessary packages:\n ```zsh frame=\"none\"\n apt update && apt install -y git make curl sudo zsh\n ```\n\n 2. Remove the default user (usually `ubuntu`) to avoid permission issues:\n ```zsh frame=\"none\"\n userdel -r ubuntu\n ```\n\n 3. Run my provisional script to setup users and permissions:\n ```zsh frame=\"none\"\n bash \u003C(curl -s https://git.mangopig.tech/mangopig/Dot-Zsh/raw/branch/main/scripts/provision.sh)\n ```\n You should create your own user when prompted, make it have 1000 as UID and GID for consistency and please remember the password you set here as you'll need it to use `sudo` later on.\n\n 4. Now change users by doing: **(replace `your-username` with the username you created)**\n ```zsh frame=\"none\"\n su - your-username\n ```\n\n OR you can exit the container and reattach with the new user by doing:\n ```zsh frame=\"none\"\n exit\n docker start llm-container\n docker exec -it --user your-username llm-container /bin/zsh\n ```\n Press `q` when they prompt you to create a zsh configuration file.\n\n 5. Now you can proceed to setup zsh and the rest of the environment as shown in the [previous section](#zsh).\n\n \u003C/Info>\n\n Try to do this on your own first! If you get stuck, you can check the solution below.\n\n \u003CSpoiler client:idle >\n ## Solution\n\n 1. Update the package lists and install necessary packages:\n ```zsh frame=\"none\"\n apt update && apt install -y git make curl sudo zsh\n ```\n\n 2. Remove the default user (usually `ubuntu`) to avoid permission issues:\n ```zsh frame=\"none\"\n userdel -r ubuntu\n ```\n\n 3. Run my provisional script to setup users and permissions:\n ```zsh frame=\"none\"\n bash \u003C(curl -s https://git.mangopig.tech/mangopig/Dot-Zsh/raw/branch/main/scripts/provision.sh)\n ```\n You should create your own user when prompted, make it have 1000 as UID and GID for consistency and please remember the password you set here as you'll need it to use `sudo` later on.\n\n 4. Now change users by doing: **(replace `your-username` with the username you created)**\n ```zsh frame=\"none\"\n su - your-username\n ```\n\n OR you can exit the container and reattach with the new user by doing:\n ```zsh frame=\"none\"\n exit\n docker start llm-container\n docker exec -it --user your-username llm-container /bin/zsh\n ```\n Press `q` when they prompt you to create a zsh configuration file.\n\n 5. Go into the dotfiles directory and setup zsh:\n ```zsh frame=\"none\"\n cd ~/Config/Dot-Zsh\n make base && \\\n make python && \\\n make clean && \\\n make stow\n ```\n\n 6. Restart the shell to finalize the zsh setup:\n ```zsh frame=\"none\"\n zsh\n ```\n\n 7. Verify that Pyenv and Miniforge is working by:\n ```zsh frame=\"none\"\n pyenv --version\n conda --version\n ```\n \u003C/Spoiler>\n \u003C/section>\n\n\u003C/section>\n\n\u003Csection data-toc=\"Python Setup\" data-toc-level=\"1\">\n \u003Ch2>Setting Up Python Environment\u003C/h2>\n Now that we have the Docker container set up, we can proceed to set up the environment to run llama.cpp inside the container.\n\n We have setup `pyenv` and `Miniforge` as part of the zsh setup. You can verify that they are working by running:\n\n ```zsh frame=\"none\"\n pyenv --version\n conda --version\n ```\n\n `pyenv` allows us to manage multiple Python versions easily. We can easily install different versions of Python and Conda environments as needed for different projects.\n\n `conda` (via Miniforge) allows us to create isolated Python environments, which is helpful for making sure that the dependencies for llama.cpp do not interfere with other projects.\n\n Let's first create a directory for llama.cpp and navigate into it:\n\n ```zsh frame=\"none\"\n mkdir -p ~/Projects/llama.cpp\n cd ~/Projects/llama.cpp\n ```\n\n Now, let's clone the llama.cpp repository:\n\n ```zsh frame=\"none\"\n git clone https://github.com/ggerganov/llama.cpp.git .\n ```\n\n \u003CInfo>\n - You can also the contents of the repository with `ls -la`\n - The `.` at the end of the git clone command ensures that the contents of the repository are cloned directly into the current directory.\n - For convenience, you can find the official llama.cpp repository at \u003Ca href=\"https://github.com/ggml-org/llama.cpp?tab=readme-ov-file\">llama.cpp GitHub\u003C/a>\n \u003C/Info>\n\n With the repository cloned, we can now proceed to build the llama.cpp.\n\n We first use `cmake` to configure the build system. It's like telling the app what our computer environment looks like and what options we want to enable.\n\n ```zsh frame=\"none\"\n cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON\n ```\n\n \u003CInfo>\n - `-S .` tells cmake where to find the source files (in this case, the current directory).\n - `-B build` specifies where all the temperary build files will go (in a folder named `build`).\n - `-G Ninja` tells cmake to use the Ninja build system.\n - `-DCMAKE_BUILD_TYPE=Release` sets the build type to Release for optimized performance.\n - `-DCMAKE_INSTALL_PREFIX=/your/install/dir` specifies where to install the built files. You can change this to your desired installation path.\n - `-DLLAMA_BUILD_TESTS=OFF` disables building tests.\n - `-DLLAMA_BUILD_EXAMPLES=ON` enables building example programs.\n - `-DLLAMA_BUILD_SERVER=ON` enables building the server component.\n \u003C/Info>\n\n Now we can build the project, this step is basically taking what we told cmake to do and actually making it into executable files.\n\n ```zsh frame=\"none\"\n cmake --build build --config Release -j $(nproc)\n ```\n\n \u003CInfo>\n - `--build build` tells cmake to build the project using the files in the `build` directory. (where we set with -B in the previous step)\n - `--config Release` specifies that we want to build the Release version.\n - `-j $(nproc)` tells cmake to use all available CPU cores for faster building.\n - `$(nproc)` is a command that returns the number of processing units available.\n \u003C/Info>\n\n After we are doing building, the binaries will be located in the `build/bin` directory. We want to move it to a more accessible location (`/usr/local` that we specified earlier), so we can run it easily. We can do this by running:\n\n ```zsh frame=\"none\"\n sudo cmake --install build && \\\n sudo ldconfig\n ```\n\n \u003CInfo>\n - `--install build` tells cmake to install the built files from the `build` directory to the location we specified earlier with `-DCMAKE_INSTALL_PREFIX`.\n - `sudo ldconfig` updates the system's library cache to recognize the newly installed binaries.\n \u003C/Info>\n\n Now you should be able to run the `llama.cpp` binary from anywhere, you can check what llama.cpp options are available by running:\n\n ```zsh frame=\"none\"\n ls /usr/local/bin\n ```\n\n ```zsh frame=\"none\"\n 󰡯 bat 󰡯 llama-eval-callback 󰡯 llama-lookup 󰡯 llama-save-load-state\n 󰡯 convert_hf_to_gguf.py 󰡯 llama-export-lora 󰡯 llama-lookup-create 󰡯 llama-server\n 󰡯 fd 󰡯 llama-finetune 󰡯 llama-lookup-merge 󰡯 llama-simple\n 󰡯 llama-batched 󰡯 llama-gen-docs 󰡯 llama-lookup-stats 󰡯 llama-simple-chat\n 󰡯 llama-batched-bench 󰡯 llama-gguf 󰡯 llama-mtmd-cli 󰡯 llama-speculative\n 󰡯 llama-bench 󰡯 llama-gguf-hash 󰡯 llama-parallel 󰡯 llama-speculative-simple\n 󰡯 llama-cli 󰡯 llama-gguf-split 󰡯 llama-passkey 󰡯 llama-tokenize\n 󰡯 llama-convert-llama2c-to-ggml 󰡯 llama-idle 󰡯 llama-perplexity 󰡯 llama-tts\n 󰡯 llama-cvector-generator 󰡯 llama-imatrix 󰡯 llama-quantize\n 󰡯 llama-diffusion-cli 󰡯 llama-logits 󰡯 llama-retrieval\n 󰡯 llama-embedding 󰡯 llama-lookahead 󰡯 llama-run\n ```\n\n We can further verify whether we can run `llama.cpp` by checking its version:\n\n ```zsh frame=\"none\"\n llama-cli --version\n ```\n\n ```zsh frame=\"none\"\n version: 7327 (c8554b66e)\n built with GNU 13.3.0 for Linux x86_64\n ```\n\n\u003C/section>\n\n\u003Csection data-toc=\"Getting the AI\" data-toc-level=\"1\">\n \u003Ch2>Fetching the AI Model Weights\u003C/h2>\n Now that we have llama.cpp set up, we need to get some AI models to run with it.\n The main place to get models is from [Hugging Face](https://huggingface.co/). You will need to create an account if you don't have one already.\n Once you have created an account, you should also setup your access token by going:\n \n \u003Cpicture>\n \u003Cimg src=\"https://pic.mangopig.tech/i/aea54c8e-9dd5-44c7-ab1f-6b57b076e7d8.webp\" alt=\"Hugging Face Access Token\" />\n \u003C/picture>\n\n And then give your token all the `read` permissions.\n\n \u003Cpicture>\n \u003Cimg src=\"https://pic.mangopig.tech/i/4360ee94-7f37-4897-91e9-882fd198b8b3.webp\" alt=\"Hugging Face Token Permissions\" />\n \u003C/picture>\n\n \u003CImportant>\n Make sure to copy the token somewhere safe and **DO NOT SHARE IT WITH ANYONE** or **USE IT DIRECTLY IN PUBLIC REPOSITORIES** and **DIRECTLY IN YOUR CODE**! Consult AIs on how to keep your tokens safe if you are unsure, but do not directly share them with the AI.\n \u003C/Important>\n\n Now that you have your token, you can use it to download models from Hugging Face. We will use `huggingface-cli` to do this. Let's first make the directory to store the models:\n\n ```zsh frame=\"none\"\n mkdir -p ~/Models\n cd ~/Models\n ```\n\n We can then install `huggingface-cli`\n\n ```zsh frame=\"none\"\n curl -LsSf https://hf.co/cli/install.sh | bash\n ```\n\n We will then login to Hugging Face using the CLI and provide our access token when prompted:\n\n ```zsh frame=\"none\"\n git config --global credential.helper store\n ```\n\n ```zsh frame=\"none\"\n hf auth login\n ```\n\n ```zsh frame=\"none\"\n _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|\n _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|\n _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|\n\n To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .\n Enter your token (input will not be visible): INPUT_YOUR_TOKEN_HERE\n Add token as git credential? [y/N]: y\n Token is valid (permission: fineGrained).\n The token `temp` has been saved to /home/mangopig/.cache/huggingface/stored_tokens\n Your token has been saved in your configured git credential helpers (store).\n Your token has been saved to /home/mangopig/.cache/huggingface/token\n Login successful.\n The current active token is: `temp`\n ```\n\n Now you can download models using the `hf download` command. I will be using the [`SmolLM3-3B`](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) following this tutorial but if the model is too large for your system, you can choose a smaller model from Hugging Face, such as [`SmolLM2-1.7B`](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B) or [`SmolLM2-360M`](https://huggingface.co/HuggingFaceTB/SmolLM2-360M).\n\n ```zsh frame=\"none\"\n hf download HuggingFaceTB/SmolLM3-3B --local-dir ~/Models/SmolLM3-3B\n ```\n\n \u003CInfo>\n - `HuggingFaceTB/SmolLM3-3B` is the model identifier on Hugging Face. Get it from clicking the button to copy the name in the image below:\n \u003Cpicture>\n \u003Cimg src=\"https://pic.mangopig.tech/i/674714b4-736b-429c-b198-c9d57ba8bdee.webp\" alt=\"Hugging Face Model Page\" />\n \u003C/picture>\n - `--local-dir ~/Models/SmolLM3-3B` specifies where to save the downloaded model.\n\n You can find out more about what options you can use with `hf download` by doing `hf download --help`.\n\n ```zsh frame=\"none\"\n > hf download --help\n\n Usage: hf download [OPTIONS] REPO_ID [FILENAMES]...\n\n Download files from the Hub.\n\n Arguments:\n REPO_ID The ID of the repo (e.g. `username/repo-name`). [required]\n [FILENAMES]... Files to download (e.g. `config.json`,\n `data/metadata.jsonl`).\n\n Options:\n --repo-type [model|dataset|space]\n The type of repository (model, dataset, or\n space). [default: model]\n --revision TEXT Git revision id which can be a branch name,\n a tag, or a commit hash.\n --include TEXT Glob patterns to include from files to\n download. eg: *.json\n --exclude TEXT Glob patterns to exclude from files to\n download.\n --cache-dir TEXT Directory where to save files.\n --local-dir TEXT If set, the downloaded file will be placed\n under this directory. Check out https://hugg\n ingface.co/docs/huggingface_hub/guides/downl\n oad#download-files-to-local-folder for more\n details.\n --force-download / --no-force-download\n If True, the files will be downloaded even\n if they are already cached. [default: no-\n force-download]\n --dry-run / --no-dry-run If True, perform a dry run without actually\n downloading the file. [default: no-dry-run]\n --token TEXT A User Access Token generated from\n https://huggingface.co/settings/tokens.\n --quiet / --no-quiet If True, progress bars are disabled and only\n the path to the download files is printed.\n [default: no-quiet]\n --max-workers INTEGER Maximum number of workers to use for\n downloading files. Default is 8. [default:\n 8]\n --help Show this message and exit.\n ```\n \u003C/Info>\n\n With this, we have a model downloaded at `~/Models/SmolLM3-3B`. We can now proceed to try to run the model with llama.cpp.\n\n\u003C/section>\n\n\u003Csection data-toc=\"Converting Model to GGUF\" data-toc-level=\"1\">\n \u003Ch2>Converting the Model to GGUF\u003C/h2>\n \u003Cp>After downloading the model from Hugging Face, we need to convert it to the GGUF format so that llama.cpp can use it.\u003C/p>\n \u003Cp>Hugging Face usually store their models in the `.safetensors` format\u003C/p>\n \u003Cp>However, `llama.cpp` usually expect the models to be in the `.gguf` format.\u003C/p>\n \u003Cp>So we will need to convert the models to `.gguf`. Luckily, `llama.cpp` comes with a python script that helps us to do just that.\u003C/p>\n \u003Cp>We will first create a `Python` environment with `Conda` and activate it\u003C/p>\n\n ```zsh frame=\"none\"\n conda create -n llama-cpp python=3.10 -y\n conda activate llama-cpp\n python -m pip install --upgrade pip wheel setuptools\n ```\n\n \u003CInfo>\n - `conda create -n llama-cpp python=3.10 -y` creates a new conda environment named `llama-cpp` with Python 3.10 installed\n - `-n`: Specifies the name of the environment.\n - `python=3.10`: Specifies the Python version to install in the environment.\n - `-y`: Automatically confirms the creation.\n - `conda activate llama-cpp` activates the newly created conda environment.\n - `python -m pip install --upgrade pip wheel setuptools`\n - We are updating `pip`, `wheel`, and `setuptools`\n - `pip`: The package installer for Python. Similar to `npm` and `go get` in other languages.\n - `wheel`: A built-package format for Python.\n - `setuptools`: A package development and distribution library for Python.\n \u003C/Info>\n\n \u003Cp>`conda` is used to isolate the dependencies needed for the conversion process so that it doesn't interfere with other projects.\u003C/p>\n \u003Cp>We will then install the dependencies for `llama.cpp`\u003C/p>\n\n ```zsh frame=\"none\"\n pip install --upgrade -r ~/Projects/llama.cpp/requirements/requirements-convert_hf_to_gguf.txt\n ```\n\n \u003CInfo>\n - `pip install`: Installs Python packages.\n - `--upgrade`: Upgrades the packages to the latest versions.\n - `-r`: Specifies that we are installing packages from a requirements file.\n - `~/Projects/llama.cpp/requirements/requirements-convert_hf_to_gguf.txt`: The path to the requirements file that contains the list of packages needed for converting models to GGUF format.\n \u003C/Info>\n\n Nice! Now we are ready to convert the model to GGUF format. We can do this by running the conversion script provided by `llama.cpp`\n\n ```zsh frame=\"none\"\n python ~/Projects/llama.cpp/convert_hf_to_gguf.py \\\n ~/Models/SmolLM3-3B \\\n --outfile ~/Models/SmolLM3-3B/SmolLM3-3B.gguf\n ```\n\n \u003CInfo>\n - `python ~/Projects/llama.cpp/convert_hf_to_gguf.py`: `python` runs the conversion script located at `~/Projects/llama.cpp/scripts/convert_hf_to_gguf.py`.\n - `~/Models/SmolLM3-3B`: Specifies the path to the downloaded model in Hugging Face format.\n - `--outfile ~/Models/SmolLM3-3B/SmolLM3-3B.gguf`: Specifies where to save the converted model in GGUF format.\n \u003C/Info>\n\n When you see a similar output to:\n\n ```zsh frame=\"none\"\n INFO:hf-to-gguf:Model successfully exported to SmolLM3-3B.gguf\n ```\n\n Then you have succeeded in converting the model to GGUF format!\n\n\u003C/section>\n\n\u003Csection data-toc=\"Quantizing the Model\" data-toc-level=\"1\">\n \u003Ch2>Quantizing the Model for Better Performance\u003C/h2>\n \u003Cp>Quantization is a technique used to reduce the size of the model and improve inference speed and VRAM requirements by compressing and reducing the model's weight\u003C/p>\n\n We can learn what quantization `llama.cpp` supports by running:\n\n ```zsh frame=\"none\"\n llama-quantize --help\n ```\n\n ```zsh frame=\"none\"\n usage: llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights]\n [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--tensor-type] [--prune-layers] [--keep-split] [--override-kv]\n model-f32.gguf [model-quant.gguf] type [nthreads]\n\n --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit\n --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing\n --pure: Disable k-quant mixtures and quantize all tensors to the same type\n --imatrix file_name: use data in file_name as importance matrix for quant optimizations\n --include-weights tensor_name: use importance matrix for this/these tensor(s)\n --exclude-weights tensor_name: use importance matrix for this/these tensor(s)\n --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor\n --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor\n --tensor-type TENSOR=TYPE: quantize this tensor to this ggml_type. example: --tensor-type attn_q=q8_0\n Advanced option to selectively quantize tensors. May be specified multiple times.\n --prune-layers L0,L1,L2...comma-separated list of layer numbers to prune from the model\n Advanced option to remove all tensors from the given layers\n --keep-split: will generate quantized model in the same shards as input\n --override-kv KEY=TYPE:VALUE\n Advanced option to override model metadata by key in the quantized model. May be specified multiple times.\n Note: --include-weights and --exclude-weights cannot be used together\n\n Allowed quantization types:\n 2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B\n 3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B\n 38 or MXFP4_MOE : MXFP4 MoE\n 8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B\n 9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B\n 19 or IQ2_XXS : 2.06 bpw quantization\n 20 or IQ2_XS : 2.31 bpw quantization\n 28 or IQ2_S : 2.5 bpw quantization\n 29 or IQ2_M : 2.7 bpw quantization\n 24 or IQ1_S : 1.56 bpw quantization\n 31 or IQ1_M : 1.75 bpw quantization\n 36 or TQ1_0 : 1.69 bpw ternarization\n 37 or TQ2_0 : 2.06 bpw ternarization\n 10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B\n 21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B\n 23 or IQ3_XXS : 3.06 bpw quantization\n 26 or IQ3_S : 3.44 bpw quantization\n 27 or IQ3_M : 3.66 bpw quantization mix\n 12 or Q3_K : alias for Q3_K_M\n 22 or IQ3_XS : 3.3 bpw quantization\n 11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B\n 12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B\n 13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B\n 25 or IQ4_NL : 4.50 bpw non-linear quantization\n 30 or IQ4_XS : 4.25 bpw non-linear quantization\n 15 or Q4_K : alias for Q4_K_M\n 14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B\n 15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B\n 17 or Q5_K : alias for Q5_K_M\n 16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B\n 17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B\n 18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B\n 7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B\n 1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B\n 32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B\n 0 or F32 : 26.00G @ 7B\n COPY : only copy tensors, no quantizing\n ```\n\n \u003CInfo>\n For a line `2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B`\n - `2` and `Q4_0` are the identifiers you can use to specify the quantization type.\n - `4.34G` indicates the size of the quantized model.\n - `+0.4685 ppl` indicates the increase in perplexity (a measure of model performance; lower is better) when using this quantization type\n \u003C/Info>\n\n \u003CQA>\n \u003Cspan slot=\"question\">How do I know how big of a model size can I fit in my computer\u003C/span>\n \u003Cp>It depends on whether you are running inference on your \u003Cstrong>CPU (System RAM)\u003C/strong> or \u003Cstrong>GPU (VRAM)\u003C/strong>.\u003C/p>\n\n \u003Cp>For CPU inference, you generally want the model size to be around 2x the size of your system RAM for comfortable operation. For example, if you have 16GB of RAM, you should aim for models that are around 8GB or smaller.\u003C/p>\n\n **Size (GB) ≈ (Parameters (Billions) × Bits Per Weight) / 8 + Overhead**\n\n - Bits Per Weight (bpw):\n - Qx = x bits per weight\n - Qx_K = K quants will keep some important weights at higher precision (Q4_K ≈ 5 bits per weight, Q5_K ≈ 6 bits per weight, Q6_K ≈ 7 bits per weight)\n - Qx_K_S = Small K quants\n - Qx_K_M = Medium K quants\n - Qx_K_L = Large K quants\n - IQx = Integer Quantization with x bits per weight, bpw is on the chart\n - TQx = Ternary Quantization with x bits per weight, bpw is on the chart\n \u003C/QA>\n\n \u003CQuantizationCalculator client:idle />\n\n Once we have decided what quantization type to use, we can proceed to quantize the model by running:\n\n ```zsh frame=\"none\"\n llama-quantize \\\n ~/Models/SmolLM3-3B/SmolLM3-3B.gguf \\\n ~/Models/SmolLM3-3B/SmolLM3-3B.q4.gguf \\\n q4_0\n 4\n ```\n\n \u003CInfo>\n - `llama-quantize`: The command to run the quantization process.\n - `~/Models/SmolLM3-3B/SmolLM3-3B.gguf`: The path to the original GGUF model that we want to quantize.\n - `~/Models/SmolLM3-3B/SmolLM3-3B.q4.gguf`: The path where we want to save the quantized model.\n - `q4_0`: The quantization type we want to use (in this case, Q4_0).\n - `4`: Number of threads to use for quantization (optional, defaults to number of CPU cores).\n \u003C/Info>\n\n \u003Cp>After the quantization is complete, you should see a new file named `SmolLM3-3B.q4.gguf` in the model directory.\u003C/p>\n \u003Cp>We can now learn how to serve the model with `llama.cpp`\u003C/p>\n\n\u003C/section>\n\n\u003Csection data-toc=\"Inferencing the Model\" data-toc-level=\"1\">\n \u003Ch2>Inferencing the Model\u003C/h2>\n \u003Cp>Now that we have the model ready, we can proceed to run inference with it using `llama.cpp`.\u003C/p>\n \u003Cp>`llama.cpp` provides us with multiple ways of inferencing, we can: \u003C/p>\n - Use the command line interface (CLI) to interact with the model directly from the terminal. (llama-cli)\n - Use the server mode to host the model and interact with it via HTTP requests. (llama-server)\n\n For this tutorial, we will use the `llama-server` to serve the model.\n\n To start the server with our quantized model, we can run:\n\n ```zsh frame=\"none\"\n llama-server \\\n --model ~/Models/SmolLM3-3B/SmolLM3-3B.q4.gguf \\\n --host 0.0.0.0 \\\n --port 8080\n ```\n\n \u003CInfo>\n - `llama-server`: The command to start the server.\n - `--model ~/Models/SmolLM3-3B/SmolLM3-3B.q4.gguf`: Specifies the path to the quantized model we want to serve.\n - `--host 0.0.0.0`: This makes the server accessible from any IP address.\n - `--port 8080`: Specifies the port on which the server will listen for incoming requests.\n You can read all the options you can customize to run the server [here](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md)\n \u003C/Info>\n\n As soon as you see this\n\n ```zsh frame=\"none\"\n main: model loaded\n main: server is listening on http://0.0.0.0:8080\n main: starting the main loop...\n ```\n\n Your server is up and running! You can now interact with the model by going to [`http://localhost:8080`](http://localhost:8080) in your web browser or using tools like `curl` for API requests.\n\n Open another terminal window and use this example for API request using `curl`:\n\n ```zsh frame=\"none\"\n curl \\\n --request POST \\\n --url http://localhost:8080/completion \\\n --header \"Content-Type: application/json\" \\\n --data '{\"prompt\": \"Building a website can be done in 10 simple steps:\",\"n_predict\": 128}'\n ```\n\n \u003CInfo>\n - `--request POST`: Specifies that we are making a POST request. (We will get into REST HTTP APIs in future tutorials)\n - `--url http://localhost:8080/completion`: The URL of the server endpoint for completions.\n - `--header \"Content-Type: application/json\"`: Sets the content type to JSON.\n - `--data '{...}'`: The JSON payload containing the prompt and other parameters for the model.\n\n Read more about the API requests [here](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md#using-with-curl)\n \u003C/Info>\n\n\u003C/section>\n\n\u003Csection data-toc=\"Docker Volume Mount\" data-toc-level=\"1\">\n \u003Ch2>Docker Volume Mount\u003C/h2>\n\n Before we continue, we are going to destroy everything that we have worked on so far:\n\n ```zsh frame=\"none\"\n exit # As many times as needed to exit the container to your host shell\n docker stop llm-container\n docker rm llm-container\n ```\n\n This is to show that, whenever we remove the Docker container, all the data inside the container will be lost. This is bad because we don't want to redownload and reconvert the models every time we restart the container.\n\n To solve this issue, we can use Docker volume mounts to persist our data.\n\n Docker volume maps directories from your host machine to the Docker container.\n It's a little bit like plugging in a USB drive to your computer, so that the data on the USB drive is accessible even if you remove the USB drive.\n\n When you run the Docker container, you can use the `-v` option to specify volume mounts.\n\n ```zsh frame=\"none\"\n docker run \\\n --gpus all \\\n -it \\\n -v ~/Models:/Models \\\n --name llm-container \\\n -p 8080:8080 \\ \n nvidia/cuda:13.0.2-cudnn-devel-ubuntu24.04 \\\n /bin/bash\n ```\n\n \u003CInfo>\n - `-v ~/Models:/Models`: This maps the `~/Models` directory on your host machine to the `/Models` directory inside the Docker container.\n - The left side (`~/Models`) is the path on your host machine.\n - The right side (`/Models`) is the path inside the Docker container.\n - With this setup, any models you download to `~/Models` on your host machine will be accessible at `/Models` inside the Docker container, and vice versa.\n \u003C/Info>\n\n Now, it's your turn to set up everything again inside the Docker container, but this time, when you download and convert the models, make sure to save them to the `/Models` directory inside the container. Try to do it own your own!\n\n \u003CHomework>\n \u003Ch3>Your Task\u003C/h3>\n 1. Setting up Hugging Face CLI and downloading the model to `~/Models` in your host machine\n 2. Starting a docker container and mount `~/Models` to `/Models` in the container\n 3. Initializing the container with the scripts provided\n - apt update and install dependencies\n - delete default user\n - provisional script\n - log into to your own user account\n 4. Cloning llama.cpp and building it\n 5. Converting the model to GGUF and quantizing it (Remember your models are in `/Models` now!)\n 6. Running the server with the model from `/Models`\n \u003C/Homework>\n\n The solution is below if you get stuck:\n\n \u003CSpoiler client:idle>\n\n 1. Setting up Hugging Face CLI and downloading the model to `~/Models` in your host machine\n\n ```zsh frame=\"none\"\n mkdir -p ~/Models\n cd ~/Models\n curl -LsSf https://hf.co/cli/install.sh | bash\n git config --global credential.helper store\n hf auth login\n hf download HuggingFaceTB/SmolLM3-3B --local-dir ~/Models/SmolLM3-3B\n ```\n\n 2. Starting a docker container and mount `~/Models` to `/Models` in the container\n\n ```zsh frame=\"none\"\n docker run \\\n --gpus all \\\n -it \\\n -v ~/Models:/Models \\\n --name llm-container \\\n -p 8080:8080 \\\n nvidia/cuda:13.0.2-cudnn-devel-ubuntu24.04 \\\n /bin/bash\n ```\n\n 3. Initializing the container with the scripts provided\n\n ```zsh frame=\"none\"\n apt update && apt install -y git make curl sudo zsh\n userdel -r ubuntu\n bash \u003C(curl -s https://git.mangopig.tech/mangopig/Dot-Zsh/raw/branch/main/scripts/provision.sh)\n su - mangopig\n ```\n\n ```zsh frame=\"none\"\n cd ~/Config/Dot-Zsh\n make base && \\\n make python && \\\n make clean && \\\n make stow && \\\n zsh\n ```\n\n OR you can just run:\n\n ```zsh frame=\"none\"\n cd ~/Config/Dot-Zsh\n make setup && \\\n zsh\n ```\n 4. Cloning llama.cpp and building it\n\n ```zsh frame=\"none\"\n mkdir -p ~/Projects/llama.cpp\n cd ~/Projects/llama.cpp\n git clone https://github.com/ggerganov/llama.cpp.git .\n cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON\n cmake --build build --config Release -j $(nproc)\n sudo cmake --install build && \\\n sudo ldconfig\n ```\n\n 5. Converting the model to GGUF and quantizing it (Remember your models are in `/Models` now!)\n\n ```zsh frame=\"none\"\n conda create -n llama-cpp python=3.10 -y\n conda activate llama-cpp\n python -m pip install --upgrade pip wheel setuptools\n pip install --upgrade -r ~/Projects/llama.cpp/requirements/requirements-convert_hf_to_gguf.txt\n python ~/Projects/llama.cpp/convert_hf_to_gguf.py \\\n /Models/SmolLM3-3B \\\n --outfile /Models/SmolLM3-3B/SmolLM3-3B.gguf\n llama-quantize \\\n /Models/SmolLM3-3B/SmolLM3-3B.gguf \\\n /Models/SmolLM3-3B/SmolLM3-3B.q4.gguf \\\n q4_0\n 4\n ```\n\n 6. Running the server with the model from `/Models`\n\n ```zsh frame=\"none\"\n llama-server \\\n --model /Models/SmolLM3-3B/SmolLM3-3B.q4.gguf \\\n --host 0.0.0.0\n --port 8080\n ```\n\n \u003C/Spoiler>\n\n If you have done it without help! Congratulations! You have successfully set up a persistent environment for running llama.cpp with Docker volume mounts!\n\n \u003Ch3 data-toc=\"Conclusion\" data-toc-level=\"1\">Wrapping Up\u003C/h3>\n\n Your LLM setup will still stop when you stop the container tho. In the future, we will learn more about that will help solve these issues:\n\n - Creating Custom Docker Images to Preserve Setup\n - Deploying LLM Server to the Cloud\n - Hosting Multiple Models and Switching Between Them\n - Using docker-compose to Manage Multiple Containers\n\n \u003Ch3 data-toc=\"Tmux Session Persistence\" data-toc-level=\"2\">Tmux Session Persistence\u003C/h3>\n For now, if you want to keep the server running after exiting the terminal, you can use `tmux` or `screen` to create a persistent session inside the Docker container.\n\n 1. Enter the Docker container again (if you have exited it):\n\n ```zsh frame=\"none\"\n docker start llm-container\n ```\n\n ```zsh frame=\"none\"\n docker exec -it --user YOUR_USERNAME llm-container /bin/zsh\n ```\n\n 2. Install `tmux` inside the container\n\n ```zsh frame=\"none\"\n sudo apt install -y tmux\n tmux new -s llm-server\n ```\n\n 3. Start the server inside the `tmux` session\n\n ```zsh frame=\"none\"\n llama-server \\\n --model /Models/SmolLM3-3B/SmolLM3-3B.q4.gguf \\\n --host 0.0.0.0\n --port 8080\n ```\n\n 4. To detach from the `tmux` session and keep it running in the background, press `Ctrl + B`, then `D`.\n \n 5. To reattach to the `tmux` session later, use:\n ```zsh frame=\"none\"\n tmux attach -t llm-server\n ```\n\n \u003Ch3 data-toc=\"Basic Container Management\" data-toc-level=\"2\">Basic Container Management\u003C/h3>\n\n This session will persist as long as the Docker container is running. Your setup will also persist as long as you don't remove the Docker container. But if you want to free up some resources, you should stop the container when not in use.\n\n You can stop the docker container with:\n\n ```zsh frame=\"none\"\n docker stop llm-container\n ```\n\n You can remove the container with:\n\n ```zsh frame=\"none\"\n docker rm llm-container\n ```\n\n Start it back up anytime with:\n \n ```zsh frame=\"none\"\n docker start llm-container\n ```\n\n Reattach to the container with:\n\n ```zsh frame=\"none\"\n docker exec -it --user YOUR_USERNAME llm-container /bin/zsh\n ```\n\n\u003C/section>","src/content/lessons/01-intro.mdx","35514ac791a6626d","01-intro.mdx",true]