· Tomasz Guściora · Blog

Small Language Model continued: Docker - Guardrail Garage #005

This instalment shows how to build a lightweight llama.cpp server image and wire it up with docker-compose and Caddy on a VPS. It continues from last week's server prep and sets the stage for CI/CD next.

This instalment shows how to build a lightweight llama.cpp server image and wire it up with docker-compose and Caddy on a VPS. It continues from last week's server prep and sets the stage for CI/CD next.

Small Language Model continued: Docker - Guardrail Garage #005

Picking up from last week’s server prep. In case you missed it, read it here.

A quick repo tour next. The full code is on my GitHub - README coming later.

You can check your own repo layout with:

# tree command produces a tree graph of the directory structure
# if you don't have this command, install it (macOS, or search for your system equivalent)
# brew install tree
tree -a -I '.git|.DS_Store|.old'
# options:
# -a - show hidden files and subdirectories
# -I - ignore files matching a pattern
# '.git|.DS_Store|.old' - the pattern, "|" stands for OR 
# .git - folder that stores files Git needs to track your project's history
# .DS_Store - Desktop Services Store - macOS-specific folder metadata
# .old - a folder I used for local testing of the repo

And here’s my local terminal output:

Tree folder structure in directory
Does this mean that I can cross-out “planting a tree” from my adulthood checklist?

Last week we walked through the Ansible directory and files. There is one more safety net - a requirements.yaml that lists the modules Ansible needs to install things properly on target servers:

# requirements.yaml
---
collections:
- name: ansible.posix
- name: community.general
- name: community.docker

Nothing fancy.

Now on to Docker. These containers run the main LLM service and expose it on the VPS.

I know the git bits and the Caddyfile are waiting in the wings - hang tight. Their time is coming.

Second bite of Snack-size RAG - Docker container

Docker fixes the ‘works on my computer’ problem… mostly. Cross-architecture builds can still bite, but you can plan around that. The payoff is a lightweight, isolated environment that keeps your app steady while the world around it changes. Less yak shaving, more shipping.

Office interior with small yak figure and shaver as well as plan with tasks to do.
Ok I’ll only complete 100 side-tasks and my yak is ready for shaving! Image generated by Nano Banana Pro.

Here’s how the sausage gets made - how the LLM server is built and deployed.

We start with the Dockerfile, a multi-stage build. Think image-ception: first we compile with a full toolchain, then copy only what we need into a minimal runtime image.

You might notice I do not use a prebuilt llama.cpp image. I am building from source because LFM-1.2B needs features in the latest llama.cpp, and when I wrote this (2 to 3 weeks ago) I could not find a ready-to-go image. That will likely change - but I adapted.

Without further ado, Dockerfile contents:

# Stage 1: Build from source
# We kick things off with a standard Ubuntu environment.
# We call this stage "builder" because it's just here to compile the code, not to run it permanently.
FROM ubuntu:22.04 AS builder

# Install build tools
# We are prepping our toolbox. We need 'build-essential' (compilers),
# 'cmake' (to manage the build process), and 'git' (to download the code).
# It's like buying all your ingredients before you start cooking.
RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    git \
    curl \
    libcurl4-openssl-dev \
    && rm -rf /var/lib/apt/lists/*

# Set the working directory to /app.
# From now on, any command we run happens inside this folder.
WORKDIR /app

# Clone the latest code
# We are downloading the raw source code directly from the repository.
# Note: We are building from scratch to get the absolute latest features.
RUN git clone https://github.com/ggml-org/llama.cpp.git .

# Configure and build
# First, we configure the project with CMake.
# The flag -DLLAMA_BUILD_SERVER=ON is crucial - it tells the builder
# we specifically want the HTTP server, not just the command line tool.
RUN cmake -B build -DLLAMA_BUILD_SERVER=ON

# Now we actually compile the code (Release mode for speed).
# The flag '-j1' is the "safety first" approach. It limits the builder to 1 CPU core.
# Why? Because compiling C++ sucks up RAM like a vacuum cleaner.
# Using -j1 prevents your machine from crashing, even if it takes a bit longer.
RUN cmake --build build --config Release --target llama-server -j1

# --- FIX: Collect ALL shared libraries ---
# This acts as a scavenger hunt.
# Sometimes the build produces shared libraries (.so files) scattered in different folders.
# This command finds them all and copies them to one spot so we don't lose them later.
RUN mkdir -p /app/libs_to_copy && \
    find build -name "*.so*" -exec cp -P {} /app/libs_to_copy/ \;

# Stage 2: Runtime
# NOW we start the fresh, final image.
# We discard all the heavy compilers from the "builder" stage and start with a clean Ubuntu.
FROM ubuntu:22.04

# Install runtime libraries
# We only install what is strictly needed to run the program (runtime dependencies),
# keeping the final image size much smaller than the builder stage.
RUN apt-get update && apt-get install -y \
    libgomp1 \
    curl \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# 1. Copy the executable
# We reach back into the "builder" stage and grab ONLY the 'llama-server' executable.
COPY --from=builder /app/build/bin/llama-server /usr/local/bin/llama-server

# 2. Copy ALL collected shared libraries to the system library path
# We also grab those shared libraries we rescued earlier and put them
# directly in the system's library folder so the server finds them immediately.
COPY --from=builder /app/libs_to_copy/* /usr/lib/

# Set entrypoint
# This tells Docker: "When this container starts, run this command immediately."
ENTRYPOINT ["/usr/local/bin/llama-server"]
Construction site with containers representing multi-stage build
I think I played this mission already in Los Santos. Image generated by Nano Banana Pro.

Cool - we have a lightweight llama.cpp server image.

Time to wire it up with other services via docker-compose.yml. Run docker-compose or docker compose and it will spin the stack with these settings.

Pro tip: I build the image on GitHub and pull from GitHub Container Registry to use their compute. Compiling on a tiny VPS can take ages or crash.

# docker-compose.yml
version: "3.7"

services:
  llm:
    # This pulls the image. If you built it locally, you might change this tag,
    # but currently, it points to a pre-built image on GitHub Container Registry.
    # Note: change "repo_name" to your own repository if you want to use your own image.
    image: ghcr.io/repo_name/snacksize_server:latest
    container_name: llama-server
    restart: always
    
    # 1. Use host network
    # This is the "easy mode" for networking. Instead of isolating the container,
    # it shares the host's IP address directly.
    # Great for performance and simplicity, slightly less secure than bridge mode.
    network_mode: "host"
    
    volumes:
      # Map the './models' folder on your host
      # to '/models' inside the container.
      # This lets the container read large model files without copying them in.
      - ./models:/models
      
    environment:
      # Pass the API key from your .env file into the container.
      - LLAMA_API_KEY=${API_KEY}
      
      # This is the actual command the server runs on startup.
      # -m: Path to the model file (inside the container).
      # --host: We bind to localhost since we are using host networking.
      # --port: The port the server listens on.
      # --api-key: Enforces authentication.
      # -c: Context size (how much text the AI can remember at once). 20k is quite generous.
    command: >
      -m /models/LFM2-1.2B-RAG-Q4_K_M.gguf     
      --host 127.0.0.1
      --port 8080
      --api-key ${API_KEY}
      -c 20048

  caddy:
    # Caddy is a web server acting as a reverse proxy.
    # It sits in front of the LLM server and handles security, SSL, and traffic routing.
    image: caddy:2-alpine
    restart: always
    
    # 2. Use host network
    # Sharing the host network stack for easier communication with the LLM service.
    network_mode: "host"
    
    # Note: No 'ports' section allowed here because network_mode is "host".
    # Caddy binds to ports defined in its configuration file (Caddyfile).
    volumes:
      # Map the configuration file and data persistence folders.
      - ./Caddyfile:/etc/caddy/Caddyfile
      - caddy_data:/data
      - caddy_config:/config

volumes:
  # Persistent storage for Caddy so it remembers SSL certs and configurations across restarts.
  caddy_data:
  caddy_config:

We have two parts done:

  1. Set up infrastructure on the target server and downloaded the model
  2. Built a Docker image to serve the app

Next up:

  1. Concise Caddyfile explanation
  2. Continuous Integration and Continuous Deployment (CI/CD) with GitHub Actions

See you next week - hopefully!

Behind closed doors

After the DevAI conference, a bunch of folks reached out to chat about Claude Code and what it can do. I am preparing a training on how to use this tool in your software development (and beyond) - plus how Claude Code lowers the barrier for anyone who wants to build on their own.

Obviously, you still need three things:

  1. Willingness to learn
  2. Willingness to look ‘stupid’ in the short term so you can iterate on mistakes
  3. Focus - because AI left to wander freely in the wild might do more harm than good

If you or your organisation want to hear more or bring me in for a training - message me.

As always, feel free to reach out if you want to discuss AI ideas or other collaboration:

What’s coming up next week?

CI/CD basics for serving your apps - automatic deploys to your VPS, no more manual file copying or post-update launch rituals. Sweeeet!

Server-site of DemystifAI.blog
See you in the matrix of DemystifAI.blog! Image generated by Nano Banana Pro.
Back to Blog

Related posts

View all posts »
Snack-size RAG - Guardrail Garage #003

Snack-size RAG - Guardrail Garage #003

A small, practical plan for building a snack-size RAG assistant with a tiny local LLM on a VPS. Goals, constraints, and a quick tour of RAG options, plus a few personal updates.

Prompting 101 - Guardrail Garage #001

Prompting 101 - Guardrail Garage #001

A practical introduction to getting better results from LLMs: free ways to compare models, meta-prompting, a simple prompting framework, and Socratic questioning. Part of the Guardrail Garage series.