NEWS Earn Money with Onidel Cloud! Affiliate Program Details - Check it out

How to Deploy a Production‑Ready RAG Stack on an Ubuntu 24.04 VPS with Ollama, Qdrant, Open WebUI, Docker Compose, and TLS (2025 Tutorial)

Introduction

Building a production-ready Retrieval-Augmented Generation (RAG) system requires careful orchestration of multiple components working seamlessly together. RAG combines the power of large language models with external knowledge bases to provide contextually accurate responses. This comprehensive tutorial will guide you through deploying a complete RAG stack on Ubuntu 24.04 that includes Ollama for local LLM inference, Qdrant as a high-performance vector database, Open WebUI for the user interface, all orchestrated with Docker Compose and secured with TLS encryption.

By the end of this tutorial, you’ll have a fully functional RAG system capable of ingesting documents, generating embeddings, storing them in a vector database, and providing intelligent responses through a modern web interface.

Prerequisites

Before we begin, ensure you have:

  • Ubuntu 24.04 LTS VPS with minimum 8GB RAM and 4 vCPU cores (16GB RAM recommended for larger models)
  • Root or sudo access to the server
  • Domain name pointed to your VPS IP address
  • Docker and Docker Compose installed
  • Basic understanding of containerization and vector databases

The system requirements depend on your choice of language models – larger models like Llama 2 70B require significantly more resources than smaller 7B models.

Step-by-Step Tutorial

Step 1: System Preparation

First, update your Ubuntu 24.04 system and install Docker if not already present:

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl git

# Install Docker if needed
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
newgrp docker

Step 2: Create Project Structure

Create a dedicated directory structure for our RAG stack:

mkdir -p ~/rag-stack/{data/{ollama,qdrant,webui},config,certs}
cd ~/rag-stack

Step 3: Configure Docker Compose

Create the main Docker Compose configuration:

version: '3.8'

services:
  qdrant:
    image: qdrant/qdrant:v1.7.3
    container_name: qdrant
    restart: unless-stopped
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - ./data/qdrant:/qdrant/storage
    environment:
      - QDRANT__SERVICE__HTTP_PORT=6333
      - QDRANT__SERVICE__GRPC_PORT=6334
    networks:
      - rag-network

  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ./data/ollama:/root/.ollama
    environment:
      - CUDA_VISIBLE_DEVICES=all
    networks:
      - rag-network
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - ./data/webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=$(openssl rand -hex 32)
      - RAG_EMBEDDING_MODEL=nomic-embed-text
      - RAG_RERANKING_MODEL=
      - QDRANT_URI=http://qdrant:6333
    depends_on:
      - ollama
      - qdrant
    networks:
      - rag-network

  caddy:
    image: caddy:2.7-alpine
    container_name: caddy
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./config/Caddyfile:/etc/caddy/Caddyfile
      - ./certs:/data
    depends_on:
      - open-webui
    networks:
      - rag-network

networks:
  rag-network:
    driver: bridge

volumes:
  qdrant_data:
  ollama_data:
  webui_data:

Step 4: Configure Reverse Proxy and TLS

Create a Caddyfile for automatic TLS and reverse proxy configuration:

cat > config/Caddyfile << 'EOF'
yourdomain.com {
    reverse_proxy open-webui:8080
    
    # Security headers
    header {
        Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
        X-Content-Type-Options "nosniff"
        X-Frame-Options "DENY"
        Referrer-Policy "strict-origin-when-cross-origin"
    }
    
    # Rate limiting
    rate_limit {
        zone static_rl
        key {remote_host}
        events 100
        window 1m
    }
}
EOF

Replace yourdomain.com with your actual domain name.

Step 5: Deploy the Stack

Start all services with Docker Compose:

docker compose up -d

Verify all containers are running:

docker compose ps

Step 6: Initialize Language Models

Pull and configure language models in Ollama:

# Pull a lightweight model for testing
docker compose exec ollama ollama pull llama2:7b

# Pull embedding model for RAG
docker compose exec ollama ollama pull nomic-embed-text

# For production, consider larger models
docker compose exec ollama ollama pull llama2:13b

Step 7: Configure Vector Database

Initialize Qdrant collections for document storage:

curl -X PUT 'http://localhost:6333/collections/documents' \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "size": 768,
      "distance": "Cosine"
    },
    "optimizers_config": {
      "default_segment_number": 2
    },
    "replication_factor": 1
  }'

Step 8: Access and Configure WebUI

Navigate to your domain in a web browser. You should see the Open WebUI interface with automatic TLS encryption. Complete the initial setup by:

  • Creating an admin account
  • Verifying Ollama connection in Settings → Connections
  • Enabling RAG functionality in Settings → Features
  • Configuring Qdrant connection if not automatically detected

Best Practices

Security Considerations

Implement these security measures for production deployments:

  • Firewall configuration: Only expose necessary ports (80, 443)
  • API authentication: Enable authentication for Qdrant and Ollama APIs
  • Network isolation: Use Docker networks to isolate services
  • Regular updates: Keep all container images updated
# Configure UFW firewall
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw --force enable

Performance Optimization

For optimal performance on your machine learning workload:

  • Resource allocation: Assign appropriate CPU and memory limits to containers
  • Vector indexing: Configure Qdrant with HNSW parameters optimized for your data size
  • Model caching: Use persistent volumes to avoid re-downloading models
  • Monitoring: Implement observability with tools like those discussed in our observability stack tutorial

Backup Strategy

Protect your RAG system data with automated backups. Consider using encrypted backup solutions as outlined in our backup automation guide to preserve your vector embeddings and system configurations.

Conclusion

You’ve successfully deployed a production-ready RAG stack on Ubuntu 24.04 with enterprise-grade features including automatic TLS, vector search capabilities, and local LLM inference. This setup provides a solid foundation for intelligent document processing, customer support automation, and knowledge base applications.

The combination of Ollama’s efficient model serving, Qdrant’s high-performance vector search, and Open WebUI’s intuitive interface creates a powerful platform for retrieval-augmented generation workloads. For enhanced performance and reliability, consider deploying this stack on dedicated VPS infrastructure that offers high-performance computing resources optimized for AI workloads.

Share your love