Posts

LLM hosting, performance, RAG, and observability

New u pdates of pillar hubs on glukhov.org: organise LLM hosting, performance, RAG, and observability - with dives on runtimes, benchmarks, retrieval, and inference monitoring. https://glukhov.au/posts/2026/llms-hosting-performance-rag-observability #AI #LLM #RAG #Observability #Performance #SelfHosting

Ollama in Docker Compose with GPU and Persistent Model Storage

Run Ollama as a reproducible single-node LLM server using Docker Compose. Configure OLLAMA_HOST and OLLAMA_MODELS, keep models on persistent volumes, enable NVIDIA GPUs, and upgrade safely with rollbacks. Ollama in Docker Compose with GPU and Persistent Model Storage

Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

Expose Ollama securely behind Caddy or Nginx with automated HTTPS, optional Basic Auth or SSO front gates, and correct streaming and WebSocket proxying. Includes timeouts, buffering pitfalls, rate limits, and curl checks. Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

Text embeddings for RAG and search - Python, Ollama, OpenAI-compatible APIs

Learn what text embeddings are, how they power RAG and semantic search, and how to call embedding APIs from Python using Ollama or an OpenAI-compatible server (for example llama.cpp). Includes persistence, retrieval, and links to chunking, vector stores, and reranking on this site. Text embeddings for RAG and search - Python, Ollama, OpenAI-compatible APIs

Retrieval-Augmented Generation (RAG) Tutorial: Architecture, Implementation, and Production Guide

Step-by-step RAG tutorial: build retrieval-augmented generation systems with vector databases, hybrid search, reranking, and web search. Architecture, implementation, and production best practices. Retrieval-Augmented Generation (RAG) Tutorial: Architecture, Implementation, and Production Guide

Netlify for Hugo & static sites: pricing, free tier, and alternatives

Technical guide to Netlify for Hugo and modern web apps. Deploy Previews, Functions, Edge Functions, credit-based pricing, Free plan limits, Hugo netlify.toml patterns, and alternatives such as Vercel and Cloudflare Pages. Netlify for Hugo & static sites: pricing, free tier, and alternatives

Neo4j graph database for GraphRAG, install, Cypher, vectors, ops

Senior-engineer guide to Neo4j for property graphs and GraphRAG. Cypher, ACID, Neo4j vs Neptune and TigerGraph, Docker and AuraDB, ports and neo4j.conf, vector indexes, hybrid retrieval, and Python neo4j-graphrag. Neo4j graph database for GraphRAG, install, Cypher, vectors, ops

Apache Flink on K8s and Kafka: PyFlink, Go, ops, and managed pricing

DevOps guide to Flink. Stateful streaming, JobManager and TaskManagers, checkpoints vs savepoints, vs Spark and Kafka Streams, K8s Operator, Helm, PyFlink, Go, managed pricing. Apache Flink on K8s and Kafka: PyFlink, Go, ops, and managed pricing