Posts

Showing posts from March, 2026

Building REST APIs in Go: Complete Guide

A comprehensive guide to implementing RESTful APIs in Go, covering standard library approaches, frameworks, authentication, testing patterns, and production-ready best practices for scalable backend services. Building REST APIs in Go: Complete Guide

Structured Logging in Go with slog for Observability and Alerting

Structured logs turn Go application output into queryable events. Explore log/slog records, JSON handlers, context and trace correlation, redaction, and log-based signals that support monitoring and alerting. Structured Logging in Go with slog for Observability and Alerting

LLM hosting, performance, RAG, and observability

New u pdates of pillar hubs on glukhov.org: organise LLM hosting, performance, RAG, and observability - with dives on runtimes, benchmarks, retrieval, and inference monitoring. https://glukhov.au/posts/2026/llms-hosting-performance-rag-observability #AI #LLM #RAG #Observability #Performance #SelfHosting

Ollama in Docker Compose with GPU and Persistent Model Storage

Run Ollama as a reproducible single-node LLM server using Docker Compose. Configure OLLAMA_HOST and OLLAMA_MODELS, keep models on persistent volumes, enable NVIDIA GPUs, and upgrade safely with rollbacks. Ollama in Docker Compose with GPU and Persistent Model Storage

Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

Expose Ollama securely behind Caddy or Nginx with automated HTTPS, optional Basic Auth or SSO front gates, and correct streaming and WebSocket proxying. Includes timeouts, buffering pitfalls, rate limits, and curl checks. Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

Text embeddings for RAG and search - Python, Ollama, OpenAI-compatible APIs

Learn what text embeddings are, how they power RAG and semantic search, and how to call embedding APIs from Python using Ollama or an OpenAI-compatible server (for example llama.cpp). Includes persistence, retrieval, and links to chunking, vector stores, and reranking on this site. Text embeddings for RAG and search - Python, Ollama, OpenAI-compatible APIs

Retrieval-Augmented Generation (RAG) Tutorial: Architecture, Implementation, and Production Guide

Step-by-step RAG tutorial: build retrieval-augmented generation systems with vector databases, hybrid search, reranking, and web search. Architecture, implementation, and production best practices. Retrieval-Augmented Generation (RAG) Tutorial: Architecture, Implementation, and Production Guide

Netlify for Hugo & static sites: pricing, free tier, and alternatives

Technical guide to Netlify for Hugo and modern web apps. Deploy Previews, Functions, Edge Functions, credit-based pricing, Free plan limits, Hugo netlify.toml patterns, and alternatives such as Vercel and Cloudflare Pages. Netlify for Hugo & static sites: pricing, free tier, and alternatives

Neo4j graph database for GraphRAG, install, Cypher, vectors, ops

Senior-engineer guide to Neo4j for property graphs and GraphRAG. Cypher, ACID, Neo4j vs Neptune and TigerGraph, Docker and AuraDB, ports and neo4j.conf, vector indexes, hybrid retrieval, and Python neo4j-graphrag. Neo4j graph database for GraphRAG, install, Cypher, vectors, ops

Apache Flink on K8s and Kafka: PyFlink, Go, ops, and managed pricing

DevOps guide to Flink. Stateful streaming, JobManager and TaskManagers, checkpoints vs savepoints, vs Spark and Kafka Streams, K8s Operator, Helm, PyFlink, Go, managed pricing. Apache Flink on K8s and Kafka: PyFlink, Go, ops, and managed pricing

Web Infrastructure — static publishing, CDN, indexing, and domain services

Systems for publishing static sites, shipping to AWS S3 and CloudFront, notifying search engines with IndexNow, and running email on a custom domain. CLI-first workflows and trade-offs without console fluff. Web Infrastructure — static publishing, CDN, indexing, and domain services

Hosted email for custom domains compared - Workspace, Microsoft 365, Zoho, Proton, WorkMail

Google Workspace, Microsoft 365, Zoho, Proton, and AWS WorkMail compared for custom-domain email. Typical monthly cost, what MX and SPF really buy you, deliverability tradeoffs, and when to skip self-hosting. Hosted email for custom domains compared - Workspace, Microsoft 365, Zoho, Proton, WorkMail

IndexNow explained - notify search engines when you publish

IndexNow pushes URL changes to Bing and other engines in minutes. Learn why it beats waiting for crawls, how host and key verification work, and how to run submissions from the CLI or a sitemap. IndexNow explained - notify search engines when you publish

SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

Install SGLang with uv, pip, or Docker; configure YAML and server flags; then serve Hugging Face LLMs with an OpenAI-compatible API plus native /generate and offline Engine examples. SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

llama.swap Model Switcher Quickstart for OpenAI-Compatible Local LLMs

Install and configure llama-swap (llama.swap), hot-swap local models via OpenAI or Anthropic APIs, compare with Ollama and LM Studio, and troubleshoot common errors. llama.swap Model Switcher Quickstart for OpenAI-Compatible Local LLMs

Data Infrastructure for AI Systems: Object Storage, Databases, Search & AI Data Architecture

Engineering guide to data infrastructure for production AI systems — S3-compatible object storage (MinIO, Garage, AWS S3), PostgreSQL, Elasticsearch, streaming and messaging (Kafka, Airflow, queues), SaaS integrations, AI-native data layers, benchmarks, and trade-offs. Data Infrastructure for AI Systems: Object Storage, Databases, Search & AI Data Architecture

Apache Kafka Quickstart - Install Kafka 4.2 with CLI and Local Examples

Learn Apache Kafka 4.2 fast with tarball or Docker, start a local KRaft broker, master key CLI tools, and run practical producer, consumer, and Connect examples. Apache Kafka Quickstart - Install Kafka 4.2 with CLI and Local Examples

Selfhosting SearXNG

Self-hosting SearXNG Selfhosting SearXNG

Converting Windows Text to Linux Format

Complete guide to converting Windows CRLF line endings to Linux LF format using dos2unix, sed, and tr with Git configuration for cross-platform development. Converting Windows Text to Linux Format

Markdown Code Blocks: Complete Guide with Syntax, Languages & Examples

Complete guide to Markdown code blocks: fenced blocks, inline code, syntax highlighting, diff formatting, language identifiers, filename display, and Hugo-specific features. Markdown Code Blocks: Complete Guide with Syntax, Languages & Examples

Best Linux Terminal Emulators: 2026 Comparison

Compare top Linux terminal emulators: Alacritty, Kitty, WezTerm, GNOME Terminal, and more. Features, performance, and customization options reviewed. Best Linux Terminal Emulators: 2026 Comparison

Oh My Opencode Review: Honest Results, Billing Risks, and When It's Worth It

Real hands-on Oh My Opencode experience plus community benchmarks. Learn when Ultrawork beats vanilla OpenCode, when it doesn't, and how to avoid the billing surprises that caught users off guard. Oh My Opencode Review: Honest Results, Billing Risks, and When It's Worth It

Oh My Opencode Specialised Agents Deep Dive and Model Guide

Deep dive into Oh My Opencode specialised agents for OpenCode. Learn Sisyphus orchestration, Prometheus planning, Librarian research, Oracle review, model fallbacks, and local LLM swaps. Oh My Opencode Specialised Agents Deep Dive and Model Guide

OpenClaw: Examining a Self-Hosted AI Assistant as a Real System

A case-study exploration of OpenClaw — a self-hosted AI assistant system that integrates local LLMs, retrieval, memory, routing, and observability into a cohesive local infrastructure. OpenClaw: Examining a Self-Hosted AI Assistant as a Real System

Oh My Opencode QuickStart for OpenCode: Install, Configure, Run

A practical Oh My Opencode quickstart for OpenCode. Learn installation via bunx or npm, configuration file locations, ultrawork mode, agent models, and real command examples for daily dev. Oh My Opencode QuickStart for OpenCode: Install, Configure, Run

Best LLMs for OpenCode - Tested Locally

Hands-on comparison of LLMs in OpenCode - local Ollama and llama.cpp models vs cloud. Coding tasks, migration map accuracy stats, and honest failure analysis. Best LLMs for OpenCode - Tested Locally

OpenHands Coding Assistant QuickStart: Install, CLI Flags, Examples

OpenHands QuickStart for developers. Install the CLI, configure your LLM API key, learn core command-line flags and safety modes, and run practical examples in interactive and headless workflows. OpenHands Coding Assistant QuickStart: Install, CLI Flags, Examples

LocalAI QuickStart: Run OpenAI-Compatible LLMs Locally

Learn to install LocalAI, load models from the gallery or Hugging Face, and serve an OpenAI-compatible API plus Web UI for chat, embeddings, images and audio on your own hardware. LocalAI QuickStart: Run OpenAI-Compatible LLMs Locally

Retrieval-Augmented Generation (RAG) Tutorial: Architecture, Implementation, and Production Guide

Step-by-step RAG tutorial: build retrieval-augmented generation systems with vector databases, hybrid search, reranking, and web search. Architecture, implementation, and production best practices. Retrieval-Augmented Generation (RAG) Tutorial: Architecture, Implementation, and Production Guide

Designing Non-Blocking RAG Pipelines

Learn how to design non-blocking RAG pipelines using asynchronous processing, vector databases, and robust error handling to achieve low-latency, high-throughput AI systems in 2026. Designing Non-Blocking RAG Pipelines

vLLM Quickstart: High-Performance LLM Serving - in 2026

Complete vLLM setup guide with Docker, OpenAI API compatibility, PagedAttention optimization. Compare vLLM vs Ollama vs Docker Model Runner for production. vLLM Quickstart: High-Performance LLM Serving - in 2026

llama.cpp Quickstart with CLI and Server

Install llama.cpp, run GGUF models with llama-cli, and serve OpenAI-compatible APIs using llama-server. Key flags, examples, and tuning tips with a short commands cheatsheet llama.cpp Quickstart with CLI and Server

Docker vs Podman: A 2026 Comparison for AI Infrastructure

Compare Docker and Podman for AI infrastructure in 2026. This analysis covers architecture, performance, security, and Kubernetes integration to help teams choose the right container runtime for AI development and deployment. Docker vs Podman: A 2026 Comparison for AI Infrastructure

OpenCode Quickstart: Install, Configure, and Use the Terminal AI Coding Agent

A practical OpenCode quickstart for developers: install and verify, connect models/providers, run CLI workflows, use the server + JS SDK, and keep a short cheatsheet. OpenCode Quickstart: Install, Configure, and Use the Terminal AI Coding Agent

Network Programming in Rust with Tokio

Learn network programming in Rust with practical examples for TCP servers, async networking using Tokio, and performance optimization. Covers core concepts, security best practices, and modern Rust networking patterns. Network Programming in Rust with Tokio

Airtable for Developers & DevOps - Plans, API, Webhooks, and Go/Python Examples

Deep research guide to Airtable - what it is, core features, Free plan limits and implications, key competitors, and production-ready DevOps integration patterns with runnable Go and Python examples (CRUD, pagination, rate limits, batching, webhooks). Airtable for Developers & DevOps - Plans, API, Webhooks, and Go/Python Examples

Comparing LLMs performance on Ollama on 16GB VRAM GPU

Benchmark of 14 LLMs on RTX 4080 16GB with Ollama 0.15.2. Compare tokens/sec, VRAM usage, and CPU offloading for GPT-OSS, Qwen3, Qwen3.5, Mistral, and more. Comparing LLMs performance on Ollama on 16GB VRAM GPU

Running LLM Inference on Kubernetes: What Breaks First

Learn the critical failure points when running LLM inference on Kubernetes, including resource constraints, operator compatibility, security, scalability, and monitoring best practices for production workloads. Running LLM Inference on Kubernetes: What Breaks First

LLM Performance and PCIe Lanes: Key Considerations

LLM Performance and PCIe Lanes: Key Considerations LLM Performance and PCIe Lanes: Key Considerations

Search vs Deepsearch vs Deep Research

Search vs Deepsearch vs Deep Research Search vs Deepsearch vs Deep Research

Markdown Code Blocks: Complete Guide with Syntax, Languages & Examples

Complete guide to Markdown code blocks: fenced blocks, inline code, syntax highlighting, diff formatting, language identifiers, filename display, and Hugo-specific features. Markdown Code Blocks: Complete Guide with Syntax, Languages & Examples

Markdown Cheatsheet: Syntax, Formatting & Structure Quick Reference

Quick reference to Markdown syntax: headings, bold, italic, lists, links, images, tables, code blocks, blockquotes, task lists, math, and more — with examples for every element. Markdown Cheatsheet: Syntax, Formatting & Structure Quick Reference

Docker Model Runner vs Ollama (2026): Which Is Better for Local LLMs?

Trying to choose between Docker Model Runner and Ollama? We compare performance, GPU support, API compatibility, Docker integration and production readiness to help you decide fast. Docker Model Runner vs Ollama (2026): Which Is Better for Local LLMs?

Ollama vs vLLM vs LM Studio: Best Way to Run LLMs Locally in 2026?

Choosing the best way to run LLMs locally? Compare Ollama, vLLM, LM Studio, LocalAI and 8+ tools by API support, hardware compatibility, tool calling, and production readiness. Ollama vs vLLM vs LM Studio: Best Way to Run LLMs Locally in 2026?

Monitor LLM Inference in Production (2026): Prometheus & Grafana for vLLM, TGI, llama.cpp

Learn how to monitor LLM inference in production using Prometheus and Grafana. Track p95 latency, tokens/sec, queue duration, and KV cache usage across vLLM, TGI, and llama.cpp. Includes PromQL examples, dashboards, alerts, Docker & Kubernetes setups. Monitor LLM Inference in Production (2026): Prometheus & Grafana for vLLM, TGI, llama.cpp

OpenClaw Quickstart: Install with Docker (Ollama GPU or Claude + CPU)

Install OpenClaw in minutes with Docker. Run locally with Ollama (GPU) or use Claude Sonnet 4.6 (CPU-only). Includes setup, model config, testing, and troubleshooting. OpenClaw Quickstart: Install with Docker (Ollama GPU or Claude + CPU)

Garage vs MinIO vs AWS S3: Object Storage Comparison and Feature Matrix

Compare MinIO, Garage, and AWS S3 for object storage. Feature matrix, cost model, operational complexity, and when to choose each—managed S3, self-hosted Garage, or MinIO with broad S3 parity. Garage vs MinIO vs AWS S3: Object Storage Comparison and Feature Matrix

Implementing Workflow Applications with Temporal in Go: A Complete Guide

Learn how to implement workflow applications with Temporal in Go using the official Temporal Go SDK. This end-to-end guide covers configuration, examples, deployment, troubleshooting, and best practices for building scalable, resilient workflows. Implementing Workflow Applications with Temporal in Go: A Complete Guide