AI in Kubernetes – how to get started without expensive GPUs

Jasmina Dimitrievska
Sep 4
4 min read

Open-source LLMs (Large Language Models) have put powerful AI capabilities in the hands of developers. But while most of the focus is on massive models with billions of parameters running in GPU clusters in the cloud, it’s just as important to understand how smaller language models (SLMs) can be run locally.

For testing, development, and prototyping, it’s not only possible – it’s often the most practical path. Running LLM inference in Kubernetes directly on your laptop can save both time and costs. This is something our DevOps and AI/MLOps consultants frequently encounter, as companies often want to experiment locally before building larger solutions in the cloud together with, for example, a GCP or AWS cloud consultant.

🖥️ Build a CPU-only LLM Inference Server

In this setup, the entire model runs locally. When the Docker container starts, the pre-trained model is loaded from disk into memory. All inference then runs on your CPU (Central Processing Unit, the computer’s regular processor) without any external API calls. The internet is only used once, when the model is fetched via from_pretrained. After that, you can serve the model yourself via FastAPI and Docker, and even run completely offline if you want.

👉 This means full control, high security, and no dependency on third-party providers – something that is often a prerequisite when MLOps consultants build internal AI platforms.

Examples of small models that work well on CPU:

distilgpt2
EleutherAI/gpt-neo-125M

The downside of smaller models is that they tend to hallucinate and overgeneralize more often. They can produce confident but incorrect answers, especially for fact-based questions. Perfect for prototypes and testing – but not something you should use for production-critical tasks.

🔍 CPU or GPU – when should you use which?

CPU works well for models with around 125M–350M parameters (distilgpt2, gpt2, gpt-neo-125M). On a modern laptop, performance is perfectly acceptable.
GPU (Graphics Processing Unit) is needed when you want to run larger models such as LLaMA 2 7B, Mistral, or Falcon 7B+. Here we’re talking about requirements of 12–40GB VRAM, high memory bandwidth, and CUDA support to run computations efficiently.

💡 Simply put:

For local testing and development → run on CPU.
For production and heavier workloads → move to GPU-backed Kubernetes nodes in the cloud (EKS, GKE). Here, a cloud consultant can help you build the right architecture.

🛠️ Minikube + FastAPI: here’s how I did it

I built a FastAPI-based SLM inference server and deployed it in a local Kubernetes cluster with Minikube – entirely without GPU, Docker Hub, or external dependencies.

Example:

Python code: run.py
Dockerfile: Dockerfile

Build locally:

$ docker build -t llm-api .

Then load the image into Minikube and deploy with kubectl apply. Once everything is running, you can call the API via port-forward and curl.

🍏 Running an LLM with ONNX Runtime on macOS (M1–M4)

If you’re developing on a Mac (M1–M4), GPU backends like vLLM or TGI don’t run natively. ONNX Runtime (Open Neural Network Exchange Runtime) is a great alternative. It’s built for CPU and works optimally on Apple Silicon.

Benefits of ONNX Runtime:

Native support for macOS ARM64
Support for INT8-quantized models (fast and memory-efficient)
No GPU required for small models
Easy to containerize and run in Kubernetes

I converted EleutherAI/gpt-neo-125M to ONNX (via conversion.py) and built a FastAPI server that I then deployed to Minikube. Everything runs locally, entirely decoupled from external services.

🤖 Hallucinations in small models

Small models like distilgpt2 and gpt-neo-125M have limited capacity. The result can be:

repeated phrases
incorrect facts (“hallucinations”)
dull and predictable answers

Example:

”Kubernetes is an open-source, open-source, open-source…”

To reduce this, you can use:

top-k/top-p sampling
adjust temperature
set repetition penalty
imit generation length

These models are best suited for prototypes, UI testing, and rapid development. But for production, QA, or code generation, larger models or RAG systems are needed – something our AI/MLOps consultants often help companies implement.

Summary

With ONNX Runtime and Kubernetes, you can run small LLMs locally almost as easily as a Flask app. It’s fast, private, flexible – and you avoid expensive GPUs.

It’s not just about “running a model,” but about enabling:

offline testing
efficient inner-loop development
self-hosted AI features
full freedom from cloud dependencies

👉 For organizations wanting to start exploring AI in Kubernetes, this is a smart first step. And when it’s time to move toward production, cloud, and scaling, the right DevOps and cloud consulting (GCP/AWS) can be crucial for building a sustainable solution.

🚀 Take the next step with AI and Kubernetes

Want to explore how LLM in Kubernetes can create value for your business? Our DevOps and AI/MLOps consultants help you go from prototype to production – whether you’re running locally, in GCP, AWS, or a hybrid setup.

👉 We can support you with: