The Best Open Source Tools for AIOps

Jasmina Dimitrievska
May 20
6 min read

After several years of building and operating LLM-based systems in production, one thing has become very clear: it’s impossible to work with AI sustainably without proper observability, structure, and control.

In the beginning, things often work anyway. Someone tweaks prompts directly in the code. Someone manually tests things in ChatGPT. Logs are missing or incomplete. Evaluations are done by reviewing a handful of outputs in a notebook.

But as more AI features go live, technical debt starts growing rapidly.

And not just for the developer who built the solution — but for the entire organization.

The good news is that the open source AI ecosystem has matured quickly in recent years. Today, there are genuinely strong tools for building stable, scalable, and secure AI platforms without locking yourself into a specific vendor.

AIOps — the practice of operating, monitoring, and quality-assuring AI systems and LLMs — is no longer something only large AI companies work with. It’s entirely possible for enterprises and IT organizations to implement today.

In my work, there are four tools in particular that stand out and together create a very strong foundation for modern AI operations:

Langfuse — for tracing, prompt management, and AI observability
LiteLLM — for routing and controlling model calls across different AI models
Promptfoo, DeepEval och Ragas — for automated AI evaluation and regression testing
Guardrails AI — for validation, safety, and structured outputs

Below, I’ll explain why these tools have become central to the way I build and operate AI systems in production.

Langfuse: A Central Platform for Observability and Prompt Management

One of the most common challenges in AI projects is actually quite simple:

you don’t really know what the model is doing in production.

That’s where Langfuse comes in.

Langfuse is an open source platform for LLM engineering that does far more than just tracing. It acts as a central source of truth where prompts, tracing data, evaluations, costs, and feedback are all collected in one place

This makes a huge difference once AI systems start becoming more complex.

What makes Langfuse particularly useful is the following:

⮑ Full observability for AI systems

Inputs, outputs, latency, token usage, costs, and metadata are logged throughout the entire AI workflow — even across complex agent flows and multi-step processes.

This makes it significantly easier to understand why an AI system behaves the way it does.

⮑ Prompt management that actually works in practice

Prompts can be versioned, tagged, and rolled back without redeploying the application itself. Instead, the application dynamically fetches the current production prompt at runtime, allowing development teams and business stakeholders to iterate faster without getting blocked by release processes.

⮑ Built-in AI evaluations and datasets

Langfuse makes it possible to run model-based evaluations directly against production data, build regression tests from real traffic, and detect quality degradation before users are affected.

In practice, this means new prompts can be rolled out, monitored in real time, and automatically reverted if quality drops too much.

LiteLLM: A Unified Layer for All LLM Calls

LiteLLM has quickly become one of the most interesting open source solutions for routing and managing language models.

Instead of building separate integrations for providers such as OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, or Google Vertex, the application uses a single standardized API call.

LiteLLM handles the rest.

And that saves a significant amount of time.

Here are some of the capabilities that make a real difference in practice:

✦ Automatic routing and failover

LiteLLM automatically handles provider selection, retries, fallbacks, and rate limiting.

If a model goes down or becomes slow, traffic can automatically be routed to another model without affecting the application.

✦ A unified interface for more than 100 models

The same API works with OpenAI, Anthropic, Cohere, Mistral, Ollama, vLLM, AWS Bedrock, Azure OpenAI, and many others. This reduces vendor lock-in and makes the AI architecture significantly easier to maintain.

✦ Built-in observability

LiteLLM integrates directly with tools such as Langfuse, OpenTelemetry, Helicone, and Arize Phoenix. All model calls are automatically logged and traced.

✦ Cost control and budget management

Costs can be tracked per user, team, project, or API key.

It’s also possible to define budget limits to prevent AI costs from spiraling out of control.

✦ Centralized policy management

Rate limits, model access, and user policies can be managed centrally instead of being scattered throughout the application codebase. This makes the platform significantly easier to operate and scale over time.

Python-Based Evaluation Frameworks: How to Actually Measure AI Quality

Observability shows what happened.

Routing shows where requests were sent.

But neither of those things actually tells you whether the result was good.

That’s where modern AI evaluation frameworks come in.

The open source ecosystem has evolved rapidly, and today there are several genuinely strong frameworks for systematically measuring quality in generative AI systems.

Three tools in particular stand out to me:

Promptfoo

Promptfoo is a CLI-based framework for comparing prompts and models.

It integrates smoothly into CI/CD pipelines and can block pull requests that reduce quality against predefined test datasets.

One thing I particularly like is that Promptfoo also includes advanced capabilities for red teaming and security testing, including:

Prompt injection
Jailbreak attempts
PII extraction
Adversarial attacks

DeepEval

DeepEval is often described as “pytest for LLMs.”

It integrates AI evaluation directly into Python-based testing workflows and includes built-in methods for measuring things such as:

Factual accuracy
Relevance
Bias
Toxicity
Hallucinations

It works well both for continuous monitoring and more traditional testing scenarios.

3. Ragas

Ragas is built specifically for RAG systems (Retrieval-Augmented Generation).

It measures areas such as:

Retrieval quality
Context relevance
Faithfulness of responses

And it makes it significantly easier to determine whether a RAG system is actually working as intended.

What a Modern AI Testing Workflow Looks Like in Practice

When a developer opens a pull request that modifies a prompt, a CI job can automatically:

Fetch production datasets from Langfuse
Run tests through LiteLLM
Evaluate the results using frameworks such as DeepEval or Promptfoo
Publish regressions directly inside the pull request

This allows teams to make decisions based on data instead of intuition.

And that’s actually a significant shift compared to how many AI projects still operate today.

Guardrails AI: Runtime Safety and Validation

Even with strong observability and solid evaluation workflows, you still need safeguards directly at inference time.

LLMs can still:

Generate broken JSON
Leak sensitive information
Hallucinate sources
Violate internal policies

Guardrails AI acts as a protection layer between the model and the application.

These are the capabilities I find particularly valuable:

⮑ Structured outputs

Schemas are defined in code, and outputs that fail validation can be:

Rejected
Repaired
Automatically regenerated

This is especially important in agent-based AI systems and tool-calling workflows.

⮑ Built-in content validation

Guardrails can validate for things such as:

PII
Toxicity
Prompt injections
Policy violations
Off-topic responses

⮑ Custom domain-specific rules

It’s also possible to create custom validation rules based on business requirements.

For example:

Financial disclaimers
Medical limitations
Legal compliance requirements

⮑ Middleware for entire AI stacks

Guardrails can integrate directly with LiteLLM and be applied consistently across all models and providers.

That makes security and policy management significantly easier to standardize across the organization.

How These Tools Work Together in a Modern AIOps Architecture

What makes these tools so powerful is how well they complement each other.

A typical AI workflow might look something like this:

The application sends a request to LiteLLM
Guardrails AI validates the input and checks policy requirements
LiteLLM selects the model and handles routing, retries, and budget controls
The model output is validated again by Guardrails AI
All tracing data is sent to Langfuse
Production data is then used in Promptfoo, DeepEval, or Ragas for automated evaluations and regression testing

Each tool essentially solves its own problem domain:

Routing
Security
Observability
Quality evaluation

Together, they create a very solid foundation for organizations building AI solutions in production.

Conclusion

Building AI systems for production today involves far more than simply choosing the right model. It’s about observability, governance, security, cost control, and maintaining quality over time.

For me, Langfuse, LiteLLM, Promptfoo, DeepEval, Ragas, and Guardrails AI have become some of the most interesting open source tools for building stable and scalable AI platforms.

It does require some effort to design the right architecture from the beginning.

But the benefits are significant:

Better control
Better quality
Better security
Better scalability

And most importantly, it becomes possible to operate AI systems in a sustainable and truly professional way over the long term.

Want to Get Started with AIOps?

Building stable AI solutions for production requires the right architecture, observability, and governance — and it quickly becomes more complex than many organizations expect.

At SDNIT, our AI and AIOps consultants work daily with generative AI, LLM platforms, and modern AI solutions for companies and organizations.

Would you like help building, improving, or scaling your AI solutions?

Book a demo

Or feel free to contact our colleague Ronnie Qvist directly, and he’ll help you move forward.