Skip to main content

Reimagining Logs: Building AI powered Conversational Observability System

It is mid-2025 and the cogs of AI are at full speed. So we (I and Mobin) decided to do our own AI project. We called it "IntelliLogs". 

IntelliLogs at a glance:




In this post I will describe why we did what we did, what is it that we did and how we did it. I will share my personal experience. I am hoping this will, at least, be an interesting read.

Table of contents:

Why IntelliLogs

What is IntelliLogs

How IntelliLogs was developed

Future of IntelliLogs

Conclusion

References

Why IntelliLogs:

Personal motivation ðŸ’ª to this were:

  • Explore and experience what does an AI app look like from an architectural and engineering perspective
  • Explore the realm of Huge LLMs (eg: GPT-4.1-170B,  Gemini Pro etc) vs small LLMs (eg: granite-7b, gemma-4b)
  • Explore the possibilities of model tuning / making a model without being a data scientist. How easy or hard it is, what tools available etc.

We also wanted to tackle a "not too far from practical" problem to solve. So, the problems, I think ðŸ¤”, we are solving here are:

  • Cost of logs: Logs ingestion by the enterprise level aggregators is costly enough and getting quality logs ingested is a pain point to solve. May be a low cost small AI can help here.
  • Out of control logs: Many product owner onboards COTS applications and there is probably little control over the logs quality (eg: whether the logs are labeled correctly). On top, out-of-the-box most runtimes (eg: K8s) provide limited labels (eg: info, warn, error, critical). This with the mix of COTS often times may lead to weird filtering rules for ALERTs or threat issues. We are experimenting to see if this can be solved through AI doing additional labels.
  • When "sh*t hits the fan" POs starts chasing SREs, SREs chase Devs and every body ends up combing through rows of logs. There may already be existing elegant solution to this. But I thought I would try to re-invent the wheels here, cause why not? If NASA literally did it for moon rovers I can at least try to experiment. 
  • Shortfall of traditional observability tools: I have always found that the traditional observability tools (tools without AI) usually require some degree of up-skilling to work with and as a result it gets pushed down to as SRE responsibility or even worse developer responsibility, despite its original intention. Perhaps a conversational AI powered solution like "IntelliLogs" can close the gap here.

So yeah, this is not an orphan project.


What is IntelliLogs:

The IntelliLogs platform is a modular, AI-driven conversational observability solution aimed at making log analysis more optimised and accessible to everyone (tech users as well as non-tech users).

It wasn't created out of necessity rather our main motivation was to explore the possibilities.

 
Below is how the user story was formed (it was super organic 😉):



 Below are the Functional and Non Function requirements we set for this application.





Finally, here's a system context diagram to describe what the system does at a high level



Here is a short demo video of IntelliLogs in action:



How IntelliLogs was developed:


It was an interesting journey from its ideation to its deployment where we saw it functioning. TBH, despite the scary conspiracies about AI, I really enjoyed the process.

So when we first thought of it, I did not have any clue how to approach it. The open community in the internet was the main source of my research in terms of how do I go about implementing such system. After few weeks of try outs, overcoming learning curves and also a fair amount filtering out information I gained some clues. 
I will make references of the resources that was actually helpful at the end of the post. 

So, my journey went something like this:




Few interesting things to highlight here:

  • No GPU for development: There was no GPU requirement to develop this application because of LLM models chosen for this project. They are smaller in size (of course with lower context window) that can run on CPU.
  • The classification model: Using Scikit Learn is pretty much only thing that is probably sort of data science-ish. But it was simple enough; basically multiclass classification using LogisticRegression. I followed this tutorial to understand it and tweak the implementation to fit my needs.
  • The command extract model: At first I did not plan this. I tried to use LLM (eg: granatice-2b-instruct, granite-7b-instruct, gemma-3-4b etc; I tried all of them and then some) that are claimed to have text extraction capability. It worked to some degree with moderate percentage of error and hallucination. Then I decided to train the base model.
  • InstructLAB: It is not really a LAB rather the LAB stands for Large-Scale Alignment for ChatBots. It was a game changer in my case. I did go down the path of tearing my hair trying to train using LoRA. Then pivoted to using InstructLAB and it was a super simple process. I simply read the tutorial, installed ilab in my docker dev environment, created qna.yaml (I was so lazy that I used chatgpt to create me the qna.yaml) and samples.md (again used chatgpt) and executed iLAB command to first generate synthetic data then train granite-7b-base model to learn new skill and voila, the trained model was able to extract my commands from natural language with significantly improved command extraction and almost no hallucination. It was a bit painful because on CPU this process took 2 days I was't sure whether it is going to work or not. This is where GPU probably have worked much better and in the absence of local GPU I could use the instructLAB available with AIAddons on OpenShift. 
  • The rest of the discoveries (eg: local model vs model server, which model size to choose, which model output the most accurately based on given prompts etc) were from software engineering perspectives and brute force trial and error was easy enough.
  • The most painful (not hard) part was to figure out which model server to use for local development. Finding and settling on ollama took sometime as there are few options (eg: vllm, llamacpp) available and trying out those took some time. 
  • Local ONNX: The "Backend" component is the core of the IntelliLogs application. During deployment it has an init container that downloads the ONNX ML model file (for custom log classification) from my hugging face repo.  

Although it wasn’t exactly a "journey to the center of the Earth", it was certainly an adventurous one.

Here's an architecture overview diagram depicting the components of the IntelliLogs system:


 

Few interesting things to highlight here are:

  • Independent layers: At each layer (except the application layer) the underlying technologies are swappable. For example: The application does not have hard dependency on AI Addons and in the absence of  this layer the application will still be functional but with slower response time from the AI models.
  • IntelliLogs components: The components in the application layers are also swappable (except the 1 core component called backend) without changing the core component. For example: 
    • Ollama can be replaced with KServe to optimise resource utilization
    • Used small LLMs could be replaced with huge LLMs if the need arises (although)
    • Proxy is currently implemented using python-flask but could be replaced with native K8s ingress or proxy servers like nginx.
    • Kafka could be replaced with AMQ or S3 or anything else that can provide logs over API call
    • Web Socket currently implemented using python socket could be replaced with a high performing queue/streaming or dedicated web socket server.
    • Even frontend could replaced with other UIs as long as handling backend API is implemented.
Here's a sequence diagram of Log Analysis Report process, showing how the IntelliLogs components are working together to provide user the outcome:


 
The classification process will have similar sequence except we won't need to call model server/endpoint as the SKLearn ML Model's portable format, ONNX, is loaded from localdisk in the backend component. In future I will explore the possibilities to create an inference endpoint with this log_classification.onnx. But for now it works without any issues.

Future of IntelliLogs:


Agentic AI: As you may already be thinking this could have been simply created using AI Agent and yes you are correct. It could have. My initial discovery took a bit longer due to my time poorness and this being a side project and within that time MCP concept and tooling accelerated and era of Agentic AI started. I need to do discoveries on this subject matter and see if I can re-use the "Backend" component to turn into a model server and train command extraction model to make function calls.

User Feedback Loop: Enhance the q&a part with context of the classified logs or log analysis report (eg: instead of reading table or display ask the model more questions). This is something seems possible but challenge here is token size. Need to do further discovery to understand if Agentic AI will solve for this too or how can I provide more context to the LLM to answer user's follow up queries and still maintain running on CPU NFR.

Remediation Automation: This is probably the low hanging fruit. Along with text based remediation suggestions train the model to provide with existing automation to execute to remediate.


Conclusion:

As organisations and human adopt AI, I think, there probably exists a convolution between ChatGPT or Gemini or CoPilot and general LLMs (specially the small ones) which may pose as risk that AI initiatives within an organisation getting lost in translation.

This IntelliLogs project is a starting point about understanding systems at a higher level of abstraction, and taking us closer to intelligent, self-explaining infrastructure or ecosystem of systems. There's also a potential to achieve true self healing. Not sure how far is too far or what is the latest definition of "too far" though.
Regardless all the jargon, this project was indeed exciting.

That's it. Happy AI-ing.



source: sora


References:










Comments

Popular posts from this blog

Openshift-Powered Homelab | Why, What, How

I wanted to build a Homelab for some time but it was taking a backseat as I always had access to cloud environments (eg: cloud accounts, VMware DC etc) and the use cases I was focusing on didn't really warrant for one. But lately, some new developments and opportunities in the industry triggered the need to explore use cases in a bare-metal server environment, ultimately leading to the built of my own homelab, called MetalSNO. In this post, I will discuss some of my key reasons for building a homelab, the goals I set for it, and the process I followed to building one from scratch. I'll conclude with some reflections on whether it was truly worth it and what I plan to do with it going forward. Compelling reasons (The Why ) My uses cases for a homelab weren't about hosting plex server, home automation etc (I have them on Raspberry PIs for some years now). My Homelab is really about exploring technologies and concepts that are on par with industry trend. Below are some of the ...

Understanding The Ingress and The Mesh components of Service Mesh

I wrote about the key concepts about service mesh and how to evaluate the requirements for a service mesh in my previous post here:  Deciphering the hype of Service Mesh . This post is a follow up from there covering the technical aspects. Part 1:   Deciphering the hype of Service Mesh Part 2:   Understanding The Ingress and The Mesh components of Service Mesh. Part 3: Uderstanding the observability component of Service Mesh (TBD in another post).  Almost all popular service mesh technologies/tools (eg: Istio, LinkerD) have both ingress and mesh capabilities. Conceptually, I see them as 2 mutually exclusive domain (integrated nicely by the underlying tool). Understanding  the ingress  and  the mesh  components individually, such as what they offer, what I can do with them etc, was the basic building block to my understanding of service mesh technology as a whole. This is arguably the most mis-represented topic in the internet. So, I thought,...

CKA Exam; May 2024: My take on it and cheat sheet

So, I finally got the little green tick of having CKA certification in my certification list. I put off this exam for so long that it seriously became not funny anymore. The internet has quite literally way more than 1000 posts on this topic. But what harm would one more post cause? So here's mine. I will write it from my perspective. I am writing this post just in case if anyone benefits from it, as I predict there could be many on the same boat as me. Background: Kubernetes, modern application architecture, DevSecOps etc are not new territory for me. In fact, I think I am fairly versed in K8s and related tech stack. But due my own imposter syndrome I have been putting off sitting the CKA exam. However, last week I thought about the CKA as "just another approval for my skills" and got the nudge to sit the exam.  Here's what I did till the day I sat for the exam. (Everybody is different but the below worked for me the best) The preparation: As I have been working with...

Exception Handling With Exception Policy

This is how I would think of an application at the very basic level: Now this works great. But one thing that is missing in this picture is Exception Handling . In many cases we pay very less attention to it and take it as "we'll cross that bridge when it'll come to that". We can get away with this as in many application as exceptions does not stop it from being in the state "is the application working" as long as we code it carefully and at the very least handling the exceptions in code blocks. This works. But we end up having try catch and if else everywhere and often with messy or no direction to what type of exception is to be handled where and how. Nonetheless, when it comes down an enhancement that depends upon different types exceptions, we will end up writing/modifying code every where, resulting in even messier code. I'm sure no one wants that. Even, in scenarios, a custom handler is not the answer either. Cause this w...