Skip to main content

Reimagining Logs: Building AI powered Conversational Observability System

It is mid-2025 and the cogs of AI are at full speed. So we (I and Mobin) decided to do our own AI project. We called it "IntelliLogs". 

IntelliLogs at a glance:




In this post I will describe why we did what we did, what is it that we did and how we did it. I will share my personal experience. I am hoping this will, at least, be an interesting read.

Table of contents:

Why IntelliLogs

What is IntelliLogs

How IntelliLogs was developed

Future of IntelliLogs

Conclusion

References

Why IntelliLogs:

Personal motivation ðŸ’ª to this were:

  • Explore and experience what does an AI app look like from an architectural and engineering perspective
  • Explore the realm of Huge LLMs (eg: GPT-4.1-170B,  Gemini Pro etc) vs small LLMs (eg: granite-7b, gemma-4b)
  • Explore the possibilities of model tuning / making a model without being a data scientist. How easy or hard it is, what tools available etc.

We also wanted to tackle a "not too far from practical" problem to solve. So, the problems, I think ðŸ¤”, we are solving here are:

  • Cost of logs: Logs ingestion by the enterprise level aggregators is costly enough and getting quality logs ingested is a pain point to solve. May be a low cost small AI can help here.
  • Out of control logs: Many product owner onboards COTS applications and there is probably little control over the logs quality (eg: whether the logs are labeled correctly). On top, out-of-the-box most runtimes (eg: K8s) provide limited labels (eg: info, warn, error, critical). This with the mix of COTS often times may lead to weird filtering rules for ALERTs or threat issues. We are experimenting to see if this can be solved through AI doing additional labels.
  • When "sh*t hits the fan" POs starts chasing SREs, SREs chase Devs and every body ends up combing through rows of logs. There may already be existing elegant solution to this. But I thought I would try to re-invent the wheels here, cause why not? If NASA literally did it for moon rovers I can at least try to experiment. 
  • Shortfall of traditional observability tools: I have always found that the traditional observability tools (tools without AI) usually require some degree of up-skilling to work with and as a result it gets pushed down to as SRE responsibility or even worse developer responsibility, despite its original intention. Perhaps a conversational AI powered solution like "IntelliLogs" can close the gap here.

So yeah, this is not an orphan project.


What is IntelliLogs:

The IntelliLogs platform is a modular, AI-driven conversational observability solution aimed at making log analysis more optimised and accessible to everyone (tech users as well as non-tech users).

It wasn't created out of necessity rather our main motivation was to explore the possibilities.

 
Below is how the user story was formed (it was super organic 😉):



 Below are the Functional and Non Function requirements we set for this application.





Finally, here's a system context diagram to describe what the system does at a high level



Here is a short demo video of IntelliLogs in action:



How IntelliLogs was developed:


It was an interesting journey from its ideation to its deployment where we saw it functioning. TBH, despite the scary conspiracies about AI, I really enjoyed the process.

So when we first thought of it, I did not have any clue how to approach it. The open community in the internet was the main source of my research in terms of how do I go about implementing such system. After few weeks of try outs, overcoming learning curves and also a fair amount filtering out information I gained some clues. 
I will make references of the resources that was actually helpful at the end of the post. 

So, my journey went something like this:




Few interesting things to highlight here:

  • No GPU for development: There was no GPU requirement to develop this application because of LLM models chosen for this project. They are smaller in size (of course with lower context window) that can run on CPU.
  • The classification model: Using Scikit Learn is pretty much only thing that is probably sort of data science-ish. But it was simple enough; basically multiclass classification using LogisticRegression. I followed this tutorial to understand it and tweak the implementation to fit my needs.
  • The command extract model: At first I did not plan this. I tried to use LLM (eg: granatice-2b-instruct, granite-7b-instruct, gemma-3-4b etc; I tried all of them and then some) that are claimed to have text extraction capability. It worked to some degree with moderate percentage of error and hallucination. Then I decided to train the base model.
  • InstructLAB: It is not really a LAB rather the LAB stands for Large-Scale Alignment for ChatBots. It was a game changer in my case. I did go down the path of tearing my hair trying to train using LoRA. Then pivoted to using InstructLAB and it was a super simple process. I simply read the tutorial, installed ilab in my docker dev environment, created qna.yaml (I was so lazy that I used chatgpt to create me the qna.yaml) and samples.md (again used chatgpt) and executed iLAB command to first generate synthetic data then train granite-7b-base model to learn new skill and voila, the trained model was able to extract my commands from natural language with significantly improved command extraction and almost no hallucination. It was a bit painful because on CPU this process took 2 days I was't sure whether it is going to work or not. This is where GPU probably have worked much better and in the absence of local GPU I could use the instructLAB available with AIAddons on OpenShift. 
  • The rest of the discoveries (eg: local model vs model server, which model size to choose, which model output the most accurately based on given prompts etc) were from software engineering perspectives and brute force trial and error was easy enough.
  • The most painful (not hard) part was to figure out which model server to use for local development. Finding and settling on ollama took sometime as there are few options (eg: vllm, llamacpp) available and trying out those took some time. 
  • Local ONNX: The "Backend" component is the core of the IntelliLogs application. During deployment it has an init container that downloads the ONNX ML model file (for custom log classification) from my hugging face repo.  

Although it wasn’t exactly a "journey to the center of the Earth", it was certainly an adventurous one.

Here's an architecture overview diagram depicting the components of the IntelliLogs system:


 

Few interesting things to highlight here are:

  • Independent layers: At each layer (except the application layer) the underlying technologies are swappable. For example: The application does not have hard dependency on AI Addons and in the absence of  this layer the application will still be functional but with slower response time from the AI models.
  • IntelliLogs components: The components in the application layers are also swappable (except the 1 core component called backend) without changing the core component. For example: 
    • Ollama can be replaced with KServe to optimise resource utilization
    • Used small LLMs could be replaced with huge LLMs if the need arises (although)
    • Proxy is currently implemented using python-flask but could be replaced with native K8s ingress or proxy servers like nginx.
    • Kafka could be replaced with AMQ or S3 or anything else that can provide logs over API call
    • Web Socket currently implemented using python socket could be replaced with a high performing queue/streaming or dedicated web socket server.
    • Even frontend could replaced with other UIs as long as handling backend API is implemented.
Here's a sequence diagram of Log Analysis Report process, showing how the IntelliLogs components are working together to provide user the outcome:


 
The classification process will have similar sequence except we won't need to call model server/endpoint as the SKLearn ML Model's portable format, ONNX, is loaded from localdisk in the backend component. In future I will explore the possibilities to create an inference endpoint with this log_classification.onnx. But for now it works without any issues.

Future of IntelliLogs:


Agentic AI: As you may already be thinking this could have been simply created using AI Agent and yes you are correct. It could have. My initial discovery took a bit longer due to my time poorness and this being a side project and within that time MCP concept and tooling accelerated and era of Agentic AI started. I need to do discoveries on this subject matter and see if I can re-use the "Backend" component to turn into a model server and train command extraction model to make function calls.

User Feedback Loop: Enhance the q&a part with context of the classified logs or log analysis report (eg: instead of reading table or display ask the model more questions). This is something seems possible but challenge here is token size. Need to do further discovery to understand if Agentic AI will solve for this too or how can I provide more context to the LLM to answer user's follow up queries and still maintain running on CPU NFR.

Remediation Automation: This is probably the low hanging fruit. Along with text based remediation suggestions train the model to provide with existing automation to execute to remediate.


Conclusion:

As organisations and human adopt AI, I think, there probably exists a convolution between ChatGPT or Gemini or CoPilot and general LLMs (specially the small ones) which may pose as risk that AI initiatives within an organisation getting lost in translation.

This IntelliLogs project is a starting point about understanding systems at a higher level of abstraction, and taking us closer to intelligent, self-explaining infrastructure or ecosystem of systems. There's also a potential to achieve true self healing. Not sure how far is too far or what is the latest definition of "too far" though.
Regardless all the jargon, this project was indeed exciting.

That's it. Happy AI-ing.



source: sora


References:










Comments

Popular posts from this blog

Managing devices using Edge Manager

Managing edge devices has been a complex process as traditional IT ops tools fall short in distributed, low-connectivity environment to manage huge quantity of devices.  Red Hat Edge Manager  (Open source project: FlightControl , GA'd by Red Hat on late Jan, 2026) solves these challenges by providing streamlined management of edge devices and applications through a declarative approach . Now, there's a fair bit to unpack here. But for simplicity this is how I am going to map those 3 things here: Management of edge devices: I am mapping this to LCM (including upgrade, patch etc) of the underlying OS (in this case RHEL OS of BootC flavor or at least UBI based RHEL ). Managing applications: Mapping this to deploying applications and LCM of the applications stack on the OS. Declarative approach: This one is super interesting. To me this is very K8s-yy but in the world of edge devices running linux (RHEL OS, as of today). And then this thing also has MCP : This is my next prob...

The story of a Hack Job

"So, you have hacked it" -- Few days ago one of the guys at work passed me this comment on a random discussion about something I built. I paused for a moment and pondered: Do I reply defending how that's not a hack. OR Do I just not bother I picked the second option for 2 reasons: It was late. It probably isn't worth defending the "hack vs" topic as the comment passed was out of context. So I chose the next best action and replied "Yep, sure did and it is working great.". I felt like Batman in the moment. In this post I will rant about the knowledge gap around hacking and then describe about one of the components of my home automation project (really, this is the main reason for this post) and use that as an example how hacking is cool and does not always mean bad. But first lets align on my definition of hacking: People use this term in good and bad, both ways. For example: "He/she did a hack job" -- Yeah, that probably...

Passwordless Auth to Azure Key Vault using External Secret and Workload Identity

I want to fetch my secrets from Azure KV and I don't want to use any password for it. Let's see how this can be implemented. This is yet another blog post (YABP) about ESO and Azure Workload Identity. Why Passwordless Auth: It is a common practice to use some sort of "master password" (spn clienid, clientsecret etc) to access Secret Vaults (in this case it is AZ KV) but that master password becomes a headache to manage (rotate, prevent leak etc). So, the passwordless auth to AKV is ideal.  Why ESO: This is discussed and addressed in the conclusion section. Workload Identity (Passwordless Auth): Lets make a backward start (just for a change). I will try to explain how the passwordless auth will work. This will make more sense when you will read through the detailed implementation section. Here's a sequence diagram to explain it: There's no magic here. This is a well documented process by microsoft  here . The below diagram (directly copied from the official doc...

Speeding using Crossplane and ServiceBinding

Software development and release processes continues to improve to deliver value to the users faster and better to support business growth and relevance in this competitive market. To achieve this we focus on automating the path to production and any people or process related obstacles of a software on its way to the user. Generally, some of the goals of the golden paths, are: Remove interdependency and promote self service and  provider & consumer relationship. Shift left - from people & process to technology & automation . Treat Platform-as-product and provide PaaS Secured and standardised by design In this post, I will describe how Crossplane and ServiceBinding can help achieve these goals in the context of applications development and delivery and their consumption of external resources / services in the process. Note:   Crossplane and  ServiceBinding both are capable of covering beyond just database connectivity. In this blog post I am describing Dat...

A modern cloud native (and self serve) way to manage Virtual Machines

Really!! Are there could native way to deploy, LCM VMs and add Self Serve on top ???? In this post I will describe an art of the possibility using the below tools: RHDH: Red Hat Developer Hub (Open source project: Backstage ) OCP Virtualization: Red Hat OpenShift Virtualization (Open source project: KubeVirt ) AAP: Red Hat Ansible Automation Platform (Open source project: Ansible / AWX ) RHEL BootC: Image mode for Red Hat Enterprise Linux (Open source project: bootc ) GitOps: Red Hat OpenShift GitOps (Open source project: ArgoCD ) Quay Registry or any other OCI compliant registry All of these projects can be run on Red Hat OpenShift (Open source project: OKD ) OR on other Kubernetes distribution or on VMs (you pick your underlying infra. For this post I have used OpenShift for simplicity of deployment, integrated tools and narrowly focusing on the usecases instead of the deployment of the tools).  The main goal here is to: Easily deploy and lifecycle applications and stuffs ...

Deciphering the hype of Service Mesh

Service Mesh is not a new topic anymore. Most of us in the industry are already familiar with it. There are also tons of article in the internet about its why and how. In my opinion, it has a significant influence on the application architecture. Here's a DevSecOps humor to start the discussion (and it will make sense as you read along).  This is part 1 of my 3 parts blog posts on Service Mesh. Part 1:   Deciphering the hype of Service Mesh Part 2:   Understanding The Ingress and The Mesh components of Service Mesh. Part 3:  Understanding the observability component of Service Mesh (TBD).  In this post, I am going approach Service Mesh from an application architecture point of view. I will also score some of its basic features on a scale of 1 to 5, where 1 being the least important to me and 5 being the most important.  Table of contents: Common Q&As Features mTLS Service Discovery Meshing Ingress, Gateways etc Telemetries Enterprise products and offeri...