It is mid-2025 and the cogs of AI are at full speed. So we (I and Mobin) decided to do our own AI project. We called it "IntelliLogs".
IntelliLogs at a glance:
Table of contents:
Why IntelliLogs:
Personal motivation 💪 to this were:
- Explore and experience what does an AI app look like from an architectural and engineering perspective
- Explore the realm of Huge LLMs (eg: GPT-4.1-170B, Gemini Pro etc) vs small LLMs (eg: granite-7b, gemma-4b)
- Explore the possibilities of model tuning / making a model without being a data scientist. How easy or hard it is, what tools available etc.
We also wanted to tackle a "not too far from practical" problem to solve. So, the problems, I think 🤔, we are solving here are:
- Cost of logs: Logs ingestion by the enterprise level aggregators is costly enough and getting quality logs ingested is a pain point to solve. May be a low cost small AI can help here.
- Out of control logs: Many product owner onboards COTS applications and there is probably little control over the logs quality (eg: whether the logs are labeled correctly). On top, out-of-the-box most runtimes (eg: K8s) provide limited labels (eg: info, warn, error, critical). This with the mix of COTS often times may lead to weird filtering rules for ALERTs or threat issues. We are experimenting to see if this can be solved through AI doing additional labels.
- When "sh*t hits the fan" POs starts chasing SREs, SREs chase Devs and every body ends up combing through rows of logs. There may already be existing elegant solution to this. But I thought I would try to re-invent the wheels here, cause why not? If NASA literally did it for moon rovers I can at least try to experiment.
- Shortfall of traditional observability tools: I have always found that the traditional observability tools (tools without AI) usually require some degree of up-skilling to work with and as a result it gets pushed down to as SRE responsibility or even worse developer responsibility, despite its original intention. Perhaps a conversational AI powered solution like "IntelliLogs" can close the gap here.
So yeah, this is not an orphan project.
What is IntelliLogs:
The IntelliLogs platform is a modular, AI-driven conversational observability solution aimed at making log analysis more optimised and accessible to everyone (tech users as well as non-tech users).
It wasn't created out of necessity rather our main motivation was to explore the possibilities.
Below is how the user story was formed (it was super organic 😉):
How IntelliLogs was developed:
- No GPU for development: There was no GPU requirement to develop this application because of LLM models chosen for this project. They are smaller in size (of course with lower context window) that can run on CPU.
- The classification model: Using Scikit Learn is pretty much only thing that is probably sort of data science-ish. But it was simple enough; basically multiclass classification using LogisticRegression. I followed this tutorial to understand it and tweak the implementation to fit my needs.
- The command extract model: At first I did not plan this. I tried to use LLM (eg: granatice-2b-instruct, granite-7b-instruct, gemma-3-4b etc; I tried all of them and then some) that are claimed to have text extraction capability. It worked to some degree with moderate percentage of error and hallucination. Then I decided to train the base model.
- InstructLAB: It is not really a LAB rather the LAB stands for Large-Scale Alignment for ChatBots. It was a game changer in my case. I did go down the path of tearing my hair trying to train using LoRA. Then pivoted to using InstructLAB and it was a super simple process. I simply read the tutorial, installed ilab in my docker dev environment, created qna.yaml (I was so lazy that I used chatgpt to create me the qna.yaml) and samples.md (again used chatgpt) and executed iLAB command to first generate synthetic data then train granite-7b-base model to learn new skill and voila, the trained model was able to extract my commands from natural language with significantly improved command extraction and almost no hallucination. It was a bit painful because on CPU this process took 2 days I was't sure whether it is going to work or not. This is where GPU probably have worked much better and in the absence of local GPU I could use the instructLAB available with AIAddons on OpenShift.
- The rest of the discoveries (eg: local model vs model server, which model size to choose, which model output the most accurately based on given prompts etc) were from software engineering perspectives and brute force trial and error was easy enough.
- The most painful (not hard) part was to figure out which model server to use for local development. Finding and settling on ollama took sometime as there are few options (eg: vllm, llamacpp) available and trying out those took some time.
- Local ONNX: The "Backend" component is the core of the IntelliLogs application. During deployment it has an init container that downloads the ONNX ML model file (for custom log classification) from my hugging face repo.
Few interesting things to highlight here are:
- Independent layers: At each layer (except the application layer) the underlying technologies are swappable. For example: The application does not have hard dependency on AI Addons and in the absence of this layer the application will still be functional but with slower response time from the AI models.
- IntelliLogs components: The components in the application layers are also swappable (except the 1 core component called backend) without changing the core component. For example:
- Ollama can be replaced with KServe to optimise resource utilization
- Used small LLMs could be replaced with huge LLMs if the need arises (although)
- Proxy is currently implemented using python-flask but could be replaced with native K8s ingress or proxy servers like nginx.
- Kafka could be replaced with AMQ or S3 or anything else that can provide logs over API call
- Web Socket currently implemented using python socket could be replaced with a high performing queue/streaming or dedicated web socket server.
- Even frontend could replaced with other UIs as long as handling backend API is implemented.
Future of IntelliLogs:
Conclusion:
This IntelliLogs project is a starting point about understanding systems at a higher level of abstraction, and taking us closer to intelligent, self-explaining infrastructure or ecosystem of systems. There's also a potential to achieve true self healing. Not sure how far is too far or what is the latest definition of "too far" though.
Comments
Post a Comment