Background
My AI journey started in academia — ML, NLP, conversational systems before LLMs went mainstream. Since early 2024, I’ve been shipping LLM-powered chatbots at Brillar and Pluggi. Customer-facing bots, internal admin assistants, multilingual tool-calling agents — production systems with real users.
Two years in, I’d gone through every pattern most teams hit.
The Usual LLM Progression
Plain API
┌──────┐ ┌────────┐ ┌─────┐
│ User │──▶│ Prompt │──▶│ LLM │──▶ Out
└──────┘ └────────┘ └─────┘
✓ Simple ✗ No real data
RAG
┌──────┐ ┌───────┐ ┌────────┐ ┌─────┐
│ User │──▶│ Embed │──▶│ Vector │──▶│ LLM │──▶ Out
└──────┘ └───────┘ └────────┘ └─────┘
✓ Real data ✗ Retrieves every query
Agentic RAG
┌──────┐ ┌────────┐──need data?──▶ Vector ──▶ LLM
│ User │──▶│ Router │
└──────┘ └────────┘──no────────────────────▶ Out
✓ Selective ✗ Can't compute or aggregate
Functional. But held together by prompt hacks.
The Shift
Then I used Claude Code. The way it called tools, reasoned over results, decided next steps — different principles entirely.
Found three Anthropic articles that clicked:
- Building Effective Agents — start simple, add complexity only when needed
- Context Engineering — the model is only as good as the context you give it
- Writing Tools for Agents — tools aren’t APIs. The consumer is an LLM, not a frontend
Not theoretical. Engineering guides with patterns I could apply. So I did.
What I Changed
BEFORE AFTER
────── ─────
┌──────┐ ┌──────┐
│ User │ │ User │
└──┬───┘ └──┬───┘
▼ ▼
┌─────────────────┐ ┌──────────────┐
│ Single Agent │ │ Host Router │
│ Generic Prompt │ └──┬──┬──┬──┬──┘
│ 13 tools │ ▼ ▼ ▼ ▼
└────────┬────────┘ ┌──┐┌──┐┌──┐┌──┐
▼ │4 ││4 ││2 ││3 │
┌─────────────────┐ └┬─┘└┬─┘└┬─┘└┬─┘
│ Raw API-style │ └───┴───┴───┘
│ response │ ▼
└─────────────────┘ SUMMARY + DATA
+ GUIDANCE
Score: 4.7/5 Score: 5.0/5
Tool Accuracy: 94% Tool Accuracy: 100%
Tokens: 1,675 Tokens: 1,342 (-20%)
Three changes:
1. Tool response redesign. Stopped treating tool outputs like REST responses. Added SUMMARY (what happened), DATA (the facts), GUIDANCE (what to do next). The LLM went from “I can’t compute margin” to calculating it unprompted.
2. Multi-agent split. 13 tools → 4 specialist agents (inventory, sales, purchasing, finance) with 2-4 tools each. A host router decides who handles the query. Less choice = better choice.
3. Context engineering. Input normalization (Burmese → English → process → translate back), structured prompts, tool result compaction middleware. The model gets clean, focused context every time.
Results
Built a 55-case benchmark suite. Ran it before and after.
| Metric | Before | After |
|---|---|---|
| Avg Score | 4.7 / 5 | 5.0 / 5 |
| Tool Accuracy | 94.3% | 100% |
| Avg Tokens | 1,675 | 1,342 |
| Failed Cases | 1 | 0 |
Also tested locally with Qwen 2.5 7B on a Mac Mini M4 — same architecture, same benchmark: 4.5/5, 85.7% tool accuracy. The architecture carried a 7B model further than its size should allow.
What I Learned
Tools are not APIs. A REST endpoint serves a frontend. A tool serves an LLM’s reasoning. Format them differently.
Less tools = better decisions. 13 tools confused the model. 3-4 per domain = near-perfect accuracy. Same principle as good UX.
The model is rarely the bottleneck. Gemini Flash (mid-tier) scored 5.0/5. Qwen 7B (local) scored 4.5/5. Context engineering > model size.
Benchmark first. One of my prompt rewrites looked cleaner but scored 4.3/5 — a regression. Without measurement I would’ve shipped it.
Stack
- Go + CloudWeGo Eino (multi-agent framework)
- Gemini 2.5 Flash (production) / Qwen 2.5 7B via Ollama (local)
- Host multi-agent with 4 domain specialists
- 55-case automated benchmark suite
The biggest shift: I stopped chasing the “best model” and started engineering the context around it. The model is a reasoning engine — my job is to give it the right inputs, the right tools, and the right constraints.