Context Engineering Changed Everything

Background

My AI journey started in academia — ML, NLP, conversational systems before LLMs went mainstream. Since early 2024, I’ve been shipping LLM-powered chatbots at Brillar and Pluggi. Customer-facing bots, internal admin assistants, multilingual tool-calling agents — production systems with real users.

Two years in, I’d gone through every pattern most teams hit.

The Usual LLM Progression

 Plain API
 ┌──────┐   ┌────────┐   ┌─────┐
 │ User │──▶│ Prompt │──▶│ LLM │──▶ Out
 └──────┘   └────────┘   └─────┘
 ✓ Simple    ✗ No real data

 RAG
 ┌──────┐   ┌───────┐   ┌────────┐   ┌─────┐
 │ User │──▶│ Embed │──▶│ Vector │──▶│ LLM │──▶ Out
 └──────┘   └───────┘   └────────┘   └─────┘
 ✓ Real data  ✗ Retrieves every query

 Agentic RAG
 ┌──────┐   ┌────────┐──need data?──▶ Vector ──▶ LLM
 │ User │──▶│ Router │
 └──────┘   └────────┘──no────────────────────▶ Out
 ✓ Selective  ✗ Can't compute or aggregate

Functional. But held together by prompt hacks.

The Shift

Then I used Claude Code. The way it called tools, reasoned over results, decided next steps — different principles entirely.

Found three Anthropic articles that clicked:

Building Effective Agents — start simple, add complexity only when needed
Context Engineering — the model is only as good as the context you give it
Writing Tools for Agents — tools aren’t APIs. The consumer is an LLM, not a frontend

Not theoretical. Engineering guides with patterns I could apply. So I did.

What I Changed

 BEFORE                              AFTER
 ──────                              ─────

 ┌──────┐                            ┌──────┐
 │ User │                            │ User │
 └──┬───┘                            └──┬───┘
    ▼                                    ▼
 ┌─────────────────┐                 ┌──────────────┐
 │  Single Agent   │                 │  Host Router │
 │  Generic Prompt │                 └──┬──┬──┬──┬──┘
 │  13 tools       │                    ▼  ▼  ▼  ▼
 └────────┬────────┘                 ┌──┐┌──┐┌──┐┌──┐
          ▼                          │4 ││4 ││2 ││3 │
 ┌─────────────────┐                 └┬─┘└┬─┘└┬─┘└┬─┘
 │ Raw API-style   │                  └───┴───┴───┘
 │ response        │                       ▼
 └─────────────────┘                 SUMMARY + DATA
                                     + GUIDANCE

 Score: 4.7/5                        Score: 5.0/5
 Tool Accuracy: 94%                  Tool Accuracy: 100%
 Tokens: 1,675                       Tokens: 1,342 (-20%)

Three changes:

1. Tool response redesign. Stopped treating tool outputs like REST responses. Added SUMMARY (what happened), DATA (the facts), GUIDANCE (what to do next). The LLM went from “I can’t compute margin” to calculating it unprompted.

2. Multi-agent split. 13 tools → 4 specialist agents (inventory, sales, purchasing, finance) with 2-4 tools each. A host router decides who handles the query. Less choice = better choice.

3. Context engineering. Input normalization (Burmese → English → process → translate back), structured prompts, tool result compaction middleware. The model gets clean, focused context every time.

Results

Built a 55-case benchmark suite. Ran it before and after.

Metric	Before	After
Avg Score	4.7 / 5	5.0 / 5
Tool Accuracy	94.3%	100%
Avg Tokens	1,675	1,342
Failed Cases	1	0

Also tested locally with Qwen 2.5 7B on a Mac Mini M4 — same architecture, same benchmark: 4.5/5, 85.7% tool accuracy. The architecture carried a 7B model further than its size should allow.

What I Learned

Tools are not APIs. A REST endpoint serves a frontend. A tool serves an LLM’s reasoning. Format them differently.

Less tools = better decisions. 13 tools confused the model. 3-4 per domain = near-perfect accuracy. Same principle as good UX.

The model is rarely the bottleneck. Gemini Flash (mid-tier) scored 5.0/5. Qwen 7B (local) scored 4.5/5. Context engineering > model size.

Benchmark first. One of my prompt rewrites looked cleaner but scored 4.3/5 — a regression. Without measurement I would’ve shipped it.

Stack

Go + CloudWeGo Eino (multi-agent framework)
Gemini 2.5 Flash (production) / Qwen 2.5 7B via Ollama (local)
Host multi-agent with 4 domain specialists
55-case automated benchmark suite

The biggest shift: I stopped chasing the “best model” and started engineering the context around it. The model is a reasoning engine — my job is to give it the right inputs, the right tools, and the right constraints.