AI & Infrastructure

Cutting RAG Costs with Semantic Cache — Part 2: Walking Through the Test Environment

In the previous post I wrote about why semantic caching matters for keeping AI token costs under control. With theory aside, this post moves to the workshop: defining measurement points and walking through the test rig where "cache on" and "cache off" scenarios will sit side by side.

Hardware and Approach

The tests run on a MacPro M5 Max with 48 GB of RAM. All embedding operations happen locally, accelerated by MPS (Apple Silicon GPU). This way OpenRouter credit only goes toward LLM calls — keeping the cost measurement clean.

For test data I picked World Economic Forum reports: they are public and globally recognized; the content mixes technical and policy material; and running RAG on English documents with Turkish prompts is itself an interesting test case.

Architecture 1 — Baseline RAG (No Cache)

Baseline RAG flow — every question triggers embedding, vector search, and an LLM call

Every question triggers embedding, vector search, and an LLM call. That means every question pays at least once and waits at least once.

Architecture 2 — RAG with Semantic Cache

RAG with Semantic Cache — on a HIT, zero tokens spent and sub-second response

The critical point: the cache looks for semantic similarity, not exact matches. "What are the economic effects of global warming?" and "What is the impact of climate change on the global economy?" are two separate LLM calls in baseline; with semantic cache, the second likely gets its answer from the first.

Hypotheses

Hypothesis 1 — Tokens: Semantic cache will reduce total LLM token consumption by at least 40% vs. baseline, with no observable drop in answer quality.

Hypothesis 2 — Latency: Cache HITs will cut average response time by more than 80% vs. baseline.

Hypothesis 3 — Threshold sensitivity: The sweet spot will sit around 0.85.

Hypothesis 4 — Domain effect: Policy and definition questions will have a higher cache hit rate than statistical ones.

What’s Next

The next post is the build itself — bringing up Redis Stack on Docker, loading the sentence-transformers model locally, chunking the WEF reports, and running the first end-to-end baseline RAG test.

← Blog