TikTok serves millions of users around the world every day. To keep the platform stable, secure, and easy to use, our engineering teams are constantly working behind the scenes to improve reliability, performance, and user experience at scale.
To reduce manual QA work and improve efficiency, we built an internal AI agent to assist with testing. But soon after launch, we ran into growing inference costs. To address this, we worked with AWS to deploy multiple models through Bedrock's unified API. This setup made it easier to experiment and optimize. By introducing prompt caching, we were able to cut costs by 50%, with just a few lines of code.
In this post, we'll discuss what prompt caching is, how it works, our implementation, and the results. The chart below illustrates the cost reduction achieved by our AI agent, comparing the theoretical costs without caching to the actual costs after caching was implemented. This resulted in approximately 50% savings.
Our search for fast cost reduction solutions amid exponential usage
Since the launch of our internal QA AI Agent last January 2025, various business lines in TikTok have adopted AI to generate and run natural language test cases for Web and Mobile applications, integrating these into scheduled jobs and regression pipelines. Unlike traditional programs, AI responses are context-dependent and incur costs for each run, which are expected to rise with increased adoption.
Dollar cost increases with the number of AI test cases
Caching is a common software engineering technique for optimizing cost and time. Current open-source UI automation frameworks often cache AI planning steps and DOM XPath selectors. With this approach, a successful cache hit reduces the number of calls to the LLM, thereby improving both cost efficiency and response time.
However, applying similar techniques to the company presented some challenges:
- Majority of our current users mostly use our AI Agent for mobile automation, where selector retrieval from position, which is available on Web (DOM.elementFromPoint), is not available on mobile. This would require us to develop this feature on different Mobile technologies such as Android/iOS/Webview or it would require us to cache the X,Y coordinates - which could quickly be obsolete with the variety of device viewports.
- User experience - How do we provide a similar experience for users when running locally vs running on cloud for invalidating the cache, updating the cache, and debugging the cache usage? What happens when the cache fails midway?
- Most apps under test would have UI changes that would quickly make the cache unusable-Unexpected popups, AB Tests, multiple device viewports.
Ultimately, the significant engineering effort required to implement this type of caching, combined with the low likelihood of cache hits, made it difficult to adopt for our AI agent. Fortunately, modern LLMs offer a technique known as prompt caching. In this post, we'll explain what prompt caching is, how it works, and how it helped us reduce costs by 50%.
What is Prompt Caching
Large Language Model (LLM) prompts often include repetitive segments, such as system messages, global knowledge, or tool instructions. Prompt caching addresses this by reusing computations for identical prompt sections across multiple API calls, avoiding redundant processing for each request. Model providers estimate that this can reduce costs by up to 90%.
In the next section, we'll dive deeper into how it works. But first, here's a brief overview of some key concepts behind prompt caching:
Prompt Caching only applies to Input Tokens
If you look at a typical model pricing page, costs are often split into two parts: input tokens (for prompt and context processing) and output tokens (for response generation).
Prompt caching only optimizes input token processing. While output tokens are usually priced higher per token, input tokens dominate overall costs in practice. In our agent's usage, they account for around 78% of the total cost. This makes optimizing inputs a key opportunity for cost savings.
Sample Claude Pricing table
Prompt Caching optimizes both time and dollar costs
The initial cache write operation incurs higher costs than standard input token processing (~25% higher) due to additional computational overhead – including cache storage mechanics and prompt representation calculations. However, subsequent cache reads are billed at reduced rates (~90% lower). While exact savings vary by vendor, proper cache implementation usually results in significant cost reductions.
At the same time, prompt caching improves performance by eliminating redundant processing of cached segments. This directly reduces Time to First Byte (TTFB), a key metric that measures how quickly users receive the initial response.
Prompt Caching is Provider-Managed
Leading model providers deliver near-automatic caching implementations, eliminating engineering overhead for hardware provisioning and caching logic management. Two industry-standard approaches exist:
- Implicit Caching:
- Systems automatically identify and cache reusable prompt segments without developer intervention
- Estimated savings is 50% (Referenced from GPT)
- Model Providers: GPT and Gemini
- Explicit Caching:
How does Prompt Caching work
Prompt Caching was introduced in a 2024 paper Prompt Cache: Modular Attention Reuse for Low Latency Inference. Although the specific technical details may differ between model providers, the overall concept is the same, which we will try to simplify here.
Let's break down the title of the paper: "Modular Attention," "Reuse," and "Low Latency Inference."
Low Latency Inference
Modular Attention Reuse is a bit more complicated, so let's explain Low Latency Inference first. Inference is the step where a model processes an input prompt and produces an output based on parameters (such as temperature, Top K/Top P sampling). Unlike traditional caching, where a specific input will give a fixed output, Prompt caching leads to "speeding" up the the "output" token generation, and will not lead to fixed output.
Modular Attention
"Attention mechanism" is a crucial part of the Transformer Architecture. Transformer Architecture is the go-to architecture for popular text-generative models such as OpenAI's GPT (The T in GPT!), Anthropic Claude, and Google Gemini. The architecture itself is a big topic, and we will only give a very abstract explanation for the "Attention mechanism" here. (For a more detailed breakdown, we recommend this blog which explains and visualizes Transformer Architecture)
The Attention mechanism captures information about each token and how it relates to other tokens. It calculates Attention scores, which determine how much focus or weight each token should receive when generating predictions. A higher score means a stronger influence on the model’s output.
Referring to the Attention Mechanism in the image above:
- The prompt has 8 tokens ("You," "are," "a," "professional," "automated," "Q," "A," "Engineer")
- The attention mechanism calculates the attention score for each token in relation to the other tokens
- "You" with "You," "are," "a," "professional," "automated," "Q," "A," "Engineer"
- "are" with "You," "are," "a," "professional," "automated," "Q," "A," "Engineer"
- "a" with "You," "are," "a," "professional," "automated," "Q," "A," "Engineer"
- <... and repeat for all tokens ... >
- In this example, Let's take note of the attention score of the token "professional" in relation to token "automated" (Attention Score: 0.40)
Now, let's analyze these 3 prompts:
Prompt 1
You are a professional automated QA Engineer
Token "professional" in relation to token "automated":
Attention Score: 0.40
Prompt 2
You are a professional automated QA Engineer at TikTok
Token "professional" in relation to token "automated":
Attention Score: 0.40
Prompt 3
As a professional automated QA Engineer
Token "professional" in relation to token "automated":
Attention Score: 0.43
Prompt 1 and Prompt 2 ("You are a professional automated...") will both result in the same attention scores (0.40). However, changing the tokens before "professional automated" in Prompt 3 ("As a...") results in a different attention score (0.43). Tokens after it do not change the attention score.
This example shows the part of the Transformer architecture which has "fixed computations." Despite LLM behavior of same input tokens resulting in different output tokens, the fixed computation of Attention scores provides an opportunity for caching. This example also shows how input sequence can influence attention scores, which can help us understand why a seemingly "small" change in the Prompt 3 can result in a cache miss. It's also worth noting that changing the temperature and parameters do not change the attention score.
Attention score at Temperature 2
(Modular Attention) Reuse
Now that we understand the terms "Modular Attention" and "Low Latency Inference," we can see that prompt caching works by reusing the fixed portion of input processing—specifically, the attention scores—to speed up the generation of output tokens.
Generation with Prompt Cache
When is it "reused"? A key concept to prompt caching is the Prompt Markup Language (PML). These are explicit marks which define which parts of the input token should be considered for caching.
The prompt cache is also reused between multiple inference calls (Or simply, API calls to the model).
PML and Cache Reuse between API calls
Cache Logic From: https://aws.amazon.com/blogs/machine-learning/effectively-use-prompt-caching-on-amazon-bedrock/
Solution
This specific implementation section will focus on Explicit caching.
Given a sample prompt:
If the user is logged in, then skip rest of the test steps. If the user is not logged in, then continue with the following steps.
1. Click Login button and wait for the Login modal to appear.
2. Click the "Use phone / email / username" option.
3. Click "Login with email or username" option.
4. Use the following credentials. username: "usertt31", password: "wrongpassword".
5. After clicking Login, it should show an error message.
We can imagine the sample message to be like this:
Based on the diagram above, we can find "repeated" prompts which present opportunities for caching:
- Previous "Messages" from earlier requests. Historical messages are kept to give enough context to the LLM to improve accuracy.
- Tools
Since we use Prompt Caching with AWS Bedrock, there are a few important points from their cache specification:
- Maximum 4 cache checkpoints per request through the
cache_control
parameter cache_control
can be thought of the "PML" , which allows us to explicitly define reusable sections of the prompt.- For a detailed explaination of cache breakpoints, check the AWS Blog
- 5 minute TTL
- This is the cache lifetime. Hitting the cache will reset the 5 minute timer.
- Minimum 1K~2K tokens
- Depending on the model, Cache sections less than the minimum will not be written successfully in the cache.
Here's our 50 lines of code to implement prompt caching (excluding comments and spacing, of course).
let lastCacheControl = -1;
async function sendMessageWithCaching(client, messages, tools) {
// 1. Add "cache_control" to messages & tools
const { processedMessages, updatedLastCacheControl } = addCacheControlToMessages(messages, lastCacheControl);
const processedTools = addCacheControlToTools(tools);
// 2. Send to AWS Bedrock API
const response = await client.sendMessage({
messages: processedMessages,
tools: processedTools,
});
// 3. Update cache control state
lastCacheControl = updatedLastCacheControl;
return response;
}
// 1 cache points will be used on tools
function addCacheControlToTools(tools) {
return tools.map((tool, index) => {
if (index === tools.length - 1)
? return { ...tool, cache_control: { type: "ephemeral" } };
: return tool;
});
}
// 2 cache points will be used on messages
function addCacheControlToMessages(messages, lastCacheControl) {
const cacheIndices = calculateCachePoints(messages, lastCacheControl);
const processedMessages = messages.map((message, index) => {
if (cacheIndices.includes(index)) { // Cache specific messages
return { ...message, cache_control: { type: "ephemeral" } };
}
return message;
});
const updatedLastCacheControl = cacheIndices?.[cacheIndices.length - 1] || -1
return { processedMessages, updatedLastCacheControl };
}
function calculateCachePoints(messages, lastCacheControl) {
const cachePoints = [];
if (messages.length === 0) return cachePoints;
const lastIndex = messages.length - 1;
const targetIndex = hasTemporaryContent(messages[lastIndex]) && lastIndex > 0
? lastIndex - 1 : lastIndex;
if (targetIndex < 0) return cachePoints;
if (lastCacheControl === -1) {
// First time - check total size
if (getTokenSize(messages, 0, targetIndex) > CACHE_THRESHOLD) {
cachePoints.push(targetIndex);
}
} else {
// Keep old cache + add new if enough content
cachePoints.push(lastCacheControl);
if (getTokenSize(messages, lastCacheControl + 1, targetIndex) > CACHE_THRESHOLD) {
cachePoints.push(targetIndex);
}
}
return cachePoints;
}
function getTokenSize(messages, start, end) {
return messages.slice(start, end + 1).reduce((size, msg) => size + estimateTokens(msg), 0);
}
Here's a diagram to help visualize our implementation:
Results
- Dollar Cost: Based on our total daily usage, prompt caching can save ~40-50% of the costs. The reason for variations per day depends on the prompts of users. A longer prompt and conversation would result in higher prompt savings.
Weekly Savings View
Monthly Savings View
- Time Cost: After running our benchmark cases on both cached and non-cached prompts, the overall time to complete each test case has no clear perceivable time improvements from the user side. Most savings for an entire run are around ~5 seconds. Multiple factors contribute to test running time, making it difficult to benchmark such as network slowdown, model API availability, performance of app under test.
Time cost in seconds per test case
Conclusion
Prompt caching offers the most straightforward approach to reducing dollar costs, especially since it's natively supported by model providers. Given our tool's conversational nature, prompts are frequently reusable, making prompt caching particularly effective. It also provides a unique advantage over selector-based caching by enabling cache hits even during early test case development when user prompts change rapidly.
While prompt caching does not significantly improve overall time costs on its own, it lays a strong foundation for further optimization. Combined with other strategies such as model selection, architectural tuning, and fine-grained execution control, it opens the door to more scalable and cost-effective AI-powered testing.
We are excited to continue refining this system and exploring what comes next.
