Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124


Test Time Scaling (TTS) has emerged as a proven method to improve the performance of large language models in real-world applications by providing them with additional computational cycles during inference. However, TTS strategies have historically been created by hand, relying heavily on human intuition to dictate the rules of the model’s reasoning.
To address this difficulty, researchers at Meta, Google, and several universities introduced AutoTTS, a framework that automatically discovers optimal TTS strategies. This automated approach allows enterprise organizations to dynamically optimize compute allocations without manually setting heuristics.
By implementing the optimal strategies discovered by AutoTTS, organizations can directly reduce token usage and operational costs for deploying advanced reasoning models in production environments. In experimental trials, AutoTTS efficiently manages inference budgets, successfully reducing token consumption by up to 69.5% without sacrificing accuracy.
Test time scaling improves LLMs by providing them with additional computations when generating answers. This extra computation allows the model to generate multiple reasoning paths or evaluate its intermediate steps before arriving at a final answer.
The main challenge in designing TTS strategies is to determine how to allocate these additional computations optimally. Historically, researchers designed these strategies by hand, relying on guesswork to build robust heuristics. Engineers must hypothesize rules and thresholds for when a model should branch into new paths of reasoning, delve deeper into an existing path, cut off an unpromising branch, or stop reasoning altogether.
Because this manual tuning process is limited by human intuition, a vast amount of possible approaches remain unexplored. This often results in suboptimal trade-offs between model accuracy and computational cost.
Current TTS algorithms can be mapped to a width and depth control space – "width" this is the number of reasonings explored, "depth" how far each develops. Self-consistency (SC) samples a fixed number of trajectories and majority votes the answer. Adaptive sequencing (ASC) saves computation by stopping early after a confidence threshold is reached. Parallel-probe takes a more granular approach, pruning unpromising branches while deepening others. All three are hand-crafted, and this is the limitation that AutoTTS is designed to overcome.
While some more advanced methods use richer structures such as search trees or external verifiers, they all share one key characteristic: they are meticulously handcrafted. This manual approach limits the scope of strategy discovery, leaving a huge portion of the potential resource allocation space untouched.
AutoTTS changes the way test time scaling is optimized. Rather than treating strategy design as a human task, AutoTTS approaches it as an algorithmic search problem in a controlled environment.
This framework redefines the roles of both the human engineer and the AI model. Instead of manually crafting specific rules for when an LLM should branch, cut, or stop reasoning, the engineer’s role shifts to building the discovery environment. The human defines the boundaries, including the control space of states and actions, optimization objectives balancing accuracy against cost, and specific feedback mechanisms.
An LLM researcher like Claude Code designed the strategy. This researcher acts as an autonomous agent that iteratively proposes TTS « controllers ». These controllers are code-defined policies or algorithms that dictate how an AI model allocates its computational budget during inference. The researcher tests and refines these controllers based on feedback until they find an optimal resource allocation policy.
To make this automated search computationally affordable, AutoTTS relies on an « offline iteration environment ». If the LLM researcher had to invoke an underlying reasoning model to generate new tokens every time he tested a new strategy, the computational cost would be astronomical. Instead, it relies on thousands of reasoning trajectories pre-assembled from the base LLM. These trajectories include "probe signals," which are intermediate responses that help the administrator assess progress in various reasoning.
During the discovery cycle, the Explorer agent proposes a controller and evaluates it against this offline data. The agent observes the execution traces of the proposed controller, which show its distributed computations over time. By analyzing these traces, the agent can diagnose specific failure modes, such as noting whether the controller pruned branches too aggressively in a particular scenario. This provides an advantage over simply looking at a final result. The agent then iteratively rewrites its code to improve the accuracy-cost trade-off.
Because the research agent is not limited by human intuition, it can discover highly coordinated, complex rules that a human engineer would likely never code by hand. One optimal controller discovered by AutoTTS, called the Confidence Momentum Controller, uses several non-obvious mechanisms to control the computations:
Stop based on trend: Handcrafted strategies often instruct the model to stop reasoning once it reaches a certain instantaneous confidence threshold. The AutoTTS agent found that instant confidence can be misleading due to temporary spikes. Instead, the controller tracks an exponential moving average (EMA) of confidence and stops only if the overall confidence level is high and the trend is not actively declining.
Combined width and depth control: Manually designed algorithms usually treat "expansion" of new avenues of reasoning and "deepening" of current paths as separate solutions. AutoTTS found a closed feedback loop where the two actions are linked. If the trust of current branches slows down or regresses, the controller automatically triggers the creation of new branches.
Alignment-aware depth distribution: Instead of giving all active reasoning branches the same computation budget, the controller dynamically identifies which branches agree with the current leading answer. It then prioritizes those branches "outbursts" additional calculations. This concentrates the computational budget on the emerging consensus to quickly verify that it is correct.
To test whether the AI can independently discover a better test time scaling strategy, the researchers created a rigorous evaluation framework. The main experiments were performed on Qwen3 models ranging from 0.6B to 8B parameters. The researchers also tested the system’s ability to generalize over a distilled 8B version of the DeepSeek-R1 model.
Initially, the researcher’s AI agent was tasked with finding an optimal strategy using the AIME24 mathematical reasoning benchmark. This discovery strategy was then tested on two established mathematics benchmarks, AIME25 and HMMT25, as well as the GPQA-Diamond higher education level general reasoning benchmark.
The discovered AutoTTS controller was pitted against four manually designed industry test time scaling algorithms. These baselines include Self-Consistency with 64 parallel paths of reasoning (SC@64), Adaptive-Consistency (ASC), Parallel-Probe and Early-Stopping Self-Consistency (ESC). ESC is a hybrid approach that generates trajectories in parallel and stops early when a response appears stable.
When set to cost-aware balanced mode, the AutoTTS-discovered controller reduces total token consumption by approximately 69.5% compared to SC@64. At the same time, the controller maintains the same average accuracy across the four Qwen models. When the inference budget was increased, AutoTTS pushed maximum accuracy beyond all hand-crafted baselines in five out of eight test cases.
This efficiency translates into other tasks. In the GPQA-Diamond benchmark, the balanced variant of AutoTTS reduced the inference token cost from 510K tokens to just 151K tokens while slightly improving overall accuracy. In the DeepSeek AutoTTS model, it achieved the highest overall accuracy on the HMMT25 benchmark, cutting tokens spent almost in half.
For practitioners building enterprise AI applications, these experiments highlight two key operational benefits:
Increase peak performance: AutoTTS not only saves money on token consumption. It actively increases the maximum achievable performance of the base model. The AI-engineered controller is exceptionally good at detecting noisy or unproductive logic branches on the fly and continuously redirects its computational budget to the branches generating the most useful reasoning signals.
Cost-effective custom development: Because the framework relies on an offline replay environment, the entire discovery process costs only $39.90 and takes 160 minutes. For enterprise teams, this means optimized reasoning strategies aligned with proprietary models and internal tasks are now available – without a dedicated research budget.
Both the AutoTTS framework and the Confidence Momentum Controller are available on GitHub; CMC can be used as a replacement for other TTS controllers.
Orchestration
#Researchers #automated #LLM #reasoning #strategy #design #reduced #token #usage