Rongchai Wang
Mar 23, 2026 20:27
Anthropic demonstrates multi-day autonomous AI workflows the place Claude compressed months of physics analysis coding into a number of days with minimal human oversight.
Anthropic simply confirmed what occurs while you let an AI work unsupervised for days on finish. Their Claude Opus 4.6 mannequin constructed a posh cosmological physics solver from scratch—work that sometimes takes researchers months or years—in a matter of days.
The $380 billion AI firm printed analysis on March 23 detailing how their newest mannequin tackled implementing a differentiable Boltzmann solver, which predicts statistical properties of the Cosmic Microwave Background by simulating photons, baryons, neutrinos, and darkish matter within the early universe. The kicker? The researcher overseeing the mission, Siddharth Mishra-Sharma, admits it wasn’t even his core area.
The Setup That Made It Work
Overlook the standard AI chat loop the place people babysit each step. This method units Claude unfastened with clear success standards and lets it run autonomously throughout a number of periods. The mannequin achieved sub-percent accuracy towards CLASS, a reference implementation that cosmologists contemplate the gold commonplace.
Three parts proved important. First, a progress file (CHANGELOG.md) acts because the agent’s long-term reminiscence between periods, monitoring accomplished duties, failed approaches, and why they did not work. With out recording useless ends, successive periods waste time re-attempting the identical errors.
Second, a check oracle—on this case, the CLASS C supply code—offers the agent an goal solution to measure progress. Claude repeatedly ran unit exams towards this reference, aiming for that 0.1% accuracy goal.
Third, git commits after each significant unit of labor create recoverable historical past. If compute allocation runs out mid-session, nothing will get misplaced. Mishra-Sharma monitored progress by checking GitHub on his cellphone whereas ready in line for espresso.
The Ralph Loop Downside
Present fashions endure from what Anthropic calls “agentic laziness”—they’re going to discover excuses to cease earlier than ending advanced duties. One mannequin actually mentioned, “It is getting late, let’s decide again up once more tomorrow.”
The workaround is the “Ralph loop,” basically a for-loop that kicks the agent again into context when it claims completion and asks if it is actually achieved. Claude would iterate as much as 20 instances till genuinely completed.
The place It Struggled
The event trajectory wasn’t clean. Claude initially examined code at solely a single parameter level, drastically decreasing its bug-catching capability. It spent hours chasing bugs that any cosmologist would spot immediately. It tripped over gauge conventions.
However it stored making progress. The ensuing solver is not production-grade—accuracy falls quick in sure regimes—but it demonstrates real compression of researcher time.
What This Means for AI Improvement
This builds on Anthropic’s earlier C compiler mission, the place Claude labored throughout roughly 2,000 periods to construct a compiler able to compiling the Linux kernel. The Boltzmann solver required totally different expertise: tracing errors by way of a deeply coupled pipeline the place small numerical errors cascade by way of every thing downstream.
An sudden aspect impact emerged. Mishra-Sharma discovered substantial physics by watching the git commit historical past unfold. He described it as studying lab notes from “a quick, hyper-literal postdoc.”
For Anthropic, contemporary off a $30 billion Collection G spherical in February that valued the corporate at $380 billion, these demonstrations matter. They don’t seem to be simply displaying Claude can chat—they’re proving it will probably change costly, specialised labor over prolonged intervals with minimal supervision. The query now turns into which industries determine easy methods to deploy this primary.
Picture supply: Shutterstock






