Headlines (47 articles)
-
On a recent episode of Equity, we talked to Arena Private Wealth to explore a growing trend: family offices bypassing VCs to gain direct exposure to AI startups, turning them from passive investors into active participants.The AI gold rush is pulling private wealth into riskier, earlier bets TechCrunch AI Apr 07, 2026 01:00 PM 1 min read On a recent episode of Equity, we talked to Arena Private Wealth to explore a growing trend: family offices bypassing VCs to gain direct exposure to AI startups, turning them from passive investors in
-
Gemini is making it faster for distressed users to reach mental health resources The Verge AI Apr 07, 2026 06:09 AM 1 min read The update follows a wrongful death lawsuit alleging Gemini âcoachedâ a man to die by suicide.
Google says it has updated Gemini to better direct users to get mental health resources during moments of crisis. The change comes as the tech giant faces a wrongful death lawsuit alleging its chatbot "coached" a man to die by suicide, the latest in a string of lawsuits alleging tangible harm from AI products.
When a conversation indicates a user is in a potential crisis related to suicide or self-harm, Gemini already launches a "Help is available" module that directs users to mental health crisis resources, like a suicide hotline or crisis text line. Google says the update - really more of a redesign - will streamline this into a "one-touc âŚ
- AI startup Rocket offers vibe McKinsey-style reports at a fraction of the cost TechCrunch AI Apr 07, 2026 05:30 AM Rocket's new AI platform combines strategy, product building, and competitive intelligence, aiming to move beyond code generation.
-
From folding boxes to fixing vacuums, GEN-1 robotics model hits 99% reliability Ars Technica AI Apr 06, 2026 10:18 PM 1 min read New model can respond to disruptions and figure out moves it wasn't trained for.
Robotic machine-learning company Generalist has announced GEN-1, a new physical AI system that it says "crosses into production-level success rates" on "a broad range of physical skills" that used to require the dexterity and muscle memory of human hands. Generalist is also touting the new model's ability to respond to disruptions by improvising new moves and "connect[ing] ideas from different places in order to solve new problems."
GEN-1 builds on Generalist's previous GEN-0 model, which the company touted in November as a proof of concept for the applicability of scaling laws in robotics training, showing how more pre-training data and compute time improve post-training performance. But while large language models have been able to effectively process trillions of words collectively written on the Internet as part of their training, robotic models don't have a similar, readily accessible source of quality data about how humans manipulate objects.
To help solve this problem, Generalist has relied on "data hands,"Â a set of wearable pincers that capture micro-movements and visual information as humans perform manual tasks. Generalist now claims it has collected over half a million hours and "petabytes of physical interaction data" to help train its physical model.
- OpenAI alums have been quietly investing from a new, potentially $100M fund TechCrunch AI Apr 06, 2026 09:54 PM Zero Shot, a new venture capital fund with deep ties to OpenAI, is aiming to raise $100 million for its first fund. It has already written some checks.
- Google quietly launched an AI dictation app that works offline TechCrunch AI Apr 06, 2026 06:54 PM Google's new offline-first dictation app uses Gemma AI models to take on the apps like Wispr Flow.
- Iran threatens âStargateâ AI data centers TechCrunch AI Apr 06, 2026 06:06 PM Iran said it will target U.S.-linked data centers with new missile strikes, as the war between the U.S. and Iran escalates.
-
The one piece of data that could actually shed light on your job and AI MIT Technology Review Apr 06, 2026 04:33 PM 5 min read âWe need a Manhattan Project for this,â one economist says.
This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.
Within Silicon Valleyâs orbit, an AI-fueled jobs apocalypse is spoken about as a given. The mood is so grim that a societal impacts researcher at Anthropic, responding Wednesday to a call for more optimistic visions of AIâs future, said there might be a recession in the near term and a âbreakdown of the early-career ladder.â Her less-measured colleague Dario Amodei, the companyâs CEO, has called AI âa general labor substitute for humansâ that could do all jobs in less than five years. And those ideas are not just coming from Anthropic, of course.Â
These conversations have unsurprisingly left many workers in a panic (and are probably contributing to support for efforts to entirely pause the construction of data centers, some of which gained steam last week). The panic isnât being helped by lawmakers, none of whom have articulated a coherent plan for what comes next.
Even economists who have cautioned that AI has not yet cut jobs and may not result in a cliff ahead are coming around to the idea that it could have a unique and unprecedented impact on how we work.Â
Alex Imas, based at the University of Chicago, is one of those economists. He shared two things with me when we spoke on Friday morning: a blunt assessment that our tools for predicting what this will look like are pretty abysmal, and a âcall to armsâ for economists to start collecting the one type of data that could make a plan to address AI in the workforce possible at all.Â
On our abysmal tools: consider the fact that any job is made up of individual tasks. One part of a real estate agentâs job, for example, is to ask clients what sort of property they want to buy. The US government chronicled thousands of these tasks in a massive catalogue first launched in 1998 and updated regularly since then. This was the data that researchers at OpenAI used in December to judge how âexposedâ a job is to AI (they found a real estate agent to be 28% exposed, for example). Then in February, Anthropic used this data in its analysis of millions of Claude conversations to see which tasks people are actually using its AI to complete and where the two lists overlapped.
But knowing the AI exposure of tasks leads to an illusory understanding of how much a given job is at risk, Imas says. âExposure alone is a completely meaningless tool for predicting displacement,â he told me.
Sure, it is illustrative in the gloomiest caseâfor a job in which literally every task could be done by AI with no human direction. If it costs less for an AI model to do all those tasks than what youâre paidâwhich is not a given, since reasoning models and agentic AI can rack up quite a billâand it can do them well, the job likely disappears, Imas says. This is the oft-mentioned case of the elevator operator from decades ago; maybe todayâs parallel is a customer service agent solely doing phone call triage.Â
But for the vast majority of jobs, the case is not so simple. And the specifics matter, too: Some jobs are likely to have dark days ahead, but knowing how and when this will play out is hard to answer when only looking at exposure.
Take writing code, for example. Someone who builds premium dating apps, letâs say, might use AI coding tools to create in one day what used to take three days. That means the worker is more productive. The workerâs employer, spending the same amount of money, can now get more output. So then will the employer want more employees or fewer?Â
This is the question that Imas says should keep any policymaker up at night, because the answer will change depending on the industry. And we are operating in the dark.Â
In this coderâs case, these efficiencies make it possible for dating apps to lower prices. (A skeptic might expect companies to simply pocket the gains, but in a competitive market, they risk being undercut if they do.) These lower prices will always drive some increase in demand for the apps. But how much? If millions more people want it, the company might grow and ultimately hire more engineers to meet this demand. But if demand barely ticks upâmaybe the people who donât use premium dating apps still wonât want them even at a lower priceâfewer coders are needed, and layoffs will happen.
Repeat this hypothetical across every job with tasks that AI can do, and you have the most pressing economic question of our time: the specifics of price elasticity, or how much demand for something changes when its price changes. And this is the second part of what Imas emphasized last week: We donât currently have this data across the economy. But we could.Â
We do have the numbers for grocery items like cereal and milk, Imas says, because the University of Chicago partners with supermarkets to get data from their price scanners. But we donât have such figures for tutors or web developers or dietitians (all jobs found to have âexposureâ to AI, by the way). Or at least not in a way thatâs been widely compiled or made accessible to researchers; sometimes itâs scattered across private companies or consultancies.Â
âWe need, like, a Manhattan Project to collect this,â Imas says. And we donât need it just for jobs that could obviously be affected by AI now: âFields that are not exposed now will become exposed in the future, so you just want to track these statistics across the entire economy.â
Getting all this information would take time and money, but Imas makes the case that itâs worth it; it would give economists the first realistic look at how our AI-enabled future could unfold and give policymakers a shot at making a plan for it.
-
OpenAI proposes taxes on AI profits, public wealth funds, and expanded safety nets to address job loss and inequality, blending redistribution with capitalism as policymakers debate AIâs economic impact.OpenAIâs vision for the AI economy: public wealth funds, robot taxes, and a four-day workweek TechCrunch AI Apr 06, 2026 03:55 PM 1 min read OpenAI proposes taxes on AI profits, public wealth funds, and expanded safety nets to address job loss and inequality, blending redistribution with capitalism as policymakers debate AIâs economic impa
-
Nominate your startup, or one you know that deserves the spotlight, and finish the process by applying. Selected 200 have a chance at VC access, TechCrunch coverage, and $100K for Startup Battlefield 200. Applications close on May 27.Startup Battlefield 200 applications open: a chance for VC access, TechCrunch coverage, and $100K TechCrunch AI Apr 06, 2026 02:30 PM 1 min read Nominate your startup, or one you know, and apply for a chance at VC access, TechCrunch coverage, and $100K for Startup Battlefield 200.
- How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others TechCrunch AI Apr 06, 2026 02:11 PM Learn how to use Spotify, Canva, Figma, Expedia, and other apps directly in ChatGPT.
- Ticket savings of up to $500 this week for TechCrunch Disrupt 2026 TechCrunch AI Apr 06, 2026 02:00 PM Save nearly $500 on your ticket to TechCrunch Disrupt 2026. This offer disappears Friday, April 10, at 11:59 p.m. PT. Register here before rates hike.
- Spainâs Xoople raises $130 million Series B to map the Earth for AI TechCrunch AI Apr 06, 2026 01:00 PM The company is also announcing a deal with L3Harris to build the sensors for Xoople's spacecraft.
-
Iran threatens OpenAIâs Stargate data center in Abu Dhabi The Verge AI Apr 06, 2026 11:54 AM 1 min read Construction on OpenAIâs UAE data center was âwell underwayâ as of last year.
An October 2025 image of OpenAIâs UAE Stargate data center under construction. | Image: G42 Iran's Islamic Revolutionary Guard Corps (IRGC) has published a video threatening OpenAI's planned Abu Dhabi data center if the US follows through on threats to attack the country's power plants, as reported earlier by Tom's Hardware. The video, which was published to an Iranian state-backed news outlet's X account on April 3rd, says the IRGC will carry out the "complete and utter annihilation" of US-linked energy and technology companies in the region, before showing an image of OpenAI's $30 billion in-progress Stargate facility in the United Arab Emirates.
OpenAI's overarching $500 billion Stargate project includes investments from Oracle âŚ
-
Cisco CEO Chuck Robbins wants data centers in space The Verge AI Apr 06, 2026 11:15 AM 53 min read In space, no one can hear your data center.
Today, Iâm talking with Chuck Robbins, CEO of Cisco.
Cisco is one of those big companies that everyone has heard of but that most of us donât have to interact with very much; itâs not really a consumer brand. But all of us are in some way using Ciscoâs products and services every day because it makes a huge amount of networking equipment for other big companies, like telecoms and ISPs. Itâs a guarantee that somewhere between me recording this and you watching, listening to, or reading it, the bits have passed through Cisco products. Without the actual routers and switches and silicon â and the software to make those things work â thereâs no internet, thereâs no cloud, and thereâs no AI.
Verge subscribers, donât forget you get exclusive access to ad-free Decoder wherever you get your podcasts. Head here. Not a subscriber? You can sign up here.
Thatâs Ciscoâs new big business, of course: building all the networking needed inside all of the data centers the AI companies are trying to build. Chuck and I spent a lot of time discussing that. First, where should we build all these data centers? Because itâs not clear that anyone wants them around.
A data center is a really unpleasant neighbor to have: Itâs loud, itâs ugly, and it uses a ton of electricity, making rates for regular people go up. AI itself is polling pretty badly with Americans, and thereâs now fairly robust, bipartisan opposition to new data center builds all over the country. So I had to start by asking Chuck what feels, strangely, like one of the most urgent questions of the moment: Should we build data centers in space?
Elon Musk sure seems to think the answer is yes, and heâs pushing SpaceX that way. Sam Altman â along with a whole bunch of experts who understand how cooling and radiation work in orbit â thinks weâre not there yet. So I had to ask Chuck which way heâs leaning, and I was a little surprised how quickly and emphatically he answered.
Youâll also hear me ask very directly whether Chuck thinks AI is a bubble, and youâll hear him say very directly that he thinks it is. And he would know: During the dot-com bubble, Cisco â the internet builder â was very briefly the most valuable company in the world.
Beyond the AI of it, I love bringing big companies that are kind of hidden in plain sight onto Decoder, and Cisco is a perfect example. Chuck has made some big bets around chip investments to position Cisco on what he calls the leading edge â but not bleeding edge â that are really fascinating when you think about the kind of infrastructure he sells to companies all over the world.
Those companies are dealing with an increasingly fractured global landscape, and asking big questions about data. Who owns data? Where can it be stored? Should the internet have a kill switch in different countries? Theyâre important questions, but they also donât have easy answers, and youâll hear Chuck really delve into how complicated it is keeping the world connected in the deeply weird realities of 2026.
Okay: Cisco CEO Chuck Robbins. Here we go.
This interview has been lightly edited for length and clarity.
Chuck Robbins, you are the CEO of Cisco. Welcome to Decoder.
Itâs great to be here. Thank you.
Iâm excited that youâre here in person. I have a lot of questions for you. It seems like a very complicated time to run an infrastructure company â which is fundamentally what Cisco is â especially one for global infrastructure. The internetâs a global network, and that seems to be under a lot of pressure from a lot of different directions. So, I want to get into a lot of things with you.Â
But I actually want to start with a question Iâve been dying to ask you ever since we scheduled this interview/ I thought finally I can ask this question and someone will be able to tell me the answer. Should we put data centers in space?
Absolutely.
Yes?
And we will.
You think so?
I think so. Right now weâre dealing with lots of power constraints, and up there you donât have that. And if you think about the people who are talking about putting data centers in space, I wouldnât doubt them.
Elon [Musk].
Yeah. And thereâs a lot of stuff weâre working on right now. Weâre thinking through what we need to do to our portfolio to make it work properly in the conditions that might exist up there. But I think weâre going to see it. I think we are.
So Elonâs plan â he recently filed for approval for this plan â is to launch a million satellites as part of a constellation. Heâs launched constellations before. You mentioned power, thatâs obviously solar. Canât we just do solar power here on Earth? Is that not a possibility?
Well, up there itâs unlimited and unimpeded, so itâs just easier. You donât have to deal with a lot of the challenges, like people who donât want these data centers in or near their communities. So, thatâs obviously off the table. I think it solves a lot of problems. There are a lot of challenges figuring out how to make it all happen. But again, given his history, I wouldnât doubt him. Weâre going to prepare so our technology is ready.
What does that preparation look like for you?
Itâs very early stages right now. Our teams literally came to me I think about two or three months ago. My head of product said, âWe really have to be prepared for data centers in space.â I looked at him like he was crazy. Subsequent to that, weâve just been talking about how we donât even know everything we need to do yet. Weâre in the early stages of just making sure the atmospheric issues, the temperatures, all of those things are taken into consideration.Â
But at some level, we donât have to deal with the cooling and things of that nature, which add a lot of weight to the product because you first start thinking about how do we get them up there. So, there are a lot of things that our teamâs thinking through right now.
What does that networking stack look like for you? Do you have to invent a whole bunch of new stuff? Is it the same stuff without as many cooling loads or with less energy needs?
I think itâs generally the same with perhaps different interfaces for different satellite technologies, things like that. It shouldnât be too dissimilar.
Do you want to be on the bleeding edge of that, or are you waiting and seeing if Elon can prove it out?
No, Iâd like to be on the leading edge. How about we say that? Maybe not bleed, but letâs lead.
What does that investment look like for you? Are you going to send up a team?
The teams that currently build our data centers are the logical ones to actually do this analysis, and I think thatâs whatâs happening right now.
To me, the cooling piece of it seems challenging in a lot of ways. You have to move the heat out of the products. Thereâs no air in space. Thatâs not going to naturally happen.
Youâre getting way beyond me pretty quickly.
Iâm just curious. Weâve written a bunch of âshould we put data centers in space?â stories now, and I was dying to ask you these questions because it feels like someone has to do a lot of basic R&D work to make this happen.
Iâd say six months from now, have my chief product officer do this, and he can go through a lot of that with you.
Fair enough. Let me ask you the flip side of this; you mentioned this already. There are problems with building data centers in the United States and around the world. I want to come to that in more depth. But are we just running away from the problems of politics and saying weâll just do it in space where thereâs no one to get in our way?
I donât think thatâs it. I think it just eliminates a lot of the challenges that youâre facing on the planet. Let me assure you. I grew up on a farm in Georgia, so the last thing I ever thought Iâd be talking about are data centers in space. Even five years ago, I wouldnât have thought Iâd be talking about it.
If you think about a lot of the dynamics weâre dealing with, I donât think itâs politics so much as it is the physical limitations, the community. There is an aspect where a lot of the people in the communities donât want these things in their backyards, and I get that. Sam Altman is one who says, âI donât think they should be in their backyards.â Weâve got enough rural areas in this country where we ought to be able to put these things, but weâll see.
Sam Altman also notably says putting data centers in space is a pipe dream. So who are you going to believe?
Does he?
So whoâve you got?
I wouldnât bet against Elon.
All right, fair enough. Letâs talk about Cisco for a second. Youâve been CEO for 11 years. Youâve been there for almost 30, I want to say. This is a company that goes boom and bust with boom and busts. I think in the dot-com era, Cisco was briefly the worldâs most valuable company.
For about a day, I think, yeah.
And this is a company that, when itâs time to do infrastructure, can be one of the big growth drivers.
Infrastructureâs cool again.
Itâs time to build, as they say. What is Cisco to you right now? How would you describe this company?
We securely connect everything. Thatâs basically what we do. We connect systems, we connect people, we connect things, and we do it in a secure way. Weâre connecting AI data centers, weâre connecting GPUs within AI data centers. Itâs primarily about secure connectivity.
I think when people have thought about connecting everything, theyâve thought about, honestly, the last mile. Like, you build the big internet, thatâs an enterprise problem. Then, weâre going to do 5G. Or itâs Mobile World Congress and weâre going to do 6G now. Who knows when thatâs going to be. But thatâs the big internet people have long thought about. I know you have a big corporate business.
The turn for networking right now feels like data centers. It feels like weâre building these big data centers. Weâre going to link up a bunch of GPUs in ways we havenât linked them up before. We have different kinds of workloads because of AI. Is that a meaningful difference to how you conceive of Cisco?
It is. Thereâs certainly more and faster innovation around things like the silicon we design ourselves that goes into the data centers. The continued evolution of data centers is forcing us to drive those cycles faster. If you look at our enterprise data center business â going back to 2010 or 2008, when the cloud came along â there was a belief that there was never going to be another private data center built. And if you look at the last eight quarters, our enterprise data center networking business has had double-digit growth in six quarters and high single digits in the other two.Â
So, we see that business growing. If you go back five or six years, we had relatively zero business from the big hyperscalers, and this year, weâll do billions. And most of thatâs driven by AI infrastructure and their data centers. So, I think your assumption is accurate.
Is that just their lack of capacity? Amazon or Microsoft wants to build out another data center, but they canât do it themselves because theyâre building so fast so they turn to you?
No, weâre selling them equipment that theyâre using to build their own data centers. So theyâre building them. They are building them.
So what was the turn? Why did that line start growing for you when it wasnât growing before?
Success in business is always a combination of good decisions and a lot of luck. The luck struck in 2016 when one of my engineers, who built our hardware, came to me and said, âThereâs this silicon company in Israel that I think we should buy.â It gave us the opportunity to standardize on a single silicon architecture across the entire portfolio.Â
So in 2016, we bought this company called Leaba. Fast forward and weâre one of basically three companies in the world that can build the networking silicon thatâs needed to connect these GPUs, run the training models, and run these AI data centers. So, that was a big part of whatâs helped us get there.Â
And to be candid, if we didnât have that silicon today, we would not be participating in this phase. Otherwise, Iâd be buying merchant silicon like all my competitors, and Iâd be just like everybody else. So, thatâs the biggest thing thatâs differentiated us and got us to this point.
We have a lot of competitors or would-be competitors come on the show and talk about networking. That seems like a growing business for a class of companies.
The one that Iâm particularly interested in is Nvidia. You guys have a deep partnership with Nvidia. They just had GTC. Jensen [Huang, Nvidia CEO] is out there pointing out that their networking business is huge. Itâs bigger than yours in some ways. Its last fiscal year was $31 billion. I think you guys were at $20 billion in the last quarter. Itâs billions bigger than yours. Is it a threat that Nvidia is so deep into building up the networking component? Because its obviously selling the GPUs. Thereâs a place it could go. It can just expand its footprint. Is that competition? Is that coopetition? Howâs that work?
Itâs coopetition. If you look at the big hyperscalers, they actually build their own integrated architecture using best of breed or whoever they want to use. They are very good at balancing their spending across multiple vendors. They like to have optionality. They want diversity at the silicon level. Thatâs how they think. You see some neoclouds as an example. Nvidia sells a fully integrated stack that has networking included in it. Thatâs the path of least resistance, and it helps them get there faster. So sometimes theyâll buy that.Â
If you look at the enterprise, most enterprises have built 40 years of knowledge, processes, everything around our platforms and our technology. Thatâs why what we can do together in the enterprise is a big part of why Nvidia values our partnership.
The other thing we have, which no one else has, is security. As we move to this agentic era with agents operating all over your infrastructure, you have to do security in the network because the latency requirements are going to require full-time security on these agents all the time. Iâm doing access, validation, and identity validation of agents. Weâre the only networking company that has a big security business. None of our security competitors have a networking business. So itâs a big advantage to us as we go forward.
We just had the CEO of Okta on the show, and his entire pitch was, âI will build you a kill switch for your agents.â Is that competition for you? Is that something that will work alongside what youâre planning?
I actually think thereâs a great opportunity for us to partner with Okta. That kill switch might be implemented at the network layer because we may see something happening that it wonât see at the upper layers. So weâll figure it out, but the teams are working on this day and night right now.
The deal is being made here on Decoder. You heard it here first.
Exactly.
This seems like the opportunity. When I say Ciscoâs a company that grows with booms and busts, the amount of compute that everyone is describing that they need in order to deploy agents at scale across the enterprise and to train the next generation of models is vast. You are obviously going to help build the data centers that supply a lot of that compute. The question I have is, do you see the revenue on the back end of that? This is a lot of growth, a lot of forward investment.
For them?
For them and for you, right?
Well, weâre getting the revenue now. We would not expect this buildout to end anytime soon. Everybody wants to compare this to the dot-com era, right? Is it a bubble? Is it going to bust? Iâm like, well, did the dot-com bust or did the winners emerge, the losers failed, and now we have what we have? If they hadnât been successful, we wouldnât be talking about anything weâre talking about today.
So, it wasnât like it went away. People lost money, but the winners emerged. I think youâre going to see the same thing here. The difference is that, in a lot of cases, the companies that are spending so much money on this infrastructure view it as an existential issue for their survival. Theyâre going to continue to build, and theyâre going to continue to invest. I think theyâve proven that over the last few years, and I think we have a long way to go. Weâre very early in this cycle.
I have two thoughts about this Iâm eager to push you on. One â and this is just related to the infrastructure â a big part of the bubble there was that we built a fiber network that sat dark for ages. You can say whether that was good or bad, but we had it. And the fiber itself was valuable, even if it wasnât full of traffic yet. Is a data center valuable on the same scale? If you build a data center, and there isnât the consumer workload to run it, you canât just show up 20 years from now and plug into the fiber the way that you could.
I think that the difference is that, unlike that fiber, these data centers are being used day one at full capacity. I mean, theyâre just being used. In our world, itâs about the networking connectivity, but itâs also about optics. We havenât talked about optics, but we made some strategic acquisitions in optics, which has also been a big deal for us. Because at some point, you wonât be able to get the packets off the processor over copper because the speedâs just too great. So us having both those technologies in house is another benefit as we look to the future.
The other question Iâve been really thinking about with the dot-com era comparison is less dot-com and more mobile. If you look at the promise of the dot-com era, it was, âWeâre going to take the economy, and weâre going to move it onto the internet broadly. Youâre going to buy your pet food online, and maybe you werenât going to buy it on a Dell desktop PC.â It actually happened when we got to mobile. We just moved the economy onto the internet. Everyoneâs doing e-commerce, and it turns out buying pet food from Amazon on a computer totally works when that computer is a phone. Then, Apple and Google get to extract rent from everybody for all their purchases and games. We have an economy that works that way.
The promise of AI is weâre going to do it again. Weâre going to move the economy a third time to the next paradigm in computing. Whatâs the evidence you see that that is happening or will happen at the scale necessary to support this investment?
Well, if you look at some of the early agentic platforms, you heard Jensen this week talking about OpenClaw. I guess when this is broadcast, it would have been two or three weeks ago, but nonetheless. If you just look at the early promise of what that can do for you, I think youâre going to see it automate a lot. Itâs going to make your whole purchasing process different.
I think itâs yet to evolve, but I just reflect back to 2007 when the iPhone came out. None of us had any clue what weâd be doing with that phone today, none of us. Maybe there were some people somewhere who were such visionaries that they saw it coming. But the application portfolio that we have today is much broader than we ever thought it would be. I think youâre going to see the same thing emerge around AI.Â
We donât know what is going to come. We have ideas about things that we think will happen, but we donât know everything thatâs going to happen. I mean, this stuffâs changing so fast. I talked to Kevin Weil at OpenAI and heâs like, âWeâd sit down and have meetings about what are we going to do the next two months, and then three weeks later, we throw it out and start over because everythingâs changing so quickly.â I think thatâs the way weâre going to all have to operate, which is going to be very uncomfortable for a lot of people.
Is that changing the way youâre selling your products to build this capacity? Because if you donât know what the capacity is for, it must change how itâs being built.
Itâs changing a lot about how we design silicon. These customers are so big that theyâre a market of one. So, we have unique requirements coming from an individual company, which we havenât had to deal with in the past. We built general-purpose silicon, we sold it to everybody, and it worked.
So, you have different applications, different use cases, different customers that are leading us to move faster and build more variants of this technology than we would have in the past.
Thatâs right up against the insatiable demand of other silicon providers, right? There is a capacity crunch for chips, thereâs a capacity crunch for RAM. How is that working for you? Are you able to get the flexibility you need?
Yeah. Certainly, when you look at fab capacity, we could use more, but the world could use more. I donât think youâd get anybody on here who builds chips that wouldnât say, âIâd love to have more capacity.â Same thing for memory. Weâre in a crunch for probably 18 months doing everything we can to try to secure what we need. We feel pretty good about where we are right now, but weâll see how the demand plays out over the next year and a half.
Iâve talked to people about RAM margins like consumer laptop vendors who say, âThere might not be consumer laptops this year.â It might just be priced out. You might never be able to cover the cost to just put a stick of memory in a cheap laptop. y You might just be out. The CEO of Razer, which makes gaming laptops with lots of fun lights, was like, âWeek to week, I donât know what the margin on that product will be.â
Itâs true.
Youâve got to build a big piece of the infrastructure puzzle. The GPU is useless without the networking. This at least has to equalize somewhere for you, right? Youâll say, âLook, this is our margin to build the networking, to get the value out of the GPUs that weâre buying at super high rates from Nvidia and whoever else.â Is that working in the market to at least equalize your prices?
Networking equipment uses a lot less memory than compute platforms do. So, we still have memory in every networking device, but itâs much smaller percentage of the BOM than it would be in a â
Thatâs âbill of materials.â
Thank you, sorry about that. Itâs a much smaller percentage than it would be in, like, a server. The customers understand⌠what I keep trying to explain to them is that price increases are happening upstream from us. Weâre just an absorber of the price increase. Weâre having to do more frequent price increases than we have in the past, and weâre having to change our terms to deal with the same thing that your other guest talked about, which is the dynamic nature of the pricing that weâre seeing right now in the memory space.
But when you go to the large hyperscalers⌠I said earlier that itâs existential. So, what weâve just adopted with them is a more transparent model that says, â Hereâs what we need. Hereâs how it works. Hereâs our pricing.â And they generally understand.
Because there are other choices, especially, for you to provide â
Everybodyâs in the same boat. Itâs not like youâre going to go somewhere else and somebodyâs going to give you memory at 10 percent of the cost of what weâre offering. Everybodyâs just trying to deal with the capacity crunch right now.
This brings me to the Decoder questions because my next set of questions after this are how youâre handling this interlocking set of complicated puzzle pieces.
Tell me how Cisco is structured right now. How big is the company? How is it organized?
Weâre 85,000 people, plus some contractors. Weâre functionally structured like most companies. Weâve got a sales organization. Weâve got a product organization. The one change I made about 18 months ago was to consolidate all of our products under a single leader for the first time that I can remember. Itâs a big complex portfolio, so we did that. Weâve got a services organization. Itâs fairly functional. Pretty standard.
Youâve been reducing the size of the company pretty substantially over the past three years, I would say. You had two big rounds of layoffs in 2024. You just had some other little layoffs.
Most of the time those are rapid reallocations that we need to do. Itâs unfortunate, but itâs not⌠Typically, the ones weâve done have not been about reducing the total head count. At least, they have not generally been that way up until now.
I was reading some coverage of those changes. Thereâs a lot of, âAre these AI-related layoffs?â Is that on your mind? That you might be thinking about new kinds of structures, new kinds of engineering structures?
As an example, letâs say that our engineers become twice as productive because of coding. This year, weâll have five or six products thatâll be 100 percent written by AI. Next year, weâll probably have 70 percent of our code be written by AI.
You still have to test it. Youâve still got to go through all that stuff. But letâs say you make them twice as productive, just to simplify the math here. The companies are going to have to decide, âAm I going to maintain the same pace of innovation with half the people? Or am I going to double my pace of innovation with the same number of people?â I think different companies are going to make different decisions with some in-between variants. I think thatâs where weâre heading.
But weâve got to see this all come to life. Weâre seeing the early coding successes of coding, but we havenât seen the unintended downsides that we havenât figured out yet. My head of product was saying that weâve got 20 or 30-year-old code thatâs integrated in the systems thatâs written in C++, as an example. That head of product told me, âWe took all these old lines of code, we compressed it by about 20 percent, and we converted it to a modern language using AI.â My first response to him was, âYou better test that like crazy before you put it in a product and then put it in a customer environment.â Thereâs a lot of stuff weâre still learning as we go through there.
Stay on that for one second. Cisco code canât fail, right? The networking components should not go down in the same way that⌠I donât know, how we are resilient to Amazon being broken for five minutes and then it coming back to life, right?
The world stops.
Yeah. If Cisco fails, something bad happens in an escalating, catastrophic way.
I get those calls, by the way. [Laughs]
I have a lot of listeners who are like, âWhatâs Chuckâs phone number? Because I manage a Cisco portfolio.â Weâll give it out at the end.
Okay. Great. Perfect.
Stick to the end of the episode. Thereâll be an affiliate code when you call. [Laughs]Â Â
How do you think about that risk? I keep joking about how I ask everybody the org chart question. Iâve asked it for five years, and thereâs two answers:Â weâre functional, weâre divisional, and we get through it.Â
Now, I think weâre on the cusp of seeing some of the weirdest org charts in business history. âI manage a team of two people and 500 agents.â Meta is about to do one manager to 50 individual contributors all using agents to write code. I donât know how any of thatâs going to work. You canât take some of those risks, but youâre describing the productivity gains that might come with some of those risks. How are you thinking about that?
We need a little more runtime. Youâre right. The whole mental model around our software development versus these models. Kevin Weil from OpenAI made a comment at our AI summit, and he said, âYou guys should be using these models when theyâre working properly 10 percent of the time just to get to use them.â I sat there and I listened to that comment. Itâs just a different way of thinking. Granted, theyâre going to get it to full⌠But you go into it recognizing that itâs still evolving. We donât have that luxury. Our stuff has to work. Weâll have to figure this out as we go, but weâve seen how dependent the world is on technology functioning properly. Weâll have to just assess it as we get closer, but I think thereâs going to be an awful lot of testing that has to get done.Â
But the flip side is that we think AI can help us find bugs more quickly. It can help us assess customersâ infrastructure and say, âHey, youâre running these four versions of our software. Weâve seen a lot of instances where when youâre running those four, itâs created a problem.â Or, thereâs cybersecurity risks in certain parts of the code that AI can help us find. There are a lot of upsides. Thereâs a lot of opportunity for AI to help us become more reliable safely.
Youâve mentioned security several times now. The flip side of deploying AI to help with security is your adversaries who attack might be able to deploy AI to attack you much more efficiently.
And they are.
How is that playing out for you?
The emulations that youâre going to see, like email and video simulations and people replicating me, is just going to get crazier. So, we have to be better at using our tools. I have also been a big proponent of all of the security competitors in the industry laying down our weapons. We still compete but in service to our customers. I believe we have to more effectively share intelligence in real time today to help our customers deal with this because any one of us on our own is going to be less effective than all of us together.
Thatâs a big thing weâve been pushing. Weâve been building a lot of capabilities. There are a lot of opportunities to integrate our platforms and our threat intelligence. If you think about what you can do with models, like training on threat intelligence and conditions that led up to threat vulnerabilities, thereâs an awful lot we can do to get ahead of this. And we need to do that.
I think this brings me into the other Decoder question that I ask everybody. This is the one that I think is pressure for everybody. At the scale of change youâre describing here, how do you make decisions? Whatâs your framework?
When I wrote my thesis during the process of becoming CEO and the board was assessing the candidates, one of the things that I called out in the document â and this is 12 years or 11 and a half years ago â I said that the industry is moving so rapidly that youâre going to neede team-based strategy. You have to have a lot of people developing strategy because thereâs no one individual. Thereâs some brilliant minds, so Iâm not ruling any one human out, but thereâs no one individual who can come up with the exact right strategy every time theyâre assessing what they need to do.
So, we spend a lot of time together as a team. We spend anywhere from one to three hours together every Monday. We go off-site together for two to three days every quarter. And the way we make decisions⌠Look, 99 percent of the decisions get made below me because theyâre easy or because two smart people agree. When they get to me or any other CEO, youâre usually assessing two potential bad choices. Or you have two smart people who completely disagree, which tells you itâs complicated. In general, we just spend a lot of time in transparent discussion and open communication about how weâre going to make the decisions.Â
At the end of the day, I own them. I have this belief that when a decision goes really well, you give everybody else the credit, and when it goes very poorly, itâs all on me. Thatâs just how you have to operate.
To the decisions question, you are dealing with a vast amount of uncertainty, right? Thereâs a vast amount of uncertainty with how the global internet will be structured. What do the hyperscalers need as they build out new capacity for uncertain workloads? Who knows. Weâre going to sell a bunch of products to the neoclouds, which have circular financing. Those bills might not get paid, which I want to come to. That is a lot of uncertainty. I would say whether or not all of this infrastructure investment pays off in GDP growth is the biggest uncertainty of all. How are you dealing with that?
You havenât even gotten to three or four other big ones.
Go ahead. What are they?
Well, youâve got the geopolitical situation, youâve got sovereignty requirements emerging all around the world. Youâve got two wars around the world. Youâve got tariffs, youâve got memory costs, youâve got all these things that weâre all trying to navigate. So, itâs pretty complicated.
That is a lot of uncertainty on your decision-making. Youâre saying it all rolls up to you when it goes wrong. Has that affected how youâre making choices?
Faster. You just have to move faster. We had an all-hands with our entire company yesterday. We do it once a month. I told them, âLook, if speed and change makes you uncomfortable, youâre going to be uncomfortable because it is a world where companies can get seriously damaged in a very short period of time.â
This is whatâs driving a lot of the investments. Thereâs a big FOMO issue in the C-suite today. CEOs are like, âWhat am I missing? Whatâs my competitor going to do that I donât know about?â We used to say, âGet 80 percent of the information you can, make the decision, and then adjust as you go.â And I think thatâs⌠Maybe itâs not 80 percent anymore, but youâre going to have to take that approach. Youâre going to have to be willing to take risks, and youâre going to have to be comfortable being uncomfortable. And if youâre not, itâs going to be a pretty complicated and stressful time.
Billions of dollars in capital are being allocated for infrastructure. Does it come up that the products that might pay this off donât exist? Does it come up next to the FOMO?
Depends on the customer. If you look at the [telecom operators], the cloud providers, the people whose core business is highly dependent upon products that we build. Everybody is, but we will have those conversations with their CEOs and their leadership team. You go to Mobile World Congress, as an example. We were just there in every meeting the CEOs from some of those carriers and service providers are in. So, they care. When you get into the enterprise space, some of them are super technical. They understand the value of technology. So, they want to talk about trends. They want to talk about what we see other companies doing or what weâre doing as an enterprise that they should be thinking about.Â
But usually, if thereâs something big going on the table, my only position with them is, âIf you go with us, you have my personal commitment that weâll throw all the resources you need to make you successful.â Thatâs usually all they want to know.
I feel like thereâs a split in the market right now. I understand the enterprise use cases for AI. I understand why youâd want to build as fast as you can there â particularly in software development, as you described. We can see the benefit.Â
We talk to developers all the time here at The Verge. Theyâre like, âOur entire job is different.â The world has changed. The market has cracked open. Something is going to happen there. Then, downstream of that, you can say, âWell we hired a bunch of engineers to build us business process automation. Maybe we get way more value out of those engineers and we get way more automation.â Thereâs something in the enterprise thatâs going to happen with AI, that feels like I understand the value. Do you see any consumer applications of that scale beyond just telling Alexa to buy me shoes. Quite honestly, I donât yet, apart from Google Search getting a lot weirder over the past two years.
I donât have any great examples yet. Youâre right. You are seeing some horizontal areas in the enterprise that are consistent across almost every company, like coding. Customer service is one that everybodyâs working. You start to see some emerging horizontal use cases in legal. Weâre seeing a lot of use cases in our people organization, too. I think those are pretty standard. Everybodyâs at least aware of those opportunities. People are at different stages on the journey. But, Iâm not the consumer expert by any stretch. Weâre purely B2B, so thatâs where I spend all my time.
If I saw it, I would have probably read it on something you published.
Weâre looking for it every day.
The reason Iâm asking is because I think this relates to why I started out asking about data centers in space. Iâve heard [Google CEO] Sundar Pichai say variations on this idea. Without the big consumer application that everybody understands and can see the benefit of, putting the data center in the backyard is becoming an increasingly harder sell. The power requirements, the water requirements â which I know are controversial and often argued about â just the energy, resources, and requirements of the data centers are making them unpopular.Â
I donât think itâs all that. I donât think the environmental argument historically wins in America. I drive a V8 Mustang, and Iâm going to keep driving that car. We have an EV thatâs parked right next to it, but those cars are popular for reasons. Fast fashion, enormous environmental impact. People like it. There isnât an AI product for consumers that they like so much that it just transcends objections they can reach for. Weâre seeing it play out in really weird ways. In bipartisan ways, people are pushing back against the data centers.Â
Are you going to be able to hit your goals if data center construction slows? Is there a way to overcome those goals without the great consumer product?
I think there is. Look, weâve been the most innovative country on the planet for a very long time, and thatâs not going to change. Some of the smartest people in the world are actually trying to solve these problems, and they will.Â
By the way, I think if you give some of those residents the greatest AI tools that theyâve ever seen in their lives on their phones, they still donât want the data center in their backyard. I donât think theyâre going to say, âOh, this is great. Go ahead and drive my energy costs through the roof and Iâll be okay with it.â Thatâs not going to be the gating factor. I think those apps will come. We saw a little bit of this with 5G. They didnât want radio towers. You remember that whole thing?
Oh, I remember.
This is that at a much greater scale, but I think weâll figure it out.
The 5G comparisonâs really interesting. I know you just came back from Mobile World Congress. At least the telecom industry understood that they had to describe some applications that all of this build-out will accomplish. The ones that got me every time were, âWeâre going to have self-driving cars and weâre going to do robot surgery.â There were all these demos of these things. I went to endless CES demos. A self-driving car demo is fundamentally very boring. You donât want to be in an exciting, self-driving car demo.
You donât.
I sat in a lot of them at CES and pretended to be very excited that 5G would drive the car.Â
That car looks like every other car driving.
Yeah. And itâs like, âI would like this to be as least exciting as possible.â
Yes.
Maybe there was one demo of a 5G surgery, and it was still backstopped by wired internet. 6G has the same sort of application problem, right? We donât know what itâs for. AI has the same problem. You canât describe what itâs for in a way that might overcome the objection. That feels like a fairly unique point for the whole industry to be at, where the next generation of technology is very exciting to a handful of providers. Itâs the future of your business in real ways, and the applications are harder and harder to describe.
Youâve seen it all. Iâm asking you this question on a big sweep. If weâre talking about the internet, it was easy to describe what it might do for people. I actually disagree with you. I think a lot of people instantly saw what the phone would become. There was an excitement there. Thatâs where you got startup founders from, in that way that you got founders from.
This one just seems more nascent to me. How would you place that in your sweep of history?
Youâre right, we didnât have the number of use cases. I think if you asked the telecom CEOs, they would probably say that theyâre disappointed in the return they got on all the investment around 5G. Thatâs pretty well-understood. I think robotics in general could be the real driver of 6G utilization once it gets built. But again, its early days are being defined. Typically, weâve talked about these tech transitions for years and years and years before they come to fruition.Â
AI is different. We did talk about it for a long time, and then all of a sudden it broke loose. The pace at which itâs changing is just unprecedented. I think weâll have to see on 6G; itâs still TBD. I donât expect that youâll see the same mistakes made from a speed-of-investment perspective until they become more clear.
The internet also was hand-in-hand with globalization, right? We both have iPhones that are made all over the world. You had this giant global network. Maybe this is going to lead to an age of prosperity. Maybe this is going to lead to an age of extreme labor displacement. You can read that however you want. There are a lot of opinions about what the internet and particularly globalized manufacturing led to.Â
Thatâs all being undone. You can see thatâs being undone every single day. Whether thatâs with tariffs in an effort to bring manufacturing back to the United States â which weâve talked about on the show with many of your peers â or whether itâs, âHey, weâre going to put up big walls on the internet.â Australiaâs going to have a social media ban for teens. Theyâve got to enforce that somehow. Thatâs probably going to happen at the network layer. You have the European Union saying, âThe data has to be here. We have to put the data here. The European data has to be in Europe.â
You build the networks. Iâm imagining all of this is just one more layer of complication, even as you describe how we should have global systems that bring us to an era of shared prosperity. How are you dealing with that?
I think what youâre seeing play out is not only do countries want data sovereignty, they want to have sovereign control over any technology theyâre using. And itâs not limited to Europe at this point. They donât want the US to have the ability to impede the use of those products under any conditions. As an example, some of the meeting platforms like Webex or Zoom donât want any other country â Iâll say the US but any other country â to have the ability to cut off access to these platforms if theyâre going to invest and use them for critical reasons in other countries. Letâs use Europe as an example. In many cases, European companies that build the technology they use in their infrastructure donât have that capacity at scale. So then they have default to, âWho are the companies that I trust?â And trust becomes such a big⌠Itâs a big discussion obviously around AI, but it is really a big deal.
So for us, as an example, we have always tried to be good citizens and good members of the community in evert country we operate in. Weâve had education programs for 25, 30 years that train learners on digital skills around the world. Last year alone, we had 5 million learners around the world go through one of these programs. So, that trust elementâs going to become very important. The technology is one thing. You have to build technology that can be deployed the way they want it to be deployed. Then, you have to have a very high degree of trust when you work with them.
Is this changing how youâre architecting some of your products?
It is.
What are some specifics?
Well, youâd love to have a cloud solution. Historically, what you would do â and a lot of companies were built this way â is build global instances, partition them, and sell them off to different customers. As an example, if you go to Germany and Germany says, âI want to have my version of that running in my country,â itâs architecturally different than how you might have designed it to begin with.
Weâre now designing a lot of those control or cloud-oriented systems so that they can be structured to run in a country alone, and we donât think about building global instances anymore.
From your perspective, this is a very different way of building the internet, right? It just isnât the thing. My first experience with the internet was watching the coffee pot at the University of England.This was the promise when I was a kid watching a one-frame-a-minute live stream of a coffee pot and being like, âI can go there.â
You watched live streams of coffee pots, for real?
Do you remember this? In 1994, the first video feed on the internet was a coffee pot in Cambridge.
It blew my mind when I was kid.Â
In some ways, itâs more than ever, right? You can watch live streams of all the coffee pots ever.
Any one you want.
If you want to. In some ways, this is just closing down, right? Every country is saying, âOur citizens are here and weâre going to manage what they do. Their lives are on the internet. We are going to control the internet in our country.â Thatâs happening all over the place in all kinds of ways. Itâs honestly happening state to state. The internet in California looks different than the internet in Texas today.Â
Youâre the networking provider for many of these countries, for many of these companies. When you think about the sweep of what the internet might look like, when you think about the amount of compute thatâs happening in a data center as opposed to happening locally on my laptop, (which is always a kind of dance) what does the internet of the next five years look like to you?
Well, itâs going to be more fragmented for sure. Youâre seeing the cloud providers build in regions and certain places, and theyâre having to re-architect to think about this. Iâm not sure most of the functionality we use for the internet today is going to change much, to be honest. There are going to be controls that will exist, but I donât think itâs going to change the core, normal operating functionality of how it works today. Theyâll be there in the case of an issue or an emergency.
Now, itâs not the networkâs issue where you store data and all those kinds of things, so how that plays out is independent of what we would think about. But I think you get into times of crisis and thatâs when you might see things happen differently. If youâre a certain country that gets into a conflict and you want to isolate yourself from a communications perspective so you can trust that your communicationâs clear, then that might create short-term dynamics for your citizens. But I donât think itâll be meaningfully different on a day-to-day basis.
Weâre seeing that right now. India shuts off the internet and cash flow all the time. The Iranian internet is on and off every day. Are those things that your customers are coming to you and saying, âThe government wants this capability of the network one level up. Can you help us build it?â
They are having those conversations primarily around not wanting to have a third party or another country disintermediate their capabilities through tech by having some control or a kill switch. Thatâs typically what theyâre talking about.
How does that play into AI a bit? Now, we have these workloads built on networks that youâre supplying, youâve got a bunch of agents doing stuff all day long, and youâre saying, âWeâre going to be the security provider for it.â
At some point, does Donald Trump get to say, âTurn off the agents, itâs getting out of control?â
You have to think about security at an agent level. Itâs like you would do at an employee level but on steroids. Youâll need to apply five to 10 times more security, maybe more. Iâm just throwing numbers out. We have to figure that out as we get going, but itâs certainly going to introduce an entry point for bad actors to do things that you wouldnât want them to be able to do. Weâre learning, and everybody is working on this problem simultaneously right now.
I feel like for most of this conversation, my assumption has been that this is going to keep going. This is going to keep working. The problems will be complicated, everyone will work diligently, weâll solve them, someone will invent the consumer product, and all of this will pay off.
What if it doesnât? What if this bubble pops? What does that look like for you?
Whatâll happen is thereâll be lost, misplaced capital. Thereâll be companies that shut their doors. Then, the winners will emerge, and weâll build out at scale just like we saw with the first wave. I suspect thatâs what will happen.
I think there are certainly going to be companies that will cease to exist. Theyâre going to go away. Thatâs what happens with any of these early things. You take a risk. Thatâs why the reward is so high; itâs risk-reward. Itâs the nature of these massive transitions, and this is bigger and faster than anything weâve ever seen.
The amount of capital tied up in what you might call circular financing with some of these neoclouds seems dangerous to me. It seems like if I had to point at where things will get shaky first, it will be, âWell, we did a lot of forward investment with a lot of debt investment into neoclouds against workloads that themselves have not yet paid off.â
Eventually, the bill will come due or the investments donât happen. Is that a risk thatâs on your mind?
It is, but not particularly for us. Weâre super conservative. Iâve heard instances where weâve looked at financials and have chosen not to do business with some of these folks. I think every company has to make their own decisions.Â
We also have creative financing solutions that protect us so we can work through. We learned a lot in 2000 because we were doing a lot of that back then.
The neoclouds, are you in them or are you staying away from them?
No, weâre in some of them. Some of them want to use us. Honestly, a lot of them want partnerships with us because they want the enterprise access. They donât have a robust enterprise sales force, and they think we can help them there. So, in many cases, we work together to figure that out.
The other way I can see things maybe not getting shaky but changing dramatically so that it changes the investment available in the AI industry is if inference becomes more valuable than training, right? So far, all the emphasis has been on running these GPUs red-hot to do training because the next version of the model will finally be capable enough to, I donât know, be your girlfriend. Whatever it is that they think theyâre going to do. Something about training has been the point. Weâre going to build AGI. They donât want to say it, but theyâre saying it all the time. Thereâs a chance that the models are good enough, and itâs actually just inference now. Weâre just going to run agents and Claude Code is big enough to meaningfully affect enterprise cost dynamics.
Does that change your business if weâre done with training and, actually, inference is the point?
No, it is actually great. I donât think youâre going to be done with training, by the way. I think the inferencing stuff is going to be additive.
Do you need all the new data center build-outs if itâs inferencing instead of the training?
I mean, we would like to participate, and weâd obviously like to see that continue to grow because itâs good for our business. But some people believe the inferencing side is going to be bigger.
I think that itâs going to be very distributed. You think about how a lot of enterprise customers are going to want to do inferencing at a point of interaction with a customer and garner immediate value in that interaction, and thatâs going to be very distributed. Distributed compute requires high-performance networks, which is good for us. So, we like that.
This is what I was mentioning: the dynamic between the edge and the data center seems to always be changing. I think I saw some press releases out of Nvidiaâs GTC about more compute coming to the edge of certain big network providers.
Are you seeing that play out? We donât know where itâs supposed to be, so everyone is investing in both the edge and the data center?
Itâs still early at the edge. I think everybody believes theyâre going to need it, and weâre seeing certain applications where people are starting to pilot it. I think this may be a good opportunity for the telecom providers. There has always been this thesis that edge compute was going to be a big benefit for them.
That was the thesis of 5G. I wonât say the name, but I went to a very long dinner with one of the major telecom providers, and they told me all about self-driving cars powered by edge networks.
But you could see this become something. There are discussions now of inferencing grids and the dynamic routing of these inferencing requests based on everything from cost of power at a given time of day to capacity thatâs available. I mean, thereâs a lot of thinking about how this plays out. I think itâs still TBD, but itâs coming.
So, I want to bring this all back around. The business here is building data centers with people, with big customers.
Itâs part of it. We also connect all the employees and everything else, too.
Well, sure. I mean, do you want me to talk about Webex for another hour? Because I have a lot of notes about Webex.
[Laughs] We can talk about anything you want.
Apple uses Webex. Does Tim Cook ever say, âDude, can you just make the Mac client a little bit better?â
No, itâs actually better than most others. Do you use it?
Iâm a journalist. Iâm on calls with these companies all of the time. So Webex comes up in my life.
Okay, good. Iâm glad to hear that.
Iâm just telling you, find the person, the native Mac client â
All right. Iâm going to get one of my guys on the phone with you and make sure that âÂ
Weâve got to do it on the show, and weâll just go through a demo together. But youâve got to be there.
Iâve got to be there?
Yeah.
All right.
Weâll do live notes on a Webex call.
For you to be happy with Webex, Iâll do that.
Every time an enterprise software CEO comes on the show, Iâm like, âDo you use your product?â And I would say itâs 50-50.
I do all day long.
You obviously do.
All day long. But I was also a coder early in my life, so Iâm a little weird. Iâve used Claude Code, so Iâm âÂ
Youâre in it.
Yeah.
All right. But Iâm saying the growth of the business, the explosive growth that everyone is seeing, is in AI, right? Itâs in building these new generation of data centers, this new generation of compute. I just keep circling around it, but the problem is that people donât want those data centers near them, and I have yet to see the argument for why that should happen.
In my mind, the argument is great consumer products. If youâre like, âThatâs where Netflix comes from,â I think people will calm down. But thatâs not the argument weâre making right now. There isnât a product like Netflix.
Thatâs where Netflix comes from.
I think if you were like, âNetflix is building a data center in your town,â people would be like, âThat rules.â
Yeah, itâs going to be faster.
Right. Is Tom Cruise going to be there? You would have some emotional connection to the thing thatâs happening. We donât have that right now.
The pressure on not building these data centers is only going to go up in weird ways. In Alabama, thereâs a state senator that proposed blocking solar build-outs as a way to reduce data center interest in his state. Thatâs a weird outcome. What happens if we canât build more data centers? What happens if the public just doesnât buy in?
Weâll build them in space faster, I guess.
This is why I started off asking if youâre just trying to escape the political problems of Earth.
I donât think theyâre political problems. I think they are issues of utility and power, cooling and water, and all those things. Theyâre all interconnected.Â
Again, I donât wake up every day and deal with this issue, but the people who do are very smart people. I think the thing a consumer will be okay with is if you go in and not only build a data center but somehow increase the utility capacity of that community or do something positive in that community beyond streaming Netflix faster. Thatâs when theyâll be okay with it because I donât think their concerns are around it being unsightly or anything like that. I think the issue is the concern over the inflationary pressure that it puts on utilities and the things they need.
In my hometown of Racine, Wisconsin, there was supposed to be a Foxconn factory, and that never came to pass. Now, itâs going to be a Microsoft data center. Instead of 13,000 or 15,000 jobs, which is what Foxconn promised to that site, thereâs going to be like a couple thousand.
This is a lot of water and a lot of power without the economic lift that you get. Then maybe thereâs the inflationary pressures on power or other utilities. As your customers are building out, are you working with them to reduce those pressures, to find ways to make the data centers more efficient?
Our role is really around the power consumption of the platforms that we sell, and thatâs a massive part of our innovation cycle. We want to deliver higher performance and lower power consumption every time. So, thatâs the role we play in that space.
Well, Chuck, youâve given us a lot of time. Whatâs next for Cisco? What should we be looking out for?
Itâs hard to predict whatâs going to happen. As I said earlier, we had a high degree of luck with the optics and silicon investments that we made. We had some smart people who were suggesting that we make them, but theyâve turned out to be magical for us right now.
For the next three to five years, weâre going to be spending every ounce of our energy on secure connectivity in this agentic era. But I mean, I donât know what weâll need to do three years from now because things are changing so quickly. I think weâre as prepared as we can be.
Well, weâll need to have you back sooner than three years to see where the pulse is. Thank you so much for being on Decoder, man.
Thanks, man.
Questions or comments about this episode? Hit us up at decoder@theverge.com. We really do read every email!
-
AI is changing how small online sellers decide what to make MIT Technology Review Apr 06, 2026 11:00 AM 6 min read Entrepreneurs based in the US are using tools like Alibabaâs Accio to compress weeks of product research and supplier hunting into a single chat.
For years Mike McClary sold the Guardian LTE Flashlight, a heavy-duty black model, online through his small outdoor brand. The product, designed for brightness and durability, became one of his most popular items ever. Even after he stopped offering it around 2017, customers kept sending him emails asking where they could buy it.Â
When McClary decided to revisit the Guardian flashlight in 2025, he didnât begin the way he might have in the past, by combing through supplier listings and sending inquiries to factories. Instead, he opened Accio, an AI sourcing and researching tool on Alibaba.com.
For small entrepreneurs in the US, deciding what to sell and where to make it has traditionally been a slow, labor-intensive process that can take months. Now that work is increasingly being done by AI tools like Accio, which help connect businesses with manufacturers in countries including China and India. Business owners and e-commerce experts told MIT Technology Review that these AI tools are making sourcing more accessible and significantly shortening the time it takes to go from product idea to launch.Â
McClary, 51, who runs his business from his Illinois living room, has sold products ranging from leather conditioner to camping lights, including one rechargeable lantern that brought in half a million dollars. Like many small online merchants, he built his business by being extremely scrappyâspotting demand for a product, tweaking existing designs, finding a factory, doing modest marketing, and getting the goods in front of customers fast.Â
This time, though, he began by telling Accio about the flashlightâs original design, production cost, and profit margin. Then Accio suggested several changes, making it smaller and slightly less bright and switching its charging method to battery power. It also identified a manufacturer in Ningbo, China, that McClary said could cut the manufacturing cost from $17 to about $2.50 per unit.
McClary took the process from there, contacting the supplier himself to discuss the revised design. Within a month, the new version of the Guardian flashlight was back up for sale on Amazon and on his brandâs website.
The new factory hunt
Although Alibaba is better known for owning Taobao, the biggest shopping site in China, its first business was Alibaba.com, the primary website that lists Chinese factories open for bulk orders. Placing an order with a manufacturer usually requires far more than clicking âBuy.â Sellers often spend days or weeks browsing listings, comparing suppliersâ reviews and manufacturing capacities, asking about minimum order quantities, requesting samples, and negotiating timelines and customization options.Â
But Accio has gained significant momentum by changing how that sourcing gets done. Launched in 2024, Accio exceeded 10 million monthly active users in March 2026, according to the company. That means about one in five Alibaba users consults with AI about product sourcing.
Accioâs interface looks a lot like ChatGPT or Claude: Users type a question into an empty box and choose between âfastâ and âthinkingâ modes. But when asked about products, the tool returns more than text, offering charts, links, and visuals and asking follow-up questions to clarify the buyerâs needs. It then narrows the field to one or a handful of suppliers that appear capable of delivering. After that, the human work begins: Users still have to reach out to suppliers themselves and negotiate the details.
Zhang Kuo, the president of Alibaba.com, told MIT Technology Review that the tool is built on multiple frontier models, including the companyâs own Qwen series, a popular family of open-source large language models. The system is able to pull from the siteâs millions of supplier profiles and is trained on 26 years of proprietary transaction data.
For tasks like product research and sourcing analysis, the tool âblows it awayâ compared with general AI tools like ChatGPT, says Richard Kostick, CEO of the beauty brand 100% Pure.
Many websites have tried using AI to assist shopping, but Alibaba has been one of the most aggressive. In March, Eddie Wu, CEO of the siteâs parent company Alibaba Group, told managers that integrating the companyâs core services with Qwenâs AI capabilities is a top priority. During a Chinese New Year promotion of Qwenâs personal shopping AI agent, where the company gave away cash, customers placed 200 million orders, the firm says.
Vincenzo Toscano, an e-commerce seller and consultant, recommended Accio to his clients before deciding to try it himself for a new sunglasses brand. He came in with a rough vision: a brand shaped by his Italian heritage, his personal style, and a boutique aesthetic. He says the AI helped turn that concept into something more concrete, suggesting materials, refining the look, and pointing to design ideas that felt current.
But the tool has clear limits. McClary, who uses AI tools regularly, says Accio is strongest when it comes to product ideation, but less helpful on marketing questions such as advertising and social media outreach. To use it well, he says, buyers still need to challenge its recommendations, since some can be generic.
The rest of the business
As platforms become more AI-driven, manufacturers are adjusting too. Sally Li, a representative at a makeup packaging company in Wuhan, China, says her firm has started writing more detailed product descriptions and adding information about its equipment and manufacturing experience on Alibaba.com because it suspects those details make its listings more likely to be surfaced by AI.
Yan says manufacturers cannot tell whether an inquiry from a customer was generated or guided by AI, and that her firm is not using AI to negotiate pricing or product details.
âAI agents are increasingly used by people to assist purchase decisions and even directly making transactions, and with clear guardrails, they can become extremely useful,â says Jiaxin Pei, a research scientist at the Stanford Institute for Human-Centered AI, âbut agents need to act transparently, securely, and in the customerâs best interest.â Pei says developers of these tools should disclose the data they collect and the incentives built into them to ensure that the marketplace remains fair.
Zhang, of Alibaba.com, says Accio currently does not include advertising. Suppliers can pay for higher placement in Alibaba.comâs regular search results, but Zhang says Accio is ânot integratedâ with that system. âWe havenât had a clear answer in terms of how to monetize this tool,â he says. For now, users can pay for additional tokens to continue chatting with the agent after their free queries run out.
Sellers say that while AI tools have made it easier to come up with ideas and get a business off the ground, they do not replace the core skills that make someone good at e-commerce. McClary believes that even when sellers have access to the same market information, some are still better at making decisions, acting quickly, and actually delivering on orders. Those differences, he says, still go a long way.
Toscano, the brand founder and e-commerce consultant, feels good about officially launching his new brand of sunglasses in just a few months: âWe [small business owners] always have to bootstrap a lot of decisions. Deciding what to sell often comes down to an educated guess,â he says, âAnd weâre now in an era when making those decisions is easier than ever.â
-
Suno is a music copyright nightmare The Verge AI Apr 05, 2026 12:00 PM 1 min read Canât stop the slop.
AI music platform Suno's policy is that it does not permit the use of copyrighted material. You can upload your own tracks to remix or set your original lyrics to AI-generated music. But, it's supposed to recognize and stop you from using other people's songs and lyrics. Now, no system is perfect, but it turns out that Suno's copyright filters are incredibly easy to fool.
With minimal effort and some free software, Suno will spit out AI-generated imitations of popular songs like BeyoncĂŠ's "Freedom," Black Sabbath's "Paranoid," and Aqua's "Barbie Girl" that are alarmingly close to the original. Most people will likely be able to tell the dif âŚ
-
I let Gemini in Google Maps plan my day and it went surprisingly well The Verge AI Apr 05, 2026 10:00 AM 1 min read Tired: tokens. Wired: tacos.
Take me to the tacos, Gemini. You may be familiar with Gemini as the thing that's in every Google service you use - whether you want it or not.
While it's been a constant, sometimes unwelcome presence in Gmail for at least the past year, it's a relatively new addition to Maps. And you know what? It's kind of great.
To put it to the test, I had Gemini plan a day-long itinerary for me around the city. After an hour or so of having Gemini find stuff for me - playgrounds near the new light rail extension, kid-friendly restaurants with vehicle themes, you get the gist - I was impressed. Some of the suggestions were obvious, but I also bookmarked a handful of spots not on m âŚ
-
Grammarlyâs sloppelganger saga The Verge AI Apr 05, 2026 08:00 AM 1 min read âExpert Reviewâ is no longer available for review.
This is The Stepback, a weekly newsletter breaking down one essential story from the tech world. For more on the ups and downs of AI, follow Stevie Bonifield. The Stepback arrives in our subscribers' inboxes at 8AM ET. Opt in for The Stepback here.
How it started
Most people probably know Grammarly for its browser extension that suggests how to spruce up your emails, but over the past few years, it's been eyeing bigger ambitions. In October, the company formerly known as Grammarly made a public pivot to rebrand as an AI company called Superhuman. The new name was adopted from Superhuman Mail, an AI email platform that Grammarly acquired i âŚ
-
A folk musician became a target for AI fakes and a copyright troll The Verge AI Apr 04, 2026 01:52 PM 1 min read Murphy Campbell is at the center of a growing storm around AI and a broken copyright system.
Murphy Campbell is at the center of a brewing storm around AI and a broken copyright system. | Image: Murphy Campbell In January, folk artist Murphy Campbell discovered several songs on her Spotify profile that did not belong there. They were songs that she had recorded, but she'd never uploaded them to Spotify, and something was off about the vocals.
She quickly surmised that someone had pulled performances of the songs she posted to YouTube, created AI covers, and uploaded them to streaming platforms under her name. I ran one of the songs, "Four Marys", through two different AI detectors, and it seemed to support her suspicions with both saying it was probably AI-generated.
Campbell was shocked, "I was kind of under the impression that we had a little b âŚ
-
Really, you made this without AI? Prove it The Verge AI Apr 04, 2026 09:00 AM 1 min read The quest to find the âFair Tradeâ logo for human-made content.
"This looks like AI."
It's a phrase I dread seeing as a writer who dabbles in illustration and amateur photography. In a world where generative AI technology is increasingly adept at mimicking the work of humans, people are naturally skeptical when online platforms refuse to label even obvious AI content.
This leads me to one conclusion: maybe we should start labeling human-made text, images, audio, and video with something akin to a universally recognized Fair Trade logo. The machines sure as hell aren't motivated to label their work, but the creators at risk of being displaced most definitely are.
Fortunately, I'm not alone in my thinki âŚ
-
"Cognitive surrender" leads AI users to abandon logical thinking, research finds Ars Technica AI Apr 03, 2026 09:06 PM 1 min read Experiments show large majorities uncritically accepting "faulty" AI answers.
When it comes to large language model-powered tools, there are generally two broad categories of users. On one side are those who treat AI as a powerful but sometimes faulty service that needs careful human oversight and review to detect reasoning or factual flaws in responses. On the other side are those who routinely outsource their critical thinking to what they see as an all-knowing machine.
Recent research goes a long way to forming a new psychological framework for that second group, which regularly engages in "cognitive surrender" to AI's seemingly authoritative answers. That research also provides some experimental examination of when and why people are willing to outsource their critical thinking to AI, and how factors like time pressure and external incentives can affect that decision.
Just ask the answer machine
In "ThinkingâFast, Slow, and Artificial: How AI is Reshaping Human Reasoning and the Rise of Cognitive Surrender," researchers from the University of Pennsylvania sought to build on existing scholarship that outlines two broad categories of decision-making: one shaped by "fast, intuitive, and affective processing" (System 1); and one shaped by "slow, deliberative, and analytical reasoning" (System 2). The onset of AI systems, the researchers argue, has created a new, third category of "artificial cognition" in which decisions are driven by "external, automated, data-driven reasoning originating from algorithmic systems rather than the human mind."
-
Anthropic essentially bans OpenClaw from Claude by making subscribers pay extra The Verge AI Apr 03, 2026 07:52 PM 1 min read Claude vs. Claw.
Using OpenClaw with Claude AI is about to get a lot more expensive, thanks to Anthropic's new policy changes. Beginning April 4th at 3PM ET, users will "no longer be able to use your Claude subscription limits for third-party harnesses including OpenClaw," according to an email sent to users on Friday evening. Instead, if users want to use OpenClaw with Claude, they'll have to use a "pay-as-you-go option" that will be billed separate from their Claude subscription.
With OpenClaw creator Peter Steinberger now employed by OpenAI, Anthropic may also be encouraging subscribers to use more of its own tools, like Claude Cowork, instead. Steinber âŚ
-
OpenAIâs AGI boss is taking a leave of absence The Verge AI Apr 03, 2026 04:22 PM 1 min read OpenAI is undergoing another round of C-suite changes.
OpenAI is undergoing another round of C-suite changes, according to an internal memo viewed by The Verge.
Fidji Simo, OpenAI's CEO of AGI deployment - who was until recently the company's CEO of applications - says in the memo that she will be stepping away on medical leave "for the next several weeks" due to a neuroimmune condition. While she's out, OpenAI president Greg Brockman will be in charge of product, including leading OpenAI's super app efforts. On the business side, CSO Jason Kwon, CFO Sarah Friar, and CRO Denise Dresser will take charge.
OpenAI's CMO, Kate Rouch, has also decided to step down in order to focus on her health, âŚ
-
Perplexity's "Incognito Mode" is a "sham," lawsuit says Ars Technica AI Apr 02, 2026 08:54 PM 1 min read Google, Meta, and Perplexity accused of sharing millions of chats to increase ad revenue.
Perplexity's AI search engine encourages users to go deeper with their prompts by engaging in chat sessions that a lawsuit has alleged are often shared in their entirety with Google and Meta without users' knowledge or consent.
"This happened to every user regardless of whether or not they signed up for a Perplexity account," the lawsuit alleged, while stressing that "enormous volumes of sensitive information from both subscribed and non-subscribed users" are shared.
Using developer tools, the lawsuit found that opening prompts are always shared, as are any follow-up questions the search engine asks that a user clicks on. Privacy concerns are seemingly worse for non-subscribed users, the complaint alleged. Their initial prompts are shared with "a URL through which the entire conversation may be accessed by third parties like Meta and Google."
-
The gig workers who are training humanoid robots at home MIT Technology Review Apr 01, 2026 11:00 AM 8 min read People in Nigeria and India are strapping iPhones onto their heads and recording themselves doing chores.
When Zeus, a medical student living in a hilltop city in central Nigeria, returns to his studio apartment from a long day at the hospital, he turns on his ring light, straps his iPhone to his forehead, and starts recording himself. He raises his hands in front of him like a sleepwalker and puts a sheet on his bed. He moves slowly and carefully to make sure his hands stay within the camera frame.Â
Zeus is a data recorder for Micro1, a US company based in Palo Alto, California that collects real-world data to sell to robotics companies. As companies like Tesla, Figure AI, and Agility Robotics race to build humanoidsârobots designed to resemble and move like humans in factories and homesâvideos recorded by gig workers like Zeus are becoming the hottest new way to train them.Â
Micro1 has hired thousands of contract workers in more than 50 countries, including India, Nigeria, and Argentina, where swathes of tech-savvy young people are looking for jobs. Theyâre mounting iPhones on their heads and recording themselves folding laundry, washing dishes, and cooking. The job pays well by local standards and is boosting local economies, but it raises thorny questions around privacy and informed consent. And the work can be challenging at timesâand weird.
Zeus found the job in November, when people started talking about it everywhere on LinkedIn and YouTube. âThis would be a real nice opportunity to set a mark and give data that will be used to train robots in the future,â he thought.Â
Zeus is paid $15 an hour, which is good income in Nigeriaâs strained economy with high unemployment rates. But as a bright-eyed student dreaming of becoming a doctor, he finds ironing his clothes for hours every day boring.Â
âI really [do] not like it so much,â he says. âIâm the kind of person that requires ⌠a technical job that requires me to think.âÂ
Zeus, and all the workers interviewed by MIT Technology Review, asked to be referred to only by pseudonyms because they were not authorized to talk about their work.
Humanoid robots are notoriously hard to build because manipulating physical objects is a difficult skill to master. But the rise of large language models underlying chatbots like ChatGPT has inspired a paradigm shift in robotics. Just as large language models learned to generate words by being trained on vast troves of text scraped from the internet, many researchers believe that humanoid robots can learn to interact with the world by being trained on massive amounts of movement data.Â
Editorâs note: In a recent poll, MIT Technology Review readers selected humanoid robots as the 11th breakthrough for our 2026 list of 10 Breakthrough Technologies.
Robotics requires far more complex data about the physical world, though, and that is much harder to find. Virtual simulations can train robots to perform acrobatics, but not how to grasp and move objects, because simulations struggle to model physics with perfect accuracy. For robots to work in factories and serve as housekeepers, real-world data, however time-consuming and expensive to collect, may be what we need.Â
Investors are pouring money feverishly into solving this challenge, spending over $6 billion on humanoid robots in 2025. And at-home data recording is becoming a booming gig economy around the world. Data companies like Scale AI and Encord are recruiting their own armies of data recorders, while DoorDash pays delivery drivers to film themselves doing chores. And in China, workers in dozens of state-owned robot training centers wear virtual-reality headsets and exoskeletons to teach humanoid robots how to open a microwave and wipe down the table.Â
âThere is a lot of demand, and itâs increasing really fast,â says Ali Ansari, CEO of Micro1. He estimates that robotics companies are now spending more than $100 million each year to buy real-world data from his company and others like it.
A day in the life
Workers at Micro1 are vetted by an AI agent named Zara that conducts interviews and reviews samples of chore videos. Every week, they submit videos of themselves doing chores around their homes, following a list of instructions about things like keeping their hands visible and moving at natural speed. The videos are reviewed by both AI and a human and are either accepted or rejected. Theyâre then annotated by AI and a team of hundreds of humans who label the actions in the footage.
âThere is a lot of demand, and itâs increasing really fast.â
Ali Ansari, CEO of Micro1ÂBecause this approach to training robots is in its infancy, itâs not clear yet what makes good training data. Still, âyou need to give lots and lots of variations for the robot to generalize well for basic navigation and manipulation of the world,â says Ansari.
But many workers say that creating a variety of âchore contentâ in their tiny homes is a challenge. Zeus, a scrappy student living in a humble studio, struggles to record anything beyond ironing his clothes every day. Arjun, a tutor in Delhi, India, takes an hour to make a 15-minute video because he spends so much time brainstorming new chores.
âHow much content [can be made] in the home? How much content?â he says.Â
Thereâs also the sticky question of privacy. Micro1 asks workers not to show their faces to the camera or reveal personal information such as names, phone numbers, and birth dates. Then it uses AI and human reviewers to remove anything that slips through.Â
But even without faces, the videos capture an intimate slice of workersâ lives: the interiors of their homes, their possessions, their routines. And understanding what kind of personal information they might be recording while theyâre busy doing chores on camera can be tricky. Reviews of such footage might not filter out sensitive information beyond the most obvious identifiers.
For workers with families, keeping private life off camera is a constant negotiation. Arjun, a father of two daughters, has to wrangle his chaotic two-year-old out of frame. âSometimes itâs very difficult to work because my daughter is small,â he says.Â
Sasha, a banker turned data recorder in Nigeria, tiptoes around when she hangs her laundry outside in a shared residential compound so she wonât record her neighbors, who watch her in bewilderment.
âItâs going to take longer than people think.â
Ken Goldberg, UC BerkeleyWhile the workers interviewed by MIT Technology Review understand that their data is being used to train robots, none of them know how exactly their data will be used, stored, and shared with third parties, including the robotics companies that Micro1 is selling the data to. For confidentiality reasons, says Ansari, Micro1 doesnât name its clients or disclose to workers the specific nature of the projects they are contributing to.
âIt is important that if workers are engaging in this, that they are informed by the companies themselves of the intention ⌠where this kind of technology might go and how that might affect them longer term,â says Yasmine Kotturi, a professor of human-centered computing at the University of Maryland.
Occasionally, some workers say, theyâve seen other workers asking on the company Slack channel if the company could delete their data. Micro1 declined to comment on whether such data is deleted.
âPeople are opting into doing this,â says Ansari. âThey could stop the work at any time.â
Hungry for data
With thousands of workers doing their chores differently in different homes, some roboticists wonder if the data collected from them is reliable enough to train robots safely.Â
âHow we conduct our lives in our homes is not always right from a safety point of view,â says Aaron Prather, a roboticist at ASTM International. âIf those folks are teaching those bad habits that could lead to an incident, then thatâs not good data.â And the sheer volume of data being collected makes reviewing it for quality control challenging. But Ansari says the company rejects videos showing unsafe ways of performing a task, while clumsy movements can be useful to teach robots what not to do.
Then thereâs the question of how much of this data we need. Micro1 says it has tens of thousands of hours of footage, while Scale AI announced it had gathered more than 100,000 hours.
âItâs going to take a long time to get there,â says Ken Goldberg, a roboticist at the University of California, Berkeley. Large language models were trained on text and images that would take a human 100,000 years to read, and humanoid robots may need even more data, because controlling robotic joints is even more complicated than generating text. âItâs going to take longer than people think,â he says.
When Dattu, an engineering student living in a bustling tech hub in India, comes home after a full day of classes at his university, he skips dinner and dashes to his tiny balcony, cramped with potted plants and dumbbells. He straps his iPhone to his forehead and records himself folding the same set of clothes over and over again.Â
His family stares at him quizzically. âItâs like some space technology for them,â he says. When he tells his friends about his job, âthey just get astounded by the idea that they can get paid by recording chores.â
Juggling his university studies with data recording, as well as other data annotation gigs, takes a toll on him. Still, âit feels like youâre doing something different than the whole world,â he says.Â
-
Shifting to AI model customization is an architectural imperative MIT Technology Review Mar 31, 2026 02:12 PM 5 min read As LLM scaling hits diminishing returns, the next frontier of advantage is the institutionalization of proprietary logic.
In the early days of large language models (LLMs), we grew accustomed to massive 10x jumps in reasoning and coding capability with every new model iteration. Today, those jumps have flattened into incremental gains. The exception is domain-specialized intelligence, where true step-function improvements are still the norm.

When a model is fused with an organizationâs proprietary data and internal logic, it encodes the companyâs history into its future workflows. This alignment creates a compounding advantage: a competitive moat built on a model that understands the business intimately. This is more than fine-tuning; it is the institutionalization of expertise into an AI system. This is the power of customization.
Intelligence tuned to context
Every sector operates within its own specific lexicon. In automotive engineering, the âlanguageâ of the firm revolves around tolerance stacks, validation cycles, and revision control. In capital markets, reasoning is dictated by risk-weighted assets and liquidity buffers. In security operations, patterns are extracted from the noise of telemetry signals and identity anomalies.
Custom-adapted models internalize the nuances of the field. They recognize which variables dictate a âgo/no-goâ decision, and they think in the language of the industry.
Domain expertise in action
The transition from general-purpose to tailored AI centers on one goal: encoding an organizationâs unique logic directly into a modelâs weights.
Mistral AI partners with organizations to incorporate domain expertise into their training ecosystems. A few use cases illustrate customized implementations in practice:
Software engineering and assisting at scale: A network hardware company with proprietary languages and specialized codebases found that out-of-the-box models could not grasp their internal stack. By training a custom model on their own development patterns, they achieved a step function in fluency. Integrated into Mistralâs software development scaffolding, this customized model now supports the entire lifecycleâfrom maintaining legacy systems to autonomous code modernization via reinforcement learning. This turns once-opaque, niche code into a space where AI reliably assists at scale.
Automotive and the engineering copilot: A leading automotive company uses customization to revolutionize crash test simulations. Previously, specialists spent entire days manually comparing digital simulations with physical results to find divergences. By training a model on proprietary simulation data and internal analyses, they automated this visual inspection, flagging deformations in real time. Moving beyond detection, the model now acts as a copilot, proposing design adjustments to bring simulations closer to real-world behavior and radically accelerating the R&D loop.
Public sector and sovereign AI: In Southeast Asia, a government agency is building a sovereign AI layer to move beyond Western-centric models. By commissioning a foundation model tailored to regional languages, local idioms, and cultural contexts, they created a strategic infrastructure asset. This ensures sensitive data remains under local governance while powering inclusive citizen services and regulatory assistants. Here, customization is the key to deploying AI that is both technically effective and genuinely sovereign.
The blueprint for strategic customization
Moving from a general-purpose AI strategy to a domain-specific advantage requires a structural rethinking of the modelâs role within the enterprise. Success is defined by three shifts in organizational logic.
1. Treat AI as infrastructure, not an experiment. Â Historically, enterprises have treated model customization as an ad hoc experimentâa single fine-tuning run for a niche use case or a localized pilot. While these bespoke silos often yield promising results, they are rarely built to scale. They produce brittle pipelines, improvised governance, and limited portability. When the underlying base models evolve, the adaptation work must often be discarded and rebuilt from scratch.
In contrast, a durable strategy treats customization as foundational infrastructure. In this model, adaptation workflows are reproducible, version-controlled, and engineered for production. Success is measured against deterministic business outcomes. By decoupling the customization logic from the underlying model, firms ensure that their âdigital nervous systemâ remains resilient, even as the frontier of base models shifts.2. Retain control of your own data and models. As AI migrates from the periphery to core operations, the question of control becomes existential. Reliance on a single cloud provider or vendor for model alignment creates a dangerous asymmetry of power regarding data residency, pricing, and architectural updates.
Enterprises that retain control of their training pipelines and deployment environments preserve their strategic agency. By adapting models within controlled environments, organizations can enforce their own data residency requirements and dictate their own update cycles. This approach transforms AI from a service consumed into an asset governed, reducing structural dependency and allowing for cost and energy optimizations aligned with internal priorities rather than vendor roadmaps.
3. Design for continuous adaptation. The enterprise environment is never static: regulations shift, taxonomies evolve, and market conditions fluctuate. A common failure is treating a customized model as a finished artifact. In reality, a domain-aligned model is a living asset subject to model decay if left unmanaged.
Designing for continuous adaptation requires a disciplined approach to ModelOps. This includes automated drift detection, event-driven retraining, and incremental updates. By building the capacity for constant recalibration, the organization ensures that its AI does not just reflect its history, but it evolves in lockstep with its future. This is the stage where the competitive moat begins to compound: the modelâs utility grows as it internalizes the organizationâs ongoing response to change.
Control is the new leverage
We have entered an era where generic intelligence is a commodity, but contextual intelligence is a scarcity. While raw model power is now a baseline requirement, the true differentiator is alignmentâAI calibrated to an organizationâs unique data, mandates, and decision logic.
In the next decade, the most valuable AI wonât be the one that knows everything about the world; it will be the one that knows everything about you. The firms that own the model weights of that intelligence will own the market.
This content was produced by Mistral AI. It was not written by MIT Technology Reviewâs editorial staff.
-
How did Anthropic measure AI's "theoretical capabilities" in the job market? Ars Technica AI Mar 31, 2026 02:01 PM 1 min read 2023 study made a lot of assumptions about future "anticipated LLM-powered software."
If you follow the ongoing debate over AI's growing economic impact, you may have seen the graphic below floating around this month. It comes from an Anthropic report on the labor market impacts of AI and is meant to compare the current "observed exposure" of occupations to LLMs (in red) to the "theoretical capability" of those same LLMs (in blue) across 22 job categories.
While the current "observed exposure" area is interesting in its own right, it's the blue "theoretical capability" that jumps out. At a glance, the graph implies that LLM-based systems could perform at least 80 percent of the individual "job tasks" across a shockingly wide range of human occupations, at least theoretically. It looks like Anthropic is predicting that LLMs will eventually be able to do the vast majority of jobs in broad categories ranging from "Arts & Media" and "Office & Admin" to "Legal, Business & Finance," and even "Management."
That "theoretical AI coverage" area seems like it's destined to eat a huge swath of the US job market!
Credit:
Anthropic
Digging into the basis for those "theoretical capability" numbers, though, provides a much less chilling image of AI's future occupational impacts. When you drill down into the specifics, that blue field represents some outdated and heavily speculative educated guesses about where AI is likely to improve human productivity and not necessarily where it will take over for humans altogether.
-
AI benchmarks are broken. Hereâs what we need instead. MIT Technology Review Mar 31, 2026 12:01 PM 7 min read One-off tests donât measure AIâs true impact. Weâre better off shifting to more human-centered, context-specific methods.
For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks.Â
This framing is seductive: An AI vs. human comparison on isolated problems with clear right or wrong answers is easy to standardize, compare, and optimize. It generates rankings and headlines.Â
But thereâs a problem: AI is almost never used in the way it is benchmarked. Although  researchers and industry have started to improve benchmarking by moving beyond static tests to more dynamic evaluation methods, these innovations resolve only part of the issue. Thatâs because they still evaluate AIâs performance outside the human teams and organizational workflows where its real-world performance ultimately unfolds.Â
While AI is evaluated at the task level in a vacuum, it is used in messy, complex environments where it usually interacts with more than one person. Its performance (or lack thereof) emerges only over extended periods of use. This misalignment leaves us misunderstanding AIâs capabilities, overlooking systemic risks, and misjudging its economic and social consequences.
To mitigate this, itâs time to shift from narrow methods to benchmarks that assess how AI systems perform over longer time horizons within human teams, workflows, and organizations. I have studied real-world AI deployment since 2022 in small businesses and health, humanitarian, nonprofit, and higher-education organizations in the UK, the United States, and Asia, as well as within leading AI design ecosystems in London and Silicon Valley. I propose a different approach, which I call HAIC benchmarksâHumanâAI, Context-Specific Evaluation.
What happens when AI failsÂ
For governments and businesses, AI benchmark scores appear more objective than vendor claims. Theyâre a critical part of determining whether an AI model or application is âgood enoughâ for real-world deployment. Imagine an AI model that achieves impressive technical scores on the most cutting-edge benchmarksâ98% accuracy, groundbreaking speed, compelling outputs. On the strength of these results, organizations may decide to adopt the model, committing sizable financial and technical resources to purchasing and integrating it.Â
But then, once itâs adopted, the gap between benchmark and real-world performance quickly becomes visible. For example, take the swathe of FDA-approved AI models that can read medical scans faster and more accurately than an expert radiologist. In the radiology units of hospitals from the heart of California to the outskirts of London, I witnessed staff using highly ranked radiology AI applications. Repeatedly, it took them extra time to interpret AIâs outputs alongside hospital-specific reporting standards and nation-specific regulatory requirements. What appeared as a productivity-enhancing AI tool when tested in a vacuum introduced delays in practice.Â
It soon became clear that the benchmark tests on which medical AI models are assessed do not capture how medical decisions are actually made. Hospitals rely on multidisciplinary teamsâradiologists, oncologists, physicists, nursesâwho jointly review patients. Treatment planning rarely hinges on a static decision; it evolves as new information emerges over days or weeks. Decisions often arise through constructive debate and trade-offs between professional standards, patient preferences, and the shared goal of long-term patient well-being. No wonder even highly scored AI models struggle to deliver the promised performance once they encounter the complex, collaborative processes of real clinical care.
The same pattern emerges in my research across other sectors: When embedded within real-world work environments, even AI models that perform brilliantly on standardized tests donât perform as promised.Â
When high benchmark scores fail to translate into real-world performance, even the most highly scored AI is soon abandoned to what I call the âAI graveyard.â The costs are significant: Time, effort and money end up being wasted. And over time, repeated experiences like this erode organizational confidence in AI andâin critical settings such as healthâmay erode broader public trust in the technology as well.Â
When current benchmarks provide only a partial and potentially misleading signal of an AI modelâs readiness for real-world use, this creates regulatory blind spots: Oversight is shaped by metrics that do not reflect reality. It also leaves organizations and governments to shoulder the risks of testing AI in sensitive real-world settings, often with limited resources and support.Â
How to build better testsÂ
To close the gap between benchmark and real-world performance, we must pay attention to the actual conditions in which AI models will be used. The critical questions: Can AI function as a productive participant within human teams? And can it generate sustained, collective value?Â
Through my research on AI deployment across multiple sectors, I have seen a number of organizations already movingâdeliberately and experimentallyâtoward the HAIC benchmarks I favor.Â
HAIC benchmarks reframe current benchmarking in four ways:Â
1. Â Â From individual and single-task performance to team and workflow performance (shifting the unit of analysis)
2. Â Â From one-off testing with right/wrong answers to long-term impacts (expanding the time horizon)
3. Â Â From correctness and speed to organizational outcomes, coordination quality, and error detectability (expanding outcome measures)
4. Â Â From isolated outputs to upstream and downstream consequences (system effects)
Across the organizations where this approach has emerged and started to be applied, the first step is shifting the unit of analysis.Â
For example, in one UK hospital system in the period 2021â2024, the question expanded from whether a medical AI application improves diagnostic accuracy to how the presence of AI within the hospitalâs multidisciplinary teams affects not only accuracy but also coordination and deliberation. The hospital specifically assessed coordination and deliberation in human teams using and not using AI. Multiple stakeholders (within and outside the hospital) decided on metrics like how AI influences collective reasoning, whether it surfaces overlooked considerations, whether it strengthens or weakens coordination, and whether it changes established risk and compliance practices.Â
This shift is fundamental. It matters a lot in high-stakes contexts where system-level effects matter more than task-level accuracy. It also matters for the economy. It may help recalibrate inflated expectations of sweeping productivity gains that are so far predicated largely on the promise of improving individual task performance.Â
Once that foundation is set, HAIC benchmarking can begin to take on the element of time.Â
Todayâs benchmarks resemble school examsâone-off, standardized tests of accuracy. But real professional competence is assessed differently. Junior doctors and lawyers are evaluated continuously inside real workflows, under supervision, with feedback loops and accountability structures. Performance is judged over time and in a specific context, because competence is relational. If AI systems are meant to operate alongside professionals, their impact should be judged longitudinally, reflecting how performance unfolds over repeated interactions.Â
I saw this aspect of HAIC applied in one of my humanitarian-sector case studies. Over 18 months, an AI system was evaluated within real workflows, with particular attention to how detectable its errors wereâthat is, how easily human teams could identify and correct them. This long-term ârecord of error detectabilityâ meant the organizations involved could design and test context-specific guardrails to promote trust in the system, despite the inevitability of occasional AI mistakes.
A longer time horizon also makes visible the system-level consequences that short-term benchmarks miss. An AI application may outperform a single doctor on a narrow diagnostic task yet fail to improve multidisciplinary decision-making. Worse, it may introduce systemic distortions: anchoring teams too early in plausible but incomplete answers, adding to peopleâs cognitive workloads, or generating downstream inefficiencies that offset any speed or efficiency gains at the point of the AIâs use. These knock-on effectsâoften invisible to current benchmarksâare central to understanding real impact.Â
The HAIC approach, admittedly promises to make benchmarking more complex, resource-intensive, and harder to standardize. But continuing to evaluate AI in sanitized conditions detached from the world of work will leave us misunderstanding what it truly can and cannot do for us. To deploy AI responsibly in real-world settings, we must measure what actually matters: not just what a model can do alone, but what it enablesâor underminesâwhen humans and teams in the real world work with it.
 Angela Aristidou is a professor at University College London and a faculty fellow at the Stanford Digital Economy Lab and the Stanford Human-Centered AI Institute. She speaks, writes, and advises about the real-life deployment of artificial-intelligence tools for public good.
-
There are more AI health tools than everâbut how well do they work? MIT Technology Review Mar 30, 2026 04:00 PM 9 min read Specialized chatbots might make a difference for people with limited health-care access. Without more testing, we don't know if theyâll help or harm.
Earlier this month, Microsoft launched Copilot Health, a new space within its Copilot app where users will be able to connect their medical records and ask specific questions about their health. A couple of days earlier, Amazon had announced that Health AI, an LLM-based tool previously restricted to members of its One Medical service, would now be widely available. These products join the ranks of ChatGPT Health, which OpenAI released back in January, and Anthropicâs Claude, which can access user health records if granted permission. Health AI for the masses is officially a trend.Â
Thereâs a clear demand for chatbots that provide health advice, given how hard it is for many people to access it through existing medical systems. And some research suggests that current LLMs are capable of making safe and useful recommendations. But researchers say that these tools should be more rigorously evaluated by independent experts, ideally before they are widely released.Â
In a high-stakes area like health, trusting companies to evaluate their own products could prove unwise, especially if those evaluations arenât made available for external expert review. And even if the companies are doing quality, rigorous researchâwhich some, including OpenAI, do seem to beâthey might still have blind spots that the broader research community could help to fill.
âTo the extent that you always are going to need more health care, I think we should definitely be chasing every route that works,â says Andrew Bean, a doctoral candidate at the Oxford Internet Institute. âItâs entirely plausible to me that these models have reached a point where theyâre actually worth rolling out.â
âBut,â he adds, âthe evidence base really needs to be there.â
Tipping pointsÂ
To hear developers tell it, these health products are now being released because large language models have indeed reached a point where they can effectively provide medical advice. Dominic King, the vice president of health at Microsoft AI and a former surgeon, cites AI advancement as a core reason why the companyâs health team was formed, and why Copilot Health now exists. âWeâve seen this enormous progress in the capabilities of generative AI to be able to answer health questions and give good responses,â he says.
But thatâs only half the story, according to King. The other key factor is demand. Shortly before Copilot Health was launched, Microsoft published a report, and an accompanying blog post, detailing how people used Copilot for health advice. The company says it receives 50 million health questions each day, and health is the most popular discussion topic on the Copilot mobile app.
Other AI companies have noticed, and responded to, this trend. âEven before our health products, we were seeing just a rapid, rapid increase in the rate of people using ChatGPT for health-related questions,â says Karan Singhal, who leads OpenAIâs Health AI team. (OpenAI and Microsoft have a long-standing partnership, and Copilot is powered by OpenAIâs models.)
Itâs possible that people simply prefer posing their health problems to a nonjudgmental bot thatâs available to them 24-7. But many experts interpret this pattern in light of the current state of the health-care system. âThere is a reason that these tools exist and they have a position in the overall landscape,â says Girish Nadkarni, chief AI officerâ at the Mount Sinai Health System. âThatâs because access to health care is hard, and itâs particularly hard for certain populations.â
The virtuous vision of consumer-facing LLM health chatbots hinges on the possibility that they could improve user health while reducing pressure on the health-care system. That might involve helping users decide whether or not they need medical attention, a task known as triage. If chatbot triage works, then patients who need emergency care might seek it out earlier than they would have otherwise, and patients with more mild concerns might feel comfortable managing their symptoms at home with the chatbotâs advice rather than unnecessarily busying emergency rooms and doctorâs offices.
But a recent, widely discussed study from Nadkarni and other researchers at Mount Sinai found that ChatGPT Health sometimes recommends too much care for mild conditions and fails to identify emergencies. Though Singhal and  some other experts have suggested that its methodology might not provide a complete picture of ChatGPT Healthâs capabilities, the study has surfaced concerns about how little external evaluation these tools see before being released to the public.
Most of the academic experts interviewed for this piece agreed that LLM health chatbots could have real upsides, given how little access to health care some people have. But all six of them expressed concerns that these tools are being launched without testing from independent researchers to assess whether they are safe. While some advertised uses of these tools, such as recommending exercise plans or suggesting questions that a user might ask a doctor, are relatively harmless, others carry clear risks. Triage is one; another is asking a chatbot to provide a diagnosis or a treatment plan.Â
The ChatGPT Health interface includes a prominent disclaimer stating that it is not intended for diagnosis or treatment, and the announcements for Copilot Health and Amazonâs Health AI include similar warnings. But those warnings are easy to ignore. âWe all know that people are going to use it for diagnosis and management,â says Adam Rodman, an internal medicine physician and researcher at Beth Israel Deaconess Medical Center and a visiting researcher at Google.
Medical testing
Companies say they are testing the chatbots to ensure that they provide safe responses the vast majority of the time. OpenAI has designed and released HealthBench, a benchmark that scores LLMs on how they respond in realistic health-related conversationsâthough the conversations themselves are LLM-generated. When GPT-5, which powers both ChatGPT Health and Copilot Health, was released last year, OpenAI reported the modelâs HealthBench scores: It did substantially better than previous OpenAI models, though its overall performance was far from perfect.Â
But evaluations like HealthBench have limitations. In a study published last month, Beanâthe Oxford doctoral candidateâand his colleagues found that even if an LLM can accurately identify a medical condition from a fictional written scenario on its own, a non-expert user who is given the scenario and asked to determine the condition with LLM assistance might figure it out only a third of the time. If they lack medical expertise, users might not know which parts of a scenarioâor their real-life experienceâare important to include in their prompt, or they might misinterpret the information that an LLM gives them.
Bean says that this performance gap could be significant for OpenAIâs models. In the original HealthBench study, the company reported that its models performed relatively poorly in conversations that required them to seek more information from the user. If thatâs the case, then users who donât have enough medical knowledge to provide a health chatbot with the information that it needs from the get-go might get unhelpful or inaccurate advice.
Singhal, the OpenAI health lead, notes that the companyâs current GPT-5 series of models, which had not yet been released when the original HealthBench study was conducted, do a much better job of soliciting additional information than their predecessors. However, OpenAI has reported that GPT-5.4, the current flagship, is actually worse at seeking context than GPT-5.2, an earlier version.
Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That might be a heavy lift, particularly given how fast the AI world moves and how long human studies can take. Beanâs own study used GPT-4o, which came out almost a year ago and is now outdated.Â
Earlier this month, Google released a study that meets Beanâs standards. In the study, patients discussed medical concerns with the companyâs Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, AMIEâs diagnoses were just as accurate as physiciansâ, and none of the conversations raised major safety concerns for researchers.Â
Despite the encouraging results, Google isnât planning to release AMIE anytime soon. âWhile the research has advanced, there are significant limitations that must be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,â wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it is building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment.
Rodman, who led the AMIE study with Karthikesalingam, doesnât think such extensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. âThereâs lots of reasons that the clinical trial paradigm doesnât always work in generative AI,â he says. âAnd thatâs where this benchmarking conversation comes in. Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?â
They key there is âthird party.â No matter how extensively companies evaluate their own products, itâs tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but if there are many third parties involved, it also helps protect against blind spots.
OpenAIâs Singhal says heâs strongly in favor of external evaluation. âWe try our best to support the community,â he says. âPart of why we put out HealthBench was actually to give the community and other model developers an example of what a very good evaluation looks like.âÂ
Given how expensive it is to produce a high-quality evaluation, he says, heâs skeptical that any individual academic laboratory would be able to produce what he calls âthe one evaluation to rule them all.â But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suitesâsuch as Stanfordâs MedHELM framework, which tests models on a wide variety of medical tasks. Currently, OpenAIâs GPT-5 holds the highest MedHELM score.
Nigam Shah, a professor of medicine at Stanford University who led the MedHELM project, says it has limitations. In particular, it only evaluates individual chatbot responses, but someone whoâs seeking medical advice from a chatbot tool might engage it in a multi-turn, back-and-forth conversation. He says that he and some collaborators are gearing up to build an evaluation that can score those complex conversations, but that it will take time, and money. âYou and I have zero ability to stop these companies from releasing [health-oriented products], so theyâre going to do whatever they damn please,â he says. âThe only thing people like us can do is find a way to fund the benchmark.â
No one interviewed for this article argued that health LLMs need to perform perfectly on third-party evaluations in order to be released. Doctors themselves make mistakesâand for someone who has only occasional access to a doctor, a consistently accessible LLM that sometimes messes up could still be a huge improvement over the status quo, as long as its errors arenât too grave.Â
With the current state of the evidence, however, itâs impossible to know for sure whether the currently available tools do in fact constitute an improvement, or whether their risks outweigh their benefits.
-
The Pentagonâs culture war tactic against Anthropic has backfired MIT Technology Review Mar 30, 2026 03:42 PM 5 min read Decisions to tweet first and lawyer later didnât sit well with a federal judge, who last week halted the governmentâs punishment of the AI company.
This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.
Last Thursday, a California judge temporarily blocked the Pentagon from labeling Anthropic a supply chain risk and ordering government agencies to stop using its AI. Itâs the latest development in the month-long feud. And the matter still isnât settled: The government was given seven days to appeal, and Anthropic has a second case against the designation that has yet to be decided. Until then, the company remains persona non grata with the government.Â
The stakes in the caseâhow much the government can punish a company for not playing ballâwere apparent from the start. Anthropic drew lots of senior supporters with unlikely bedfellows among them, including former authors of President Trumpâs AI policy.
But Judge Rita Linâs 43-page opinion suggests that what is really a contract dispute never needed to reach such a frenzy. It did so because the government disregarded the existing process for how such disputes are governed and fueled the fire with social media posts from officials that would eventually contradict the positions it took in court. The Pentagon, in other words, wanted a culture war (on top of the actual war in Iran that began hours later).Â
The government used Anthropicâs Claude for much of 2025 without complaint, according to court documents, while the company walked a branding tightrope as a safety-focused AI company that also won defense contracts. Defense employees accessing it through Palantir were required to accept terms of a government-specific usage policy that Anthropic cofounder Jared Kaplan said âprohibited mass surveillance of Americans and lethal autonomous warfareâ (Kaplanâs declaration to the court didnât include details of the policy). Only when the government aimed to contract with Anthropic directly did the disagreements begin.Â
What drew the ire of the judge is that when these disagreements became public, they had more to do with punishment than just cutting ties with Anthropic. And they had a pattern: Tweet first, lawyer later.Â
President Trumpâs post on Truth Social on February 27 referenced âLeftwing nutjobsâ at Anthropic and directed every federal agency to stop using the companyâs AI. This was echoed soon after by Defense Secretary Pete Hegseth, who said heâd direct the Pentagon to label Anthropic a supply chain risk.Â
Doing so necessitates that the secretary take a specific set of actions, which the judge found Hegseth did not complete. Letters sent to congressional committees, for example, said that less drastic steps were evaluated and deemed not possible, without providing any further details. The government also said the designation as a supply chain risk was necessary because Anthropic could implement a âkill switch,â but its lawyers later had to admit it had no evidence of that, the judge wrote.
Hegsethâs post also stated that âNo contractor, supplier, or partner that does business with the United States military may conduct any commercial activity with Anthropic.â But the governmentâs own lawyers admitted on Tuesday that the Secretary doesnât have the power to do that, and agreed with the judge that the statement had âabsolutely no legal effect at all.â
The aggressive posts also led the judge to also conclude that Anthropic was on solid ground in complaining that its First Amendment rights were violated. The government, the judge wrote while citing the posts, âset out to publicly punish Anthropic for its âideologyâ and ârhetoric,â as well as its âarroganceâ for being unwilling to compromise those beliefs.â
Labeling Anthropic a supply chain risk would essentially be identifying it as a âsaboteurâ of the government, for which the judge did not see sufficient evidence. She issued an order last Thursday halting the designation, preventing the Pentagon from enforcing it and forbidding the government from fulfilling the promises made by Hegseth and Trump. Dean Ball, who worked on AI policy for the Trump administration but wrote a brief supporting Anthropic, described the judgeâs order on Thursday as âa devastating ruling for the government, finding Anthropic likely to prevail on essentially all of its theories for why the governmentâs actions were unlawful and unconstitutional.â
The government is expected to appeal the decision. But Anthropicâs separate case, filed in DC, makes similar allegations. It just references a different segment of the law governing supply chain risks.Â
The court documents paint a pretty clear pattern. Public statements made by officials and the President did not at all align with what the law says should happen in a contract dispute like this, and the governmentâs lawyers have consistently had to create justifications for social media lambasting of the company after the fact.
Pentagon and White House leadership knew that pursuing the nuclear option would spark a court battle; Anthropic vowed on February 27 to fight the supply chain risk designation days before the government formally filed it on March 3. Pursuing it anyway meant senior leadership was, to say the least, distracted during the first five days of the Iran war, launching strikes while also compiling evidence that Anthropic was a saboteur to the government, all while it could have cut ties with Anthropic by simpler means.Â
But even if Anthropic ultimately wins, the government has other means to shun the company from government work. Defense contractors who want to stay on good terms with the Pentagon, for example, now have little reason to work with Anthropic even if itâs not flagged as a supply chain risk.Â
âI think itâs safe to say that there are mechanisms the government can use to apply some degree of pressure without breaking the law,â says Charlie Bullock, a senior research fellow at the Institute for Law and AI. âIt kind of depends how invested the government is in punishing Anthropic.â
From the evidence thus far, the administration is committing top-level time and attention to winning an AI culture war. At the same time, Claude is apparently so important to its operations that even President Trump said the Pentagon needed six months to stop using it. The White House demands political loyalty and ideological alignment from top AI companies, But the case against Anthropic, at least for now, exposes the limits of its leverage.
If you have information about the militaryâs use of AI, you can share it securely via Signal (username jamesodonnell.22).
-
With new plugins feature, OpenAI officially takes Codex beyond coding Ars Technica AI Mar 27, 2026 09:53 PM 1 min read Things are moving fast, and competitors have offered something similar for a while.
OpenAI has added plugin support to its agentic coding app Codex in an apparent attempt to match similar features offered by competitors Anthropic (in Claude Code) and Google (in Gemini's command line interface).
What OpenAI calls "plugins" are actually bundles that may include skills ("prompts that describe workflows to Codex"âa standard feature in tools like this these days), app integrations, and MCP (Model Context Protocol) servers.
The idea is that they make it possible to configure Codex in certain ways for specific tasks to be easier for the user and replicable across multiple users in an organization.
-
Hegseth, Trump had no authority to order Anthropic to be blacklisted, judge says Ars Technica AI Mar 27, 2026 07:49 PM 1 min read âI donât knowâ: Department of War fails to justify blacklisting Anthropic.
"Classic First Amendment retaliation." That's how US District Judge Rita Lin described the Department of War's effort to blacklist Anthropic and designate it a supply-chain risk.
By all appearances, "these measures appear designed to punish Anthropic," Lin wrote in an order granting Anthropic's request for a preliminary injunction.
Officials seemingly had no authority to take such extreme actions without considering less restrictive alternatives or offering any evidence that Anthropic posed an urgent risk to national security, Lin said. Instead, "the Department of Warâs records show that it designated Anthropic as a supply chain risk because of its 'hostile manner through the press.'"
-
This startup wants to change how mathematicians do math MIT Technology Review Mar 25, 2026 01:59 PM 5 min read Axiom Math is giving away a powerful new AI tool. But it remains to be seen if it speeds up research as much as the company hopes.
Axiom Math, a startup based in Palo Alto, California, has released a free new AI tool for mathematicians, designed to discover mathematical patterns that could unlock solutions to long-standing problems.
The tool, called Axplorer, is a redesign of an existing one called PatternBoost that François Charton, now a research scientist at Axiom, co-developed in 2024 when he was at Meta. PatternBoost ran on a supercomputer; Axplorer runs on a Mac Pro.
The aim is to put the power of PatternBoost, which was used to crack a hard math puzzle known as the TurĂĄn four-cycles problem, in the hands of anyone who can install Axplorer on their own computer.
Last year, the US Defense Advanced Research Projects Agency set up a new initiative called expMathâshort for Exponentiating Mathematicsâto encourage mathematicians to develop and use AI tools. Axiom sees itself as part of that drive.
Breakthroughs in math have enormous knock-on effects across technology, says Charton. In particular, new math is crucial for advances in computer science, from building next-generation AI to improving internet security.
Most of the successes with AI tools have involved finding solutions to existing problems. But finding solutions is not all that mathematicians do, says Axiom Math founder and CEO Carina Hong. Math is exploratory and experimental, she says.Â
MIT Technology Review met with Charton and Hong last week for an exclusive video chat about their new tool and how AI in general could change mathematics.Â
Math by chatbot
In the last few months, a number of mathematicians have used LLMs, such as OpenAIâs GPT-5, to find solutions to unsolved problems, especially ones set by the 20th-century mathematician Paul ErdĹs, who left behind hundreds of puzzles when he died.
But Charton is dismissive of those successes. âThere are tons of problems that are open because nobody looked at them, and itâs easy to find a few gems you can solve,â he says. Heâs set his sights on tougher challengesââthe big problems that have been very, very well studied and famous people have worked on them.â Â
The TurĂĄn four-cycles problem that PatternBoost cracked is one such problem, says Charton. (The problem is an important one in graph theory, a branch of math thatâs used to analyze complex networks such as social media connections, supply chains, and search engine rankings. Imagine a page covered in dots. The puzzle involves figuring out how to draw lines between as many of the dots as possible without creating loops that connect four dots in a row.) Axiom Math says it has used Axplorer to match or improve on the best-known results for two other big problems in graph theory as well.
âLLMs are extremely good if what you want to do is derivative of something that has already been done,â says Charton. âThis is not surprisingâLLMs are pretrained on all the data that there is. But you could say that LLMs are conservative. They try to reuse things that exist.â
However, there are lots of problems in math that require new ideas, insights that nobody has ever had. Sometimes those insights come from spotting patterns that hadnât been spotted before. Such discoveries can open up whole new branches of mathematics.
PatternBoost was designed to help mathematicians find new patterns. Give the tool an example and it generates others like it. You select the ones that seem interesting and feed them back in. The tool then generates more like those, and so on. Â
Itâs a similar idea to Google DeepMindâs AlphaEvolve, a system that uses an LLM to come up with novel solutions to a problem. AlphaEvolve keeps the best suggestions and asks the LLM to improve on them.
Special access
Researchers have already used both AlphaEvolve and PatternBoost to discover new solutions to long-standing math problems. The trouble is that those tools run on large clusters of GPUs and are not available to most mathematicians.
Mathematicians are excited about AlphaEvolve, says Charton. âBut itâs closedâyou need to have access to it. You have to go and ask the DeepMind guy to type in your problem for you.â
And when Charton solved the TurĂĄn problem with PatternBoost, he was still at Meta. âI had literally thousands, sometimes tens of thousands, of machines I could run it on,â he says. âIt ran for three weeks. It was embarrassing brute force.â
Axplorer is far faster and far more efficient, according to the team at Axiom Math. Charton says it took Axplorer just 2.5 hours to match PatternBoostâs TurĂĄn result. And it runs on a single machine.
Geordie Williamson, a mathematician at the University of Sydney, who worked on PatternBoost with Charton, has not yet tried Axplorer. But he is curious to see what mathematicians do with it. (Williamson still occasionally collaborates with Charton on academic projects but says he is not otherwise connected to Axiom Math.)
Williamson says Axiom Math has made several improvements to PatternBoost that (in theory) make Axplorer applicable to a wider range of mathematical problems. âIt remains to be seen how significant these improvements are,â he says.
âWe are in a strange time at the moment, where lots of companies have tools that theyâd like us to use,â Williamson adds. âI would say mathematicians are somewhat overwhelmed by the possibilities. It is unclear to me what impact having another such tool will be.â
Hong admits that there are a lot of AI tools being pitched at mathematicians right now. Some also require mathematicians to train their own neural networks. Thatâs a turnoff, says Hong, who is a mathematician herself. Instead, Axplorer will walk you through what you want to do step by step, she says.
The code for Axplorer is open source and available via GitHub. Hong hopes that students and researchers will use the tool to generate sample solutions and counterexamples to problems theyâre working on, speeding up mathematical discovery.
Williamson welcomes new tools and says he uses LLMs a lot. But he doesnât think mathematicians should throw out the whiteboards just yet. âIn my biased opinion, PatternBoost is a lovely idea, but it is certainly not a panacea,â he says. âIâd love us not to forget more down-to-earth approaches.â
-
Agentic commerce runs on truth and context MIT Technology Review Mar 25, 2026 11:48 AM 6 min read Successful organizations will implement an architectural decision encoded in identity, context, and control.
Imagine telling a digital agent, âUse my points and book a family trip to Italy. Keep it within budget, pick hotels weâve liked before, and handle the details.â Instead of returning a list of links, the agent assembles an itinerary and executes the purchase.

That shift, from assistance to execution, is what makes agentic AI different. It also changes the operating speed of commerce. Payment transactions are already clear in milliseconds. The new acceleration is everything before the payment: discovery, comparison, decisioning, authorization, and follow-through across many systems. As humans step out of routine decisions, âgood enoughâ data stops being good enough. In an agent-driven economy, the constraint isnât speed; itâs trust at machine speed and scale.
Automated markets already work because identity, authority, and accountability are built in. As agents transact across businesses, that same clarity is required. Master data management (MDM)âthe discipline of creating a single master recordâbecomes the exchange layer: tracking who an agent represents, what it can do, and where responsibility sits when value moves. Markets donât fail from automation; they fail from ambiguous ownership. MDM turns autonomous action into legitimate, scalable trust.
To make agentic commerce safe and scalable, organizations will need more than better models. They will need a modern data architecture and an authoritative system of context that can instantly recognize, resolve, and distinguish entities. It is the difference between automation that scales and automation that needs constant human correction.
The agent is a new participant
Digital commerce has long been built on two primary sides: buyers and suppliers/merchants. Agentic commerce adds a third participant that must be treated as a first-class entity: the agent acting on the buyerâs behalf.
That sounds simple until you ask the questions every enterprise will face:
- Who is the individual, across channels and devices, with enough certainty for automation?
- Who is the agent, and what permissions and limits define what it can do?
- Who is the merchant or supplier, and are we sure we mean the right one?
- Who holds liability if the agent acts with permission, but against user intent?
The practical risk is confusion. Humans, for example, can infer that âDeltaâ means the airline when they are booking a flight, not the faucet company. An agent needs deterministic signals. If the system guesses wrong, it either breaks trust or forces a human confirmation step that defeats the promise of speed.
Why âgood enoughâ data breaks at machine speed
Most organizations have learned to live with imperfect data. Duplicate customer records are tolerable. Incomplete product attributes are annoying. Merchant identities can be reconciled later.
Agentic workflows change that tolerance. When an agent takes action without a human checking the output, it needs data that is close to perfect, because it cannot reliably notice when data is ambiguous or wrong the way a person can.
The failure modes are predictable, and they show up in places that matter most:
- Product truth: If the catalog is inconsistent, an agentâs choices will look arbitrary (âthe wrong shirt,â âthe wrong size,â âthe wrong materialâ), and trust collapses quickly.
- Payee truth: Agentic commerce expands beyond cards to account-to-account and open-banking-connected experiences, broadening the universe of payees and the need to recognize them accurately in real time.
- Identity truth: People operate in multiple contexts (work versus personal). Devices shift. A system that cannot distinguish amongst these contexts will either block legitimate activity or approve risky activity, both of which damage adoption.
This is why unified enterprise data and entity resolution move from nice to have to operationally required. The more autonomy you want, the more you must invest in modern data foundations that ensure it is safe.
Context intelligence: The missing layer
When leaders talk about agentic AI, they often focus on model capability: planning, tool use, and reasoning. Those are necessary, but they are not sufficient.
Agentic commerce also requires a layer that provides authoritative context at runtime. Think of it as a real-time system of context that can answer instantly and consistently:
⢠Is this the right person?
⢠Is this the right agent, acting within the right permissions?
⢠Is this the right merchant or payee?
⢠What constraints apply right now (budget, policy, risk, loyalty rules, preferred suppliers)?Two design principles matter.
First, entity truth must be deterministic enough for automation. Large language models are probabilistic by nature. That is helpful for creating options for writing and drawing. It is risky for deciding where money goes, especially in B2B and finance workflows, where âprobably correctâ is not acceptable.
Second, context must travel at the speed of interaction and remain portable across the entire connected network value chain. Mastercardâs experience optimizing payment flows is instructive: the more services you layer onto a transaction, the more you risk slowing it down. The pattern that scales pre-resolves, curates, and packages the signal so that execution is lightweight.
This is also where tokenization is heading. Initiatives like Mastercardâs Agent Pay and Verifiable Intent signal a future in which consumer credentials, agent identities, permissions, and provable user intent are encoded as cryptographically secure artifacts â enabling merchants, issuers and platforms to deterministically verify authorization and execution at machine speed.
What leaders should do in the next 12 to 24 months
Adoption will not be uniform. Early traction will often depend less on industry and more on the sophistication of an organizationâs systems and data discipline.
That makes the next two years a window for practical preparation. Five moves stand out.
- Treat agents as governed identities, not features. Define how agents are onboarded, authenticated, permissioned, monitored, and retired.
- Prioritize entity resolution where the cost of being wrong is highest. Start with payees, suppliers, employee-versus-personal identity, and high-volume product categories.
- Build a reusable context service that every workflow and agent can call. Do not force each system to reconstruct identity and relationships from scratch.
- Precompute and compress signals. Resolve and curate context upstream so that runtime decisioning stays fast and predictable.
- Expand autonomy only as trust is earned. Build a governance framework to address disputes, keep humans in the loop for higher-risk actions, measure accuracy, and expand automation as outcomes prove reliable.
A tsunami effect across industries
Agentic AI will not be confined to shopping carts. It will touch procurement, travel, claims, customer service, and finance operations. It will compress decision cycles and remove manual steps, but only for organizations that can supply agents with clean identity, precise entity truth, and reliable context.
The winners will treat entity truth and context as core infrastructure for automation, not as a back-office cleanup project. In commerce at machine speed, trust is not a brand attribute; it is an architectural decision encoded in identity, context, and control.
This content was produced by Reltio. It was not written by MIT Technology Reviewâs editorial staff.
-
The AI Hype Index: AI goes to war MIT Technology Review Mar 25, 2026 09:00 AM 1 min read MIT Technology Reviewâs highly subjective take on the latest buzz about AI
AI is at war. Anthropic and the Pentagon feuded over how to weaponize Anthropicâs AI model Claude; then OpenAI swept the Pentagon off its feet with an âopportunistic and sloppyâ deal. Users quit ChatGPT in droves. People marched through London in the biggest protest against AI to date. If youâre keeping score, Anthropicâthe company founded to be ethicalâis now turbocharging US strikes on Iran.Â
On the lighter side, AI agents are now going viral online. OpenAI hired the creator of OpenClaw, a popular AI agent. Meta snapped up Moltbook, where AI agents seem to ponder their own existence and invent new religions like Crustafarianism. And on RentAHuman, bots are hiring people to deliver CBD gummies. The future isnât AI taking your job. Itâs AI becoming your boss and finding God.
-
Mozilla dev's "Stack Overflow for agents" targets a key weakness in coding AI Ars Technica AI Mar 24, 2026 09:37 PM 1 min read There are major problems to be solved before it can be adopted, though.
Mozilla developer Peter Wilson has taken to the Mozilla.ai blog to announce cq, which he describes as "Stack Overflow for agents." The nascent project hints at something genuinely useful, but it will have to address security, data poisoning, and accuracy to achieve significant adoption.
It's meant to solve a couple of problems. First, coding agents often use outdated information when making decisions, like attempting deprecated API calls. This stems from training cutoffs and the lack of reliable, structured access to up-to-date runtime context. They sometimes use techniques like RAG (Retrieval Augmented Generation) to get updated knowledge, but they don't always do that when they need toâ"unknown unknowns," as the saying goesâand it's never comprehensive when they do.
Second, multiple agents often have to find ways around the same barriers, but there's no knowledge sharing after said training cutoff point. That means hundreds or thousands of individual agents end up using expensive tokens and consuming energy to solve already-solved problems all the time. Ideally, one would solve an issue once, and the others would draw from that experience.
-
OpenAI announces plans to shut down its Sora video generator Ars Technica AI Mar 24, 2026 09:19 PM 1 min read Move comes amid a reported plan to refocus on business and productivity use cases.
OpenAI is preparing to shut down Sora, the video-generation app that drew widespread attention when it launched in late 2024.
OpenAI announced the move in a social media post Tuesday just after a Wall Street Journal story broke the news. The company said it will have more to share soon on "timelines for the app and API and details on preserving your work."
"To everyone who created with Sora, shared it, and built community around it: thank you," OpenAI wrote. "What you made with Sora mattered, and we know this news is disappointing."
-
Electronic Frontier Foundation to swap leaders as AI, ICE fights escalate Ars Technica AI Mar 24, 2026 09:00 PM 1 min read Public interest in government tech abuses is peaking. EFF's new leader plans to build on that.
Back in 2022 when Cindy Cohn, the executive director of a US digital rights nonprofit called the Electronic Frontier Foundation, started writing her memoir, Privacy's Defender, she worried that people would think she was an "old fuddy duddy" still sounding alarms about government spying online.
As one of EFF's first litigators and then its longtime leader, Cohn witnessed firsthand how government surveillance became one of the earliest concerns for civil rights advocates when the Internet became mainstream in the 1990s. Since then, attention has pivoted away from caring about government's Internet abuses to focusing much more on Big Tech harms, she said.
But then Donald Trump's second term started, launching aggressive Immigration and Customs Enforcement (ICE) operations nationwide that depended on abusing tech to support its goals of mass deportation. Railing against ICE raids, communities have quickly mobilized to defend online privacy, even banding together across political divides to tear down Flock cameras that can aid in arrests. Maybe even more concerning, as the Department of Homeland Security (DHS) has increasingly sought to unmask ICE critics on social mediaâand largely failedâEFF has filed and backed lawsuits fighting to protect Americans' rights to track ICE activity and share information anonymously online.
-
Writer denies it, but publisher pulls horror novel after multiple allegations of AI use Ars Technica AI Mar 20, 2026 09:03 PM 1 min read One of the first controversies of its kind.
Shy Girl, a horror novel by Mia Ballard, was one of those buzzy books that leapt from self-published prominence into full-on trade publication. Until yesterday, that is, when publisher Hachette pulled the book from the UK market and canceled plans to bring it to the US.
The move came after a New York Times investigation suggested that AI had been used in significant parts of the work.
"If it isn't AI, she's a terrible writer"
Shy Girl was self-published in 2025 and quickly found an audience on social media. The novel follows a depressed, OCD woman named Gia who, down on her luck, encounters a "sugar daddy" who pays off her debts. All she has to do? Live as his literal pet. Eventually, of course, living like an animal makes her into an animal, and things apparently get nasty.
-
Railway secures $100 million to challenge AWS with AI-native cloud infrastructure VentureBeat AI Jan 22, 2026 02:00 PM 10 min read
Railway, a San Francisco-based cloud platform that has quietly amassed two million developers without spending a dollar on marketing, announced Thursday that it raised $100 million in a Series B funding round, as surging demand for artificial intelligence applications exposes the limitations of legacy cloud infrastructure.
TQ Ventures led the round, with participation from FPV Ventures, Redpoint, and Unusual Ventures. The investment values Railway as one of the most significant infrastructure startups to emerge during the AI boom, capitalizing on developer frustration with the complexity and cost of traditional platforms like Amazon Web Services and Google Cloud.
"As AI models get better at writing code, more and more people are asking the age-old question: where, and how, do I run my applications?" said Jake Cooper, Railway's 28-year-old founder and chief executive, in an exclusive interview with VentureBeat. "The last generation of cloud primitives were slow and outdated, and now with AI moving everything faster, teams simply can't keep up."
The funding is a dramatic acceleration for a company that has charted an unconventional path through the cloud computing industry. Railway raised just $24 million in total before this round, including a $20 million Series A from Redpoint in 2022. The company now processes more than 10 million deployments monthly and handles over one trillion requests through its edge network â metrics that rival far larger and better-funded competitors.
Why three-minute deploy times have become unacceptable in the age of AI coding assistants
Railway's pitch rests on a simple observation: the tools developers use to deploy and manage software were designed for a slower era. A standard build-and-deploy cycle using Terraform, the industry-standard infrastructure tool, takes two to three minutes. That delay, once tolerable, has become a critical bottleneck as AI coding assistants like Claude, ChatGPT, and Cursor can generate working code in seconds.
"When godly intelligence is on tap and can solve any problem in three seconds, those amalgamations of systems become bottlenecks," Cooper told VentureBeat. "What was really cool for humans to deploy in 10 seconds or less is now table stakes for agents."
The company claims its platform delivers deployments in under one second â fast enough to keep pace with AI-generated code. Customers report a tenfold increase in developer velocity and up to 65 percent cost savings compared to traditional cloud providers.
These numbers come directly from enterprise clients, not internal benchmarks. Daniel Lobaton, chief technology officer at G2X, a platform serving 100,000 federal contractors, measured deployment speed improvements of seven times faster and an 87 percent cost reduction after migrating to Railway. His infrastructure bill dropped from $15,000 per month to approximately $1,000.
"The work that used to take me a week on our previous infrastructure, I can do in Railway in like a day," Lobaton said. "If I want to spin up a new service and test different architectures, it would take so long on our old setup. In Railway I can launch six services in two minutes."
Inside the controversial decision to abandon Google Cloud and build data centers from scratch
What distinguishes Railway from competitors like Render and Fly.io is the depth of its vertical integration. In 2024, the company made the unusual decision to abandon Google Cloud entirely and build its own data centers, a move that echoes the famous Alan Kay maxim: "People who are really serious about software should make their own hardware."
"We wanted to design hardware in a way where we could build a differentiated experience," Cooper said. "Having full control over the network, compute, and storage layers lets us do really fast build and deploy loops, the kind that allows us to move at 'agentic speed' while staying 100 percent the smoothest ride in town."
The approach paid dividends during recent widespread outages that affected major cloud providers â Railway remained online throughout.
This soup-to-nuts control enables pricing that undercuts the hyperscalers by roughly 50 percent and newer cloud startups by three to four times. Railway charges by the second for actual compute usage: $0.00000386 per gigabyte-second of memory, $0.00000772 per vCPU-second, and $0.00000006 per gigabyte-second of storage. There are no charges for idle virtual machines â a stark contrast to the traditional cloud model where customers pay for provisioned capacity whether they use it or not.
"The conventional wisdom is that the big guys have economies of scale to offer better pricing," Cooper noted. "But when they're charging for VMs that usually sit idle in the cloud, and we've purpose-built everything to fit much more density on these machines, you have a big opportunity."
How 30 employees built a platform generating tens of millions in annual revenue
Railway has achieved its scale with a team of just 30 employees generating tens of millions in annual revenue â a ratio of revenue per employee that would be exceptional even for established software companies. The company grew revenue 3.5 times last year and continues to expand at 15 percent month-over-month.
Cooper emphasized that the fundraise was strategic rather than necessary. "We're default alive; there's no reason for us to raise money," he said. "We raised because we see a massive opportunity to accelerate, not because we needed to survive."
The company hired its first salesperson only last year and employs just two solutions engineers. Nearly all of Railway's two million users discovered the platform through word of mouth â developers telling other developers about a tool that actually works.
"We basically did the standard engineering thing: if you build it, they will come," Cooper recalled. "And to some degree, they came."
From side projects to Fortune 500 deployments: Railway's unlikely corporate expansion
Despite its grassroots developer community, Railway has made significant inroads into large organizations. The company claims that 31 percent of Fortune 500 companies now use its platform, though deployments range from company-wide infrastructure to individual team projects.
Notable customers include Bilt, the loyalty program company; Intuit's GoCo subsidiary; TripAdvisor's Cruise Critic; and MGM Resorts. Kernel, a Y Combinator-backed startup providing AI infrastructure to over 1,000 companies, runs its entire customer-facing system on Railway for $444 per month.
"At my previous company Clever, which sold for $500 million, I had six full-time engineers just managing AWS," said Rafael Garcia, Kernel's chief technology officer. "Now I have six engineers total, and they all focus on product. Railway is exactly the tool I wish I had in 2012."
For enterprise customers, Railway offers security certifications including SOC 2 Type 2 compliance and HIPAA readiness, with business associate agreements available upon request. The platform provides single sign-on authentication, comprehensive audit logs, and the option to deploy within a customer's existing cloud environment through a "bring your own cloud" configuration.
Enterprise pricing starts at custom levels, with specific add-ons for extended log retention ($200 monthly), HIPAA BAAs ($1,000), enterprise support with SLOs ($2,000), and dedicated virtual machines ($10,000).
The startup's bold strategy to take on Amazon, Google, and a new generation of cloud rivals
Railway enters a crowded market that includes not only the hyperscale cloud providersâAmazon Web Services, Microsoft Azure, and Google Cloud Platformâbut also a growing cohort of developer-focused platforms like Vercel, Render, Fly.io, and Heroku.
Cooper argues that Railway's competitors fall into two camps, neither of which has fully committed to the new infrastructure model that AI demands.
"The hyperscalers have two competing systems, and they haven't gone all-in on the new model because their legacy revenue stream is still printing money," he observed. "They have this mammoth pool of cash coming from people who provision a VM, use maybe 10 percent of it, and still pay for the whole thing. To what end are they actually interested in going all the way in on a new experience if they don't really need to?"
Against startup competitors, Railway differentiates by covering the full infrastructure stack. "We're not just containers; we've got VM primitives, stateful storage, virtual private networking, automated load balancing," Cooper said. "And we wrap all of this in an absurdly easy-to-use UI, with agentic primitives so agents can move 1,000 times faster."
The platform supports databases including PostgreSQL, MySQL, MongoDB, and Redis; provides up to 256 terabytes of persistent storage with over 100,000 input/output operations per second; and enables deployment to four global regions spanning the United States, Europe, and Southeast Asia. Enterprise customers can scale to 112 vCPUs and 2 terabytes of RAM per service.
Why investors are betting that AI will create a thousand times more software than exists today
Railway's fundraise reflects broader investor enthusiasm for companies positioned to benefit from the AI coding revolution. As tools like GitHub Copilot, Cursor, and Claude become standard fixtures in developer workflows, the volume of code being written â and the infrastructure needed to run it â is expanding dramatically.
"The amount of software that's going to come online over the next five years is unfathomable compared to what existed before â we're talking a thousand times more software," Cooper predicted. "All of that has to run somewhere."
The company has already integrated directly with AI systems, building what Cooper calls "loops where Claude can hook in, call deployments, and analyze infrastructure automatically." Railway released a Model Context Protocol server in August 2025 that allows AI coding agents to deploy applications and manage infrastructure directly from code editors.
"The notion of a developer is melting before our eyes," Cooper said. "You don't have to be an engineer to engineer things anymore â you just need critical thinking and the ability to analyze things in a systems capacity."
What Railway plans to do with $100 million and zero marketing experience
Railway plans to use the new capital to expand its global data center footprint, grow its team beyond 30 employees, and build what Cooper described as a proper go-to-market operation for the first time in the company's five-year history.
"One of my mentors said you raise money when you can change the trajectory of the business," Cooper explained. "We've built all the required substrate to scale indefinitely; what's been holding us back is simply talking about it. 2026 is the year we play on the world stage."
The company's investor roster reads like a who's who of developer infrastructure. Angel investors include Tom Preston-Werner, co-founder of GitHub; Guillermo Rauch, chief executive of Vercel; Spencer Kimball, chief executive of Cockroach Labs; Olivier Pomel, chief executive of Datadog; and Jori Lallo, co-founder of Linear.
The timing of Railway's expansion coincides with what many in Silicon Valley view as a fundamental shift in how software gets made. Coding assistants are no longer experimental curiosities â they have become essential tools that millions of developers rely on daily. Each line of AI-generated code needs somewhere to run, and the incumbents, by Cooper's telling, are too wedded to their existing business models to fully capitalize on the moment.
Whether Railway can translate developer enthusiasm into sustained enterprise adoption remains an open question. The cloud infrastructure market is littered with promising startups that failed to break the grip of Amazon, Microsoft, and Google. But Cooper, who previously worked as a software engineer at Wolfram Alpha, Bloomberg, and Uber before founding Railway in 2020, seems unfazed by the scale of his ambition.
"In five years, Railway [will be] the place where software gets created and evolved, period," he said. "Deploy instantly, scale infinitely, with zero friction. That's the prize worth playing for, and there's no bigger one on offer."
For a company that built a $100 million business by doing the opposite of what conventional startup wisdom dictates â no marketing, no sales team, no venture hypeâthe real test begins now. Railway spent five years proving that developers would find a better mousetrap on their own. The next five will determine whether the rest of the world is ready to get on board.
-
Claude Code costs up to $200 a month. Goose does the same thing for free. VentureBeat AI Jan 19, 2026 02:00 PM 11 min read
The artificial intelligence coding revolution comes with a catch: it's expensive.
Claude Code, Anthropic's terminal-based AI agent that can write, debug, and deploy code autonomously, has captured the imagination of software developers worldwide. But its pricing â ranging from $20 to $200 per month depending on usage â has sparked a growing rebellion among the very programmers it aims to serve.
Now, a free alternative is gaining traction. Goose, an open-source AI agent developed by Block (the financial technology company formerly known as Square), offers nearly identical functionality to Claude Code but runs entirely on a user's local machine. No subscription fees. No cloud dependency. No rate limits that reset every five hours.
"Your data stays with you, period," said Parth Sareen, a software engineer who demonstrated the tool during a recent livestream. The comment captures the core appeal: Goose gives developers complete control over their AI-powered workflow, including the ability to work offline â even on an airplane.
The project has exploded in popularity. Goose now boasts more than 26,100 stars on GitHub, the code-sharing platform, with 362 contributors and 102 releases since its launch. The latest version, 1.20.1, shipped on January 19, 2026, reflecting a development pace that rivals commercial products.
For developers frustrated by Claude Code's pricing structure and usage caps, Goose represents something increasingly rare in the AI industry: a genuinely free, no-strings-attached option for serious work.
Anthropic's new rate limits spark a developer revolt
To understand why Goose matters, you need to understand the Claude Code pricing controversy.
Anthropic, the San Francisco artificial intelligence company founded by former OpenAI executives, offers Claude Code as part of its subscription tiers. The free plan provides no access whatsoever. The Pro plan, at $17 per month with annual billing (or $20 monthly), limits users to just 10 to 40 prompts every five hours â a constraint that serious developers exhaust within minutes of intensive work.
The Max plans, at $100 and $200 per month, offer more headroom: 50 to 200 prompts and 200 to 800 prompts respectively, plus access to Anthropic's most powerful model, Claude 4.5 Opus. But even these premium tiers come with restrictions that have inflamed the developer community.
In late July, Anthropic announced new weekly rate limits. Under the system, Pro users receive 40 to 80 hours of Sonnet 4 usage per week. Max users at the $200 tier get 240 to 480 hours of Sonnet 4, plus 24 to 40 hours of Opus 4. Nearly five months later, the frustration has not subsided.
The problem? Those "hours" are not actual hours. They represent token-based limits that vary wildly depending on codebase size, conversation length, and the complexity of the code being processed. Independent analysis suggests the actual per-session limits translate to roughly 44,000 tokens for Pro users and 220,000 tokens for the $200 Max plan.
"It's confusing and vague," one developer wrote in a widely shared analysis. "When they say '24-40 hours of Opus 4,' that doesn't really tell you anything useful about what you're actually getting."
The backlash on Reddit and developer forums has been fierce. Some users report hitting their daily limits within 30 minutes of intensive coding. Others have canceled their subscriptions entirely, calling the new restrictions "a joke" and "unusable for real work."
Anthropic has defended the changes, stating that the limits affect fewer than five percent of users and target people running Claude Code "continuously in the background, 24/7." But the company has not clarified whether that figure refers to five percent of Max subscribers or five percent of all users â a distinction that matters enormously.
How Block built a free AI coding agent that works offline
Goose takes a radically different approach to the same problem.
Built by Block, the payments company led by Jack Dorsey, Goose is what engineers call an "on-machine AI agent." Unlike Claude Code, which sends your queries to Anthropic's servers for processing, Goose can run entirely on your local computer using open-source language models that you download and control yourself.
The project's documentation describes it as going "beyond code suggestions" to "install, execute, edit, and test with any LLM." That last phrase â "any LLM" â is the key differentiator. Goose is model-agnostic by design.
You can connect Goose to Anthropic's Claude models if you have API access. You can use OpenAI's GPT-5 or Google's Gemini. You can route it through services like Groq or OpenRouter. Or â and this is where things get interesting â you can run it entirely locally using tools like Ollama, which let you download and execute open-source models on your own hardware.
The practical implications are significant. With a local setup, there are no subscription fees, no usage caps, no rate limits, and no concerns about your code being sent to external servers. Your conversations with the AI never leave your machine.
"I use Ollama all the time on planes â it's a lot of fun!" Sareen noted during a demonstration, highlighting how local models free developers from the constraints of internet connectivity.
What Goose can do that traditional code assistants can't
Goose operates as a command-line tool or desktop application that can autonomously perform complex development tasks. It can build entire projects from scratch, write and execute code, debug failures, orchestrate workflows across multiple files, and interact with external APIs â all without constant human oversight.
The architecture relies on what the AI industry calls "tool calling" or "function calling" â the ability for a language model to request specific actions from external systems. When you ask Goose to create a new file, run a test suite, or check the status of a GitHub pull request, it doesn't just generate text describing what should happen. It actually executes those operations.
This capability depends heavily on the underlying language model. Claude 4 models from Anthropic currently perform best at tool calling, according to the Berkeley Function-Calling Leaderboard, which ranks models on their ability to translate natural language requests into executable code and system commands.
But newer open-source models are catching up quickly. Goose's documentation highlights several options with strong tool-calling support: Meta's Llama series, Alibaba's Qwen models, Google's Gemma variants, and DeepSeek's reasoning-focused architectures.
The tool also integrates with the Model Context Protocol, or MCP, an emerging standard for connecting AI agents to external services. Through MCP, Goose can access databases, search engines, file systems, and third-party APIs â extending its capabilities far beyond what the base language model provides.
Setting Up Goose with a Local Model
For developers interested in a completely free, privacy-preserving setup, the process involves three main components: Goose itself, Ollama (a tool for running open-source models locally), and a compatible language model.
Step 1: Install Ollama
Ollama is an open-source project that dramatically simplifies the process of running large language models on personal hardware. It handles the complex work of downloading, optimizing, and serving models through a simple interface.
Download and install Ollama from ollama.com. Once installed, you can pull models with a single command. For coding tasks, Qwen 2.5 offers strong tool-calling support:
ollama run qwen2.5
The model downloads automatically and begins running on your machine.
Step 2: Install Goose
Goose is available as both a desktop application and a command-line interface. The desktop version provides a more visual experience, while the CLI appeals to developers who prefer working entirely in the terminal.
Installation instructions vary by operating system but generally involve downloading from Goose's GitHub releases page or using a package manager. Block provides pre-built binaries for macOS (both Intel and Apple Silicon), Windows, and Linux.
Step 3: Configure the Connection
In Goose Desktop, navigate to Settings, then Configure Provider, and select Ollama. Confirm that the API Host is set to http://localhost:11434 (Ollama's default port) and click Submit.
For the command-line version, run goose configure, select "Configure Providers," choose Ollama, and enter the model name when prompted.
That's it. Goose is now connected to a language model running entirely on your hardware, ready to execute complex coding tasks without any subscription fees or external dependencies.
The RAM, processing power, and trade-offs you should know about
The obvious question: what kind of computer do you need?
Running large language models locally requires substantially more computational resources than typical software. The key constraint is memory â specifically, RAM on most systems, or VRAM if using a dedicated graphics card for acceleration.
Block's documentation suggests that 32 gigabytes of RAM provides "a solid baseline for larger models and outputs." For Mac users, this means the computer's unified memory is the primary bottleneck. For Windows and Linux users with discrete NVIDIA graphics cards, GPU memory (VRAM) matters more for acceleration.
But you don't necessarily need expensive hardware to get started. Smaller models with fewer parameters run on much more modest systems. Qwen 2.5, for instance, comes in multiple sizes, and the smaller variants can operate effectively on machines with 16 gigabytes of RAM.
"You don't need to run the largest models to get excellent results," Sareen emphasized. The practical recommendation: start with a smaller model to test your workflow, then scale up as needed.
For context, Apple's entry-level MacBook Air with 8 gigabytes of RAM would struggle with most capable coding models. But a MacBook Pro with 32 gigabytes â increasingly common among professional developers â handles them comfortably.
Why keeping your code off the cloud matters more than ever
Goose with a local LLM is not a perfect substitute for Claude Code. The comparison involves real trade-offs that developers should understand.
Model Quality: Claude 4.5 Opus, Anthropic's flagship model, remains arguably the most capable AI for software engineering tasks. It excels at understanding complex codebases, following nuanced instructions, and producing high-quality code on the first attempt. Open-source models have improved dramatically, but a gap persists â particularly for the most challenging tasks.
One developer who switched to the $200 Claude Code plan described the difference bluntly: "When I say 'make this look modern,' Opus knows what I mean. Other models give me Bootstrap circa 2015."
Context Window: Claude Sonnet 4.5, accessible through the API, offers a massive one-million-token context window â enough to load entire large codebases without chunking or context management issues. Most local models are limited to 4,096 or 8,192 tokens by default, though many can be configured for longer contexts at the cost of increased memory usage and slower processing.
Speed: Cloud-based services like Claude Code run on dedicated server hardware optimized for AI inference. Local models, running on consumer laptops, typically process requests more slowly. The difference matters for iterative workflows where you're making rapid changes and waiting for AI feedback.
Tooling Maturity: Claude Code benefits from Anthropic's dedicated engineering resources. Features like prompt caching (which can reduce costs by up to 90 percent for repeated contexts) and structured outputs are polished and well-documented. Goose, while actively developed with 102 releases to date, relies on community contributions and may lack equivalent refinement in specific areas.
How Goose stacks up against Cursor, GitHub Copilot, and the paid AI coding market
Goose enters a crowded market of AI coding tools, but occupies a distinctive position.
Cursor, a popular AI-enhanced code editor, charges $20 per month for its Pro tier and $200 for Ultraâpricing that mirrors Claude Code's Max plans. Cursor provides approximately 4,500 Sonnet 4 requests per month at the Ultra level, a substantially different allocation model than Claude Code's hourly resets.
Cline, Roo Code, and similar open-source projects offer AI coding assistance but with varying levels of autonomy and tool integration. Many focus on code completion rather than the agentic task execution that defines Goose and Claude Code.
Amazon's CodeWhisperer, GitHub Copilot, and enterprise offerings from major cloud providers target large organizations with complex procurement processes and dedicated budgets. They are less relevant to individual developers and small teams seeking lightweight, flexible tools.
Goose's combination of genuine autonomy, model agnosticism, local operation, and zero cost creates a unique value proposition. The tool is not trying to compete with commercial offerings on polish or model quality. It's competing on freedom â both financial and architectural.
The $200-a-month era for AI coding tools may be ending
The AI coding tools market is evolving quickly. Open-source models are improving at a pace that continually narrows the gap with proprietary alternatives. Moonshot AI's Kimi K2 and z.ai's GLM 4.5 now benchmark near Claude Sonnet 4 levels â and they're freely available.
If this trajectory continues, the quality advantage that justifies Claude Code's premium pricing may erode. Anthropic would then face pressure to compete on features, user experience, and integration rather than raw model capability.
For now, developers face a clear choice. Those who need the absolute best model quality, who can afford premium pricing, and who accept usage restrictions may prefer Claude Code. Those who prioritize cost, privacy, offline access, and flexibility have a genuine alternative in Goose.
The fact that a $200-per-month commercial product has a zero-dollar open-source competitor with comparable core functionality is itself remarkable. It reflects both the maturation of open-source AI infrastructure and the appetite among developers for tools that respect their autonomy.
Goose is not perfect. It requires more technical setup than commercial alternatives. It depends on hardware resources that not every developer possesses. Its model options, while improving rapidly, still trail the best proprietary offerings on complex tasks.
But for a growing community of developers, those limitations are acceptable trade-offs for something increasingly rare in the AI landscape: a tool that truly belongs to them.
Goose is available for download at github.com/block/goose. Ollama is available at ollama.com. Both projects are free and open source.
-
Listen Labs raises $69M after viral billboard hiring stunt to scale AI customer interviews VentureBeat AI Jan 16, 2026 02:01 PM 10 min read
Alfred Wahlforss was running out of options. His startup, Listen Labs, needed to hire over 100 engineers, but competing against Mark Zuckerberg's $100 million offers seemed impossible. So he spent $5,000 â a fifth of his marketing budget â on a billboard in San Francisco displaying what looked like gibberish: five strings of random numbers.
The numbers were actually AI tokens. Decoded, they led to a coding challenge: build an algorithm to act as a digital bouncer at Berghain, the Berlin nightclub famous for rejecting nearly everyone at the door. Within days, thousands attempted the puzzle. 430 cracked it. Some got hired. The winner flew to Berlin, all expenses paid.
That unconventional approach has now attracted $69 million in Series B funding, led by Ribbit Capital with participation from Evantic and existing investors Sequoia Capital, Conviction, and Pear VC. The round values Listen Labs at $500 million and brings its total capital to $100 million. In nine months since launch, the company has grown annualized revenue by 15x to eight figures and conducted over one million AI-powered interviews.
"When you obsess over customers, everything else follows," Wahlforss said in an interview with VentureBeat. "Teams that use Listen bring the customer into every decision, from marketing to product, and when the customer is delighted, everyone is."
Why traditional market research is broken, and what Listen Labs is building to fix it
Listen's AI researcher finds participants, conducts in-depth interviews, and delivers actionable insights in hours, not weeks. The platform replaces the traditional choice between quantitative surveys â which provide statistical precision but miss nuanceâand qualitative interviews, which deliver depth but cannot scale.
Wahlforss explained the limitation of existing approaches: "Essentially surveys give you false precision because people end up answering the same question... You can't get the outliers. People are actually not honest on surveys." The alternative, one-on-one human interviews, "gives you a lot of depth. You can ask follow up questions. You can kind of double check if they actually know what they're talking about. And the problem is you can't scale that."
The platform works in four steps: users create a study with AI assistance, Listen recruits participants from its global network of 30 million people, an AI moderator conducts in-depth interviews with follow-up questions, and results are packaged into executive-ready reports including key themes, highlight reels, and slide decks.
What distinguishes Listen's approach is its use of open-ended video conversations rather than multiple-choice forms. "In a survey, you can kind of guess what you should answer, and you have four options," Wahlforss said. "Oh, they probably want me to buy high income. Let me click on that button versus an open ended response. It just generates much more honesty."
The dirty secret of the $140 billion market research industry: rampant fraud
Listen finds and qualifies the right participants in its global network of 30 million people. But building that panel required confronting what Wahlforss called "one of the most shocking things that we've learned when we entered this industry"ârampant fraud.
"Essentially, there's a financial transaction involved, which means there will be bad players," he explained. "We actually had some of the largest companies, some of them have billions in revenue, send us people who claim to be kind of enterprise buyers to our platform and our system immediately detected, like, fraud, fraud, fraud, fraud, fraud."
The company built what it calls a "quality guard" that cross-references LinkedIn profiles with video responses to verify identity, checks consistency across how participants answer questions, and flags suspicious patterns. The result, according to Wahlforss: "People talk three times more. They're much more honest when they talk about sensitive topics like politics and mental health."
Emeritus, an online education company that uses Listen, reported that approximately 20% of survey responses previously fell into the fraudulent or low-quality category. With Listen, they reduced this to almost zero. "We did not have to replace any responses because of fraud or gibberish information," said Gabrielli Tiburi, Assistant Manager of Customer Insights at Emeritus.
How Microsoft, Sweetgreen, and Chubbies are using AI interviews to build better products
The speed advantage has proven central to Listen's pitch. Traditional customer research at Microsoft could take four to six weeks to generate insights. "By the time we get to them, either the decision has been made or we lose out on the opportunity to actually influence it," said Romani Patel, Senior Research Manager at Microsoft.
With Listen, Microsoft can now get insights in days, and in many cases, within hours.
The platform has already powered several high-profile initiatives. Microsoft used Listen Labs to collect global customer stories for its 50th anniversary celebration. "We wanted users to share how Copilot is empowering them to bring their best self forward," Patel said, "and we were able to collect those user video stories within a day." Traditionally, that kind of work would have taken six to eight weeks.
Simple Modern, an Oklahoma-based drinkware company, used Listen to test a new product concept. The process took about an hour to write questions, an hour to launch the study, and 2.5 hours to receive feedback from 120 people across the country. "We went from 'Should we even have this product?' to 'How should we launch it?'" said Chris Hoyle, the company's Chief Marketing Officer.
Chubbies, the shorts brand, achieved a 24x increase in youth research participationâgrowing from 5 to 120 participants â by using Listen to overcome the scheduling challenges of traditional focus groups with children. "There's school, sports, dinner, and homework," explained Lauren Neville, Director of Insights and Innovation. "I had to find a way to hear from them that fit into their schedules."
The company also discovered product issues through AI interviews that might have gone undetected otherwise. Wahlforss described how the AI "through conversations, realized there were like issues with the the kids short line, and decided to, like, interview hundreds of kids. And I understand that there were issues in the liner of the shorts and that they were, like, scratchy, quote, unquote, according to the people interviewed." The redesigned product became "a blockbuster hit."
The Jevons paradox explains why cheaper research creates more demand, not less
Listen Labs is entering a massive but fragmented market. Wahlforss cited research from Andreessen Horowitz estimating the market research industry at roughly $140 billion annually, populated by legacy players â some with more than a billion dollars in revenue â that he believes are vulnerable to disruption.
"There are very much existing budget lines that we are replacing," Wahlforss said. "Why we're replacing them is that one, they're super costly. Two, they're kind of stuck in this old paradigm of choosing between a survey or interview, and they also take months to work with."
But the more intriguing dynamic may be that AI-powered research doesn't just replace existing spending â it creates new demand. Wahlforss invoked the Jevons paradox, an economic principle that occurs when technological advancements make a resource more efficient to use, but increased efficiency leads to increased overall consumption rather than decreased consumption.
"What I've noticed is that as something gets cheaper, you don't need less of it. You want more of it," Wahlforss explained. "There's infinite demand for customer understanding. So the researchers on the team can do an order of magnitude more research, and also other people who weren't researchers before can now do that as part of their job."
Inside the elite engineering team that built Listen Labs before they had a working toilet
Listen Labs traces its origins to a consumer app that Wahlforss and his co-founder built after meeting at Harvard. "We built this consumer app that got 20,000 downloads in one day," Wahlforss recalled. "We had all these users, and we were thinking like, okay, what can we do to get to know them better? And we built this prototype of what Listen is today."
The founding team brings an unusual pedigree. Wahlforss's co-founder "was the national champion in competitive programming in Germany, and he worked at Tesla Autopilot." The company claims that 30% of its engineering team are medalists from the International Olympiad in Informatics â the same competition that produced the founders of Cognition, the AI coding startup.
The Berghain billboard stunt generated approximately 5 million views across social media, according to Wahlforss. It reflected the intensity of the talent war in the Bay Area.
"We had to do these things because some of our, like early employees, joined the company before we had a working toilet," he said. "But now we fixed that situation."
The company grew from 5 to 40 employees in 2024 and plans to reach 150 this year. It hires engineers for non-engineering roles across marketing, growth, and operations â a bet that in the AI era, technical fluency matters everywhere.
Synthetic customers and automated decisions: what Listen Labs is building next
Wahlforss outlined an ambitious product roadmap that pushes into more speculative territory. The company is building "the ability to simulate your customers, so you can take all of those interviews we've done, and then extrapolate based on that and create synthetic users or simulated user voices."
Beyond simulation, Listen aims to enable automated action based on research findings. "Can you not just make recommendations, but also create spawn agents to either change things in code or some customer churns? Can you give them a discount and try to bring them back?"
Wahlforss acknowledged the ethical implications. "Obviously, as you said, there's kind of ethical concerns there. Of like, automated decision making overall can be bad, but we will have considerable guardrails to make sure that the companies are always in the loop."
The company already handles sensitive data with care. "We don't train on any of the data," Wahlforss said. "We will also scrub any sensitive PII automatically so the model can detect that. And there are times when, for example, you work with investors, where if you accidentally mention something that could be material, non public information, the AI can actually detect that and remove any information like that."
How AI could reshape the future of product development
Perhaps the most provocative implication of Listen's model is how it could reshape product development itself. Wahlforss described a customer â an Australian startup â that has adopted what amounts to a continuous feedback loop.
"They're based in Australia, so they're coding during the day, and then in their night, they're releasing a Listen study with an American audience. Listen validates whatever they built during the day, and they get feedback on that. They can then plug that feedback directly into coding tools like Claude Code and iterate."
The vision extends Y Combinator's famous dictum â "write code, talk to users" â into an automated cycle. "Write code is now getting automated. And I think like talk to users will be as well, and you'll have this kind of infinite loop where you can start to ship this truly amazing product, almost kind of autonomously."
Whether that vision materializes depends on factors beyond Listen's control â the continued improvement of AI models, enterprise willingness to trust automated research, and whether speed truly correlates with better products. A 2024 MIT study found that 95% of AI pilots fail to move into production, a statistic Wahlforss cited as the reason he emphasizes quality over demos.
"I'm constantly have to emphasize like, let's make sure the quality is there and the details are right," he said.
But the company's growth suggests appetite for the experiment. Microsoft's Patel said Listen has "removed the drudgery of research and brought the fun and joy back into my work." Chubbies is now pushing its founder to give everyone in the company a login. Sling Money, a stablecoin payments startup, can create a survey in ten minutes and receive results the same day.
"It's a total game changer," said Ali Romero, Sling Money's marketing manager.
Wahlforss has a different phrase for what he's building. When asked about the tension between speed and rigor â the long-held belief that moving fast means cutting corners â he cited Nat Friedman, the former GitHub CEO and Listen investor, who keeps a list of one-liners on his website.
One of them: "Slow is fake."
It's an aggressive claim for an industry built on methodological caution. But Listen Labs is betting that in the AI era, the companies that listen fastest will be the ones that win. The only question is whether customers will talk back.
-
Salesforce rolls out new Slackbot AI agent as it battles Microsoft and Google in workplace AI VentureBeat AI Jan 13, 2026 01:00 PM 12 min read
Salesforce on Tuesday launched an entirely rebuilt version of Slackbot, the company's workplace assistant, transforming it from a simple notification tool into what executives describe as a fully powered AI agent capable of searching enterprise data, drafting documents, and taking action on behalf of employees.
The new Slackbot, now generally available to Business+ and Enterprise+ customers, is Salesforce's most aggressive move yet to position Slack at the center of the emerging "agentic AI" movement â where software agents work alongside humans to complete complex tasks. The launch comes as Salesforce attempts to convince investors that artificial intelligence will bolster its products rather than render them obsolete.
"Slackbot isn't just another copilot or AI assistant," said Parker Harris, Salesforce co-founder and Slack's chief technology officer, in an exclusive interview with Salesforce. "It's the front door to the agentic enterprise, powered by Salesforce."
From tricycle to Porsche: Salesforce rebuilt Slackbot from the ground up
Harris was blunt about what distinguishes the new Slackbot from its predecessor: "The old Slackbot was, you know, a little tricycle, and the new Slackbot is like, you know, a Porsche."
The original Slackbot, which has existed since Slack's early days, performed basic algorithmic tasks â reminding users to add colleagues to documents, suggesting channel archives, and delivering simple notifications. The new version runs on an entirely different architecture built around a large language model and sophisticated search capabilities that can access Salesforce records, Google Drive files, calendar data, and years of Slack conversations.
"It's two different things," Harris explained. "The old Slackbot was algorithmic and fairly simple. The new Slackbot is brand new â it's based around an LLM and a very robust search engine, and connections to third-party search engines, third-party enterprise data."
Salesforce chose to retain the Slackbot brand despite the fundamental technical overhaul. "People know what Slackbot is, and so we wanted to carry that forward," Harris said.
Why Anthropic's Claude powers the new Slackbot â and which AI models could come next
The new Slackbot runs on Claude, Anthropic's large language model, a choice driven partly by compliance requirements. Slack's commercial service operates under FedRAMP Moderate certification to serve U.S. federal government customers, and Harris said Anthropic was "the only provider that could give us a compliant LLM" when Slack began building the new system.
But that exclusivity won't last. "We are, this year, going to support additional providers," Harris said. "We have a great relationship with Google. Gemini is incredible â performance is great, cost is great. So we're going to use Gemini for some things." He added that OpenAI remains a possibility as well.
Harris echoed Salesforce CEO Marc Benioff's view that large language models are becoming commoditized: "You've heard Marc talk about LLMs are commodities, that they're democratized. I call them CPUs."
On the sensitive question of training data, Harris was unequivocal: Salesforce does not train any models on customer data. "Models don't have any sort of security," he explained. "If we trained it on some confidential conversation that you and I have, I don't want Carolyn to know â if I train it into the LLM, there is no way for me to say you get to see the answer, but Carolyn doesn't."
Inside Salesforce's internal experiment: 80,000 employees tested Slackbot with striking results
Salesforce has been testing the new Slackbot internally for months, rolling it out to all 80,000 employees. According to Ryan Gavin, Slack's chief marketing officer, the results have been striking: "It's the fastest adopted product in Salesforce history."
Internal data shows that two-thirds of Salesforce employees have tried the new Slackbot, with 80% of those users continuing to use it regularly. Internal satisfaction rates reached 96% â the highest for any AI feature Slack has shipped. Employees report saving between two and 20 hours per week.
The adoption happened largely organically. "I think it was about five days, and a Canvas was developed by our employees called 'The Most Stealable Slackbot Prompts,'" Gavin said. "People just started adding to it organically. I think it's up to 250-plus prompts that are in this Canvas right now."
Kate Crotty, a principal UX researcher at Salesforce, found that 73% of internal adoption was driven by social sharing rather than top-down mandates. "Everybody is there to help each other learn and communicate hacks," she said.
How Slackbot transforms scattered enterprise data into executive-ready insights
During a product demonstration, Amy Bauer, Slack's product experience designer, showed how Slackbot can synthesize information across multiple sources. In one example, she asked Slackbot to analyze customer feedback from a pilot program, upload an image of a usage dashboard, and have Slackbot correlate the qualitative and quantitative data.
"This is where Slackbot really earns its keep for me," Bauer explained. "What it's doing is not just simply reading the image â it's actually looking at the image and comparing it to the insight it just generated for me."
Slackbot can then query Salesforce to find enterprise accounts with open deals that might be good candidates for early access, creating what Bauer called "a really great justification and plan to move forward." Finally, it can synthesize all that information into a Canvas â Slack's collaborative document format â and find calendar availability among stakeholders to schedule a review meeting.
"Up until this point, we have been working in a one-to-one capacity with Slackbot," Bauer said. "But one of the benefits that I can do now is take this insight and have it generate this into a Canvas, a shared workspace where I can iterate on it, refine it with Slackbot, or share it out with my team."
Rob Seaman, Slack's chief product officer, said the Canvas creation demonstrates where the product is heading: "This is making a tool call internally to Slack Canvas to actually write, effectively, a shared document. But it signals where we're going with Slackbot â we're eventually going to be adding in additional third-party tool calls."
MrBeast's company became a Slackbot guinea pigâand employees say they're saving 90 minutes a day
Among Salesforce's pilot customers is Beast Industries, the parent company of YouTube star MrBeast. Luis Madrigal, the company's chief information officer, joined the launch announcement to describe his experience.
"As somebody who has rolled out enterprise technologies for over two decades now, this was practically one of the easiest," Madrigal said. "The plumbing is there. Slack as an implementation, Enterprise Tools â being able to turn on the Slackbot and the Slack AI functionality was as simple as having my team go in, review, do a quick security review."
Madrigal said his security team signed off "rather quickly" â unusual for enterprise AI deployments â because Slackbot accesses only the information each individual user already has permission to view. "Given all the guardrails you guys have put into place for Slackbot to be unique and customized to only the information that each individual user has, only the conversations and the Slack rooms and Slack channels that they're part ofâthat made my security team sign off rather quickly."
One Beast Industries employee, Sinan, the head of Beast Games marketing, reported saving "at bare minimum, 90 minutes a day." Another employee, Spencer, a creative supervisor, described it as "an assistant who's paying attention when I'm not."
Other pilot customers include Slalom, reMarkable, Xero, Mercari, and Engine. Mollie Bodensteiner, SVP of Operations at Engine, called Slackbot "an absolute 'chaos tamer' for our team," estimating it saves her about 30 minutes daily "just by eliminating context switching."
Slackbot vs. Microsoft Copilot vs. Google Gemini: The fight for enterprise AI dominance
The launch puts Salesforce in direct competition with Microsoft's Copilot, which is integrated into Teams and the broader Microsoft 365 suite, as well as Google's Gemini integrations across Workspace. When asked what distinguishes Slackbot from these alternatives, Seaman pointed to context and convenience.
"The thing that makes it most powerful for our customers and users is the proximity â it's just right there in your Slack," Seaman said. "There's a tremendous convenience affordance that's naturally built into it."
The deeper advantage, executives argue, is that Slackbot already understands users' work without requiring setup or training. "Most AI tools sound the same no matter who is using them," the company's announcement stated. "They lack context, miss nuance, and force you to jump between tools to get anything done."
Harris put it more directly: "If you've ever had that magic experience with AI â I think ChatGPT is a great example, it's a great experience from a consumer perspective â Slackbot is really what we're doing in the enterprise, to be this employee super agent that is loved, just like people love using Slack."
Amy Bauer emphasized the frictionless nature of the experience. "Slackbot is inherently grounded in the context, in the data that you have in Slack," she said. "So as you continue working in Slack, Slackbot gets better because it's grounded in the work that you're doing there. There is no setup. There is no configuration for those end users."
Salesforce's ambitious plan to make Slackbot the one 'super agent' that controls all the others
Salesforce positions Slackbot as what Harris calls a "super agent" â a central hub that can eventually coordinate with other AI agents across an organization.
"Every corporation is going to have an employee super agent," Harris said. "Slackbot is essentially taking the magic of what Slack does. We think that Slackbot, and we're really excited about it, is going to be that."
The vision extends to third-party agents already launching in Slack. Last month, Anthropic released a preview of Claude Code for Slack, allowing developers to interact with Claude's coding capabilities directly in chat threads. OpenAI, Google, Vercel, and others have also built agents for the platform.
"Most of the net-new apps that are being deployed to Slack are agents," Seaman noted during the press conference. "This is proof of the promise of humans and agents coexisting and working together in Slack to solve problems."
Harris described a future where Slackbot becomes an MCP (Model Context Protocol) client, able to leverage tools from across the software ecosystem â similar to how the developer tool Cursor works. "Slack can be an MCP client, and Slackbot will be the hub of that, leveraging all these tools out in the world, some of which will be these amazing agents," he said.
But Harris also cautioned against over-promising on multi-agent coordination. "I still think we're in the single agent world," he said. "FY26 is going to be the year where we started to see more coordination. But we're going to do it with customer success in mind, and not demonstrate and talk about, like, 'I've got 1,000 agents working together,' because I think that's unrealistic."
Slackbot costs nothing extra, but Salesforce's data access fees could squeeze some customers
Slackbot is included at no additional cost for customers on Business+ and Enterprise+ plans. "There's no additional fees customers have to do," Gavin confirmed. "If they're on one of those plans, they're going to get Slackbot."
However, some enterprise customers may face other cost pressures related to Salesforce's broader data strategy. CIOs may see price increases for third-party applications that work with Salesforce data, as effects of higher charges for API access ripple through the software supply chain.
Fivetran CEO George Fraser has warned that Salesforce's shift in pricing policy for API access could have tangible consequences for enterprises relying on Salesforce as a system of record. "They might not be able to use Fivetran to replicate their data to Snowflake and instead have to use Salesforce Data Cloud. Or they might find that they are not able to interact with their data via ChatGPT, and instead have to use Agentforce," Fraser said in a recent CIO report.
Salesforce has framed the pricing change as standard industry practice.
What Slackbot can do today, what's coming in weeks, and what's still on the roadmap
The new Slackbot begins rolling out today and will reach all eligible customers by the end of February. Mobile availability will complete by March 3, Bauer confirmed during her interview with VentureBeat.
Some capabilities remain works in progress. Calendar reading and availability checking are available at launch, but the ability to actually book meetings is "coming a few weeks after," according to Seaman. Image generation is not currently supported, though Bauer said it's "something that we are looking at in the future."
When asked about integration with competing CRM systems like HubSpot and Microsoft Dynamics, Salesforce representatives declined to provide specifics during the interview, though they acknowledged the question touched on key competitive differentiators.
Salesforce is betting the future of work looks like a chat windowâand it's not alone
The Slackbot launch is Salesforce's bet that the future of enterprise work is conversational â that employees will increasingly prefer to interact with AI through natural language rather than navigating traditional software interfaces.
Harris described Slack's product philosophy using principles like "don't make me think" and "be a great host." The goal, he said, is for Slackbot to surface information proactively rather than requiring users to hunt for it.
"One of the revelations for me is LLMs applied to unstructured information are incredible," Harris said. "And the amount of value you have if you're a Slack user, if your corporation uses Slack â the amount of value in Slack is unbelievable. Because you're talking about work, you're sharing documents, you're making decisions, but you can't as a human go through that and really get the same value that an LLM can do."
Looking ahead, Harris expects the interfaces themselves to evolve beyond pure conversation. "We're kind of saturating what we can do with purely conversational UIs," he said. "I think we'll start to see agents building an interface that best suits your intent, as opposed to trying to surface something within a conversational interface that matches your intent."
Microsoft, Google, and a growing roster of AI startups are placing similar bets â that the winning enterprise AI will be the one embedded in the tools workers already use, not another application to learn. The race to become that invisible layer of workplace intelligence is now fully underway.
For Salesforce, the stakes extend beyond a single product launch. After a bruising year on Wall Street and persistent questions about whether AI threatens its core business, the company is wagering that Slackbot can prove the opposite â that the tens of millions of people already chatting in Slack every day is not a vulnerability, but an unassailable advantage.
Haley Gault, the Salesforce account executive in Pittsburgh who stumbled upon the new Slackbot on a snowy morning, captured the shift in a single sentence: "I honestly can't imagine working for another company not having access to these types of tools. This is just how I work now."
That's precisely what Salesforce is counting on.
-
Anthropic launches Cowork, a Claude Desktop agent that works in your files â no coding required VentureBeat AI Jan 12, 2026 11:30 AM 9 min read
Anthropic released Cowork on Monday, a new AI agent capability that extends the power of its wildly successful Claude Code tool to non-technical users â and according to company insiders, the team built the entire feature in approximately a week and a half, largely using Claude Code itself.
The launch marks a major inflection point in the race to deliver practical AI agents to mainstream users, positioning Anthropic to compete not just with OpenAI and Google in conversational AI, but with Microsoft's Copilot in the burgeoning market for AI-powered productivity tools.
"Cowork lets you complete non-technical tasks much like how developers use Claude Code," the company announced via its official Claude account on X. The feature arrives as a research preview available exclusively to Claude Max subscribers â Anthropic's power-user tier priced between $100 and $200 per month â through the macOS desktop application.
For the past year, the industry narrative has focused on large language models that can write poetry or debug code. With Cowork, Anthropic is betting that the real enterprise value lies in an AI that can open a folder, read a messy pile of receipts, and generate a structured expense report without human hand-holding.
How developers using a coding tool for vacation research inspired Anthropic's latest product
The genesis of Cowork lies in Anthropic's recent success with the developer community. In late 2024, the company released Claude Code, a terminal-based tool that allowed software engineers to automate rote programming tasks. The tool was a hit, but Anthropic noticed a peculiar trend: users were forcing the coding tool to perform non-coding labor.
According to Boris Cherny, an engineer at Anthropic, the company observed users deploying the developer tool for an unexpectedly diverse array of tasks.
"Since we launched Claude Code, we saw people using it for all sorts of non-coding work: doing vacation research, building slide decks, cleaning up your email, cancelling subscriptions, recovering wedding photos from a hard drive, monitoring plant growth, controlling your oven," Cherny wrote on X. "These use cases are diverse and surprising â the reason is that the underlying Claude Agent is the best agent, and Opus 4.5 is the best model."
Recognizing this shadow usage, Anthropic effectively stripped the command-line complexity from their developer tool to create a consumer-friendly interface. In its blog post announcing the feature, Anthropic explained that developers "quickly began using it for almost everything else," which "prompted us to build Cowork: a simpler way for anyone â not just developers â to work with Claude in the very same way."
Inside the folder-based architecture that lets Claude read, edit, and create files on your computer
Unlike a standard chat interface where a user pastes text for analysis, Cowork requires a different level of trust and access. Users designate a specific folder on their local machine that Claude can access. Within that sandbox, the AI agent can read existing files, modify them, or create entirely new ones.
Anthropic offers several illustrative examples: reorganizing a cluttered downloads folder by sorting and intelligently renaming each file, generating a spreadsheet of expenses from a collection of receipt screenshots, or drafting a report from scattered notes across multiple documents.
"In Cowork, you give Claude access to a folder on your computer. Claude can then read, edit, or create files in that folder," the company explained on X. "Try it to create a spreadsheet from a pile of screenshots, or produce a first draft from scattered notes."
The architecture relies on what is known as an "agentic loop." When a user assigns a task, the AI does not merely generate a text response. Instead, it formulates a plan, executes steps in parallel, checks its own work, and asks for clarification if it hits a roadblock. Users can queue multiple tasks and let Claude process them simultaneously â a workflow Anthropic describes as feeling "much less like a back-and-forth and much more like leaving messages for a coworker."
The system is built on Anthropic's Claude Agent SDK, meaning it shares the same underlying architecture as Claude Code. Anthropic notes that Cowork "can take on many of the same tasks that Claude Code can handle, but in a more approachable form for non-coding tasks."
The recursive loop where AI builds AI: Claude Code reportedly wrote much of Claude Cowork
Perhaps the most remarkable detail surrounding Cowork's launch is the speed at which the tool was reportedly built â highlighting a recursive feedback loop where AI tools are being used to build better AI tools.
During a livestream hosted by Dan Shipper, Felix Rieseberg, an Anthropic employee, confirmed that the team built Cowork in approximately a week and a half.
Alex Volkov, who covers AI developments, expressed surprise at the timeline: "Holy shit Anthropic built 'Cowork' in the last... week and a half?!"
This prompted immediate speculation about how much of Cowork was itself built by Claude Code. Simon Smith, EVP of Generative AI at Klick Health, put it bluntly on X: "Claude Code wrote all of Claude Cowork. Can we all agree that we're in at least somewhat of a recursive improvement loop here?"
The implication is profound: Anthropic's AI coding agent may have substantially contributed to building its own non-technical sibling product. If true, this is one of the most visible examples yet of AI systems being used to accelerate their own development and expansion â a strategy that could widen the gap between AI labs that successfully deploy their own agents internally and those that do not.
Connectors, browser automation, and skills extend Cowork's reach beyond the local file system
Cowork doesn't operate in isolation. The feature integrates with Anthropic's existing ecosystem of connectors â tools that link Claude to external information sources and services such as Asana, Notion, PayPal, and other supported partners. Users who have configured these connections in the standard Claude interface can leverage them within Cowork sessions.
Additionally, Cowork can pair with Claude in Chrome, Anthropic's browser extension, to execute tasks requiring web access. This combination allows the agent to navigate websites, click buttons, fill forms, and extract information from the internet â all while operating from the desktop application.
"Cowork includes a number of novel UX and safety features that we think make the product really special," Cherny explained, highlighting "a built-in VM [virtual machine] for isolation, out of the box support for browser automation, support for all your claude.ai data connectors, asking you for clarification when it's unsure."
Anthropic has also introduced an initial set of "skills" specifically designed for Cowork that enhance Claude's ability to create documents, presentations, and other files. These build on the Skills for Claude framework the company announced in October, which provides specialized instruction sets Claude can load for particular types of tasks.
Why Anthropic is warning users that its own AI agent could delete their files
The transition from a chatbot that suggests edits to an agent that makes edits introduces significant risk. An AI that can organize files can, theoretically, delete them.
In a notable display of transparency, Anthropic devoted considerable space in its announcement to warning users about Cowork's potential dangers â an unusual approach for a product launch.
The company explicitly acknowledges that Claude "can take potentially destructive actions (such as deleting local files) if it's instructed to." Because Claude might occasionally misinterpret instructions, Anthropic urges users to provide "very clear guidance" about sensitive operations.
More concerning is the risk of prompt injection attacks â a technique where malicious actors embed hidden instructions in content Claude might encounter online, potentially causing the agent to bypass safeguards or take harmful actions.
"We've built sophisticated defenses against prompt injections," Anthropic wrote, "but agent safety â that is, the task of securing Claude's real-world actions â is still an active area of development in the industry."
The company characterized these risks as inherent to the current state of AI agent technology rather than unique to Cowork. "These risks aren't new with Cowork, but it might be the first time you're using a more advanced tool that moves beyond a simple conversation," the announcement notes.
Anthropic's desktop agent strategy sets up a direct challenge to Microsoft Copilot
The launch of Cowork places Anthropic in direct competition with Microsoft, which has spent years attempting to integrate its Copilot AI into the fabric of the Windows operating system with mixed adoption results.
However, Anthropic's approach differs in its isolation. By confining the agent to specific folders and requiring explicit connectors, they are attempting to strike a balance between the utility of an OS-level agent and the security of a sandboxed application.
What distinguishes Anthropic's approach is its bottom-up evolution. Rather than designing an AI assistant and retrofitting agent capabilities, Anthropic built a powerful coding agent first â Claude Code â and is now abstracting its capabilities for broader audiences. This technical lineage may give Cowork more robust agentic behavior from the start.
Claude Code has generated significant enthusiasm among developers since its initial launch as a command-line tool in late 2024. The company expanded access with a web interface in October 2025, followed by a Slack integration in December. Cowork is the next logical step: bringing the same agentic architecture to users who may never touch a terminal.
Who can access Cowork now, and what's coming next for Windows and other platforms
For now, Cowork remains exclusive to Claude Max subscribers using the macOS desktop application. Users on other subscription tiers â Free, Pro, Team, or Enterprise â can join a waitlist for future access.
Anthropic has signaled clear intentions to expand the feature's reach. The blog post explicitly mentions plans to add cross-device sync and bring Cowork to Windows as the company learns from the research preview.
Cherny set expectations appropriately, describing the product as "early and raw, similar to what Claude Code felt like when it first launched."
To access Cowork, Max subscribers can download or update the Claude macOS app and click on "Cowork" in the sidebar.
The real question facing enterprise AI adoption
For technical decision-makers, the implications of Cowork extend beyond any single product launch. The bottleneck for AI adoption is shifting â no longer is model intelligence the limiting factor, but rather workflow integration and user trust.
Anthropic's goal, as the company puts it, is to make working with Claude feel less like operating a tool and more like delegating to a colleague. Whether mainstream users are ready to hand over folder access to an AI that might misinterpret their instructions remains an open question.
But the speed of Cowork's development â a major feature built in ten days, possibly by the company's own AI â previews a future where the capabilities of these systems compound faster than organizations can evaluate them.
The chatbot has learned to use a file manager. What it learns to use next is anyone's guess.
-
Nous Research's NousCoder-14B is an open-source coding model landing right in the Claude Code moment VentureBeat AI Jan 07, 2026 08:00 PM 8 min read
Nous Research, the open-source artificial intelligence startup backed by crypto venture firm Paradigm, released a new competitive programming model on Monday that it says matches or exceeds several larger proprietary systems â trained in just four days using 48 of Nvidia's latest B200 graphics processors.
The model, called NousCoder-14B, is another entry in a crowded field of AI coding assistants, but arrives at a particularly charged moment: Claude Code, the agentic programming tool from rival Anthropic, has dominated social media discussion since New Year's Day, with developers posting breathless testimonials about its capabilities. The simultaneous developments underscore how quickly AI-assisted software development is evolving â and how fiercely companies large and small are competing to capture what many believe will become a foundational technology for how software gets written.
type: embedded-entry-inline id: 74cSyrq6OUrp9SEQ5zOUSl
NousCoder-14B achieves a 67.87 percent accuracy rate on LiveCodeBench v6, a standardized evaluation that tests models on competitive programming problems published between August 2024 and May 2025. That figure represents a 7.08 percentage point improvement over the base model it was trained from, Alibaba's Qwen3-14B, according to Nous Research's technical report published alongside the release.
"I gave Claude Code a description of the problem, it generated what we built last year in an hour," wrote Jaana Dogan, a principal engineer at Google responsible for the Gemini API, in a viral post on X last week that captured the prevailing mood around AI coding tools. Dogan was describing a distributed agent orchestration system her team had spent a year developing â a system Claude Code approximated from a three-paragraph prompt.
The juxtaposition is instructive: while Anthropic's Claude Code has captured imaginations with demonstrations of end-to-end software development, Nous Research is betting that open-source alternatives trained on verifiable problems can close the gap â and that transparency in how these models are built matters as much as raw capability.
How Nous Research built an AI coding model that anyone can replicate
What distinguishes the NousCoder-14B release from many competitor announcements is its radical openness. Nous Research published not just the model weights but the complete reinforcement learning environment, benchmark suite, and training harness â built on the company's Atropos framework â enabling any researcher with sufficient compute to reproduce or extend the work.
"Open-sourcing the Atropos stack provides the necessary infrastructure for reproducible olympiad-level reasoning research," noted one observer on X, summarizing the significance for the academic and open-source communities.
The model was trained by Joe Li, a researcher in residence at Nous Research and a former competitive programmer himself. Li's technical report reveals an unexpectedly personal dimension: he compared the model's improvement trajectory to his own journey on Codeforces, the competitive programming platform where participants earn ratings based on contest performance.
Based on rough estimates mapping LiveCodeBench scores to Codeforces ratings, Li calculated that NousCoder-14B's improvemen tâ from approximately the 1600-1750 rating range to 2100-2200 â mirrors a leap that took him nearly two years of sustained practice between ages 14 and 16. The model accomplished the equivalent in four days.
"Watching that final training run unfold was quite a surreal experience," Li wrote in the technical report.
But Li was quick to note an important caveat that speaks to broader questions about AI efficiency: he solved roughly 1,000 problems during those two years, while the model required 24,000. Humans, at least for now, remain dramatically more sample-efficient learners.
Inside the reinforcement learning system that trains on 24,000 competitive programming problems
NousCoder-14B's training process offers a window into the increasingly sophisticated techniques researchers use to improve AI reasoning capabilities through reinforcement learning.
The approach relies on what researchers call "verifiable rewards" â a system where the model generates code solutions, those solutions are executed against test cases, and the model receives a simple binary signal: correct or incorrect. This feedback loop, while conceptually straightforward, requires significant infrastructure to execute at scale.
Nous Research used Modal, a cloud computing platform, to run sandboxed code execution in parallel. Each of the 24,000 training problems contains hundreds of test cases on average, and the system must verify that generated code produces correct outputs within time and memory constraints â 15 seconds and 4 gigabytes, respectively.
The training employed a technique called DAPO (Dynamic Sampling Policy Optimization), which the researchers found performed slightly better than alternatives in their experiments. A key innovation involves "dynamic sampling" â discarding training examples where the model either solves all attempts or fails all attempts, since these provide no useful gradient signal for learning.
The researchers also adopted "iterative context extension," first training the model with a 32,000-token context window before expanding to 40,000 tokens. During evaluation, extending the context further to approximately 80,000 tokens produced the best results, with accuracy reaching 67.87 percent.
Perhaps most significantly, the training pipeline overlaps inference and verification â as soon as the model generates a solution, it begins work on the next problem while the previous solution is being checked. This pipelining, combined with asynchronous training where multiple model instances work in parallel, maximizes hardware utilization on expensive GPU clusters.
The looming data shortage that could slow AI coding model progress
Buried in Li's technical report is a finding with significant implications for the future of AI development: the training dataset for NousCoder-14B encompasses "a significant portion of all readily available, verifiable competitive programming problems in a standardized dataset format."
In other words, for this particular domain, the researchers are approaching the limits of high-quality training data.
"The total number of competitive programming problems on the Internet is roughly the same order of magnitude," Li wrote, referring to the 24,000 problems used for training. "This suggests that within the competitive programming domain, we have approached the limits of high-quality data."
This observation echoes growing concern across the AI industry about data constraints. While compute continues to scale according to well-understood economic and engineering principles, training data is "increasingly finite," as Li put it.
"It appears that some of the most important research that needs to be done in the future will be in the areas of synthetic data generation and data efficient algorithms and architectures," he concluded.
The challenge is particularly acute for competitive programming because the domain requires problems with known correct solutions that can be verified automatically. Unlike natural language tasks where human evaluation or proxy metrics suffice, code either works or it doesn't â making synthetic data generation considerably more difficult.
Li identified one potential avenue: training models not just to solve problems but to generate solvable problems, enabling a form of self-play similar to techniques that proved successful in game-playing AI systems. "Once synthetic problem generation is solved, self-play becomes a very interesting direction," he wrote.
A $65 million bet that open-source AI can compete with Big Tech
Nous Research has carved out a distinctive position in the AI landscape: a company committed to open-source releases that compete with â and sometimes exceed â proprietary alternatives.
The company raised $50 million in April 2025 in a round led by Paradigm, the cryptocurrency-focused venture firm founded by Coinbase co-founder Fred Ehrsam. Total funding reached $65 million, according to some reports. The investment reflected growing interest in decentralized approaches to AI training, an area where Nous Research has developed its Psyche platform.
Previous releases include Hermes 4, a family of models that we reported "outperform ChatGPT without content restrictions," and DeepHermes-3, which the company described as the first "toggle-on reasoning model" â allowing users to activate extended thinking capabilities on demand.
The company has cultivated a distinctive aesthetic and community, prompting some skepticism about whether style might overshadow substance. "Ofc i'm gonna believe an anime pfp company. stop benchmarkmaxxing ffs," wrote one critic on X, referring to Nous Research's anime-style branding and the industry practice of optimizing for benchmark performance.
Others raised technical questions. "Based on the benchmark, Nemotron is better," noted one commenter, referring to Nvidia's family of language models. Another asked whether NousCoder-14B is "agentic focused or just 'one shot' coding" â a distinction that matters for practical software development, where iterating on feedback typically produces better results than single attempts.
What researchers say must happen next for AI coding tools to keep improving
The release includes several directions for future work that hint at where AI coding research may be heading.
Multi-turn reinforcement learning tops the list. Currently, the model receives only a final binary reward â pass or fail â after generating a solution. But competitive programming problems typically include public test cases that provide intermediate feedback: compilation errors, incorrect outputs, time limit violations. Training models to incorporate this feedback across multiple attempts could significantly improve performance.
Controlling response length also remains a challenge. The researchers found that incorrect solutions tended to be longer than correct ones, and response lengths quickly saturated available context windows during training â a pattern that various algorithmic modifications failed to resolve.
Perhaps most ambitiously, Li proposed "problem generation and self-play" â training models to both solve and create programming problems. This would address the data scarcity problem directly by enabling models to generate their own training curricula.
"Humans are great at generating interesting and useful problems for other competitive programmers, but it appears that there still exists a significant gap in LLM capabilities in creative problem generation," Li wrote.
The model is available now on Hugging Face under an Apache 2.0 license. For researchers and developers who want to build on the work, Nous Research has published the complete Atropos training stack alongside it.
What took Li two years of adolescent dedication to achieveâclimbing from a 1600-level novice to a 2100-rated competitor on Codeforcesâan AI replicated in 96 hours. He needed 1,000 problems. The model needed 24,000. But soon enough, these systems may learn to write their own problems, teach themselves, and leave human benchmarks behind entirely.
The question is no longer whether machines can learn to code. It's whether they'll soon be better teachers than we ever were.
-
The creator of Claude Code just revealed his workflow, and developers are losing their minds VentureBeat AI Jan 05, 2026 07:45 AM 5 min read
When the creator of the world's most advanced coding agent speaks, Silicon Valley doesn't just listen â it takes notes.
For the past week, the engineering community has been dissecting a thread on X from Boris Cherny, the creator and head of Claude Code at Anthropic. What began as a casual sharing of his personal terminal setup has spiraled into a viral manifesto on the future of software development, with industry insiders calling it a watershed moment for the startup.
"If you're not reading the Claude Code best practices straight from its creator, you're behind as a programmer," wrote Jeff Tang, a prominent voice in the developer community. Kyle McNease, another industry observer, went further, declaring that with Cherny's "game-changing updates," Anthropic is "on fire," potentially facing "their ChatGPT moment."
The excitement stems from a paradox: Cherny's workflow is surprisingly simple, yet it allows a single human to operate with the output capacity of a small engineering department. As one user noted on X after implementing Cherny's setup, the experience "feels more like Starcraft" than traditional coding â a shift from typing syntax to commanding autonomous units.
Here is an analysis of the workflow that is reshaping how software gets built, straight from the architect himself.
How running five AI agents at once turns coding into a real-time strategy game
The most striking revelation from Cherny's disclosure is that he does not code in a linear fashion. In the traditional "inner loop" of development, a programmer writes a function, tests it, and moves to the next. Cherny, however, acts as a fleet commander.
"I run 5 Claudes in parallel in my terminal," Cherny wrote. "I number my tabs 1-5, and use system notifications to know when a Claude needs input."
By utilizing iTerm2 system notifications, Cherny effectively manages five simultaneous work streams. While one agent runs a test suite, another refactors a legacy module, and a third drafts documentation. He also runs "5-10 Claudes on claude.ai" in his browser, using a "teleport" command to hand off sessions between the web and his local machine.
This validates the "do more with less" strategy articulated by Anthropic President Daniela Amodei earlier this week. While competitors like OpenAI pursue trillion-dollar infrastructure build-outs, Anthropic is proving that superior orchestration of existing models can yield exponential productivity gains.
The counterintuitive case for choosing the slowest, smartest model
In a surprising move for an industry obsessed with latency, Cherny revealed that he exclusively uses Anthropic's heaviest, slowest model: Opus 4.5.
"I use Opus 4.5 with thinking for everything," Cherny explained. "It's the best coding model I've ever used, and even though it's bigger & slower than Sonnet, since you have to steer it less and it's better at tool use, it is almost always faster than using a smaller model in the end."
For enterprise technology leaders, this is a critical insight. The bottleneck in modern AI development isn't the generation speed of the token; it is the human time spent correcting the AI's mistakes. Cherny's workflow suggests that paying the "compute tax" for a smarter model upfront eliminates the "correction tax" later.
One shared file turns every AI mistake into a permanent lesson
Cherny also detailed how his team solves the problem of AI amnesia. Standard large language models do not "remember" a company's specific coding style or architectural decisions from one session to the next.
To address this, Cherny's team maintains a single file named CLAUDE.md in their git repository. "Anytime we see Claude do something incorrectly we add it to the CLAUDE.md, so Claude knows not to do it next time," he wrote.
This practice transforms the codebase into a self-correcting organism. When a human developer reviews a pull request and spots an error, they don't just fix the code; they tag the AI to update its own instructions. "Every mistake becomes a rule," noted Aakash Gupta, a product leader analyzing the thread. The longer the team works together, the smarter the agent becomes.
Slash commands and subagents automate the most tedious parts of development
The "vanilla" workflow one observer praised is powered by rigorous automation of repetitive tasks. Cherny uses slash commands â custom shortcuts checked into the project's repository â to handle complex operations with a single keystroke.
He highlighted a command called /commit-push-pr, which he invokes dozens of times daily. Instead of manually typing git commands, writing a commit message, and opening a pull request, the agent handles the bureaucracy of version control autonomously.
Cherny also deploys subagents â specialized AI personas â to handle specific phases of the development lifecycle. He uses a code-simplifier to clean up architecture after the main work is done and a verify-app agent to run end-to-end tests before anything ships.
Why verification loops are the real unlock for AI-generated code
If there is a single reason Claude Code has reportedly hit $1 billion in annual recurring revenue so quickly, it is likely the verification loop. The AI is not just a text generator; it is a tester.
"Claude tests every single change I land to claude.ai/code using the Claude Chrome extension," Cherny wrote. "It opens a browser, tests the UI, and iterates until the code works and the UX feels good."
He argues that giving the AI a way to verify its own work â whether through browser automation, running bash commands, or executing test suites â improves the quality of the final result by "2-3x." The agent doesn't just write code; it proves the code works.
What Cherny's workflow signals about the future of software engineering
The reaction to Cherny's thread suggests a pivotal shift in how developers think about their craft. For years, "AI coding" meant an autocomplete function in a text editor â a faster way to type. Cherny has demonstrated that it can now function as an operating system for labor itself.
"Read this if you're already an engineer... and want more power," Jeff Tang summarized on X.
The tools to multiply human output by a factor of five are already here. They require only a willingness to stop thinking of AI as an assistant and start treating it as a workforce. The programmers who make that mental leap first won't just be more productive. They'll be playing an entirely different game â and everyone else will still be typing.
Research & Blogs (99 articles)
-
How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines Meta AI / Engineering Apr 06, 2026 04:00 PM 6 min read AI coding assistants are powerful but only as good as their understanding of your codebase. When we pointed AI agents at one of Metaâs large-scale data processing pipelines â spanning four reâŚ
AI coding assistants are powerful but only as good as their understanding of your codebase. When we pointed AI agents at one of Metaâs large-scale data processing pipelines â spanning four repositories, three languages, and over 4,100 files â we quickly found that they werenât making useful edits quickly enough.Â
We fixed this by building a pre-compute engine: a swarm of 50+ specialized AI agents that systematically read every file and produced 59 concise context files encoding tribal knowledge that previously lived only in engineersâ heads. The result: AI agents now have structured navigation guides for 100% of our code modules (up from 5%, covering all 4,100+ files across three repositories). We also documented 50+ ânon-obvious patterns,â or underlying design choices and relationships not immediately apparent from the code, and preliminary tests show 40% fewer AI agent tool calls per task. The system works with most leading models because the knowledge layer is model-agnostic.
The system also maintains itself. Every few weeks, automated jobs periodically validate file paths, detect coverage gaps, re-run quality critics, and auto-fix stale references. The AI isnât a consumer of this infrastructure, itâs the engine that runs it.Â
The Problem: AI Tools Without a Map
Our pipeline is config-as-code: Python configurations, C++ services, and Hack automation scripts working together across multiple repositories. A single data field onboarding touches configuration registries, routing logic, DAG composition, validation rules, C++ code generation, and automation scripts â six subsystems that must stay in sync.
We had already built AI-powered systems for operational tasks, scanning dashboards, pattern-matching against historical incidents, and suggesting mitigations. But when we tried to extend it to development tasks, it fell apart. The AI had no map. It didnât know that two configuration modes use different field names for the same operation (swap them and you get silent wrong output), or that dozens of âdeprecatedâ enum values must never be removed because serialization compatibility depends on them.
Without this context, agents would guess, explore, guess again and often produce code that compiled but was subtly wrong.
The Approach: Teach the Agents Before They Explore
We used a large-context-window model and task orchestration to structure the work in phases:Â
- Two explorer agents mapped the codebase,Â
- 11 module analysts read every file and answered five key questions,Â
- Two writers generated context files, andÂ
- 10+ critic passes ran three rounds of independent quality review,Â
- Four fixers applied corrections,Â
- Eight upgraders refined the routing layer,Â
- Three prompt testers validated 55+ queries across five personas,Â
- Four gap-fillers covered remaining directories, andÂ
- Three final critics ran integration tests â 50+ specialized tasks orchestrated in a single session.
The five questions each analyst answered per module:
- What does this module configure?
- What are the common modification patterns?
- What are the non-obvious patterns that cause build failures?
- What are the cross-module dependencies?
- What tribal knowledge is buried in code comments?
Question five was where the deepest learnings emerged. We found 50+ non-obvious patterns like hidden intermediate naming conventions where one pipeline stage outputs a temporary field name that a downstream stage renames (reference the wrong one and code generation silently fails), or append-only identifier rules where removing a âdeprecatedâ value breaks backward compatibility. None of this had been written down before.
What We Built: A Compass, Not An Encyclopedia

Each context file follows what we call âcompass, not encyclopediaâ principle â 25â35 lines (~1,000 tokens) with four sections:
- Quick Commands (copy-paste operations).Â
- Key Files (the 3â5 files you actually need).Â
- Non-Obvious patterns.Â
- See Also (cross-references).Â
No fluff, every line earns its place. All 59 files together consume less than 0.1% of a modern modelâs context window.
On top of this, we built an orchestration layer that auto-routes engineers to the right tool based on natural language. Type, âIs the pipeline healthy?â and it scans dashboards and matches against 85+ historical incident patterns. Type, âAdd a new data fieldâ and it generates the configuration with multi-phase validation. Engineers describe their problem; the system figures out the rest.
The system self-refreshes every few weeks, validating file paths, identifying coverage gaps, re-running critic agents, and auto-fixing issues. Context that decays is worse than no context at all.
Beyond individual contextual files, we generated a cross-repo dependency index and data flow maps showing how changes propagate across repositories. This turns âWhat depends on X?â from a multi-file exploration (~6000 tokens) into a single graph lookup (~200 tokens) â in config-as-code where one field change ripples across six-subsystems.
Results
Metric Before After AI context coverage ~5% (5 files) 100% (59 files) Codebase files with AI navigation ~50 4,100+ Tribal knowledge documented 0 50+ non-obvious patterns Tested prompts (core pass rate) 0 55+ (100%)
In preliminary tests on six tasks against our pipeline, agents with pre-computed context used roughly 40% fewer tool calls and tokens per task. Complex workflow guidance that previously required ~two days of research and consulting with engineers now completes in ~30 minutes.Quality was non-negotiable: three rounds of independent critic agents improved scores from 3.65 to 4.20 out of 5.0, and all referenced file paths were verified with zero hallucinations.
Challenging the Conventional Wisdom on AI Context Files
Recent academic research found that AI-generated context files actually decreased agent success rates on well-known open-source Python repositories. This finding deserves serious consideration but it has a limitation: It was evaluated on codebases like Django and matplotlib that models already âknowâ from pretraining. In that scenario, context files are redundant noise.
Our codebase is the opposite: proprietary config-as-code with tribal knowledge that exists nowhere in any modelâs training data. Three design decisions help us avoid the pitfalls the research identified: files are concise (~1,000 tokens, not encyclopedic summaries), opt-in (loaded only when relevant, not always-on), and quality-gated (multi-round critic review plus automated self-upgrade).
The strongest argument: Without context, agents burn 15â25 tool calls exploring, miss naming patterns, and produce subtly incorrect code. The cost of not providing context is measurably higher.
How to Apply This to Your Codebase
This approach isnât specific to our pipeline. Any team with a large, proprietary codebase can benefit:
- Identify your tribal knowledge gaps. Where do AI agents fail most? The answer is usually domain-specific conventions and cross-module dependencies that arenât documented anywhere.
- Use the âfive questionsâ framework. Have agents (or engineers) answer: what does it do, how do you modify it, what breaks, what depends on it, and whatâs undocumented?
- Follow âcompass, not encyclopedia.â Keep context files to 25â35 lines. Actionable navigation beats exhaustive documentation.
- Build quality gates. Use independent critic agents to score and improve generated context. Donât trust unreviewed AI output.
- Automate freshness. Context that goes stale causes more harm than no context. Build periodic validation and self-repair.
Whatâs Next
We are expanding context coverage to additional pipelines across Metaâs data infrastructure and exploring tighter integration between context files and code generation workflows. Weâre also investigating whether the automated refresh mechanism can detect not just stale context but emerging patterns and new tribal knowledge forming in recent code reviews and commits.
This approach turned undocumented tribal knowledge into structured, AI-readable context and one that compounds with every task that follows.Â
The post How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines appeared first on Engineering at Meta.
- Announcing the OpenAI Safety Fellowship OpenAI Blog Apr 06, 2026 10:00 AM
- Industrial policy for the Intelligence Age OpenAI Blog Apr 06, 2026 02:30 AM
- Apr 6, 2026 Announcements Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generati Anthropic News Apr 06, 2026 12:00 AM Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
- OpenClaw 3.31â4.2: Claws Out â Plugin Lockdown, Task Brain, and a Security Siege OpenClaws.io Blog Apr 03, 2026 12:00 AM 3.28 gave the lobster a new shell. Now it learns to use its claws. Three releases in three days â 3.31, 4.1, 4.2 â deliver the most aggressive security lockdown in OpenClaw history. Plugin installs ge
-
KernelEvolve: How Metaâs Ranking Engineer Agent Optimizes AI Infrastructure Meta AI / Engineering Apr 02, 2026 07:59 PM 16 min read This is the second post in the Ranking Engineer Agent blog series exploring the autonomous AI capabilities accelerating Metaâs Ads Ranking innovation. The previous post introduced Ranking EngâŚ
This is the second post in the Ranking Engineer Agent blog series exploring the autonomous AI capabilities accelerating Metaâs Ads Ranking innovation. The previous post introduced Ranking Engineer Agentâs ML exploration capability, which autonomously designs, executes, and analyzes ranking model experiments. This post covers how to optimize the low-level infrastructure that makes those models run efficiently at scale. We introduce KernelEvolve, an agentic kernel authoring system used by Ranking Engineer Agent and generally applicable to a range of AI models beyond Ads Ranking.
Summary
- Meta operates a large fleet of heterogeneous hardware â NVIDIA GPUs, AMD GPUs, Metaâs custom MTIA silicon chips, and CPUs. Using this hardware effectively and efficiently requires developing software that translates high-level model operations into efficient, chip-specific instructions called optimized kernels. Authoring and optimizing kernels must be done for each new chip generation and ML model architecture. Beyond standard kernel operators like general matrix multiplications (GEMMs) and convolutions covered by vendor libraries, production workloads require many custom operators across ranking models. With the number of models and number of hardware types and generations, hand-tuning by kernel experts doesnât scale.
- To address the volume of performance optimization work required by the increasing number of models X number of hardware types & generations, we built KernelEvolve, an agent to optimize performance used by Metaâs Ranking Engineer Agent. It enables:Â
- Faster development: Compresses weeks of expert engineering time optimizing kernels, including profiling, optimizing, and cross-hardware debugging, into hours of automated search and evaluation, freeing engineers for other work.
- Better performance: Over 60% inference throughput improvement for the Andromeda Ads model on NVIDIA GPUs and over 25% training throughput improvement for an ads model on Metaâs custom MTIA silicon chips.
- Broad applicability: Optimizes across public and proprietary hardware including NVIDIA GPUs, AMD GPUs, MTIA chips and CPUs, generating kernels in high-level DSLs like Triton, Cute DSL, and FlyDSL, as well as low-level languages including CUDA, HIP, and MTIA C++.
- KernelEvolve treats kernel optimization as a search problem: a purpose-built job-harness evaluates each candidate kernel, feeds diagnostics back to the LLM, and drives a continuous search over hundreds of alternatives, exceeding the performance of human expert generated kernels.
- More details are available in the paper, âKernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta,â which will appear at the 53rd International Symposium on Computer Architecture (ISCA) 2026.
Every day, Meta serves billions of AI-powered experiences, from personalized recommendation to generative AI assistants, on a global infrastructure including diverse hardware from NVIDIA, AMD, and Metaâs custom MTIA silicon chips. Behind every training or inference request lies a layer of highly optimized low-level hardware kernels: small programs that translate high-level model operations into instructions a specific chip can execute efficiently. As AI models grow more complex and the hardware landscape diversifies, the number of kernels scales across hardware platforms, model architectures and operator types, resulting in thousands of configurations that can no longer realistically be tuned by human experts, creating a critical bottleneck that delays hardware enablement and performance tuning and slowing model iteration cycles that drive critical advances in ML technology and its applications.    Â
Today, we are sharing KernelEvolve, an agentic AI system that improved ads model inference throughput by 60% in hours of experimentation, a task that would take human experts weeks. KernelEvolve autonomously generates and optimizes production-grade kernels for heterogeneous hardware used in training and inference, including NVIDIA GPUs, AMD GPUs, Metaâs custom MTIA silicon, and CPUs. Unlike typical large language model (LLM)-based agents that perform one-shot code generation, KernelEvolve treats kernel optimization as a search problem. It explores hundreds of alternative kernel implementations to identify a solution that often matches or exceeds human expert performance, and does so in hours instead of weeks. In Metaâs production environment, KernelEvolve is optimizing code that serves trillions of daily inference requests.
KernelEvolve represents a fundamental shift in how we think about the relationship between AI software and hardware. Where kernel development was once a manual, expert-driven process that struggled to keep pace with hardware and model evolution, KernelEvolve makes it continuous and automated â adapting as each changes. As Meta continues to diversify its AI hardware portfolio, the ability to rapidly generate optimized kernels for new chips substantially reduces the engineering effort required to integrate heterogeneous hardware for training and inference.ÂThe Challenge: The Bottleneck of Explosive Kernel Growth
Weâre seeing explosive kernel growth because the total number of kernels scales with the product of three factors: {hardware types and generations X model architectures X number of operators}. This product results in thousands of unique kernel configurations that must be written, tested, and maintained. Hand-tuning each kernel doesnât scale, and kernel experts alone canât keep up with the pace.
Hardware Heterogeneity
Metaâs accelerator fleet now spans NVIDIA GPUs, AMD GPUs, and Metaâs custom MTIA silicon, each with fundamentally different memory architectures and hierarchies, instruction sets, and execution models. A kernel that runs optimally on one platform may perform poorly or fail entirely on another. And the complexity doesnât stop at vendor boundaries. Even within a single hardware family, successive generations introduce architectural changes that require different optimization strategies. Metaâs MTIA roadmap spans four chip generations in two years (MTIA 300 through 500), each introducing new compute capabilities, memory bandwidth characteristics, and numeric data types optimized for evolving workloads. A kernel optimized for one generation will underperform when run on the next generation of the same hardware architecture.
Model Architecture Variation
Metaâs recommendation models have evolved through three major phases: from early embedding-based deep learning recommendation models, to sequence learning models that process engagement histories with attention mechanisms, to Metaâs Generative Ads Recommendation Model (GEM), and most recently Metaâs foundation inference model that brings LLM-scale to ads (Meta Adaptive Ranking Model). Each generation introduces operator types the previous generation never needed. Beyond these generational shifts, Metaâs production stack simultaneously serves fundamentally different model families, each with its own unique operators, and a single ads request may traverse multiple families in one serving call. With a vast and growing number of distinct models in production, every new architecture extends the matrix of operators that must be optimized across hardware.
Kernel Diversity Beyond Standard Libraries
Vendor libraries like cuBLAS and cuDNN cover a set of common operations â GEMMs, convolutions, standard activations â but even these standard operators resist one-size-fits-all solutions. A single operator like matrix multiplication behaves differently across contexts: The optimal kernel for a training batch differs from an inference serving request, and tensor shapes vary widely across ranking stages and ranking models, creating a combinatorial space of configurations that neither human experts nor todayâs compiler-based autotuning and fusion can fully cover at scale. Beyond standard operators, production workloads are dominated by a long tail of operators that fall outside library coverage. These include data preprocessing transforms like feature hashing, bucketing, and sequence truncation that prepare raw input for model inference, as well as custom model operators like fused feature interaction layers and specialized attention variants that are unique to Metaâs architectures.Â
None of these custom operators appear in vendor libraries, and many are too workload-specific to warrant a library implementation. Without native accelerator implementations, these operators either fall back to CPU â forcing disaggregated serving architectures with significant latency overhead â or run via unoptimized code paths that underutilize hardware.Â
The problem compounds with hardware diversity. A hand-tuned NVIDIA kernel cannot simply be recompiled for AMD GPUs or MTIA. Each new model architecture extends the tail further, and each new chip multiplies the work required to cover it.Â
How KernelEvolve Addresses These Challenges
Each challenge maps to a specific architectural decision:
Challenge How KernelEvolve Addresses It Hardware Heterogeneity A retrieval-augmented knowledge base injects platform-specific documentation including architecture manuals, instruction sets, and/or optimization patterns into the generation context. The LLM reasons over this documentation at inference timeâno prior training on the target hardware required. A single universal prompting interface eliminates per-platform prompt templates. Model Architecture Variation Tree search explores implementation alternatives for any operator, including novel ones. Successful optimizations are distilled into reusable patterns that transfer across model familiesâan optimization discovered for one architecture accelerates similar operators in future ones. Kernel Diversity / Long Tail Automated evaluation validates hundreds of candidates in parallel. Search-based optimization replaces the need for hand-tuning, making operators feasible that wouldnât otherwise justify weeks of manual tuning.
KernelEvolve: Searching for Optimal KernelsKernelEvolve approaches this challenge differently from standard AI coding assistants. Rather than prompting an LLM to generate a single kernel and testing it, the system formalizes kernel optimization as a structured search problem across the space of possible implementations. Under the hood, a purpose-built long-running job harness drives each iteration â compiling candidates, evaluating correctness and performance, profiling hardware utilization, and generating analysis reports â all while handling the multi-minute build cycles and infrastructure failures that make native approaches impractical.

Figure 1: ââHow a kernel optimization request flows through KernelEvolveâs six components. LLM Synthesizer
An LLM generates candidate kernels across multiple programming languages and hardware targets â from high-level DSLs like Triton, TLX, CuTe DSL, and FlyDSL, to low-level backends including CUDA, HIP, and MTIA C++.
Rather than using static prompts, the synthesizer constructs dynamic, context-aware prompts that are continuously enriched with runtime diagnostics, hardware constraints, and the historical signals from prior candidate optimization evaluation. This replaces the traditional approach of maintaining separate prompt templates for debugging, performance tuning, and correctness verification with a single adaptive interface that unifies these workflows into a single adaptive interface that drives a continuous, feedback-driven optimization loop.Tree Search Engine
The system explores the optimization space using graph-based search algorithms, including Monte Carlo tree search and evolutionary strategies. Each kernel candidate becomes a node in a search tree. The engine selects promising candidates, applies transformations, evaluates results, and decides whether to explore further or backtrack â balancing exploitation of known-good strategies against exploration of novel approaches.
Crucially, nodes do not evolve in isolation. Each node carries a configurable memory operator that determines how it draws context from the search tree when generating the next round of candidates. A node may inherit its parentâs optimization trajectory to refine a promising direction, compare against siblings to learn what differentiates high-performing variants, combine insights from both parent and sibling histories, or start with a clean slate to escape local optima. This selective memory mechanism allows the tree search to move beyond simple independent sampling â sibling nodes collaborate by surfacing complementary strategies, parent-child chains preserve and deepen successful optimization paths, and memory-free restarts inject diversity when the search stagnates.

Figure 2: How the tree search engine navigates the optimization space to find high-performing kernels. Retrieval-Augmented Knowledge Base
To generate optimized code for hardware the underlying LLM was never trained on, KernelEvolve maintains a hierarchical knowledge base organized into three categories: correctness constraints that enforce valid kernel implementations, platform-agnostic optimization guidance covering debugging and tuning strategies, and hardware-specific documentation containing architectural details for each accelerator platform. The system retrieves relevant knowledge dynamically based on runtime signals. For example, a memory bandwidth bottleneck triggers retrieval of memory hierarchy documentation; a compilation error activates debugging guidance.Â
This knowledge base is not static. As the system solves new optimization problems it distills successful strategies into reusable skills â compact optimization patterns and debugging heuristics â that are continuously written back into the knowledge base. This self-evolving skill library acts as a form of in-context reinforcement learning: Each successful exploration enriches the context available to future sessions, enabling the system to solve similar problems faster and with fewer search steps, without requiring model retraining.
Automated Evaluation Framework
Every generated kernel passes through a rigorous validation pipeline that checks both correctness â bitwise accuracy against reference implementations â and performance. And evaluation goes far beyond a single runtime number.
KernelEvolve leverages a stack of profiling tools, each targeting a different level of analysis. TritonBench validates numerical correctness against PyTorch baselines and measures end-to-end speedup across production input shapes. PyTorch Profiler captures system-level execution timelines, including kernel launch overhead and host-device synchronization. For GPU targets, tools like NCU provide kernel-level hardware metrics â occupancy, memory throughput, instruction mix â while Proton delivers intra-kernel instruction-level latency and pipeline behavior. For MTIA targets, MTIA Insight provides comprehensive accelerator-specific instrumentation: PE utilization, fixed-function engine metrics (DPE, SFU, MLU utilization and stall cycles), cache behavior, and per-PE memory bandwidth counters.
Rather than treating these tools as standalone steps, KernelEvolve unifies them through a compiler-centric abstraction. The framework composes analysis through job graphs: compiler transforms insert MLIR-level instrumentation, profiling passes collect metrics, and trace synthesis produces structured output. This means the search engine doesnât just see âkernel A is 1.2x faster than kernel Bâ â it sees why: whether the bottleneck is memory-bound, compute-bound, or limited by occupancy â and feeds that diagnostic signal back into the LLM synthesizer to guide the next round of candidates.
Shared Data Foundation
Every optimization session contributes to a shared data foundation. When one engineerâs exploration discovers an effective tiling strategy for a class of operators, that insight becomes available to every future session targeting similar workloads â creating a compounding effect where the system grows more capable with each use. Early adopters perform the hardest exploration; subsequent users inherit much closer to optimal starting points and refine from there.Â
Agentic Reinforcement Learning
Every optimization session generates structured training data as a natural byproduct: agentic trajectories capturing the reasoning, code transformations, and evaluation feedback behind high-performing kernels. This domain-specific data is rare and valuable. It encodes optimization intuition that no public dataset contains.Â
We use this data to post-train smaller, specialized models through agentic reinforcement learning, where the reward signal comes directly from measured kernel performance. The result is a virtuous cycle where better models produce better kernels in fewer reasoning tokens and fewer search steps, which in turn generate higher-quality training data. Over successive iterations, this compounding flywheel enables us to self-host increasingly efficient models that are compact enough to run cost-effectively at scale while retaining the optimization capability of much larger frontier models.Â
Enabling Proprietary AI Chips
One of the most consequential capabilities of this architecture is its ability to generate optimized code for hardware that does not exist in any public training dataset.Â
Metaâs custom MTIA chips present a unique programming challenge. Because these chips are proprietary, no public LLM has been trained on MTIA code. A standard coding assistant lacks the context to write optimized MTIA kernels because it has never seen MTIA documentation, instruction set details, or programming idioms.Â
KernelEvolve solves this through systematic knowledge injection. We encode MTIA-specific documentation (architecture manuals, instruction set references, memory hierarchy specifications, and optimization patterns) directly into the retrieval-augmented knowledge base. When the system targets MTIA, it retrieves and incorporates this proprietary knowledge into its reasoning, effectively âlearningâ the hardware in real time.Â
This approach extends to any new accelerator. When a new chip arrives, the engineering cost shifts from writing thousands of kernels by hand to curating a set of hardware documents and injecting them into the knowledge base. The system then autonomously generates optimized kernels for the new platform, ensuring the software stack is ready at the speed of hardware deployment rather than the speed of manual engineering.
KernelEvolveâs Impact Across Benchmark and Production
KernelEvolve has delivered strong results across both standardized benchmarks and production workloads.
Benchmark performance: On KernelBench, a benchmark suite of 250 kernel optimization problems from Stanford spanning three difficulty levels, KernelEvolve achieves a 100% pass rate â all generated kernels are both functionally correct and faster than their PyTorch reference implementations. The system also validates 160 PyTorch ATen operators with 100% correctness across three hardware platforms (480 total configurations).
Production speedups: On Metaâs MTIA chips, KernelEvolveâs generated kernels, which spanned compute-bound, memory-bound, and custom operations, achieved speed ups of over 25% training throughput improvement on an ads model. On NVIDIA GPUs, it delivered more than 60% inference throughput improvement over a model with highly optimized kernels including torch.compile and vendor libraries â performance gains that directly translate to serving capacity and infrastructure efficiency.
Hardware coverage: The system generates optimized kernels for NVIDIA GPUs, AMD GPUs, Metaâs custom MTIA silicon, and CPUs â from a single unified framework. Rather than maintaining separate prompt templates per platform, the system dynamically retrieves hardware-specific constraints and optimization patterns, adapting to each target through retrieval augmentation rather than manual prompt engineering.
Development Velocity
Kernel development that previously required weeks of expert effort â profiling, iterating on tiling strategies, debugging edge cases across hardware â now completes in hours through automated search and evaluation. This shifts engineer time from writing low-level code to higher-value work such as designing model architectures, improving training techniques, and defining optimization objectives.
How It All Fits Together
An engineer specifies a target operator, hardware platform, and performance goals. The system then autonomously:
- Retrieves relevant hardware documentation and optimization knowledge from the knowledge base.Â
- Generates an initial set of kernel candidates using the LLM synthesizer with context-aware prompting.Â
- Evaluates each candidate for correctness and performance using distributed benchmarking infrastructure.Â
- Feeds results back into the search engine, which selects the most promising candidates and applies further optimizations.Â
- Iterates steps 1-4, exploring the search tree until the termination criteria are met â either a performance target is achieved, the search budget is exhausted, or progress stalls.Â
- Outputs the best-performing, fully validated kernel, ready for production deployment.
The process runs on Metaâs distributed infrastructure, evaluating thousands of candidates in parallel. Persistent storage of search trees and implementations lets the system build on prior results when targeting new model variants or hardware generations.
Looking Ahead
The same agentic techniques powering KernelEvolve â structured reasoning, retrieval-augmented knowledge, closed-loop evaluation â can be applied to hybrid model search, compiler optimization, memory management, and system configuration. KernelEvolve represents an early step toward the vision of a Ranking Engineer Agent that can continuously optimize its own performance-critical infrastructure.
Within REA, ML Exploration discovers better models. KernelEvolve makes them production-ready. Together, they accelerate how quickly ranking improvements reach advertisers.Â
In the next post in the REA series, where weâll explore other agentic ML optimizations.
Read the PaperÂ
For more technical details, read our paper, âKernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Metaâ from ISCA 2026.
Acknowledgements
We would like to thank Ying Wang, Hongsen Qin, Tao Yang, Jia Jiunn Ang, Yujia He, Alicia Golden, Michael Kuchnik, Wei Guo, Yihan He, Jiangyuan Li, Dianshi Li, Chao Xie, Adele Sun, Richard Li, Alec Hammond, Roman Levenstein, Hongtao Yu, Yuanwei (Kevin) Fang, Kunming Ho, Haishan Zhu, Site Cao, Abdullah Ozturk, Jort Gemmeke, Daniel Wang, Juan Angeles Acuna, Yoram Bachrach, Ming Chen, Terry Chen, Jake Cheng, Wayne Chiang, Wenyuan Chi, Rick Chang, Wyatt Cook, Tri Dao, Barry Dong, Liubov Dmitrieva, Derek Dunfield, Zhou Fang, Rob Fergus, Maxwell Harrison Fisch, Zacharias Fisches, Zach Freeman, Chunli Fu, Vishal Gandhi, Kaustubh Gondkar, Wentian Guo, Han Guo, William Hanwei Liang, Samuel Hsia, Barney Huang, Nicholas Hungria, Martin Josifoski, Jacob Kahn, Shobhit Kanaujia, Drew Lackman, Marek Latuskiewicz, Kristin Lauter, Matan Levi, Evan Li, Yiting Li, Jiang Liu, Alexey Loginov, Yining Lu, Anuj Madan, John Martabano, Anna Mcburney, Keyur Muzumdar, Kelvin Niu, Sandeep Pandey, Uladzimir Pashkevich, Dmitrii Pedchenko, Pedro Pedreira, Varna Puvvada, Preyas Janak Shah, Bidit Sharma, Feng Shi, Stanley Shi, Ketan Singh, Vibha Sinha, Matt Steiner, Gabriel Synnaeve, Oleksandr Stashuk, Jim Tao, Ritwik Tewari, Chris Wiltz, Yao Xuan, Tak Yan, Bill Yoshimi, Xiayu Yu, Abdul Zainul-Abedin, Qing Zhang, and Mingjie Zhu
The post KernelEvolve: How Metaâs Ranking Engineer Agent Optimizes AI Infrastructure appeared first on Engineering at Meta.
- Gemma 4: Byte for byte, the most capable open models DeepMind Blog Apr 02, 2026 04:00 PM Gemma 4: our most intelligent open models to date, purpose-built for advanced reasoning and agentic workflows.
-
New ways to balance cost and reliability in the Gemini API Google AI Blog Apr 02, 2026 04:00 PM 1 min read Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.
Google is introducing two new inference tiers to the Gemini API, Flex and Priority,
to balance cost and latency. -
Create, edit and share videos at no cost in Google Vids Google AI Blog Apr 02, 2026 04:00 PM 1 min read New AI capabilities are coming to Google Vids, powered by Lyria 3 and Veo 3.1, like high-quality video generation at no cost and more.
New AI capabilities are coming to Google Vids, powered by Lyria 3 and Veo 3.1, like high-quality video generation at no cost and more. - OpenAI acquires TBPN OpenAI Blog Apr 02, 2026 10:30 AM
- Codex now offers more flexible pricing for teams OpenAI Blog Apr 02, 2026 10:00 AM
- Welcome Gemma 4: Frontier multimodal intelligence on device Hugging Face Blog Apr 02, 2026 12:00 AM Weâre on a journey to advance and democratize artificial intelligence through open source and open science.
- Holo3: Breaking the Computer Use Frontier Hugging Face Blog Apr 01, 2026 04:36 PM A Blog post by H company on Hugging Face
-
Weâre creating a new satellite imagery map to help protect Brazilâs forests. Google AI Blog Apr 01, 2026 01:30 PM 1 min read Google partnered with the Brazilian government on a satellite imagery map to help protect the countryâs forests.
Google partnered with the Brazilian government on a satellite imagery map to help protect the countryâs forests. -
The latest AI news we announced in March 2026 Google AI Blog Apr 01, 2026 01:00 PM 1 min read Here are Googleâs latest AI updates from March 2026
Here are Googleâs latest AI updates from March 2026 - Falcon Perception Hugging Face Blog Apr 01, 2026 07:13 AM A Blog post by Technology Innovation Institute on Hugging Face
- Gradient Labs gives every bank customer an AI account manager OpenAI Blog Apr 01, 2026 02:00 AM
- Any Custom Frontend with Gradio's Backend Hugging Face Blog Apr 01, 2026 12:00 AM Weâre on a journey to advance and democratize artificial intelligence through open source and open science.
-
Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads Meta AI / Engineering Mar 31, 2026 04:00 PM 10 min read Meta continues to lead the industry in utilizing groundbreaking AI Recommendation Systems (RecSys) to deliver better experiences for people, and better results for advertisers. To reach the next frâŚ
Meta continues to lead the industry in utilizing groundbreaking AI Recommendation Systems (RecSys) to deliver better experiences for people, and better results for advertisers. To reach the next frontier of performance, we are scaling Metaâs Ads Recommender runtime models to LLM-scale & complexity to further a deeper understanding of peopleâs interests and intent.
This increase in scale & complexity exacerbates a fundamental âinference trilemmaâ: the challenge of balancing the increased model complexity and associated need for compute and memory with the low latency and cost efficiency required for a global service serving billions of people. To overcome this, we have developed the Meta Adaptive Ranking Model, which effectively bends the inference scaling curve with high ROI and industry-leading efficiency.
Adaptive Ranking Model replaces a âone-size-fits-allâ inference approach with intelligent request routing. By dynamically aligning model complexity with a rich understanding of a personâs context and intent, the system ensures every request is served by the most effective & efficient model. This allows Meta Ads to maintain the strict, sub-second latency the platform depends on while providing a high-quality experience for every person.
 Serving LLM-scale models at Metaâs scale required a fundamental rethink of the inference stack, driven by three key innovations:- Inference-Efficient Model Scaling: By shifting to a request-centric architecture, Adaptive Ranking Model serves a LLM-scale & complexity model at sub-second latency, enabling a more sophisticated understanding of a personâs interests and intent without compromising the experience.
- Model/System Co-Design: By developing hardware-aware model architectures that align model design with underlying hardware system and siliconâs capabilities and limitations, Adaptive Ranking Model significantly improves hardware utilization in heterogeneous hardware environments.
- Reimagined Serving Infrastructure: Leveraging multi-card architectures and hardware-specific optimizations, Adaptive Ranking Model enables O(1T) parameter scaling, allowing us to serve the LLM-scale runtime RecSys models with unprecedented efficiency.
By further integrating LLM-scale intelligence into our ads stack, Adaptive Ranking Model delivers a significant increase in ad conversions and advertiser value while maintaining system-wide computational efficiency. This ensures superior performance for businesses of all sizes. Since launching on Instagram in Q4 2025, Adaptive Ranking Model has delivered a +3% increase in ad conversions and +5% increase in ad click through rate for targeted users.
Introducing Meta Adaptive Ranking Model
Serving LLM-scale & complexity models in a real-time ads recommendation environment requires resolving a fundamental tension between model complexity and system efficiency. Unlike LLM applications such as chatbots, where response times are measured in seconds, an ad recommendation must achieve two uncompromising constraints:
- Latency impacts user experience: Ads must be chosen and returned with sub-second latency. Scaling ads computation to LLM-scale level and beyond has traditionally been impossible without latency regressions that compromise user experience.
- Cost efficiency is crucial: Brute force scaling by simply adding hardware is economically unsustainable. Achieving a positive ROI requires unlocking higher model complexity without a corresponding increase in total costs.
Adaptive Ranking Model addresses these challenges through a paradigm shift powered by three core innovations across the serving stack:
- Inference-efficient model scaling: Adaptive Ranking Model achieves a model complexity equivalent to the O(10 GFLOPs) per token used by top-tier LLMs. However, it operates an order of magnitude faster than standard LLM inference, maintaining O(100 ms) bounded latency.
- Deep model-system co-design: Adaptive Ranking Model is deeply co-designed with the underlying hardware and silicon; weâve boosted model FLOPs utilization (MFU) to 35% across multiple hardware types.Â
- Reimagined serving infrastructure: Adaptive Ranking Model utilizes a multi-card GPU serving infrastructure to break the physical memory limits of single devices. This allows us to scale model parameters to O(1T), providing a depth of understanding of peopleâs interests and intent previously impossible at Metaâs scale.
By unifying these innovations, we ensure that the most effective model is used for every request â providing a highly personalized ad experience for people on our platforms and maximizing advertiser value while maintaining system-wide computational efficiency.

Inference-Efficient Model Scaling
Adaptive Ranking Model introduces model-system innovations that fundamentally redefine inference efficiency. This transformation is built on three technical pillars:
- Transforming scaling costs from linear to sub-linear by shifting to a request-oriented computation flow that eliminates massive redundancy at LLM-scale.
- Maximizing structural throughput through architectural refinements that stabilize deep models and minimize internal network bottlenecks.
- Neutralizing complexity overhead through holistic latency optimization, offloading feature preprocessing to GPUs and streamlining the end-to-end execution path.
Transforming scaling costs from linear to sub-linear
Traditional models process each user-ad pair independently, creating massive computational redundancy. Adaptive Ranking Model eliminates this through Request-Oriented Optimization, which computes high-density user signals once per request rather than once per ad candidate. This shift, powered by Request-Oriented Computation Sharing and In-Kernel Broadcast optimization, which shares request-level embeddings across ad candidates directly within the GPU kernel, transforms scaling costs from linear to sub-linear while significantly reducing memory bandwidth pressure.
Building on this, Request-Oriented Sequence Scaling unlocks the use of long-form user behavior sequences that were previously limited by compute and storage costs. To minimize compute overhead, Adaptive Ranking Model processes heavy sequences once per request and shares the results across all ad candidates. To optimize storage, it replaces redundant data replication with a centralized, high-efficiency key-value store of user logs that are joined with training data on the fly. These optimizations jointly minimize the serving and storage footprints required for global-scale systems.
Maximizing Structural Throughput with Wukong Turbo
While Request-Oriented Optimization optimizes the computation flow, Wukong Turbo is the optimized runtime evolution of the Meta Ads internal architecture. Building on the Wukong architecture that uses stackable factorization machines, sequence learning and cross-layer attention, Wukong Turbo introduces specific refinements to handle the numeric instability and network overhead that typically arise when scaling deep models. Specifically, it employs a No-Bias approach to remove unstable terms, boosting throughput without increasing FLOPs or parameter counts. To prevent internal bottlenecks, it utilizes small parameter delegation to reduce network and memory overhead by offloading parameters from Fully Sharded Data Parallel (FSDP) to Distributed Data Parallel (DDP) alongside sparsity-based simplification that reduces redundant components in the linear layers. These enhancements transform the base architecture into a stable, high-performing system, allowing model complexity to scale while strictly protecting the sub-second inference budget.
Neutralizing Complexity Overhead through Holistic Latency Optimization
The final stage of this transformation addresses feature preprocessingâa traditional bottleneck leading to client memory pressure and data starvation where the GPUâs compute power remains underutilized while waiting for processed features. Â Adaptive Ranking Model offloads preprocessing from the client CPU to remote GPU hosts, utilizing compact tuple-based formats and GPU-native kernels that reduce Top-K complexity from O(N log N) to O(N). To further speed up processing, we implemented a holistic strategy of optimized data compression and client-flow restructuring to eliminate thread-pool contention. These multi-layered optimizations successfully neutralized the latency penalty of LLM-scale & complexity, allowing Adaptive Ranking Model to deliver frontier-level personalization at the speed Metaâs global platforms require.
Maximizing Efficiency Through Deep Model-System Codesign
Meta Ads relies on deep system co-optimization to enable the LLM-scale model complexity within Meta-scale performance constraints. By fundamentally rethinking the boundary between the model and the hardware, we have created a unified inference stack that optimizes computational precision and graph execution to maximize computational ROI by boosting Model FLOPs Utilization (MFU) on heterogeneous hardware.
High-Throughput Inference with Selective FP8 Quantization
Large-scale models necessitate reduced precision to maintain high-throughput inference, yet a blanket application of low-precision quantization often degrades the nuance required for complex ads ranking. Adaptive Ranking Model overcomes this through a post-training quantization strategy that applies FP8 selectively. Using a micro-benchmark guided selection mechanism, the system deploys FP8 only in layers with high precision-loss tolerance. This targeted approach unlocks the throughput benefits of modern heterogeneous hardware for our most complex models with negligible impact on recommendation quality.
Hardware-Aware Graph and Kernel Specialization
To minimize the latency caused by redundant memory access and inefficient kernel launches, Adaptive Ranking Model optimizes the execution flow through coordinated graph and kernel specialization. We fuse operators that share inputs to minimize data movement between high-bandwidth memory and on-chip SRAM. Additionally, thousands of small operations are consolidated into compute-dense kernels using techniques like Grouped General Matrix Multiply and horizontal fusion. This precise alignment between the computation graph and modern GPU architectures significantly reduces the memory footprint and increases effective hardware utilization, ensuring that LLM-scale model complexity translates directly into performance.
Reimagined Serving Infrastructure for the Reality of LLM-Scale ProductionÂ
Beyond model-system co-optimization, deploying LLM-scale models at scale requires reimagining the underlying serving infrastructure. To neutralize the latency penalty of massive scale, the Adaptive Ranking Model utilizes a specialized stack designed to surpass physical memory limits and ensure Meta-scale production reliability.
Trillion Parameter Scale
Unlike standard LLMs, recommendation models are driven by predominantly sparse, categorical features. Mapping these IDs to high-dimensional embedding tables creates a critical trade-off where oversized tables lead to overfitting, while undersized tables suffer from hash collisions that degrade model quality. Adaptive Ranking Model enables O(1T) parameter scale through memory optimizations that resolve this tension. The system efficiently allocates embedding hash sizes based on feature sparsity and prunes unused embeddings to maximize learning capacity within strict memory budgets. This is further optimized by unified embeddings, which allow multiple features to share a single embedding table to significantly reduce the memory footprint without sacrificing the ability to learn complex feature interactions.
Multi-GPU-Card Embedding ScalingÂ
As LLM-scale model embeddings approached the terabyte level, they exceeded the memory capacity of any single GPU. To mitigate this, a multi-card sharding mechanism splits embedding tables into segments distributed across an optimized hardware cluster. By leveraging hardware-specific communication optimizations, the system maintains high throughput and efficient communication between shards. This multi-card architecture achieves performance parity with single-card setups, effectively decoupling model complexity from individual GPU hardware constraints.
Runtime Resilience and Reliability
Serving trillion-parameter models under high-traffic conditions presents significant reliability challenges, particularly regarding initialization speed and system stability. To ensure production-grade reliability, we developed accelerated model loading that utilizes multi-stream downloading and remote caching to load models in under 10 minutes, minimizing downtime during deployments. Auto-scaling rules based on streaming multiprocessor utilization allows the system to handle fluctuating traffic dynamically. This ensures real-time demand is met without the need for wasteful over-provisioning, maintaining stability across the platform.
The Path Forward: Evolving the Adaptive Ranking Model Stack
The launch of Adaptive Ranking Model on Instagram marks the first milestone in our journey to bend the inference performance vs cost scaling curve at Meta scale. The roadmap shifts from individual optimizations toward an infrastructure that is increasingly autonomous and responsive to real-time fluctuations in user signal density and request patterns across our global ecosystem.
This vision began with evolving inference efficient scaling to unlock deeper complexity and longer behavioral sequences that capture user intent with unprecedented fidelity. To sustain this growth, we are pioneering a new era of inference execution efficiency, leveraging advanced model compression and ultra-low precision quantization methods to allow the most sophisticated LLM-scale models to run efficiently across a diverse global hardware fleet.
To eliminate the traditional bottlenecks of manual engineering, we are exploring agentic optimization frameworks to further accelerate kernel performance optimizations. These frameworks will automatically adapt to new hardware and model architectures, ensuring that the most sophisticated AI remains accessible and performant at scale.Â
Furthermore, weâre reimaging the speed of learning through near-instantaneous model freshness, utilizing incremental, in-place weight updates to achieve constant, real-time adaptation. Collectively, these innovations will ensure that the Adaptive Ranking Model continues to power more personal experiences for people while driving superior ROAS for advertisers globally.
Acknowledgements
We would like to thank:Â Jia Jiunn Ang, Pan Chen, Wenlin Chen, Maomao Ding, Chengze Fan, Lu Fang, Birmingham Guan, Qin Huang, Santanu Kolay, Ashwin Kumar, Jinfu Leng, Boda Li, Huayu Li, Jiawei Li, Li Li (Ads Ranking), Liyuan Li, Mingda Li, Wenyuan Li, Rocky Liu, Jason Lu, Robert Luo, Yinbin Ma, Sandeep Pandey, Uladzimir Pashkevich, Varna Puvvada, Michael Shao, Pranav Sharma, Zijian Shen, Vibha Sinha, Matt Steiner, Chonglin Sun, Weiman Sun, Aaron (Li Bo) Tao, Xiaohan Wei, Nathan Yan, Yantao Yao, Hongtao Yu, Li Yu, Sihan Zeng, Buyun Zhang, Bill Zhao, Alex Zhong, Zhehui Zhou, and the entire V-team team behind the development and productionization of the LLM scale runtime model in Metaâs ads recommendation system.
The post Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads appeared first on Engineering at Meta.
-
Build with Veo 3.1 Lite, our most cost-effective video generation model Google AI Blog Mar 31, 2026 04:00 PM 1 min read Veo 3.1 Lite is now available in paid preview through the Gemini API and for testing in Google AI Studio.
Veo 3.1 Lite is now available in paid preview through the Gemini API and for testing in Google AI Studio. - Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents Hugging Face Blog Mar 31, 2026 03:10 PM A Blog post by IBM Granite on Hugging Face
- Accelerating the next phase of AI OpenAI Blog Mar 31, 2026 01:00 PM
- Mar 31, 2026 Announcements Australian government and Anthropic sign MOU for AI safety and research Anthropic News Mar 31, 2026 12:00 AM Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
- TRL v1.0: Post-Training Library Built to Move with the Field Hugging Face Blog Mar 31, 2026 12:00 AM Weâre on a journey to advance and democratize artificial intelligence through open source and open science.
-
AI for American-Produced Cement and Concrete Meta AI / Engineering Mar 30, 2026 04:00 PM 7 min read Meta is continuing its long-term roadmap to help the construction industry leverage AI to produce high-quality and more sustainable concrete mixes, as well as those exclusively produced in the UnitâŚ
- Meta is continuing its long-term roadmap to help the construction industry leverage AI to produce high-quality and more sustainable concrete mixes, as well as those exclusively produced in the United States.Â
- Concurrent with the 2026 American Concrete Institute (ACI) Spring Convention, Meta is releasing a new AI model for designing concrete mixes â Bayesian Optimization for Concrete (BOxCrete), as well as the foundational data used to develop award-winning concrete mixes.
- Metaâs open source model for sustainable concrete is available today on GitHub.
Every year, the United States pours roughly 400 million cubic yards of concrete, enough concrete to pave a two-lane highway that circles the Earth multiple times. Itâs the backbone of our bridges, data centers, highways, and homes. However, while we produce most of our ready-mix concrete domestically, we import nearly a quarter of the cement that makes it. Metaâs AI is helping change that.Â
Concrete consists of a mix of cement and cementitious materials, aggregates, water, and chemical admixtures. Concrete suppliers have to design concrete mixes to meet competing requirements: strength, speed, ease of handling, cost, and sustainability. Traditional concrete mix design relies heavily on trial-and-error in the lab, engineer intuition, and decades of accumulated knowledgeâa workflow that is slow and expensive to adapt. Â
Cement is a key element of concrete, thus imported cement can have a significant impact on U.S. suppliers, stifling U.S. manufacturing, jobs and investments. While ready-mix concrete is typically produced domestically, the cement required for it is heavily imported, with roughly 20-25% of U.S. cement consumption met by imports. Additionally, cement made in the U.S. complies with U.S. performance and environmental standards that are not consistent internationally.Â
At the same time, ensuring products are produced domesticallyâa process often called reshoring â generally increases manufacturing jobs in the United States. Reshoring and related foreign direct investment (FDI) have brought over 1.1 million jobs back to the U.S. since 2020, and manufacturing has one of the highest economic multipliers; with every $1.00 spent in manufacturing adding $2.69 to the U.S. economy. The cement and concrete sector alone contributes more than $130 billion annually and supports roughly 600,000 jobs â yet imports still supply about 23% of total domestic demand. To capture more of that value at home, U.S.-based concrete producers want to incorporate more U.S.-made materials in their mixes.
Different cements have different chemistries, and a mix that works perfectly with one cement might fail entirely with another. As a result, producers need a way to rapidly explore and validate new formulations without spending months in the lab.
Real-World Impact Across the U.S.
Meta and its partners have already received a number of awards for these innovations in concrete design, including a 2025 Building Innovation Award for Best Partnership (shared with Amrize) and a Slag Cement Award in 2025 for Sustainable Concrete Project of the Year (shared with Amrize and the University of Illinois at Urbana-Champaign). But the impact of this model is also being felt through on-the-ground collaborations in several states through partnerships with large-scale concrete manufacturers and software companies.
Illinois
Meta has been partnering closely with the University of Illinois at Urbana-Champaign and Amrize, the largest cement and concrete manufacturer in North America, headquartered in Chicago, IL., on the implementation of AI for sustainable and domestically-produced concrete. Amrize operates 18 cement plants, 141 cement terminals and 269 ready-mix concrete sites across North America. Their scale makes them an ideal partner for demonstrating how AI can transform mix design at industrial volumes. Amrize recently launched a Made in America cement label, which guarantees the cement meets rigorous U.S. standards and was manufactured in the U.S. by a domestic workforce with American materials. The company also recently announced close to $1 billion of capital investments in 2026 in part to increase domestic cement production. Â
Meta and Amrize will be presenting at the American Concrete Institute (ACI) Spring Convention, along with researchers from the University of Illinois Urbana-Champaign to further showcase our partnership leveraging AI for lower-emission, domestically-produced concrete.
Alongside the event, Meta is releasing a new AI model for designing concrete mixes, Bayesian Optimization for Concrete (BOxCrete). BOxCrete improves over Metaâs previous models with more robustness to noisy data as well as new features including the ability to predict concrete slump (an important indicator of concrete workability).
Coupled with BOxCrete, Meta is releasing the foundational data used to develop the novel concrete mix used in our Rosemount, MN data center. This foundational data is the best systematic foundational data for concrete mix performance compared to other open-sourced, published datasets.Â
Metaâs researchers have submitted a paper on BOxCrete for publication that outlines the new model, data, and the associated methodology.
Minnesota
In partnership with Amrize, Mortenson and the University of Illinois at Urbana-Champaign, BOxCrete was used to generate a stronger, faster-curing concrete mix that was used at scale in a site support section in one of our data center building slabs in Rosemount, MN.Â
The AI-optimized mix was designed for one of the most demanding parts of the build: the massive concrete foundation that supports the weight of thousands of servers and cooling systems. Using domestically sourced materials, the mix reached full structural strength 43% faster than the original formula, while also reducing cracking risk by nearly 10% â proving that AI can help American producers rapidly reformulate around U.S.-made materials without sacrificing quality. With the data confirming it meets all structural requirements, the mix is now qualified for use in additional areas of the data center.

Metaâs data center in Rosemount, MN. Pennsylvania
In 2023, Meta released its concrete optimization AI framework as open-source software under the MIT license, enabling broad adoption from academia to commercial software providers.
In an effort that reflects how AI-driven mix design is becoming part of the standard infrastructure of concrete production, Pennsylvania-based Quadrel, a leading enterprise SaaS platform serving the ready-mix industry, has adapted Metaâs AI framework in its software. Quadrel has applied it to real-world use cases including data preprocessing, batch and test normalization, feature engineering, and customer-specific model training. The models, which continuously improve over time as field test results are incorporated, have been embedded into daily mix design and quality control workflows, informing day-to-day decisions in quality control and operations.

Metaâs open-source AI model for sustainable concrete is provided under MIT license, allowing for commercial use with minimum restrictions while benefiting from open-source AI advances and investments. How Meta Leverages AI for Concrete Mixtures
Metaâs AI for concrete model can help suppliers more quickly incorporate U.S. materials into their mixes through an approach called adaptive experimentation.Â
Hereâs how it works:Â
Metaâs Adaptive Experimentation (Ax) platform uses Bayesian optimization to intelligently navigate the vast space of possible concrete formulations. Instead of testing mixes randomly or relying solely on human intuition, the AI:Â
- Learns from existing data: Historical mix designs, lab results, and performance metrics train the model on what works
- Proposes high-potential candidates: The AI suggests new mixes most likely to meet target specifications and can compare performance between U.S.-made and foreign materials
- Incorporates constraints upfront: Users specify technical requirements and the ingredients to be used.
- Refines with each test: Every lab result improves the modelâs predictions, giving rise to an automatic improvement loop.Â
While the inclusion of AI and adaptive experimentation does not change the process of lab validation, field trials, engineering sign-off, and code compliance, it greatly improves the speed of discovery, helping engineers find better starting points with fewer tests.

Â



Source: University of Illinois at Urbana-Champaign Building an AI-Assisted Future for ConcreteÂ
Metaâs AI for concrete is part of a broader commitment to applying machine learning where it can drive measurable, real-world impact. While the work with Amrize, the University of Illinois, and industry software providers like Quadrel represents the first wave of adoption, the goal is an industry-wide shift in how American producers approach mix design.
Over the next few years, Meta is planning to further collaborate with the construction industry to develop new AI tools. As more platforms like Quadrel build on BOxCrete, AI-optimized mix design becomes accessible to producers without requiring them to change their existing workflows. The team is also planning on continued academic collaboration with the University of Illinois Urbana-Champaign to explore how AI can address not just domestic material substitution, but broader challenges in concrete sustainability and performance.
By reducing the barriers to domestic material adoption, Meta is helping American producers compete on cost, reduce emissions, and build supply chain resilience, one mix at a time.
Get Involved
Explore Metaâs open-source BOxCrete for Sustainable Concrete on GitHub.
Read our pre-print: âBOxCrete: A Bayesian Optimization Open-Source AI Model for Concrete Strength Forecasting and Mix Optimization.â
The post AI for American-Produced Cement and Concrete appeared first on Engineering at Meta.
- Helping disaster response teams turn AI into action across Asia OpenAI Blog Mar 29, 2026 10:15 PM
- OpenClaw 3.28: New Shell â MiniMax Image Generation, Async Tool Approval, and 90+ Fixes OpenClaws.io Blog Mar 28, 2026 12:00 AM 3.22 was the surgery. 3.23 confirmed survival. 3.24 was rehab. Now the lobster has grown a new shell â harder, sharper, battle-ready. 2 breaking changes, 21 features, 90+ fixes. MiniMax brings image g
- STADLER reshapes knowledge work at a 230-year-old company OpenAI Blog Mar 27, 2026 10:00 PM
-
The Hardest Part of Running a Small Business in the Trades Mozilla.ai Blog Mar 27, 2026 04:09 PM 5 min read Running a small trade business includes a steady flow of admin work: quotes, scheduling, invoices, payments, and more. This post looks at how that workload builds up and introduces Clawbolt, a focused

The AI revolution has triggered a massive shift in daily life for knowledge workers. Developers, writers, analysts, and designers have seen their output transform dramatically over the past year. But that revolution is still working its way into the industries that rely less on sitting at a desk. The trades are one example: skilled, independent contractors running their own businesses have enormous amounts to gain from AI, but the tools built so far weren't built with them specifically in mind.
At Mozilla.ai, we think about trust, transparency, and user agency as foundational to what good AI looks like. It means building for the people who've been left out of the current wave, not just the people already in front of a screen. That's what led us to Clawbolt.
The Problem
There's a story that plays out constantly in the trades. Someone spends years working for a larger company, gets great at their craft, and eventually makes the leap: be your own boss, do great work, and experience the benefits from the effort you put in.
What they quickly discover is that running a business is a whole lot more than being good at your craft. Wrapped around the work is a mountain of administration:
- Visiting job sites to give quotes and estimates
- Researching the cost of materials
- Planning and managing schedules
- Hiring and coordinating day laborers or subcontractors
- Sending and tracking invoices
- Processing payments
- Managing business profiles, reviews, and social media
Every hour spent at a keyboard chasing invoices or updating a business profile is time not spent on the jobs that are generating the revenue. This is why so many small businesses in the trades struggle. The skill is there, but the bandwidth for everything else often isn't. And when the business is going well, the "reward" is frequently an evening in front of a laptop catching up on paperwork instead of time with family.
Why AI Agents, and Why Now?
Most people are familiar with AI assistants in the ChatGPT mold: you ask a question and you get an answer. Itâs useful, but it puts the burden on the user to know what to ask and when to ask it.
That changed this fall with the emergence of OpenClaw, an open-source project that became the highest-starred Github repository of all time. OpenClaw introduced a framework for AI that operates proactively in the background, taking initiative, surfacing things the user didn't know they needed to handle, and acting on their behalf without waiting to be prompted.
The catch (and itâs a big catch) is that OpenClaw is hard to set up, and misconfiguration has massive security implications. It's a powerful foundation, but it's not something most people can just pick up and use safely.
Introducing Clawbolt.ai
Clawbolt is an idea weâve started working with at Mozilla.ai: a narrow, purpose-built AI assistant for contractors and small trade business owners. It's not trying to be a general-purpose tool. It's designed around the specific, repeatable needs of someone running a small trade business without a back-office team to support them.
A few of the guiding principles of Clawbolt:
- It meets users where they already are. The interface is a messaging app they already use, whether that's Telegram, WhatsApp, or iMessage. No new software to learn, no browser tabs to manage. The user experience is designed from the ground up to work just like youâre messaging a friend: scheduling reminders, approving data access, updating configuration. All designed to happen smoothly over messaging apps.
- It connects to the tools they already use. Clawbolt integrates with accounting software like QuickBooks and with calendar apps like Google Calendar to handle scheduling and finances without requiring the user to leave their conversation thread.
- It's proactive, not passive. Rather than waiting to be asked, Clawbolt learns where a particular user tends to fall behind and gets ahead of it. That might mean following up on an unpaid invoice, flagging that material costs have changed on an active bid, or reminding someone to schedule a follow-up call.
- It's built on open-source foundations with security as a priority. Mozilla.ai's commitment to transparency means Clawbolt has an open source core, and we're taking our time with curating integrations to ensure that security isnât an afterthought.
We're also working on a hosted option for people who want to get started without any technical setup. Self-hosting shouldn't be a prerequisite!
Our shining star: a contractor should be able to finish a long day of work, go home, and not have to spend hours on a laptop to get paid for work they already did.
Get Involved
Clawbolt is still in early development, and thatâs an intentional decision: the earlier we hear from people working in the trades, the more that input can shape how we build.
If you work in the trades, manage a small trade business, or know someone who does and any of this resonates, we want to hear from you. You can fill out this quick form or reach us at hello@mozilla.ai. If you're a software developer and want to dig into the project, contribute, or give it a star, the codebase is public on github.com/mozilla-ai/clawbolt.
Who is Mozilla.ai?Mozilla.ai is a public benefit startup and wholly-owned subsidiary of the Mozilla Foundation, operating with its own independent team. Our work focuses on AI technologies built around agency, access, and transparency. We share the Mozilla name and values, but we're a separate organization from Firefox, Thunderbird, and other Mozilla products.
Curious about the Mozilla family? Mozilla Foundation ¡ Mozilla Corporation ¡ Mozilla Ventures ¡ Mozilla Data Collective ¡ Firefox ¡ Thunderbird
- Liberate your OpenClaw Hugging Face Blog Mar 27, 2026 12:00 AM Weâre on a journey to advance and democratize artificial intelligence through open source and open science.
-
Watch James Manyika talk AI and creativity with LL COOL J. Google AI Blog Mar 26, 2026 05:00 PM 1 min read In the latest episode of our Dialogues on Technology and Society series, LL COOL J sits down with James Manyika.
In the latest episode of our Dialogues on Technology and Society series, LL COOL J sits down with James Manyika. -
Transform your headphones into a live personal translator on iOS. Google AI Blog Mar 26, 2026 04:00 PM 1 min read Google Translateâs Live translate with headphones is officially arriving on iOS! And we're expanding the capability for both iOS and Android users to even more countriesâŚ
Google Translateâs Live translate with headphones is officially arriving on iOS! And we're expanding the capability for both iOS and Android users to even more countries⌠- Gemini 3.1 Flash Live: Making audio AI more natural and reliable DeepMind Blog Mar 26, 2026 03:23 PM Gemini 3.1 Flash Live is now available across Google products.
-
Gemini 3.1 Flash Live: Making audio AI more natural and reliable Google AI Blog Mar 26, 2026 03:21 PM 1 min read Gemini 3.1 Flash Live is now available across Google products.
Gemini 3.1 Flash Live is now available across Google products. -
Search Live is expanding globally Google AI Blog Mar 26, 2026 03:00 PM 1 min read Weâre expanding Search Live globally, to all languages and locations where AI Mode is available.
Weâre expanding Search Live globally, to all languages and locations where AI Mode is available. -
Hardening Your LLM Dependency Supply Chain Mozilla.ai Blog Mar 25, 2026 10:26 PM 4 min read When source code and distributed packages donât match, risks increase. This breakdown of the LiteLLM incident shares what to watch for and how to reduce exposure.

On March 24, 2026, LiteLLM, a Python package, with over 95 million monthly downloads, was compromised. Versions 1.82.7 and 1.82.8 on PyPI contained a credential-stealing payload that exfiltrated SSH keys, cloud provider credentials, Kubernetes secrets, API keys, crypto wallets, and database passwords to an attacker-controlled server.Â
The attacker who hit LiteLLM just compromised one package and got the keys to everything. They targeted the one dependency that, by definition, sits on every LLM credential in the organization. The source code on GitHub was clean the entire time. If you only audited the repo, you'd have seen nothing.
LLM gateway libraries are uniquely high-value targets. By design, they hold API keys for all the providers you use: OpenAI, Anthropic, Google, Azure, Cohere, and others.Â
What happened
A threat actor group known as TeamPCP gained access to the LiteLLM maintainer's PyPI publishing credentials. Using those credentials, they uploaded malicious versions of the package directly to PyPI, completely bypassing the GitHub repository.
The payload used a .pth file: a little-known Python mechanism that auto-executes code on interpreter startup. You donât need to import litellm for it to run. Just having the package installed is enough for the malware to harvest credentials, establish persistence via systemd, and attempt lateral movement through Kubernetes clusters.
As Andrej Karpathy noted, the compromised version was live for less than an hour and was only discovered because a bug in the malware caused a machine to crash. Without that bug, this could have gone undetected for days or weeks.
The critical detail: this was a divergence between the source repository and the distributed artifact. The GitHub source was clean. The PyPI package was not. Anyone who reviewed the code on GitHub and assumed the published package matched it was wrong.
Five things you can do today
Here are a few things you can do right now. Some of these are band-aids: they address this specific exploit but don't scale across hundreds of dependencies. Trusted publishers (item 3) is the exception: it eliminates the attack vector entirely.
1. Pin exact versions and verify hashes
Stop using loose version specifiers for infrastructure dependencies. Pin to exact versions and use hash verification:
pip install --require-hashes -r requirements.txtYour requirements.txt should look like:
litellm==1.82.6 --hash=sha256:<known-good-hash>You can grab the hash for any package version directly from PyPI at https://pypi.org/project/<package>/<version>/#files â click 'view details' next to the wheel file.
2. Audit .pth files in your environments
Most developers donât realize .pth files can execute code every time the Python interpreter starts. While intended only for adding paths, they are often abused to run arbitrary scripts.
Run this command to find any .pth files in your Python site-packages directory that contain import or exec statements:Â
find $(python -c "import site; print(site.getsitepackages()[0])") -name "*.pth" -exec grep -El "import|exec" {} \;What to look for: Any file that contains more than a simple directory path is a potential security or performance risk.
3. Use PyPI trusted publishers for your own packages
If you maintain a Python package, stop using stored API tokens or passwords to publish to PyPI. Use trusted publishers instead. This is an OIDC-based mechanism that ties your PyPI releases to a specific GitHub Actions workflow.Â
4. Compare distributed artifacts against source
Don't assume the package on PyPI matches the code on GitHub. For critical infra dependencies, compare them:
pip download <package>==<version> --no-deps -d /tmp/check # Unzip the wheel and diff against the tagged source5. Run a private package mirror with an allowlist
For production deployments, pull packages through a private mirror or proxy (like devpi or Artifactory) that only serves vetted versions so you can block compromised versions before they reach your infrastructure.
How we do it at Mozilla.ai
At any-llm, releases are published to PyPI exclusively through GitHub Actions using PyPI trusted publishers. None of our maintainers holds a PyPI API token. The only path to PyPI is through our CI workflow, which uses OIDC-based authentication, meaning a compromised developer account cannot be used to publish a malicious package.
Migration is easy
If you are currently looking to move off LiteLLM, weâve made the transition simple. any-llm is a drop-in replacement for OpenAI-compatible proxies.
Check out our 2-step Migration Guide here.
Your LLM gateway is your blast radius. Treat it with the same rigor youâd treat your database or your secrets managerâbecause, in 2026, thatâs exactly what it is.
Who is Mozilla.ai?Mozilla.ai is a public benefit startup and wholly-owned subsidiary of the Mozilla Foundation, operating with its own independent team. Our work focuses on AI technologies built around agency, access, and transparency. We share the Mozilla name and values, but we're a separate organization from Firefox, Thunderbird, and other Mozilla products.
Curious about the Mozilla family? Mozilla Foundation ¡ Mozilla Corporation ¡ Mozilla Ventures ¡ Mozilla Data Collective ¡ Firefox ¡ Thunderbird
- Protecting people from harmful manipulation DeepMind Blog Mar 25, 2026 04:46 PM Google DeepMind releases new findings and an evaluation framework to measure AI's potential for harmful manipulation in areas like finance and health, with the goal of enhancing AI safety.
- Lyria 3 Pro: Create longer tracks in more DeepMind Blog Mar 25, 2026 04:01 PM We are bringing Lyria 3 to the tools where professionals work and create every day.
-
Build with Lyria 3, our newest music generation model Google AI Blog Mar 25, 2026 04:00 PM 1 min read Lyria 3 is now available in paid preview through the Gemini API and for testing in Google AI Studio.
Lyria 3 is now available in paid preview through the Gemini API and for testing in Google AI Studio. - Inside our approach to the Model Spec OpenAI Blog Mar 25, 2026 10:00 AM
- Introducing the OpenAI Safety Bug Bounty program OpenAI Blog Mar 25, 2026 12:00 AM
- A New Framework for Evaluating Voice Agents (EVA) Hugging Face Blog Mar 24, 2026 02:01 AM A Blog post by ServiceNow-AI on Hugging Face
- OpenClaw 3.24: Rehabilitation â Microsoft Teams Rewrite, 18 Breaking Changes, and a Developer Experience Leap OpenClaws.io Blog Mar 24, 2026 12:00 AM 3.22 was the surgery. 3.23 made sure the patient survived. 3.24: the lobster starts rehab. 18 breaking changes, 15 fixes, 343 commits. Microsoft Teams gets a full SDK rewrite â streaming replies, welc
-
cq: Stack Overflow for Agents Mozilla.ai Blog Mar 23, 2026 03:23 PM 7 min read cq explores a Stack Overflow for agents, a shared commons where agents can query past learnings, contribute new knowledge, and avoid repeating the same mistakes in isolation.
Side A: Turtles all the way down / Side B: Mo' tokens mo' problems

If you've been around long enough in anything you start to see history repeating, fashion trends come back around, humanity makes the same mistakes. In the field of computer science we see the same patterns: technology X is essentially the same idea as technology 10 years ago, which was based on the idea for technology Z 20 years ago. Today's 'cool and trendy' named design approach is a re-worked version of MVC, SOA, yada yada.
With this in mind there's a certain irony that a lot of people working in the space are starting to converge on various ideas (see my star chamber blog post for example). Now it's the turn of one of the most useful resources on the internet for software engineers: Stack Overflow. Born in 2008, peaking at over 200,000 questions a month by 2014. Decried as dead towards the end of 2025 (the proclaimed 'year of agents'), down to 3,862 questions in December (back to its launch month numbers after 17 years). The drop off started around the time ChatGPT launched. Who needs to share knowledge when ChatGPT/Claude/Gemini et al. "know everything"?
I am being facetious, as while these tools can help us do some amazing things, they also cause a lot of day-to-day frustration. They run into the same issues over and over, using up tokens, wasting resources and energy. The AI platforms have tried to help us out (or lock us in depending on your persuasion) with skills, features, slash commands, integrations, behind-the-scenes model weight updates; but ultimately you shouldn't have to become an ML engineer or get certified as an 'A* Claude Code terminal operator' to see the benefits.
Anyway, back to the story circa 2026:
- LLMs trained on the corpus of Stack Overflow
- LLMs via Agents committed matriphagy on Stack Overflow
- Agents run into the same issues over and over in isolation because their training data is stale etc.
- Agents now need their own Stack Overflow ... the cycle continues
And yes, I chose that word deliberately. Matriphagy; the offspring consuming the parent. Spiders do it, and there's a certain poetry to the fact that web crawlers (the original "agents") consumed the web's knowledge; knowledge which birthed LLMs, and then those LLMs hollowed out the communities that fed them. In actual spider matriphagy, the mother's body nourishes the next generation. Stack Overflow's corpus genuinely did nourish the LLMs. The question is whether the next generation builds something sustainable or just moves on to the next host.
Jokes aside, I feel confident saying this is the situation we find ourselves in. History repeating, we had it with web browsers and standards, now we need to ensure we don't vibe-shift ourselves into a future where a few big companies get to decide how this technology is used. Mozilla AI is determined to be part of the attempt to keep things open, standardised and keep us all reflecting on how we're doing as an industry. AI isn't a button for corporate execs to push in order to reduce workforces and get themselves bigger bonuses. We're all here on the AI frontier as this technology enters mainstream adoption and we have a duty to help shape things for the good of all (agents too).
We now return you to our regularly scheduled programming...
cq is derived from colloquy (/ËkÉl.É.kwi/), a structured exchange of ideas where understanding emerges through dialogue rather than one-way output. In radio, CQ is a general call ('any station, respond'). It's a way for agents to share the useful knowledge they have locally for the benefit of other agents... I think of it as Stack Overflow for agents!
Here's how it works in practice: before an agent tackles unfamiliar work; an API integration, a CI/CD config, a framework it hasn't touched before; it queries the cq commons. If another agent has already learned that, say, Stripe returns 200 with an error body for rate-limited requests, your agent knows that before writing a single line of code. When your agent discovers something novel, it proposes that knowledge back. Other agents confirm what works and flag what's gone stale. Knowledge earns trust through use, not authority.
Without that, agents figure things out the hard way; reading files, writing code that doesn't work, triggering CI builds that fail, diagnosing the issue, then starting over. Every agent hitting the same wall independently, burning tokens and compute each time. That's the waste cq is designed to cut.
It's the reciprocal bit that makes this worth building. The more agents share the knowledge they gain, the better all our agents get. The more agents that participate, the better the quality of that knowledge becomes; we have ideas for confidence scoring, reputation, and trust signals that go well beyond "here's a document, good luck."
That trust piece matters. 84% of developers now use or plan to use AI tools, but 46% don't trust the accuracy of the output; up from 31% the year before. Engineers are using AI but they're not confident in it. cq can help with that. Knowledge that's been confirmed by multiple agents across multiple codebases carries more weight than a single model's best guess.
We started building this at the beginning of March, and recently saw confirmation of it through Andrew Ng's post asking whether there should be a Stack Overflow for AI coding agents. We agree with Andrew that this is worth building, and we want your feedback and input in shaping it.
cq is early in this space and we want to help form a standard for knowledge sharing between agents and how it's structured. We're looking at all aspects of the system that could support this, from quick demos and Proof of Concepts, to proposals and infrastructure ideas.
This isn't a one-horse-race so early on. Not everyone is using Claude Code, CoPilot etc. and just like we shouldn't mandate workflows on engineers: commits must follow this exact format, only IDE Z is allowed; we shouldn't force engineers using AI to augment their work into a single coding agent. The current approach of updating .md files in repos and hoping for adherence only gets you so far. We need something dynamic, something that earns trust over time rather than relying on static instructions.
We're not writing whitepapers and waiting for consensus. We've built a working PoC that you can install and try today; there's a plugin for Claude Code and OpenCode, an MCP server that manages your local knowledge store, a team API for sharing across your org, UI for 'human-in-the-loop' review, and containers to spin the whole thing up. It's an early attempt by us to help folks get a flavour of what this could be; we want to iterate quickly on something real, not something theoretical.
Internally we're figuring out ways to start dogfooding this ourselves; using cq day-to-day across our own projects to build up knowledge units, find the friction, and figure out what actually matters when agents are sharing knowledge for real. The best way to learn what works is to use it.
A shared commons is just one layer of this. The feedback loops cq creates can surface things agents can't see in isolation; patterns across teams, gaps in tooling, friction that only becomes visible at scale. We're exploring where that leads and we're excited about what we're finding. More to come.
cq is open source and we're building it in the open. We want to hear from you; whether you're building agents, using agents, or just thinking about where all of this is heading. Come check out the repo, read the proposal, and tell us what you think.
Who is Mozilla.ai?Mozilla.ai is a public benefit startup and wholly-owned subsidiary of the Mozilla Foundation, operating with its own independent team. Our work focuses on AI technologies built around agency, access, and transparency. We share the Mozilla name and values, but we're a separate organization from Firefox, Thunderbird, and other Mozilla products.
Curious about the Mozilla family? Mozilla Foundation ¡ Mozilla Corporation ¡ Mozilla Ventures ¡ Mozilla Data Collective ¡ Firefox ¡ Thunderbird
- OpenClaw 3.23: Post-Surgery Recovery â Qwen DashScope, Auth Credential Overhaul, and 40+ Stability Fixes OpenClaws.io Blog Mar 23, 2026 12:00 AM 3.22 was the surgery. 3.23 makes sure the patient survives. 3 breaking changes, 40+ fixes. Qwen gets standard DashScope endpoints for China and global API keys. The auth credential system stops revert
- OpenClaw 3.22: Architecture Overhaul â 12 Breaking Changes, 30+ Security Fixes, and the Biggest Release Yet OpenClaws.io Blog Mar 22, 2026 12:00 AM 9 days of silence. 12 breaking changes. 30+ security hardening patches. 100+ stability fixes. ClawHub replaces npm as the default plugin source, Gateway cold starts drop from minutes to seconds, Windo
- Build a Domain-Specific Embedding Model in Under a Day Hugging Face Blog Mar 20, 2026 07:38 PM A Blog post by NVIDIA on Hugging Face
-
llamafile Reloaded: Whatâs New in v0.10.0 Mozilla.ai Blog Mar 19, 2026 07:27 PM 3 min read llamafile 0.10.0 unifies portability and modern model features. Bundle weights, run multimodal models, and access tool calling and Anthropic Messages API support, all from a single executable.

We are happy to announce the release of llamafile 0.10.0.
Since our previous announcement, we've rebuilt llamafile from the ground up, following an approach that makes it far easier to keep pace with its upstream dependencies.
We started with a polyglot build of llama.cpp, so we could get the best of two worlds. On one side, the signature features that make llamafile what it is: portability across different systems and CPU architectures, plus the ability to bundle model weights directly into llamafile executables. On the other side, all the features and model support available in the latest versions of llama.cpp, so that now you can serve Qwen3.5 models for vision, lfm2 for tool calling, and use Anthropic Messages API to run Claude code with a local model, all of this by running a single executable file.
What can the new llamafile do?
We asked for your feedback and we hear you: what makes a llamafile isn't just an APE executable. So we've brought back more of llamafile's original features. Here's what you'll find in 0.10.0:
- APE executable running out-of-the-box on multiple OSes and CPU architectures
- Full llama.cpp server feature set, including recent models, multimodal support, tool calling, and the Anthropic Messages API
- Multimodal model support in the terminal chat
- Multiple UIs: CLI tool, HTTP server, and terminal chat interface
- Metal GPU support
- CUDA GPU support (currently tested on Linux)
- CPU optimizations for different architectures
- Whisperfile
Where can I get a llamafile?
We provide a few pre-built llamafiles for you to try here. We've selected a variety of models covering different capabilities (thinking, multimodal, tool calling) and sizes ranging from 0.6B to 27B parameters. But we don't want to be a bottleneck to your creativity, so we want you to experiment with different models and configurations!Â
If you already have model weights on your system, you can just download the main llamafile executable and load your GGUF files directly. The v0.10.0 llamafile and whisperfile executables are available here. Check out our documentation to see how to run them with pre-downloaded models. And if you are looking for an easier way to bundle your own llamafiles, hereâs a teaser image from llamafile-builder, an application we are building with this specific goal:

What next?
We have plenty of ideas for the future llamafile. Here's what we're currently working on:
- Feature parity with the older version of llamafile. We documented here some of the features we havenât caught up with yet. Let us know what you'd like prioritized!
- Easier bundling (see the teaser above): we want to see you experimenting with combinations of models and parameters we never thought of, and sharing them around!
- Vulkan support: check out one more teaser we left for you at the end of this post.
- And of course, finding and fixing any new issues we can spot. đ
What about the old llamafile?
If there's something you're missing from the old llamafile:
- Let us know! We want to build something that's useful for you.
- Check out previous builds: you can still download source code from older commits and binaries from previous releases.
- Look for older llamafiles: we're still hosting a wide range of older models on HuggingFace, and for each one we specify the llamafile version it was built with.
- Build your own: we'll be making it easier for you to build llamafiles with whatever version of the software you want.
⌠And last but not least, if you need another good reason to try the newer llamafiles:

-
Friend Bubbles: Enhancing Social Discovery on Facebook Reels Meta AI / Engineering Mar 18, 2026 06:19 PM 8 min read Friend bubbles in Facebook Reels highlight Reels your friends have liked or reacted to, helping you discover new content and making it easier to connect over shared interests. This article explainsâŚ
- Friend bubbles in Facebook Reels highlight Reels your friends have liked or reacted to, helping you discover new content and making it easier to connect over shared interests.
- This article explains the technical architecture behind friend bubbles, including how machine learning estimates relationship strength and ranks content your friends have interacted with to create more opportunities for meaningful engagement and connection.
Friend bubbles enhance the social experience on Facebook Reels by helping you discover content your friends enjoy, creating a shared viewing experience and sparking new conversations. With a quick tap on a bubble, you can start a one-on-one conversation with any friend who has engaged with that Reel.
This feature combines social and interest signals to recommend more relevant, personalized content while making it easier to start conversations with the people who matter most to you. When videos connect to both personal interests and friend-related interests, they create a feedback loop that improves recommendations and strengthens social connections.

An Overview of the Friend Bubbles System Architecture
The friend bubbles recommendation system includes several components that work together to surface relevant, friend-interacted content by blending video-quality signals with social-graph signals:
- Viewer-Friend Closeness (Whose Interactions Matter Most): Identifies which friendsâ interactions are most likely to interest the viewer.
- Video Relevance (What Videos to Show): Ranks videos that are contextually relevant to the viewer.
Multiple friend interactions on the same video often signal stronger shared interest and higher relevance. Content surfaced through friend connections also tends to be high quality, creating a reinforcing loop: Social discovery increases engagement, and that engagement further strengthens the social graph.

Viewer-Friend Closeness: Identifying Friends With User-User Closeness Models
Friend bubbles rely on two complementary machine learning models to identify which connections a person feels closest to. One model is based on user survey feedback; the other is based on on-platform interactions.
The survey-based closeness model draws on a broad set of signals, including social-graph features (mutual friends, connection strength, interaction patterns) and user attributes (behavioral and demographic signals such as user-provided location, number of friends, and number of posts shared) to build a more complete picture of real-world relationships.
It is trained on a regular cadence using a lightweight binary survey in which a randomly selected group of Facebook users is asked whether they feel close to a specific connection in real life. The survey is structured as a close vs. not-close prediction problem, refreshed regularly to keep labels current, and includes questions that act as proxies for offline relationship strength (such as how often two people communicate). In production, the model runs weekly inference over trillions of person-to-person connections across Facebook friends.
While survey-based closeness provides a strong foundation, friend bubbles also use a context-specific closeness prediction model trained on on-platform activity signals, using real interactions that occur when bubbles are shown (for example, likes, comments and reshares). This enables the model to capture closeness in context â how likely a viewer is to value content recommended by someone in their friend graph based on how they interact with each other on the platform.
Our approach emphasizes connection quality over quantity. While bubble prevalence naturally rises with larger friend graphs, showing more bubble videos does not necessarily increase user engagement. The goal is to surface the right friend connections â those most likely to make the social context meaningful â using a combination of existing closeness signals and surface-specific features that better reflect the relationship dynamics behind friend-driven recommendations.
Video Relevance: Making the Ranking System Friend-Content Aware
We use two key strategies to ensure high-quality, friend-interacted content can move through the recommendation funnel and reach users: expanding the top of the funnel, and enabling models to rank friend-bubble content effectively through a continuous feedback loop.

Sourcing Inventory: Expanding the Top of Funnel
The retrieval stage sources candidate videos based on close friends, as identified by the closeness model described above. By explicitly retrieving friend-interacted content, we expand the top of the funnel to ensure sufficient candidate volume for downstream ranking stages. This is important because, without it, high-quality friend content may never enter the ranking pipeline in the first place.
Enabling Models to Rank Friend Content Effectively Through a Continuous Feedback Loop
A key insight from our development process was understanding why friend-interacted videos sometimes struggled to rank highly: It wasnât because they were low quality, but because the model lacked user-user closeness context. Without that context, the model canât learn what makes friend content uniquely valuable â namely, that its relevance is often driven by relationship strength and social meaning rather than the same signals that explain interest in more general content.
To address this gap, we integrated friend-bubble interaction signals as features and added new tasks into both early-stage and late-stage ranking multi-task, multi-label (MTML) models to incorporate viewer-friend relationship strength and to learn downstream engagement on videos with social bubbles. With these signals added across the ranking funnel, the models can better recognize the value of friend-interacted content, learn the relationship between closeness and viewer interest, and rank high-quality friend content higher when it is most relevant.
The system includes a continuous feedback loop in which friend-bubble interaction data flows back into model training. This loop helps the ranking system improve its understanding of which friend-content combinations resonate with users.
We augmented our existing video-ranking formula, which includes several optimization goals, with a friend-bubble ranking objective designed to maximize overall video engagement. We consider interaction metrics such as watch time, comments and likes, and use a conditional probability term, P(video engagement | bubble impression), to predict the likelihood that a user will engage with a video after seeing a friend bubble.
This is balanced with tunable weights that manage trade-offs between social interaction and video engagement, allowing us to optimize for social connection (helping people discover videos their friends like) and content quality. This dual optimization captures the core value proposition of the friend-content ranking system: enabling effortless connection through passive friend discovery, delivering entertainment through relevant content, and strengthening relationships by turning shared videos into natural touchpoints for conversation.
Client Infra Behind the Scenes: Performance at Reels Scale
Reels is a performance-sensitive surface, so adding new per-video metadata isnât as simple as adding another field. If it increases work during scrolling or delays playback, it can hurt the core user experience. When we integrated friend bubbles, we treated three constraints as nonnegotiable:
- Smooth scrolling
- No regressions in load latency
- Low CPU overhead for metadata fetch and processing
Facebookâs video delivery system already performs significant prefetch work ahead of playback. It preloads metadata, thumbnails and buffered content before a video reaches the viewport. We pinned friend-bubble metadata retrieval to that same prefetch window, which gave us several benefits: We could reuse cached results for stable data, avoid redundant CPU work, and limit wasted network requests by using an already optimized fetch path.
Because the bubble data arrived alongside the video content, we could render bubbles at the same time as the video itself, eliminating mid-playback UI updates and redraws.
We also made animation strictly conditional. During active scrolling and interaction, animation is disabled to preserve scroll responsiveness. On low-end devices where even idle animation could compromise performance, we turn it off entirely. Along with additional optimizations in the underlying method, this approach enabled us to ship friend bubbles while preserving core Reels performance.
Why the Metadata Has to Earn Its Place
A cleaner user interface is usually better, and new metadata can backfire if it adds noise or slows the experience. Friend bubbles work because the signal is high value: It adds meaningful social context that helps people decide whatâs worth watching.
By setting a conservative threshold for which friends are eligible to appear, we ensure bubbles show up only when the relationship signal, as determined by the user-user closeness model, is strong. That approach reduces clutter while improving the viewing experience overall, reflected in increased video watch time.
The Impact and Future of Friend Bubbles
Friend bubbles improve content relevance and engagement quality. In user feedback surveys, bubble-annotated videos consistently receive higher interest scores and more positive sentiment ratings than videos without bubbles.
Beyond relevance, bubbles improve app-session quality, not just quantity. Users who see bubbles spend more time actively watching and interacting with content, with growth concentrated in longer sessions rather than brief check-ins. The improvements come primarily from deeper video consumption. Bubble-related signals show a delayed effect on longer-term engagement patterns, suggesting repeated exposure to content friends have interacted with builds sustained interest over time.
By surfacing content friends have engaged with, bubbles also expose users to a broader range of topics and creators than they would otherwise encounter organically. Users donât just passively scroll past this content â they actively engage through likes, comments, shares and follows, indicating friend-recommended content can resonate even when it falls outside their typical interests.
Not all friend signals are equal. Bubbles triggered by expressive reactions such as love or laughter drive stronger downstream engagement than simple likes, particularly for comments and private shares, suggesting expressive reactions signal stronger resonance. Engagement also scales consistently with the number of friend bubbles shown, meaning videos with multiple friend interactions tend to perform better.
Next, weâre scaling the system to increase impact and robustness by expanding friend-driven recommendations â while preserving quality â to additional surfaces and inventory, improving cold start for people with limited friend graphs, and refining ranking and feedback signals for better personalization.
Ultimately, this architecture illustrates how machine learning can strengthen human connection at scale, helping people discover shared interests and making it easier to start conversations with the people who matter most. When your friends enjoy something great, you can discover it, too â and youâre only a tap away from talking about it together.
For more information about Facebook Bubbles, visit the Meta Newsroom.
The post Friend Bubbles: Enhancing Social Discovery on Facebook Reels appeared first on Engineering at Meta.
-
Ranking Engineer Agent (REA): The Autonomous AI Agent Accelerating Metaâs Ads Ranking Innovation Meta AI / Engineering Mar 17, 2026 08:07 PM 8 min read Metaâs Ranking Engineer Agent (REA) autonomously executes key steps across the end-to-end machine learning (ML) lifecycle for ads ranking models. This post covers REAâs ML experimentation capabilitâŚ
- Metaâs Ranking Engineer Agent (REA) autonomously executes key steps across the end-to-end machine learning (ML) lifecycle for ads ranking models.
- This post covers REAâs ML experimentation capabilities: autonomously generating hypotheses, launching training jobs, debugging failures, and iterating on results. Future posts will cover additional REA capabilities.
- REA reduces the need for manual intervention. It manages asynchronous workflows spanning days to weeks through a hibernate-and-wake mechanism, with human oversight at key strategic decision points.
- In its first production rollout, REA delivered:
- 2x Model Accuracy: REA-driven iterations doubled average model accuracy over baseline across six models.
- 5x Engineering Output: With REA-driven iteration, three engineers delivered proposals to launch improvements for eight models â work that historically required two engineers per model.
The Bottleneck in Traditional ML Experimentation
Metaâs advertising system delivers personalized experiences to billions of people across Facebook, Instagram, Messenger, and WhatsApp. Powering these interactions are highly sophisticated, complex and massively distributed machine learning (ML) models that continuously evolve to serve both advertisers and people who use the platforms.
Optimizing these ML models has traditionally been time-consuming. Engineers craft hypotheses, design experiments, launch training runs, debug failures across complex codebases, analyze results and iterate. Each full cycle can span days to weeks. As Metaâs models have matured over the years, finding meaningful improvements has become increasingly challenging. The manual, sequential nature of traditional ML experimentation has become a bottleneck to innovation.
To address this, Meta built the Ranking Engineer Agent, an autonomous AI agent designed to drive the end-to-end ML lifecycle and iteratively evolve Metaâs ads ranking models at scale.
Introducing REA: A New Kind of Autonomous Agent
Many AI tools used in ML workflows today function as assistants: They are reactive, task-scoped and session-bound. They can help with individual steps (for example, drafting a hypothesis, writing configuration files, interpreting logs), but they typically cannot run an experiment end to end. An engineer still has to decide what to do next, re-establish context, and drive progress across long-running jobs â and debug inevitable failures.
REA is different: an autonomous agent built to drive the end-to-end ML lifecycle, coordinating and advancing ML experiments across multiday workflows with minimal human intervention.
REA addresses three core challenges in autonomous ML experimentation:
- Long-Horizon, Asynchronous Workflow Autonomy: ML training jobs run for hours or days, far beyond what any session-bound assistant can manage. REA maintains persistent state and memory across multiround workflows spanning days or weeks, staying coordinated without continuous human supervision.
- High-Quality, Diverse Hypothesis Generation: Experiment quality is only as good as the hypothesis that drives it. REA synthesizes outcomes from historical experiments and frontier ML research to surface configurations unlikely to emerge from any single approach, and it improves with every iteration.
- Resilient Operation Within Real-World Constraints: Infrastructure failures, unexpected errors and compute budgets canât halt an autonomous agent. REA adapts within predefined guardrails, keeping workflows moving without escalating routine failures to humans.
REA addresses these challenges through a Hibernate-and-Wake Mechanism for continuous multiweek operation, a Dual-Source Hypothesis Engine that combines a historical insights database with a deep ML research agent, and a Three-Phase Planning Framework (Validation â Combination â Exploitation) that operates within engineer-approved compute budgets.
How REA Manages Multi-Day ML Workflows Autonomously
REA is built around a core insight: Complex ML optimization isnât a single task. It is a multistage process that unfolds over days or weeks. The agent must reason, plan, adapt and persist across this entire horizon.
Long-Horizon Workflow Autonomy
Traditional AI assistants operate in short bursts, responding to prompts and then waiting for the next query. ML experimentation doesnât work that way. Training jobs run for hours or days, and the agent must remain coordinated across these extended timelines.
REA uses a hibernate-and-wake mechanism. When the agent launches a training job, it delegates the wait to a background system, shuts down to conserve resources, and automatically resumes where it left off when the job completes. This enables efficient, continuous operation across extended time frames without requiring constant human monitoring.
To support this, Meta built REA on an internal AI agent framework, Confucius, designed for complex, multistep reasoning tasks. It provides strong code generation capabilities and a flexible SDK for integrating with Metaâs internal tooling systems, including job schedulers, experiment tracking infrastructure and codebase navigation tools.
High-Quality, Diverse Hypothesis Generation
The quality of the hypothesis directly determines the quality of an ML experiment. REA consults two specialized systems to generate diverse, high-quality ideas:
- Historical Insights Database: A curated repository of past experiments that enables in-context learning and pattern recognition across prior successes and failures.
- ML Research Agent: A deep research component that investigates baseline model configurations and proposes novel optimization strategies, using Metaâs historical insights database.
By synthesizing insights from both sources, REA surfaces configurations unlikely to emerge from any single approach in isolation. REAâs most impactful improvements have combined architectural optimizations with training-efficiency techniques â a result of this cross-system methodology.
Resilient Execution Within Real-World Constraints
Real-world experimentation operates under compute constraints and inevitable failures. REA addresses both through structured planning and autonomous adaptation.
Before executing any plan, REA proposes a detailed exploration strategy, estimates total GPU compute cost, and confirms the approach with an engineer. A typical multiphase plan proceeds through three stages:
- Validation: Individual hypotheses from different sources are tested in parallel to establish quality baselines.
- Combination: Promising hypotheses are combined to search for synergistic improvements.
- Exploitation (Intensive Optimization): The most promising candidates are explored aggressively to maximize results within the approved compute budget.
When REA encounters failures â such as infrastructure issues, unexpected errors, or suboptimal results â it adjusts the plan within predefined guardrails instead of waiting for human intervention. It consults a runbook of common failure patterns, makes prioritization decisions (such as excluding jobs with clear out-of-memory errors or training instability signals such as loss explosions), and debugs preliminary infrastructure failures from first principles. This resilience is critical for maintaining autonomy over long-horizon tasks, where engineers provide periodic oversight rather than continuous monitoring.
REA operates with rigorous safeguards. It works exclusively on Metaâs ads ranking model codebase. Engineers grant explicit access controls through preflight checklist reviews, and REA confirms compute budgets up front, halting or pausing runs when thresholds are reached.
The REA System Architecture

The Ranking Engineer Agent is built on two interconnected components, REA Planner and REA Executor, supported by a shared Skill, Knowledge and Tool System that provides ML capabilities, historical experiment data, and integrations with Metaâs internal infrastructure. Together, they directly enable the agentâs three core capabilities.
Long-Horizon Autonomy is powered by the execution flow: An engineer collaborates with the hypothesis generator to create a detailed experiment plan through the REA Planner. That plan is exported to the REA Executor, which manages asynchronous job execution through an agent loop and wait state, entering a wait state during training runs and resuming with results upon completion rather than requiring continuous human monitoring across multiweek workflows.
High-Quality, Diverse Hypothesis Generation is driven by the knowledge flow: As the executor completes experiments, a dedicated experiment logger records outcomes, key metrics, and configurations into a centralized hypothesis experiment insight database. This persistent memory accumulates knowledge across the full history of the agentâs operation. The hypothesis generator draws on these insights to identify patterns, learn from prior successes and failures, and propose increasingly sophisticated hypotheses for each subsequent round, closing the loop and compounding the systemâs intelligence over time.
Resilient Execution is maintained across both flows: When the executor encounters failures, infrastructure errors, out-of-memory signals, or training instability, it consults a runbook of common failure patterns and applies prioritization logic to adapt autonomously within predefined guardrails. It then resumes the planner with actionable results rather than surfacing routine interruptions to engineers.
Impact: Model Accuracy and Engineering Productivity
2x Model Accuracy Over Baseline Approaches
In the first production validation across a set of six models, REA-driven iterations doubled average model accuracy over baseline approaches. This translates directly to stronger advertiser outcomes and better experiences on Meta platforms.
5x Engineering Productivity Gains
REA amplifies impact by automating the mechanics of ML experimentation, enabling engineers to focus on creative problem-solving and strategic thinking. Complex architectural improvements that previously required multiple engineers over several weeks can now be completed by smaller teams in days.
Early adopters using REA increased their model-improvement proposals from one to five in the same time frame. Work that once took two engineers per model now takes three engineers across eight models.
The Future of Human-AI Collaboration in ML Engineering
REA represents a shift in how Meta approaches ML engineering. By building agents that can autonomously manage the entire experimentation lifecycle, the team is changing the structure of ML development â moving engineers from hands-on experiment execution toward strategic oversight, hypothesis direction, and architectural decision-making.
This new paradigm, where agents handle iterative mechanics while humans make strategic decisions and final approvals, is just the beginning. Privacy, security, and governance remain key priorities for the agent. Meta continues to enhance REAâs capabilities by fine-tuning specialized models for hypothesis generation, expanding analysis tools, and extending the approach to new domains.
Acknowledgements
Ashwin Kumar, Harpal Bassali, Shashank Ankit, Deepak Chandra, Chaorong Chen, Wenlin Chen, Vitor Cid, Peter Chu, Xiaoyu Deng, Jingyi Guan, Junhua Gu, Liquan Huang, Qinjin Jia, Santanu Kolay, Jakob Moberg, Shweta Memane, Jp Owed, Sandeep Pandey, Vijay Pappu, Shyam Rajaram, Ben Schulte, Jags Somadder, Matt Steiner, Ritwik Tewari, Hangjun Xu, Zhaodong Wang, Fan Yang, Xin Zhao, Zoe Zu
The post Ranking Engineer Agent (REA): The Autonomous AI Agent Accelerating Metaâs Ads Ranking Innovation appeared first on Engineering at Meta.
-
When Shipping Becomes Too Easy Mozilla.ai Blog Mar 17, 2026 05:30 PM 7 min read AI is changing product development. When building becomes effortless, the real constraint is no longer code. Itâs clarity, product judgment, and knowing when the right decision is not to ship yet.
When the hardest part of building shifts, so does leadership

We have gotten very good at building software. We have not gotten equally good at deciding what to build, or whether to build it at all. That gap is the most underrated product risk of this moment. A few days ago, my colleague Alejandro explored one side of this in his essay Owning Code in the Age of AI. His core observation is simple but important: code is no longer scarce.
AI systems can generate in minutes what once took days or weeks of engineering work. The constraint is no longer writing code. The constraint is understanding and operating the systems we create. Alejandro argues that engineering ownership is shifting from authorship to stewardship. Engineers may no longer write every line, but they remain responsible for how the system behaves. This perspective echoes ideas long discussed in the Site Reliability Engineering community, where reliability is treated as a property of systems rather than of individual lines of code.
Reading his piece, I kept thinking about the same shift from a product perspective. If code is becoming abundant, shipping is becoming almost effortless. And that changes product management as much as it changes engineering.
Over the past few months, I have noticed something new emerging in many teams: the high on velocity. With AI-assisted development and increasingly powerful tooling, features can appear at a pace that would have seemed unrealistic not long ago. Shipping software feels good. It creates momentum and a sense of progress. But speed has a psychological effect: it becomes addictive. The faster a team can ship, the more tempting it becomes to ship again. Over time, velocity gradually becomes the one metric, even when nobody explicitly says it is. And then it becomes self-justifying. Features ship, metrics look good, and the structural problems accumulate little by little, until they don't. The churn that follows is not random. It is the predictable output of a system optimised for the appearance of speed.
The subtle danger is that the ability to produce software faster starts shaping what we choose to build.Â
Not necessarily because the ideas are better, but because they are easier to ship. Anyone who has worked in product long enough has seen features shipped quickly and confidently, only for it to become clear weeks later that they were not the right thing to build. As our ability to ship accelerates, the risk is that we multiply these mistakes faster than we multiply good decisions.
There is a narrative emerging that AI will reduce the need for product management. If engineers can prototype quickly and test ideas directly, perhaps the discipline becomes lighter. I increasingly believe the opposite. When the cost of building collapses, the cost of building the wrong thing increases dramatically. If shipping becomes frictionless, the real scarcity moves elsewhere: clarity of intent, product judgment, and long-term coherence. Product management becomes less about leading and coordinating work and more about protecting direction.
There is a question nobody asks out loud enough: who actually has the standing to slow things down? In teams where pressure runs toward shipping, where velocity is visible and quality is diffuse, the answer is often nobody, not because people do not care, but because the incentive structure has no mechanism for the kind of sustained, unglamorous resistance that good product work requires. AI does not create that dysfunction. It removes one of its natural governors: the fact that writing software used to take time. Naming that is a leadership responsibility. It does not fix itself.
One of the tensions I am noticing is that the ability to ship quickly does not reduce the rest of the product work. If anything, it makes it easier to overlook it. Product management is often mistakenly equated with the ability to move features forward and get them released. But much of the work happens elsewhere: making sure users understand what has changed, that support teams are prepared, that pricing and positioning make sense, that the product remains coherent, and ultimately that users trust what has been built. When shipping accelerates, these responsibilities do not disappear. And the risk is not just that users are confused or under-supported. It is that they absorb the cost of your speed, through broken workflows, lost trust, or data they did not expect to expose. That is not a product quality problem. It is an ethical one.Â
My friend and former colleague Davide has a name for this dynamic. In a recent essay, he describes what he calls âproduct management engineeringâ: the gradual convergence between product and engineering responsibilities. Product managers increasingly need to understand the systems they guide, while engineers increasingly participate in shaping product decisions. The acceleration brought by AI reinforces this dynamic. If engineers are becoming stewards of system behavior rather than authors of code, product leaders are becoming stewards of the coherence of the systems we build over time. The tools are accelerating. The thinking remains human.
What worries me most about the current moment is not the technology itself, but the culture that may emerge around it. The same velocity that excites product teams also creates pressure to move before we fully understand the consequences of what we are building.
This concern becomes particularly serious when we look beyond consumer software. Over the past months, debates about the integration of AI systems into military infrastructures have intensified, while conflicts around the world continue to escalate. But we do not need to go as far as defense to feel the weight of this. Regulators across every major jurisdiction are already drawing lines, and the contrast between their approaches is instructive.
The EU AI Act defines categories of high-risk systems with strict pre-market obligations. In the United States, the absence of a comprehensive federal law has produced a growing patchwork of state requirements, for instance California, Texas, Illinois and others enacted significant AI legislation taking effect in 2026, with over 1,000 AI-related bills introduced across states in 2025 alone. The federal government, meanwhile, is pulling in the opposite direction: Executive Order 14179, issued in January 2025, reoriented U.S. AI policy toward promoting innovation and revoked portions of the Biden administration order that emphasized safety testing and reporting requirements. The result is not deregulation. It is legal uncertainty, which, for a product team, is arguably worse.
Then there is China, which tends to be dismissed in these conversations. It should not be. China recently launched a public consultation on a proposed law on AI anthropomorphism that, whatever its political context, is strikingly specific: it defines the risks of emotional dependency, establishes concrete design obligations for providers, mandates mental health protections, and holds providers responsible for the security of their systems across the entire product lifecycle. Whether you agree with the framework or not, it offers something most Western regulation does not yet: clarity about what "responsible by design" actually means in practice.
The question for any product team shipping fast is not which regulatory regime applies to you today. It is whether the decisions you are making now will hold up when the rules catch up. Speed does not suspend legal exposure. In many cases, it increases it. The idea that we can simply "ship fast and iterate" is not just strategically risky. In certain domains, it is no longer defensible, legally or otherwise.
One underrated antidote is simpler than any process: as much as possible, use what you ship. Not as a ritual, but as an operating constraint. If your team cannot or does not use the product they are building in their real work, they have lost something important: the lived experience of what their decisions actually produce. That friction is not a bug. It is a signal. Dogfooding does not slow teams down. It makes the right things visible before users do.
Alejandro concludes that in a world of abundant code, the scarce resource becomes understanding and reliability. I think the same applies to product management. When software becomes easy to build and easy to ship, the scarce resource becomes judgment: the discipline to ask the right questions, to understand the systems we are creating, and sometimes to decide that the most responsible product decision is not to ship something yet.Â
Technology is not waiting for us to get comfortable with it. That is not a product management problem. It is a leadership one.
- State of Open Source on Hugging Face: Spring 2026 Hugging Face Blog Mar 17, 2026 04:37 PM A Blog post by Hugging Face on Hugging Face
- Measuring progress toward AGI: A cognitive framework DeepMind Blog Mar 17, 2026 04:03 PM Google DeepMind proposes a cognitive framework to evaluate AGI and launches a Kaggle hackathon to build capability benchmarks
- OpenClaw 3.13: Mobile Redesign, 2x Memory Fix, and 70+ Stability Patches OpenClaws.io Blog Mar 16, 2026 12:00 AM No headline feature this time. Instead: a memory regression that doubled RAM usage is fixed, Android and iOS get real attention, agents stop tripping over their own context, and 70+ patches land acros
-
Patch Me If You Can: AI Codemods for Secure-by-Default Android Apps Meta AI / Engineering Mar 13, 2026 04:00 PM 2 min read Even seemingly simple engineering tasks â like updating an API â can become monumental undertakings when youâre dealing with millions of lines of code and thousands of engineers, especially iâŚ
Even seemingly simple engineering tasks â like updating an API â can become monumental undertakings when youâre dealing with millions of lines of code and thousands of engineers, especially if the changes are security-related. Nowhere is this more apparent than in mobile security, where a single class of vulnerability can be replicated across hundreds of call sites scattered throughout a sprawling, multi-app codebase serving billions of users.
Metaâs Product Security team has developed a two-pronged strategy to address this:
- Designing secure-by-default frameworks that wrap potentially unsafe Android OS APIs and make the secure path the easiest path for developers, and
- Leveraging generative AI to automate the migration of existing code to those frameworks at scale.
The result is a system that can propose, validate, and submit security patches across millions of lines of code with minimal friction for the engineers who own them.
On this episode of the Meta Tech Podcast, Pascal Hartig talks to Alex and Tanu, from Metaâs Product Security team about the challenges and learnings from the journey of making Metaâs mobile frameworks more secure at a scale few companies ever experience. Tune in to this episode and join us as we explore the compelling crossroads of security, automation, and AI within mobile development.
Download or listen to the episode below:
You can also find the episode wherever you get your podcasts, including:
The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Metaâs engineers are doing at every level â from low-level frameworks to end-user features.
Send us feedback on Instagram, Threads, or X.
And if youâre interested in learning more about career opportunities at Meta visit the Meta Careers page.
The post Patch Me If You Can: AI Codemods for Secure-by-Default Android Apps appeared first on Engineering at Meta.
- OpenClaw 3.11 & 3.12: Dashboard Rewrite, Fast Mode, and 8 Security Fixes You Should Care About OpenClaws.io Blog Mar 13, 2026 12:00 AM OpenClaw 3.11 and 3.12 ship a rebuilt Control UI, GPT-5.4 and Claude fast mode toggles, first-class Ollama onboarding, multimodal memory with Gemini embeddings, Kubernetes manifests, and 8 security ad
- Mar 12, 2026 Announcements Anthropic invests $100 million into the Claude Partner Network Anthropic News Mar 12, 2026 12:00 AM
- Mar 11, 2026 Announcements Introducing The Anthropic Institute Anthropic News Mar 11, 2026 12:00 AM Weâre launching The Anthropic Institute, a new effort to confront the most significant challenges that powerful AI will pose to our societies.
-
Federate Phishing Detection: Training a URL Classifier without Sharing Browsing Data Mozilla.ai Blog Mar 10, 2026 01:00 PM 2 min read Mozilla.ai joins Flower Hub as a launch partner with fed-phish-guard, a federated phishing detection project. The classifier trains across distributed clients and shares only model updates, allowing c

At Mozilla.ai, we believe useful machine intelligence shouldnât require centralizing sensitive user data. Federated learning offers a practical path toward collective intelligence without surveillance-style data aggregation. Thatâs why weâre excited to be a launch partner of Flower Hub, helping move federated learning from research into real-world applications.
The Problem
APWG recorded over 3.8 million phishing attacks in 2025, averaging over 10,000 new phishing URLs per day. To catch the latest threats, classifier models need to see what URLs people are actually clicking on. Centrally collecting those clicks means asking users to hand over their entire digital paper trail. Federated learning offers a different approach: train the model where the data already lives, and share only the learned weights.
How fed-phish-guard works
fed-phish-guard is a Flower app that trains a phishing URL classifier across distributed clients without any raw URL data leaving their respective devices. It uses byte-level encodingâtreating each URL as a sequence of raw bytes rather than subword tokensâso the model catches the character-level tricks attackers rely on, like swapping l for 1 in paypa1.com or hiding the real domain in a subdomain. These byte sequences feed into a 1D CNN that extracts local patterns regardless of where they appear in the string.
For training, we use Federated Averaging: in each round, a subset of clients trains locally on their private data for one epoch, then sends only the updated model weights (not raw data) to the server. The server averages these weight updates (weighted by the number of training examples each client contributed) to produce a new global model, which is then broadcast back to clients for the next round.Â
For simulation, the dataset is split using IID (Independent and Identically Distributed) partitioning, where each client receives a random, balanced subset of URLs across both classes. This represents an idealized scenario, as real-world deployments may exhibit non-IID patterns (e.g., users in different regions encountering different phishing campaigns). Flower's simulation mode runs all clients as Ray actors on a single machine, enabling rapid experimentation without physical infrastructure, while deployment mode distributes clients across actual devices or servers for production use.
ResultsÂ
With default settings (10 clients, 3 rounds, IID partitioning, ~83K URLs per client):
Metric Federated approach Centralized baseline Accuracy 95.2% 98.4% F1 Score 94.8% 98.3% ROC-AUC 98.9% 99.8% Note: federated results will vary by number of rounds, clients, and data distribution. As a comparison, our reference centralized baseline reaches +3.2% accuracy after 20 epochsâbut requires pooling all browsing data.
Try it yourself
Shell pip install flwr flwr new @mozilla-ai/fed-phish-guard cd fed-phish-guard && pip install -e . flwr run .
This project is public on Flower Hub, where you can find our source code as well as details about model architecture and dataset. Feedback, new ideas and contributions are more than welcome!
- Mar 10, 2026 Announcements Sydney will become Anthropicâs fourth office in Asia-Pacific Anthropic News Mar 10, 2026 12:00 AM Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
- OpenClaw 3.8: ACP Provenance, Brave Search Integration, and the Case for Moving Fast OpenClaws.io Blog Mar 10, 2026 12:00 AM OpenClaw 3.8 lands ACP Provenance for agent identity verification, Brave LLMContext for AI-native web search, Podman/SELinux auto-detection, and a leaner Docker image. Some say OpenClaw moves too fast
- From games to biology and beyond: 10 years of AlphaGoâs impact DeepMind Blog Mar 09, 2026 01:52 PM Ten years since AlphaGo, we explore how its search and learning methods are catalyzing scientific discovery and paving a path to AGI.
- OpenClaw 3.7: Day-One Model Support, Multilingual Expansion, and 200+ Bug Fixes OpenClaws.io Blog Mar 09, 2026 12:00 AM OpenClaw 3.7 ships with first-day support for GPT-5.4 and Gemini 3.1 Flash-Lite, adds Spanish and German to the Control UI, introduces the ContextEngine plugin architecture, and squashes over 200 bugs
- ContextEngine Deep Dive: How OpenClaw 2026.3.7 Turned Context Management into a Plugin OpenClaws.io Blog Mar 09, 2026 12:00 AM OpenClaw 2026.3.7 introduces ContextEngine, a plugin slot that gives third-party developers full control over how agent context is ingested, assembled, and compacted. This article explains the archite
-
Owning Code in the Age of AI Mozilla.ai Blog Mar 06, 2026 05:56 PM 6 min read AI lets engineers generate thousands of lines of code in minutes. But humans still reason about systems slowly. That gap forces a rethink of ownership, reliability, and where safety really lives in mo

Software engineering is going through a shift that feels small on the surface but changes something fundamental: code is no longer scarce.
For decades, writing software was constrained by human typing speed and cognitive load. Engineers produced code at roughly the same pace they could understand it. That relationship shaped our entire culture: code reviews, ownership models, testing philosophies, and even how we thought about responsibility.
AI breaks that balance.
Today a single engineer can generate thousands of lines of code in minutes. Features that once took days can appear in an afternoon. Small teams suddenly move at the speed that used to require entire organizations. And the uncomfortable reality is this: not using AI is no longer a real option. A team that refuses AI assistance will simply move slower than a team that embraces it.
But this acceleration raises a question I keep coming back to. If AI is producing most of the code, what does it mean to "own" it?
The Illusion of Code Ownership
Engineering culture has long tied ownership to authorship. You wrote the code, therefore you understand it. You understand it, therefore you are responsible for it.
Even before AI, that was already a partial illusion. Most systems already contain enormous amounts of code nobody on the team truly wrote or fully understood: frameworks, libraries, generated code, boilerplate, copied patterns.
But I think AI makes the illusion different in kind, not just in degree. Frameworks and libraries gave you a legible contract. You didn't write them, but you understood what they did, what they didn't do, and roughly where they'd fail. The abstraction was something you could reason about. You outsourced execution, not reasoning.
With AI-generated code, the contract is implicit and probabilistic. You don't know what assumptions the model made, what edge cases it missed, or why it structured things the way it did. It isn't boilerplate. It's novel logic you didn't author and may not fully understand. When an engineer prompts a model, reviews the result for a few minutes, and merges it, they are no longer acting as the author of the code. They are acting as something closer to a reviewer, architect, and integrator.
The role is shifting from writing software to approving systems. And I'm not sure our ownership models have caught up to that.
The Speed Gap
The real tension created by AI coding is not authorship. It is speed.
AI can produce code much faster than humans can reason about it. A developer might once write 200 lines of code in a day and understand each decision deeply. Now they may generate 5000 lines in an hour. Reviewing that output does not mean truly understanding it.
This creates a growing gap between code production and code comprehension. Historically, these two moved together. Now they are decoupled. That gap forces engineering teams to rethink where reliability comes from.
And one natural reaction is to say: fine, we will rely more on tests. But AI writes tests too. If the same system generates both the implementation and the tests, those tests may only validate the model's own assumptions. They become another generated artifact, not necessarily an independent safety net. Testing is still useful, but it no longer plays the same role it once did. Instead of guaranteeing correctness, tests become another signal in a broader reliability system.
Where Reliability Lives Now
SRE starts from an assumption that makes a lot of people uncomfortable: systems will fail. Not because engineers are careless, but because complexity guarantees it. Rather than trying to eliminate every bug, the focus goes toward limiting blast radius, detecting failures quickly, and recovering automatically. Reliability is not achieved through perfect code. It is achieved through systems that tolerate imperfect code.
I think AI coding pushes the rest of software engineering in exactly this direction, whether teams are ready for it or not.
If humans cannot deeply reason about every line of code anymore, safety has to live somewhere else. In practice, it moves into the system itself.
Observability becomes more important than reading code. Systems need to tell us what they are doing in real time because we can no longer assume we know from looking at the source. Metrics, tracing, and anomaly detection are not nice-to-haves anymore. Failures need to stay localized: feature flags, staged rollouts, tenant isolation, and permission boundaries limit how much damage a mistake can cause. And rollback mechanisms, circuit breakers, and automated mitigation allow systems to correct themselves quickly when something goes wrong.
This is not a new playbook. It is the SRE playbook, applied to a world where the code inside your systems is increasingly not code you deeply understand.
A Fair Counterargument
I want to be honest about the strongest pushback here.
The argument goes: AI-assisted code, reviewed carefully, is still code the engineer owns. The tool doesn't matter. What matters is whether the engineer understood what they shipped. And that's true. If a team uses AI thoughtfully and reviews output rigorously, the result can be code they genuinely own and understand.
The problem is economics. The same speed that makes AI valuable also creates pressure to ship faster than you can carefully review. The risk isn't that AI-generated code is inherently worse. It's that the incentive structure pushes toward treating review as a formality rather than a real check. That's what collapses ownership, not the AI itself.
Do Users Pay the Price?
There is a real risk here that I think is worth naming directly. If the response to AI-generated code is just "ship fast and observe," users end up absorbing the cost of our velocity. That's not a tradeoff I'm comfortable with, and I don't think it's one we should normalize.
The answer can't be to slow down and go back to writing everything by hand. But it also can't be to treat production as a testing environment and call it a feedback loop.
What I keep coming back to is that production usage is an irreplaceable signal, but that doesn't mean users need to be exposed to failures to generate it. The more interesting investment is building infrastructure that captures and replays real usage patterns in isolated environments. Your test environments stop being places where you guess at how users behave and start being places where you replay how they actually did. That kind of end-to-end testing is harder to build than a unit test suite, but it's the only approach that's honest about what you're actually validating, without making users pay for it.
Velocity matters. But not at the cost of trust.
Ownership Without Authorship
So what does engineering ownership actually mean in this context? I don't think it can mean "I wrote every line of this code" anymore.
Maybe it becomes something closer to stewardship. An engineer owns a system if they understand how it behaves, monitor its health, respond when it breaks, and improve its architecture over time. They may not have written most of the implementation, but they are responsible for how the system operates.
Ownership shifts from lines of code to system behavior. I think that's the direction we're heading, whether we name it that or not.
Engineering in the Age of Infinite Code
AI has made code abundant. The scarce resource is no longer code itself, but understanding, architecture, and reliability.
The best engineers probably won't be the fastest coders. They'll be the people who design systems that remain safe even when the code inside them is imperfect. That future looks a lot like SRE. Not because engineers stopped caring about quality, but because the only way to manage infinite code is to build systems that can survive it.
I don't have clean answers here. But one thing feels increasingly clear: in a world of infinite code, reliability stops being a property of the code itself and becomes a property of the system around it.
- Mar 6, 2026 Policy Partnering with Mozilla to improve Firefoxâs security Anthropic News Mar 06, 2026 12:00 AM
-
The Star Chamber: Multi-LLM Consensus for Code Quality Mozilla.ai Blog Mar 05, 2026 06:56 PM 16 min read The Star Chamber runs code reviews across multiple LLM providers and aggregates their feedback by consensus. Instead of relying on one modelâs perspective, developers get a structured view of where mo

Every AI model has blind spots. It might overlook context, lean toward certain patterns, or fill gaps with confident guesses. When you're using an AI coding agent to help with architecture decisions or code review, you're getting one perspective from one model. That's fine for straightforward work. But for decisions that shape the long-term direction of a codebase, one perspective isn't enough.
The Star Chamber is a skill for Claude Code that fans out code reviews and design questions to multiple LLM providers simultaneously, aggregates their feedback, and presents consensus-based recommendations. Think of it as a senior engineering council that reviews the same code independently, then you get a summary of where they agree and where they don't.
The name is a nod to Mark Schwartz's A Seat at the Table, where he talks about the "Star Chamber" when describing the governance review board that oversees IT decisions. Schwartz argues that these boards should focus on outcomes and confidence rather than wading through planning documents or ceremony that were never really what they needed. It has stuck with me for years, and so the Star Chamber here is the same idea applied to code: A panel that gives you confidence in decisions through a structured, multi-perspective review. His books are definitely worth checking out, particularly A Seat at the Table and The (Delicate) Art of Bureaucracy.
The original Star Chamber was a 15th-17th century English court at the Palace of Westminster, a council of privy councillors and common-law judges established to handle cases too significant for ordinary courts. A panel of independent reviewers, each bringing different expertise, deliberating on the same evidence.
The Star Chamber is part of a broader project called claude-pragma, which I'll come back to shortly.
The Problem with Single-Model Review
If you've worked with AI coding agents, or even indulged in browser-based LLM chats for coding questions, you'll recognise this pattern:
- Ask Claude a question, get an answer
- Your experience tells you something else, so open ChatGPT and pit it against Claude's response
- Take that answer, paste it into Gemini to play devil's advocate one more time
- Mentally synthesise the results, weigh them up, decide what to do
- Repeat next time you have a question
It works, but attrition is the killer. After a few rounds of copy-pasting between tabs, "good enough" starts to feel like a reasonable answer and the rigour quietly drops off.
And each model notices different things, or has a style of writing code, or uses a particular version of a language it was trained heavily on. Claude might catch architectural concerns but miss performance implications. GPT might flag security issues but overlook idiomatic patterns. Gemini might spot documentation gaps but be less opinionated about error handling.
None of them are wrong. They're just different. And in a real engineering team, you'd want multiple reviewers precisely because different people notice different things.
In a remote-first world, there's also a practical angle. You don't always want to interrupt a colleague to rubber duck a design question or get a second opinion on an approach. That kind of collaboration is valuable and I'd never want to replace it, but something like the Star Chamber can handle the low-hanging fruit: the quick sanity checks, the "am I overthinking this?" moments. It means when you do pull someone in for a pairing session or a deeper discussion, you're bringing a sharper, better-tested starting point.
So the question became: what if that whole process of consulting multiple models and synthesising their views could be automated in a constructive way?
claude-pragma: The Bigger PictureBefore diving into the Star Chamber, it's worth understanding the project it belongs to...
claude-pragma is a collection of skills, agents, and validators for Claude Code that aim to make working with Claude more deterministic.
The core problem: Claude Code's rules (defined inÂ
CLAUDE.md files) are followed inconsistently. You can tell Claude to follow the Go Proverbs, or to always validate security boundaries, or to use a specific error handling pattern, but there's no guarantee it will remember or apply those rules on every implementation.Âclaude-pragma solves this by mechanically injecting rules and validating compliance using semantic validators that run automatically.It includes:
- Validators that block implementation until issues are resolved (security, Python style, Go idioms, TypeScript conventions)
/setup-project skill that bootstraps everything: project rules, validator configuration, and the one-time Star Chamber provider setup/implement skill that wraps implementation with automatic validation/star-chamber skill for advisory multi-LLM review
The validators distinguish between musts and shoulds. Musts are non-negotiable: if a validator flags a must, it has to be fixed before implementation can proceed. Shoulds are different: Claude can choose to skip a should, but only if it provides a rationale for why. This stops rules being blindly applied in cases where they genuinely don't fit, while still requiring the decision to be conscious and documented.
When you runÂ
/implement, you get a summary at the end showing everything: which validators ran, what was flagged, what was fixed, and any shoulds that were skipped with their rationale. It gives you a single place to review the decisions that were made on your behalf.The Star Chamber sits alongside all of this as the advisory part: subjective, contextual, and ultimately the developer's call.
How the Star Chamber Works
The Star Chamber operates as both a skill (invoked explicitly withÂ
/star-chamber) and an agent (auto-invoked when Claude Code encounters significant design decisions). When triggered, it follows a straightforward process:1. Gather context. It collects the code to review (from explicit file arguments, local changes, staged changes, or recent commits), reads any project rules fromÂ
CLAUDE.md andÂARCHITECTURE.md, and builds a structured review prompt.2. Fan out to providers. The prompt goes to all configured LLM providers in parallel. Out of the box, this means Claude, GPT, and Gemini, but the provider list is configurable. Each model reviews the code independently, with no knowledge of what the others are saying.
3. Aggregate by agreement. Results come back and get classified:
- Consensus issues (all providers flagged it) - highest confidence, address first
- Majority issues (two or more providers) - high confidence, worth investigating
- Individual observations (one provider only) - may be a specialised insight, or may be noise
This classification is the key value. When three different models independently flag the same concern, you can be reasonably confident it's a real issue. When only one model raises something, it's still worth reading, but you calibrate your confidence accordingly.
What makes this aggregation work reliably is that each provider returns structured JSON against a defined contract, not free-text prose. Issues have typed fields: severity (
high,Âmedium,Âlow), location (file:line), category (craftsmanship,Âarchitecture,Âcorrectness,Âmaintainability), a description, and a suggested fix. This structure means the aggregation can group issues by location and category to determine genuine consensus rather than relying on fuzzy text matching.Two Modes: Parallel and Debate
The default mode is parallel: all providers review independently in a single round. It's fast and good enough for most reviews.
For questions where you want the models to engage with each other's reasoning, there's debate mode (
--debate). You can specify the number of rounds withÂ--rounds N:/star-chamber --debate --rounds 3After each round, an anonymous summary of all providers' feedback is shared back to every provider for the next round. The crucial detail here is that it follows the Chatham House rule. Providers see what was said but not who said it.

Debate mode: providers review independently each round, with anonymous synthesis shared between rounds. I believe the anonymity matters. Attribution might anchor the discussion - if a model knows "Claude said X", that becomes a reference point to agree or disagree with rather than an idea to engage with on its merits. By stripping the source, the synthesis is just a set of observations and arguments. The models don't need to know whether feedback came from another LLM or a human; what matters is the substance.
A debate round might share something like:
Other council members' feedback (round 1):
Issues raised:
- The config loader silently ignores missing env vars, risking runtime errors
- Linear search in
get_resource_definitionmay be slow for large configs - Consider adding a strict mode for env var validation
Points of agreement:
- Type hints are solid
- Overall code structure is clean
Please provide your perspective on these points.
Each provider then re-evaluates, sometimes changing position when presented with arguments they hadn't considered, sometimes doubling down with additional evidence. The process can converge before reaching the specified number of rounds if the providers reach agreement early. The result is a more thorough analysis than any single round could produce.
An Example: The Star Chamber Reviews Itself
Early on, I pointed the Star Chamber at itself - a meta-review of its own code in debate mode with GPT-4o, Claude, and Gemini (full output):
Consensus Issues (All Providers Agree)
# Location Severity Category Description 1 llm_council.py: debate_modeMEDIUM correctness Debate mode lacks mechanism to prevent infinite loops if LLMs disagree. No convergence detection. 2 llm_council.py: API_key_handlingMEDIUM security API keys loaded without proper masking in logs/error messages. Risk of accidental exposure. 3 llm_council.py: variousMEDIUM architecture Code lacks clear separation of concerns â SDK loading, provider interaction, review logic, and aggregation all in one file. Majority Issues (2/3 Providers)
# Location Severity Category Description 1 llm_council.py: get_provider_clientHIGH correctness/security Dynamic SDK loading via importlib doesn't validate imported module. Risk of loading malicious packages. 2 llm_council.py: run_parallel_reviewsMEDIUM correctness No timeout mechanism for parallel reviews. Hanging LLM call could block indefinitely. All of those were legitimate findings. The convergence detection issue, the API key masking, the timeout handling - these were real problems that got fixed. The fact that the tool found genuine issues in its own implementation was a good early signal that the approach works. Turtles all the way down.
Design Questions, Not Just Code Review
The Star Chamber isn't limited to reviewing code that already exists. It handles design and architectural questions too. If you're deciding between event sourcing and traditional CRUD for an audit trail, or weighing whether to introduce a message queue versus synchronous processing, you can put the question to the council.
You can mix code review with design goals in a single invocation:
/star-chamber --debate --rounds 3 review local changes for design and achieving the goal of GitHub issue #423Each provider evaluates the local changes against both the code quality and the intent of the issue, then refines through debate. You get a structured analysis of where three different models converge and where they see legitimate trade-offs.
Where This Fits in the Pipeline
The Star Chamber is explicitly advisory, not blocking. In claude-pragma, semantic validators run automatically and block implementation until issues are resolved. The Star Chamber sits after that:

Validators block until fixed. The Star Chamber advises, but the developer decides. Validators enforce objective, automatable rules: "don't useÂ
assert for control flow", "error strings shouldn't be capitalised", "never commit secrets." These are binary.The Star Chamber handles subjective, contextual questions: "is this the right abstraction?", "will this scale?", "is this over-engineered?" Those need human judgement as the final arbiter.
When running as an agent (auto-invoked), it uses parallel mode only and limits itself to genuinely significant decisions. It won't fire on a README change or a routine bug fix.
The Plumbing
Under the hood, the Star Chamber uses Mozilla.ai's open source any-llm to talk to providers, executed viaÂ
uv run so there's no global Python installation to manage.Provider configuration lives inÂ
~/.config/star-chamber/providers.json and gets set up the first time you runÂ/setup-project.A multi-LLM system means managing API keys for every provider you want to use. That gets old fast. Three providers means three API keys in environment variables, three billing dashboards, three sets of usage limits to track. The any-llm managed platform (built at Mozilla.ai, in open beta at the time of writing) solves this with a single virtual key. You store oneÂ
ANY_LLM_KEYÂ environment variable and it handles authentication to all configured providers. Your actual provider keys are encrypted client-side and never stored in raw text on their servers.With platform mode, the provider config is clean:
{ "platform": "any-llm", "providers": [ {"provider": "openai", "model": "gpt-5.2"}, {"provider": "anthropic", "model": "claude-opus-4-6"}, {"provider": "gemini", "model": "gemini-2.5-flash"} ], "consensus_threshold": 2, "timeout_seconds": 60 }
You also get usage tracking, cost analytics, and budget controls across all providers in one place, which is particularly useful for something like the Star Chamber where every invocation fans out to multiple models. No prompt content is logged, only metadata like token counts, cost, and latency.
Alternatively, you can use individual API keys per provider if you prefer direct access. Either way, adding or removing providers is just a config change. You can easily configure Mistral or Llama with just an additionalÂ
providers.json entry and the SDK handles the rest.Validation: Perplexity's Model Council
About a week after the first Star Chamber commits landed (January 30th), Perplexity launched Model Council on February 5th, a feature that runs queries across multiple frontier models simultaneously and synthesises the results. Neither project influenced the other; we just arrived at the same idea independently. Their framing mirrors the same intuition:
Every AI model has blind spots. It might overlook context, lean toward certain perspectives, or fill gaps with confident guesses. For research you're acting on, it's a big risk.
Their approach sounds very similar to ideas in claude-pragma, running the query through Claude, GPT, and Gemini in parallel, then a synthesizer model resolves conflicts and shows where the models agree versus diverge.
When two teams working on completely different problems independently converge on a similar solution or architecture, that's a strong signal that multi-model consensus is becoming a recognised pattern for any task where accuracy matters more than speed.
Although the projects have different domains and scopes: Perplexity for research queries and Star Chamber for software engineering (code review, catching bugs, and design trade-offs), it's the same principle.
Why This Works
It's the same reason code review works in human teams. No single reviewer catches everything, but different people bring different knowledge, different pattern recognition, and different things they're sensitive to. The Star Chamber just applies that to AI-assisted development, using models with different training data and architectures instead of people with different backgrounds and experience.
What I've Learned Using It
A few observations from building and using this:
Consensus issues are almost always worth fixing. When three independently-trained models flag the same concern, ignoring it is hard to justify. These tend to be genuine problems, not stylistic preferences.
Individual observations are where the interesting insights hide. Sometimes only one model spots something, and it turns out to be the most valuable feedback. The classification doesn't mean "ignore individual observations"; it's more like "calibrate your confidence accordingly."
Debate mode changes minds. In multi-round debates, providers regularly shift position after seeing anonymous synthesis. And yes, LLMs are generally sycophantic, but in this case the anonymous synthesis seems to genuinely surface arguments they hadn't considered rather than just agreeing for the sake of it. This is the strongest argument for debate over simple parallel review.
The advisory-not-blocking distinction is essential. Making it blocking would slow development to a crawl and create false authority. These are AI opinions informed by multiple perspectives, but still opinions. The engineer decides.
What's Next
There's one direction I'm particularly interested in: assigning personas to council members. Rather than three generic reviewers, you'd configure the council so one member reviews through a security lens, another focuses on performance, and a third evaluates maintainability. Different lenses on the same code, reflecting how a well-structured review team would actually operate.
The original idea for the Star Chamber was actually a Slack-based chat room where different AI models could discuss your code in a thread. Building it as a Claude Code skill turned out to be more practical, but the conversational quality of debate mode captures some of that original spirit.
Update: March 2026
A few things have moved since this post went up.
The rename.Â
claude-pragma is now agent-pragma. The original name tied it to Claude Code, but it now works with OpenCode too, so the broader name fits better. I've updated the references in this post to match.Star Chamber is now a standalone SDK. This is the biggest change. What started as a Python script embedded inside the plugin â council logic, provider transport, consensus classification, prompt templates, all in one place â has been extracted into its own repository and published to PyPI. You can use it independently of agent-pragma as a CLI (
uvx star-chamber review ...) or as a Python library (from star_chamber import run_council). About 3,000 lines moved out of agent-pragma in the process.The extraction also produced a formal council protocol specification with JSON schemas defining the wire format between the orchestrator and providers. What was implicit in the original implementation is now an explicit, versioned contract. The skill inside agent-pragma is now a thin wrapper that shells out to the SDK.
Dual-entrypoint architecture. The Star Chamber now operates as both an explicit skill (
/star-chamber) and an auto-invoked agent that fires on significant architectural decisions. The agent uses a lighter-weight model and parallel mode only, keeping it fast enough to run in the background without disrupting flow. When it triggers as an agent, you get the council's take without having to remember to ask for it.Zero-config. Skills now work immediately without runningÂ
/setup-project first. The setup skill still exists for customising validator configuration and provider lists, but it's no longer a prerequisite. You install the plugin and go.Try It
The Star Chamber can be used standalone or as part of agent-pragma.
Standalone SDKÂ (no plugin required):
uvx star-chamber review src/main.py src/config.py uvx star-chamber ask "Should this service use event sourcing or CRUD?"BothÂ
review andÂask support the same set of flags:Flag Description -p / --providerProvider to include, repeatable (e.g. -p openai -p anthropic)--context-fileFile containing project context to include in the prompt --council-contextPrior council round feedback for debate mode --configPath to a providers.json--timeoutPer-provider timeout in seconds --formatOutput format: text(default) orjson--outputWrite JSON result to a file TheÂ
--context-file flag is particularly useful for feeding in anÂARCHITECTURE.md orÂCLAUDE.md so the council reviews against your project's actual conventions rather than general best practice.With the any-llm managed platform:
If you're using the any-llm managed platform for centralised key management (one virtual key instead of one per provider), install with the platform extra:
uvx --with 'star-chamber[platform]' star-chamber review src/main.py uvx --with 'star-chamber[platform]' star-chamber ask "Event sourcing or CRUD for the audit trail?"YourÂ
providers.json stays the same as described in the Plumbing section above â the extra just pulls in the platform client that resolves yourÂANY_LLM_KEY against the configured providers.As part of agent-pragma (includes validators,Â
/implement, and auto-invocation):/plugin marketplace add peteski22/agent-pragma /plugin install pragma@agent-pragmaThenÂ
/star-chamber whenever you want a council review. If you want to customise provider lists or validator configuration,Â/setup-project handles that, but it's optional.I'm particularly interested in how other people configure their provider lists and whether different model combinations surface different kinds of insights. If you try it, I'd like to hear what you find.
Originally published on https://peteski22.github.io/blog/ on February 22, 2026
- Mar 5, 2026 Announcements Where things stand with the Department of War Anthropic News Mar 05, 2026 12:00 AM A statement from Dario Amodei
- Gemini 3.1 Flash-Lite: Built for intelligence at scale DeepMind Blog Mar 03, 2026 04:35 PM Gemini 3.1 Flash-Lite is our fastest and most cost-efficient Gemini 3 series model yet.
- Nano Banana 2: Combining Pro capabilities with lightning-fast speed DeepMind Blog Feb 26, 2026 04:01 PM Our latest image generation model offers advanced world knowledge, production-ready specs, subject consistency and more, all at Flash speed.
-
any-llm in the Wild: Three Integrations as We Grow Our Ecosystem Mozilla.ai Blog Feb 25, 2026 07:09 PM 2 min read any-llm now integrates with JupyterLiteAI, LangChain, and Headroom. A single provider-agnostic layer powering notebooks, agents, and context optimization across OpenAI, Anthropic, Mistral, and local m

A core part of building any-llm is making sure it is present where developers already are. Over the past few months, we've integrated any-llm into the broader ecosystem by contributing to open source projects, publishing new packages, and collaborating with the community. Here are three recent integrations that each demonstrate a different aspect of what a provider-agnostic LLM layer makes possible.
JupyterLiteAI + any-llm-gateway
JupyterLite AI gives data scientists access to AI code completions and chat. It already supports a range of providers, but each requires its own configuration and API keys.
Because the anyâllm-gateway exposes an OpenAI-compatible API, JupyterLite AI connects to it without any code changes. Users point their notebook at the gateway endpoint and get access to every provider any-llm supports â OpenAI, Anthropic, Mistral, local models â through a single configuration.
The Local Edge: You can now point JupyterAI at a local Llama 3 or Mistral instance running via llamafile or Ollama. Your code, your data, and your prompts will never leave your local machine.
langchain-anyllm
LangChain is the industry standard for building AI agents, but swapping between providers often requires installing and configuring disparate packages.Â
We built langchain-anyllm to collapse that complexity into a single integration. Install one package and switch models with a string:
PythonPythonfrom langchain_anyllm import ChatAnyLLM llm = ChatAnyLLM(model="openai:gpt-4") llm = ChatAnyLLM(model="anthropic:claude-sonnet-4-20250514") llm = ChatAnyLLM(model="mistral:mistral-small-latest")The package supports streaming (sync and async), tool calling, and JSON mode. It's available on PyPI and now documented in LangChain's official docs, merged after review by the LangChain team.
Headroom + any-llm as a backend
Headroom is a context optimization layer that compresses LLM context by 50-90% using statistical analysis, thereby reducing token costs without sacrificing accuracy. It operates as a proxy and previously supported backends like AWS Bedrock, Vertex AI, Azure OpenAI, and OpenRouter.
We contributed any-llm as a new backend, giving Headroom users access to any supported provider:
bashShellheadroom proxy --backend anyllm --anyllm-provider openaiThe integration supports streaming, non-streaming, and OpenAI-format requests. The composability here is the interesting part: Headroom handles context optimization, any-llm handles provider abstraction, and the developer gets both without coupling to a specific vendor. For on-prem users, this allows running larger models with more context on more modest local hardware.
Join the Ecosystem
Come build the future of open source AI with us.
- Explore the code: Check out the any-llm repository to see how we're abstracting the provider layer.
- Try the integrations: Grab langchain-anyllm on PyPI or spin up the any-llm-gateway to use with Jupyter.
- Build with us: Have a tool you want to see integrated? Open an issue or a PRâweâre meeting developers where they are, and weâd love to meet you there, too.
-
RCCLX: Innovating GPU Communications on AMD Platforms Meta AI / Engineering Feb 24, 2026 09:30 PM 5 min read We are open-sourcing the initial version of RCCLX â an enhanced version of RCCL that we developed and tested on Metaâs internal workloads. RCCLX is fully integrated with Torchcomms and aims to empoâŚ
We are open-sourcing the initial version of RCCLX â an enhanced version of RCCL that we developed and tested on Metaâs internal workloads. RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chosen backend.
Communication patterns for AI models are constantly evolving, as are hardware capabilities. We want to iterate on collectives, transports, and novel features quickly on AMD platforms. Earlier, we developed and open-sourced CTran, a custom transport library on the NVIDIA platform. With RCCLX, we have integrated CTran to AMD platforms, enabling the AllToAllvDynamic â a GPU-resident collective. While not all the CTran features are currently integrated into the open source RCCLX library, weâre aiming to have them available in the coming months.Â
In this post, we highlight two new features â Direct Data Access (DDA) and Low Precision Collectives. These features provide significant performance improvements on AMD platforms and we are excited to share this with the community.Â
Direct Data Access (DDA) â Lightweight Intra-node Collectives
Large language model inference operates through two distinct computational stages, each with fundamentally different performance characteristics:Â
- The prefill stage processes the input prompt, which can span thousands of tokens, to generate a key-value (KV) cache for each transformer layer of the model. This stage is compute-bound because the attention mechanism scales quadratically with sequence length, making it highly demanding on GPU computational resources.
- The decoding stage then utilizes and incrementally updates the KV cache to generate tokens one by one. Unlike prefill, decoding is memory-bound, as the I/O time of reading memory dominates attention time, with model weights and the KV cache occupying the majority of memory.
Tensor parallelism enables models to be distributed across multiple GPUs by sharding individual layers into smaller, independent blocks that execute on different devices. However, one important challenge is the AllReduce communication operation can contribute up to 30% of end-to-end (E2E) latency. To address this bottleneck, Meta developed two DDA algorithms.Â
- The DDA flat algorithm improves small message-size allreduce latency by allowing each rank to directly load memory from other ranks and perform local reduce operations, reducing latency from O(N) to O(1) by increasing the data exchange from O(n) to O(n²).
- The DDA tree algorithm breaks the allreduce into two phases (reduce-scatter and all-gather) and uses direct data access in each step, moving the same amount of data as the ring algorithm but reducing latency to a constant factor for slightly larger message sizes.

The performance improvements of DDA over baseline communication libraries are substantial, particularly on AMD hardware. With AMD MI300X GPUs, DDA outperforms the RCCL baseline by 10-50% for decode (small message sizes) and yields 10-30% speedup for prefill. These improvements resulted in approximately 10% reduction in time-to-incremental-token (TTIT), directly enhancing the user experience during the critical decoding phase.
Low-precision Collectives
Low-precision (LP) collectives are a set of distributed communication algorithms â AllReduce, AllGather, AlltoAll, and ReduceScatter â optimized for AMD Instinct MI300/MI350 GPUs to accelerate AI training and inference workloads. These collectives support both FP32 and BF16 data types, leveraging FP8 quantization for up to 4:1 compression, which significantly reduces communication overhead and improves scalability and resource utilization for large message sizes (âĽ16MB).Â
The algorithms use parallel peer-to-peer (P2P) mesh communication, fully exploiting AMDâs Infinity Fabric for high bandwidth and low latency, while compute steps are performed in high precision (FP32) to maintain numerical stability. Precision loss is primarily dictated by the number of quantization operations â typically one or two per data type in each collective â and whether the data can be adequately represented within the FP8 range.Â
By dynamically enabling LP collectives, users can selectively activate these optimizations in E2E scenarios that benefit most from performance gains. Based on internal experiments, we have observed significant speed up for FP32 and notable improvements for BF16; itâs important to note that these collectives have been tuned for single-node deployments at this time.Â
Reducing the precision of types can potentially have an impact on numeric accuracy so we tested for this and we found that it provided acceptable numerical accuracy for our workloads. This flexible approach allows teams to maximize throughput while maintaining acceptable numerical accuracy, and is now fully integrated and available in RCCLX for AMD platforms â simply set the environment variable RCCL_LOW_PRECISION_ENABLE=1 to get started.

MI300 â Float LP AllReduce speedup. 
MI300 â Float LP AllGather speedup. 
MI300 â Float LP AllToAll speedup. 
MI300 â Float LP ReduceScatter speedup. We are observing the following results from E2E inference workload evaluations when selectively enabling LP collectives:
- Approximately ~0.3% delta on GSM8K evaluation runs.
- ~9â10% decrease in latency.
- ~7% increase in throughput.
The throughput measurements shown in the graphs were obtained using param-bench rccl-tests. For the MI300, the tests were run on RCCLX built with ROCm 6.4, and for the MI350, on RCCLX built with ROCm 7.0. Each test included 10 warmup iterations followed by 100 measurement iterations. The reported results represent the average throughput across the measurement iterations.
Easy adaptation of AI models
RCCLX is integrated with the Torchcomms API as a custom backend. We aim for this backend to have feature parity with our NCCLX backend (for NVIDIA platforms). Torchcomms allows users to have a single API for communication for different platforms. A user would not need to change the APIs theyâre familiar with to port their applications across AMD, or other platforms even when using the novel features provided by CTran.Â


RCCLX Quick Start Guide
Install Torchcomms with RCCLX backend by following the installation instructions in the Torchcomms repo.
import torchcomms # Eagerly initialize a communicator using MASTER_PORT/MASTER_ADDR/RANK/WORLD_SIZE environment variables provided by torchrun. # This communicator is bound to a single device. comm = torchcomms.new_comm("rcclx", torch.device("hip"), name="my_comm") print(f"I am rank {comm.get_rank()} of {comm.get_size()}!") t = torch.full((10, 20), value=comm.rank, dtype=torch.float) # run an all_reduce on the current stream comm.allreduce(t, torchcomms.ReduceOp.SUM, async_op=False)Acknowledgements
We extend our gratitude to the AMD RCCL team for their ongoing collaboration. We also want to recognize the many current and former Meta employees whose contributions were vital in developing torchcomms and torchcomms-backends for production-scale training and inference. In particular, we would like to give special thanks to Dingming Wu, Qiye Tan, Pavan Balaji Yan Cui, Zhe Qu, Ahmed Khan, Ajit Mathews, CQ Tang, Srinivas Vaidyanathan, Harish Kumar Chandrappa, Peng Chen, Shashi Gandham, and Omar Baldonado
The post RCCLX: Innovating GPU Communications on AMD Platforms appeared first on Engineering at Meta.
- Gemini 3.1 Pro: A smarter model for your most complex tasks DeepMind Blog Feb 19, 2026 04:06 PM 3.1 Pro is designed for tasks where a simple answer isnât enough.
- A new way to express yourself: Gemini can now create music DeepMind Blog Feb 18, 2026 04:01 PM Lyria 3 is now available in the Gemini app. Create custom, high-quality 30-second tracks from text and images.
- Introducing Claude Sonnet 4.6 Anthropic News Feb 17, 2026 12:00 AM
-
The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It Meta AI / Engineering Feb 11, 2026 05:00 PM 3 min read WHAT IT IS The rise of agentic software development means code is being written, reviewed, and shipped faster than ever before across the entire industry. It also means that testing frameworks needâŚ
WHAT IT IS
The rise of agentic software development means code is being written, reviewed, and shipped faster than ever before across the entire industry. It also means that testing frameworks need to evolve for this rapidly changing landscape. Faster development demands faster testing that can catch bugs as they land in a codebase, without requiring regular updates and maintenance.
Just-inâTime Tests (JiTTests) are a fundamentally novel approach to testing where tests are automatically generated by large language models (LLMs) on the fly to catch bugs â even ones that traditional testing might not catch â just-in-time before the code lands into production.
A Catching JiTTest focuses specifically on finding regressions introduced by a code change. This type of testing reimagines decades of software testing theory and practice. While traditional testing relies on static test suites, manual authoring, and ongoing maintenance, Catching JiTTests require no test maintenance and no test code review, meaning engineers can focus their expertise on real bugs, not false positives. Catching JiTTests use sophisticated techniques to maximize test signal value and minimize false positive drag, targeting test signals where they matter most: on serious failures.
HOW TESTING TRADITIONALLY WORKS
Under the traditional paradigm, tests are manually built as new code lands in a codebase and continually executed, requiring regular updates and maintenance. The engineers building these tests face the challenge of needing to check the behavior, not only of the current code, but all possible future changes. Inherent uncertainty about future changes results in tests that donât catch anything, or when they do, itâs a false positive. Agentic development dramatically increases the pace of code change, straining test development burden and scaling the cost of false positives and test maintenance to breaking point.Â
HOW CATCHING JITTESTS WORK
Broadly, JiTTests are bespoke tests, tailored to a specific code change, that give engineers simple, actionable feedback about unexpected behavior changes without the need to read or write test code. LLMs can generate JiTTests automatically the moment a pull request is submitted. And since the JiTTest itself is LLM-generated, it can often infer the plausible intention of a code change and simulate possible faults that may result from it.
With an understanding of intent, Catching JiTTests can significantly drive down instances of false positives.
Here are the key steps of the Catching JiTTest process:
- New code lands in the codebase.
- The system infers the intention of the code change.
- It creates mutants (code versions with faults deliberately inserted) to simulate what could go wrong.
- It generates and runs tests to catch those faults.
- Ensembles of rule-based and LLM-based assessors focus the signal on true positive failures.
- Engineers receive clear, relevant reports about unexpected changes right when it matters most.
WHY IT MATTERS
Catching JiTTests are designed for the world of AI-powered agentic software development and accelerate testing by focusing on serious unexpected bugs. With them engineers no longer have to spend time writing, reviewing, and testing complex test code. Catching JiTTests, by design, kill many of the issues with traditional testing in one stroke:
- They are generated on-the-fly for each code change and do not reside in the codebase, eliminating ongoing maintenance costs and shifting effort from humans to machines.
- They are tailored to each change, making them more robust and less prone to breaking due to intended updates.
- They automatically adapt as the code changes.
- They only require human review when a bug is actually caught.
This all amounts to an important shift in testing infrastructure where the focus moves from generic code quality to whether a test actually finds faults in a specific change without raising a false positive. It helps improve testing overall while also allowing it to keep up with the pace of agentic coding.
READ THE PAPER
Just-in-Time Catching Test Generation at Meta
The post The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It appeared first on Engineering at Meta.
-
Run OpenAI, Claude, Mistral, llamafile, and more from one interface, now in Go! Mozilla.ai Blog Feb 10, 2026 07:01 PM 4 min read Run any model, from any provider, like OpenAI, Claude, Mistral, or llamafile from one interface, now in Go. any-llm-go delivers type-safe provider abstraction, channel-based streaming, and normalized
Go where the models are

When we released any-llm v1.0 last year, the goal was simple: one interface to use any model, cloud or local, without rewriting your code every time a new provider ships. That goal resonated. Thousands of Python developers adopted any-llm to decouple their product logic from their model provider. In production systems, that decoupling is often the difference between iterating quickly and being locked into a single vendorâs API quirks.
But the LLM ecosystem doesn't live in one language. Go powers a significant share of production infrastructure, from API servers to CLI tools to agent frameworks. Go developers deserve the same flexibility.
Today we're releasing any-llm-go, the official Go port of any-llm.
What you get
Every provider differs slightly in streaming behavior, error semantics, and feature support. any-llm-go normalizes those differences behind a single, predictable interface following OpenAI API standard.
any-llm-go ships with support for eight providers out of the box:
Provider Completion Streaming Tools Reasoning Embeddings Anthropic â â â â â DeepSeek â â â â â Gemini â â â â â Groq â â â â â Llamafile â â â â â Mistral â â â â â Ollama â â â â â OpenAI â â â â â Every provider normalizes to the same response format. Write your logic once, swap providers by changing a single import.
Goimport ( anyllm "github.com/mozilla-ai/any-llm-go" "github.com/mozilla-ai/any-llm-go/providers/openai" ) provider, err := openai.New() if err != nil { log.Fatal(err) } response, err := provider.Completion(ctx, anyllm.CompletionParams{ Model: "gpt-4o-mini", Messages: []anyllm.Message{ {Role: anyllm.RoleUser, Content: "Hello!"}, }, })Want to switch to Anthropic? Change
openaitoanthropicandgpt-4o-minitoclaude-sonnet-4-20250514. Everything else stays the same: request shape, streaming logic, and error handling.Built for Go, not ported from Python
This isn't a line-for-line translation. any-llm-go is designed around Go's strengths:
- Streaming uses channels, not iterators.
CompletionStreamreturns a<-chan ChatCompletionChunkthat works naturally with range and select.
- Errors are values, not exceptions. Every provider's SDK errors are normalized into typed sentinel errors (ErrRateLimit,ErrAuthentication,ErrContextLength) that work with errors.Is and errors.As.
- Configuration uses functional options.openai.New(anyllm.WithAPIKey("..."),anyllm.WithTimeout(30*time.Second))gives you type-safe, composable setup.
- Context flows everywhere. Every call takes acontext.Contextfor cancellation, timeouts, and tracing.The result is a library that feels like Go, not like Go wearing Python's clothes.
The OpenAI-compatible base: add a provider in 50 lines
Not every provider has a dedicated Go SDK. Many (Groq, DeepSeek, Mistral, Llamafile) expose OpenAI-compatible APIs instead. Rather than writing a full implementation for each of these, any-llm-go includes a shared OpenAI-compatible base provider.
Adding a new compatible provider is straightforward. Define a config, call
openai.NewCompatible(), and you're done. The Groq provider, for example, is essentially a thin wrapper:Goprovider, err := openai.NewCompatible(openai.CompatibleConfig{ APIKeyEnvVar: "GROQ_API_KEY", BaseURLEnvVar: "", Capabilities: groqCapabilities(), DefaultAPIKey: "", DefaultBaseURL: "https://api.groq.com/openai/v1", Name: "groq", RequireAPIKey: true, }, opts...)The base handles completions, streaming, tool calls, embeddings, error conversion, and model listing. Your wrapper just needs to specify the API endpoint, the environment variable for the key, and which capabilities are supported.
This is by design. We want adding providers to be easy, because we want *you* to add them, and because the provider landscape changes faster than any single team can keep up with.
How to contribute
We built any-llm-go with contribution in mind. The Contributing Guide walks through the full process, but the short version is:
1. Pick a provider from the planned list (Cohere, Together AI, AWS Bedrock, Azure OpenAI) or propose a new one.
2. Check if it's OpenAI-compatible. If so, you can use the compatible base and keep your implementation minimal.
3. If it has a native Go SDK, use it. Wrap the SDK, normalize the responses, convert the errors.
4. Write tests and docs. We use the Anthropic provider as the reference implementation.We've tried to make the codebase approachable. Every provider follows the same file organization, the same patterns, the same test structure. Once you've read one, you can write another.
This is an open library by design, extensible, inspectable, and shaped by its users.
What's next
This initial release focuses on getting the core right: a stable interface, solid error handling, and broad provider coverage. On the roadmap:
- More providers (Cohere, Together AI, AWS Bedrock, Azure OpenAI)
- Batch completion support
- Continued parity with the Python any-llm as both libraries evolveany-llm-go also works with the any-llm managed platform, now in beta. It provides a vault to manage your API keys, an observability stack to monitor your LLMs performance, and per-project budget controls. If you're managing LLM keys and costs across multiple providers and teams, take a look.
Get started
Shellgo get github.com/mozilla-ai/any-llm-goCheck out the documentation, explore the examples, or jump straight into the provider list.
Found a bug? Want a new provider? Open an issue or start a discussion. We'd love to hear from you.
- Introducing Claude Opus 4.6 Anthropic News Feb 05, 2026 12:00 AM
- Claude is a space to think Anthropic News Feb 04, 2026 12:00 AM Weâve made a choice: Claude will remain ad-free. We explain why advertising incentives are incompatible with a genuinely helpful AI assistant, and how we plan to expand access without compromising use
-
NVIDIA Rubin Platform, Open Models, Autonomous Driving: NVIDIA Presents Blueprint for the Future at CES NVIDIA AI Blog Jan 05, 2026 11:30 PM 7 min read NVIDIA founder and CEO Jensen Huang opened CES in Las Vegas with Rubin â NVIDIAâs first extreme-codesigned AI platform â plus open models for healthcare, robotics and autonomy, and a Mercedes-Benz CLA
NVIDIA founder and CEO Jensen Huang took the stage at the Fontainebleau Las Vegas to open CES 2026, declaring that AI is scaling into every domain and every device.
âComputing has been fundamentally reshaped as a result of accelerated computing, as a result of artificial intelligence,â Huang said. âWhat that means is some $10 trillion or so of the last decade of computing is now being modernized to this new way of doing computing.â
Huang unveiled Rubin, NVIDIAâs first extreme-codesigned, six-chip AI platform now in full production, and introduced Alpamayo, an open reasoning model family for autonomous vehicle development â part of a sweeping push to bring AI into every domain.
With Rubin, NVIDIA aims to âpush AI to the next frontierâ while slashing the cost of generating tokens to roughly one-tenth that of the previous platform, Huang said, making large-scale AI far more economical to deploy.
Huang also emphasized the role of NVIDIA open models across every domain, trained on NVIDIA supercomputers, forming a global ecosystem of intelligence that developers and enterprises can build on.
âEvery single six months, a new model is emerging, and these models are getting smarter and smarter,â Huang said. âBecause of that, you could see the number of downloads has exploded.â
Find all NVIDIA news from CES in this online press kit.
A New Engine for Intelligence: The Rubin Platform
Introducing the audience to pioneering American astronomer Vera Rubin, after whom NVIDIA named its next-generation computing platform, Huang announced that the NVIDIA Rubin platform, the successor to the recordâbreaking NVIDIA Blackwell architecture and the companyâs first extreme-codesigned, sixâchip AI platform, is now in full production.

Built from the data center outward, Rubin platform components span:
- Rubin GPUs with 50 petaflops of NVFP4 inference
- Vera CPUs engineered for data movement and agentic processing
- NVLink 6 scaleâup networking
- SpectrumâX Ethernet Photonics scaleâout networking
- ConnectXâ9 SuperNICs
- BlueFieldâ4 DPUs
Extreme codesign â designing all these components together â is essential because scaling AI to gigascale requires tightly integrated innovation across chips, trays, racks, networking, storage and software to eliminate bottlenecks and dramatically reduce the costs of training and inference, Huang explained.
He also introduced AI-native storage with NVIDIA Inference Context Memory Storage Platform â an AIânative KVâcache tier that boosts longâcontext inference with 5x higher tokens per second, 5x better performance per TCO dollar and 5x better power efficiency.
Put it all together and the Rubin platform promises to dramatically accelerate AI innovation, delivering AI tokens at one-tenth the cost. âThe faster you train AI models, the faster you can get the next frontier out to the world,â Huang said. âThis is your time to market. This is technology leadership.â
Open Models for All

NVIDIAâs open models â trained on NVIDIAâs own supercomputers â are powering breakthroughs across healthcare, climate science, robotics, embodied intelligence and autonomous driving.
âNow on top of this platform, NVIDIA is a frontier AI model builder, and we build it in a very special way. We build it completely in the open so that we can enable every company, every industry, every country, to be part of this AI revolution.â
The portfolio spans six domains â Clara for healthcare, Earth-2 for climate science, Nemotron for reasoning and multimodal AI, Cosmos for robotics and simulation, GR00T for embodied intelligence and Alpamayo for autonomous driving â creating a foundation for innovation across industries.
âThese models are open to the world,â Huang said, underscoring NVIDIAâs role as a frontier AI builder with world-class models topping leaderboards. âYou can create the model, evaluate it, guardrail it and deploy it.â
AI on Every Desk: RTX, DGX Spark and Personal Agents
Huang emphasized that AIâs future is not only about supercomputers â itâs personal.
Huang showed a demo featuring a personalized AI agent running locally on the NVIDIA DGX Spark desktop supercomputer and embodied through a Reachy Mini robot using Hugging Face models â showing how open models, model routing and local execution turn agents into responsive, physical collaborators.
âThe amazing thing is that is utterly trivial now, but yet, just a couple of years ago, that would have been impossible, absolutely unimaginable,â Huang said.
The worldâs leading enterprises are integrating NVIDIA AI to power their products, Huang said, citing companies including Palantir, ServiceNow, Snowflake, CodeRabbit, CrowdStrike, NetApp and Semantec.
âWhether itâs Palantir or ServiceNow or Snowflake â and many other companies that weâre working with â the agentic system is the interface.â
At CES, NVIDIA also announced that DGX Spark delivers up to 2.6x performance for large models, with new support for Lightricks LTXâ2 and FLUX image models, and upcoming NVIDIA AI Enterprise availability.
Physical AI

AI is now grounded in the physical world, through NVIDIAâs technologies for training, inference and edge computing.
These systems can be trained on synthetic data in virtual worlds long before interacting with the real world.
Huang showcased NVIDIA Cosmos open world foundation models trained on videos, robotics data and simulation. Cosmos:
- Generates realistic videos from a single image
- Synthesizes multiâcamera driving scenarios
- Models edgeâcase environments from scenario prompts
- Performs physical reasoning and trajectory prediction
- Drives interactive, closedâloop simulation
Advancing this story, Huang announced Alpamayo, an open portfolio of reasoning vision language action models, simulation blueprints and datasets enabling level 4âcapable autonomy. This includes:
- Alpamayo R1 â the first open, reasoning VLA model for autonomous driving
- AlpaSim â a fully open simulation blueprint for highâfidelity AV testing

âNot only does it take sensor input and activates steering wheel, brakes and acceleration, it also reasons about what action it is about to take,â Huang said, teeing up a video showing a vehicle smoothly navigating busy San Francisco traffic.
Huang announced the first passenger car featuring Alpamayo built on NVIDIA DRIVE full-stack autonomous vehicle platform will be on the roads soon in the allânew MercedesâBenz CLA â with AIâdefined driving coming to the U.S. this year, and follows the CLAâs recent EuroNCAP fiveâstar safety rating.
Huang also highlighted growing momentum behind DRIVE Hyperion, the open, modular, levelâ4âready platform adopted by leading automakers, suppliers and robotaxi providers worldwide.

âOur vision is that, someday, every single car, every single truck will be autonomous, and weâre working toward that future,â Huang said.
Huang was then joined on stage by a pair of tiny beeping, booping, hopping robots as he explained how NVIDIAâs fullâstack approach is fueling a global physical AI ecosystem.
Huang rolled a video showing how robots are trained in NVIDIA Isaac Sim and Isaac Lab in photorealistic, simulated worlds â before highlighting the work of partners in physical AI across the industry, including Synopsys and Cadence, Boston Dynamics and Franka, and more.
Huang also appeared with Siemens CEO Roland Busch at the companyâs Tuesday keynote to announce an expanded partnership, supported by a montage showing how NVIDIAâs full stack integrates with Siemensâ industrial software, enabling physical AI from design and simulation through production.
âThese manufacturing plants are going to be essentially giant robots,â Huang said at NVIDIAâs presentation on Monday.

Roland Busch, president and CEO of Siemens, with Jensen Huang, founder and CEO of NVIDIA, during the Siemens keynote at CES 2026. Building the Future, Together
Huang explained that NVIDIA builds entire systems now because it takes a full, optimized stack to deliver AI breakthroughs.
âOur job is to create the entire stack so that all of you can create incredible applications for the rest of the world,â he said.
Watch the full presentation replay:
DLSS 4.5 and Other Gaming and Creating Updates
On Monday evening, NVIDIA announced DLSS 4.5, which introduces Dynamic Multi Frame Generation, a new 6X Multi Frame Generation mode and a second-generation transformer model for DLSS Super Resolution, so gamers can experience the latest and greatest titles with enhanced performance and visuals.
Over 250 games and apps now support NVIDIA DLSS 4 technology, with this yearâs biggest titles adding support, including 007 First Light, Phantom Blade Zero, PRAGMATA and Resident Evil Requiem at launch.
RTX Remix Logic debuted, expanding the capabilities of the Remix modding platform to enable modders to trigger dynamic graphics effects throughout a game based on real-time game events.
Plus, NVIDIA ACE technology demonstrated in Total War: PHARAOH showcases how AI can assist players in navigating the complexities of the gameâs many systems and mechanics.
In PUBG: BATTLEGROUNDS, PUBG Ally powered by NVIDIA ACE adds long-term memory, evolving its intelligence and capabilities.
And G-SYNC Pulsar monitors are available this week, delivering a tear-free experience together with a perceived 1,000Hz+ effective motion clarity and G-SYNC Ambient Adaptive Technology â all setting a new gold standard for gamers.
In addition, NVIDIA is bringing GeForce RTX gaming to more devices with new GeForce NOW Apps for Linux PC and Amazon Fire TV.
And NVIDIA RTX accelerates 4K AI video generation on PCs with LTX-2 and ComfyUI upgrades.
Read more about these announcements from Monday night at CES on this GeForce news article.
Learn more about all NVIDIA announcements at CES.
-
As AI Grows More Complex, Model Builders Rely on NVIDIA NVIDIA AI Blog Dec 11, 2025 07:19 PM 4 min read Unveiling what it describes as the most capable model series yet for professional knowledge work, OpenAI launched GPT-5.2 in December. The model was trained and deployed on NVIDIA infrastructure, incl
Unveiling what it describes as the most capable model series yet for professional knowledge work, OpenAI launched GPT-5.2 in December. The model was trained and deployed on NVIDIA infrastructure, including NVIDIA Hopper and GB200 NVL72 systems.
GPT-5.3 Codex â the first OpenAI agentic coding model to help build itself â was released in February and trained and served entirely on GB200 NVL72.
GPT-5.2 achieves the top reported score for industry benchmarks like GPQA-Diamond, AIME 2025 and Tau2 Telecom. On leading benchmarks targeting the skills required to develop AGI, like ARC-AGI-2, GPT-5.2 sets a new bar for state-of-the-art performance.
GPT 5.3-Codex combines the coding performance of GPTâ5.2-Codex and the reasoning capabilities of GPTâ5.2 together in one model, with 25% faster performance. In four benchmarks used to evaluate coding, agentic and real-world capabilities, GPT 5.3-Codex set a new industry highs on SWE-Bench Pro and Terminal-Bench while also displaying strong performance on OSWorld and GDPval benchmarks,.
GPT 5.2 and GPT 5.3-Codex are the latest examples of how leading AI builders train and deploy at scale on NVIDIAâs full-stack AI infrastructure.
Pretraining: The Bedrock of Intelligence
AI models are getting more capable thanks to three scaling laws: pretraining, post-training and test-time scaling.
Reasoning models, which apply compute during inference to tackle complex queries, using multiple networks working together, are now everywhere.
But pretraining and post-training remain the bedrock of intelligence. Theyâre core to making reasoning models smarter and more useful.
And getting there takes scale. Training frontier models from scratch isnât a small job.
It takes tens of thousands, even hundreds of thousands, of GPUs working together effectively.
That level of scale demands excellence across many dimensions. It requires world-class accelerators, advanced networking across scale-up, scale-out and increasingly scale-across architectures, plus a fully optimized software stack. In short, a purpose-built infrastructure platform built to deliver performance at scale.
Compared with the NVIDIA Hopper architecture, NVIDIA GB200 NVL72 systems delivered 3x faster training performance on the largest model tested in the latest MLPerf Training industry benchmarks, and nearly 2x better performance per dollar.
And NVIDIA GB300 NVL72 delivers a more than 4x speedup compared with NVIDIA Hopper.
These performance gains help AI developers shorten development cycles and deploy new models more quickly.
Proof in the Models Across Every Modality
The majority of todayâs leading large language models were trained on NVIDIA platforms.
AI isnât just about text.
NVIDIA supports AI development across multiple modalities, including speech, image and video generation, as well as emerging areas like biology and robotics.
For example, models like Evo 2 decode genetic sequences, OpenFold3 predicts 3D protein structures and Boltz-2 simulates drug interactions, helping researchers identify promising candidates faster.
On the clinical side, NVIDIA Clara synthesis models generate realistic medical images to advance screening and diagnosis without exposing patient data.
Companies like Runway and Inworld train on NVIDIA infrastructure.
Runway last week announced Gen-4.5, a new frontier video generation model thatâs the current top-rated video model in the world, according to the Artificial Analysis leaderboard.
Now optimized for NVIDIA Blackwell, Gen-4.5 was developed entirely on NVIDIA GPUs across initial research and development, pre-training, post-training and inference.
Runway also announced GWM-1, a state-of-the-art general world model trained on NVIDIA Blackwell thatâs built to simulate reality in real time. Itâs interactive, controllable and general-purpose, with applications in video games, education, science, entertainment and robotics.
Benchmarks show why.
MLPerf is the industry-standard benchmark for training performance. In the latest round, NVIDIA submitted results across all seven MLPerf Training 5.1 benchmarks, showing strong performance and versatility. It was the only platform to submit in every category.
NVIDIAâs ability to support diverse AI workloads helps data centers use resources more efficiently.
Thatâs why AI labs such as Black Forest Labs, Cohere, Mistral, OpenAI, Reflection and Thinking Machines Lab and are all training on the NVIDIA Blackwell platform.
NVIDIA Blackwell Across Clouds and Data Centers
NVIDIA Blackwell is widely available from leading cloud service providers, neo-clouds and server makers.
And NVIDIA Blackwell Ultra, offering additional compute, memory and architecture improvements, is now rolling out from server makers and cloud service providers.
Major cloud service providers and NVIDIA Cloud Partners, including Amazon Web Services, CoreWeave, Google Cloud, Lambda, Microsoft Azure, Nebius, Oracle Cloud Infrastructure and Together AI, to name a few, already offer instances powered by NVIDIA Blackwell, ensuring scalable performance as pretraining scaling continues.
From frontier models to everyday AI, the future is being built on NVIDIA.
Learn more about the NVIDIA Blackwell platform.
Editorâs note: This story was updated on February 6, 2026 with the latest model information from OpenAI and its GPT-5.3 Codex. Check back for subsequent model launches and new data from OpenAI.
-
Reaching Across the Isles: UK-LLM Brings AI to UK Languages With NVIDIA Nemotron NVIDIA AI Blog Sep 14, 2025 01:00 AM 11 min read Trained on the Isambard-AI supercomputer, UK-LLM enables AI reasoning for Welsh and other UK languages for public services.
Celtic languages â including Cornish, Irish, Scottish Gaelic and Welsh â are the U.K.âs oldest living languages. To empower their speakers, the UK-LLM sovereign AI initiative is building an AI model based on NVIDIA Nemotron that can reason in both English and Welsh, a language spoken by about 850,000 people in Wales today.
Enabling high-quality AI reasoning in Welsh will support the delivery of public services including healthcare, education and legal resources in the language.
âI want every corner of the U.K. to be able to harness the benefits of artificial intelligence. By enabling AI to reason in Welsh, weâre making sure that public services â from healthcare to education â are accessible to everyone, in the language they live by,â said U.K. Prime Minister Keir Starmer. âThis is a powerful example of how the latest AI technology, trained on the U.K.âs most advanced AI supercomputer in Bristol, can serve the public good, protect cultural heritage and unlock opportunity across the country.â
The UK-LLM project, established in 2023 as BritLLM and led by University College London, has previously released two models for U.K. languages. Its new model for Welsh, developed in collaboration with Walesâ Bangor University and NVIDIA, aligns with Welsh government efforts to boost the active use of the language, with the goal of achieving a million speakers by 2050 â an initiative known as Cymraeg 2050.
U.K.-based AI cloud provider Nscale will make the new model available to developers through its application programming interface.

âThe aim is to ensure that Welsh remains a living, breathing language that continues to develop with the times,â said Gruffudd Prys, senior terminologist and head of the Language Technologies Unit at Canolfan Bedwyr, the universityâs center for Welsh language services, research and technology. âAI shows enormous potential to help with second-language acquisition of Welsh as well as for enabling native speakers to improve their language skills.â
This new model could also boost the accessibility of Welsh resources by enabling public institutions and businesses operating in Wales to translate content or provide bilingual chatbot services. This can help groups including healthcare providers, educators, broadcasters, retailers and restaurant owners ensure their written content is as readily available in Welsh as they are in English.
Beyond Welsh, the UK-LLM team aims to apply the same methodology used for its new model to develop AI models for other languages spoken across the U.K. such as Cornish, Irish, Scots and Scottish Gaelic â as well as work with international collaborators to build models for languages from Africa and Southeast Asia.
âThis collaboration with NVIDIA and Bangor University enabled us to create new training data and train a new model in record time, accelerating our goal to build the best-ever language model for Welsh,â said Pontus Stenetorp, professor of natural language processing and deputy director for the Centre of Artificial Intelligence at University College London. âOur aim is to take the insights gained from the Welsh model and apply them to other minority languages, in the U.K. and across the globe.â
Tapping Sovereign AI Infrastructure for Model DevelopmentÂ
The new model for Welsh is based on NVIDIA Nemotron, a family of open-source models that features open weights, datasets and recipes. The UK-LLM development team has tapped the 49-billion-parameter Llama Nemotron Super model and 9-billion-parameter Nemotron Nano model, post-training them on Welsh-language data.
Compared with languages like English or Spanish, thereâs less available source data in Welsh for AI training. So to create a sufficiently large Welsh training dataset, the team used NVIDIA NIM microservices for gpt-oss-120b and DeepSeek-R1 to translate NVIDIA Nemotron open datasets with over 30 million entries from English to Welsh.
They used a GPU cluster through the NVIDIA DGX Cloud Lepton platform and are harnessing hundreds of NVIDIA GH200 Grace Hopper Superchips on Isambard-AI â the U.K.âs most powerful supercomputer, backed by ÂŁ225 million in government investment and based at University of Bristol â to accelerate their translation and training workloads.
This new dataset supplements existing Welsh data from the teamâs previous efforts.
Capturing Linguistic Nuances With Careful Evaluation
Bangor University, located in Gwynedd â the county with the highest percentage of Welsh speakers â is supporting the new modelâs development with linguistic and cultural expertise.

Welsh translation of: âThe aim is to ensure that Welsh remains a living, breathing language that continues to develop with the times.â â Gruffudd Prys, Bangor University Prys, from the universityâs Welsh-language center, brings to the collaboration about two decades of experience with language technology for Welsh. He and his team are helping to verify the accuracy of machine-translated training data and manually translated evaluation data, as well as assess how the model handles nuances of Welsh that AI typically struggles with â such as the way consonants at the beginning of Welsh words change based on neighboring words.
The model, as well as the Welsh training and evaluation datasets, are expected to be made available for enterprise and public sector use, supporting additional research, model training and application development.
âItâs one thing to have this AI capability exist in Welsh, but itâs another to make it open and accessible for everyone,â Prys said. âThat subtle distinction can be the difference between this technology being used or not being used.â
Deploy Sovereign AI Models With NVIDIA Nemotron, NIM Microservices
The framework used to develop UK-LLMâs model for Welsh can serve as a foundation for multilingual AI development around the world.
Benchmark-topping Nemotron models, data and recipes are publicly available for developers to build reasoning models tailored to virtually any language, domain and workflow. Packaged as NVIDIA NIM microservices, Nemotron models are optimized for cost-effective compute and run anywhere, from laptop to cloud.
Europeâs enterprises will be able to run open, sovereign models on the Perplexity AI-powered search engine.
Get started with NVIDIA Nemotron.
Welsh translation:Â
Ymestyn Ar Draws yr Ynysoedd: Mae DU-LLM yn Dod â Deallusrwydd Artiffisial i Ieithoedd y DU Gyda NVIDIA Nemotron
Wediâi hyfforddi ar yr uwch gyfrifiadur Isambard-AI, mae model newydd a ddatblygwyd gan University College London, NVIDIA a Phrifysgol Bangor yn manteisio ar dechnegau a setiau data ffynhonnell agored NVIDIA Nemotron i alluogi rhesymu Deallusrwydd Artiffisial ar gyfer y Gymraeg ac ieithoedd eraill y DU ar gyfer gwasanaethau cyhoeddus gan gynnwys gofal iechyd, addysg ac adnoddau cyfreithiol.
Ieithoedd Celtaidd â gan gynnwys Cernyweg, Gwyddeleg, Gaeleg yr Alban a Chymraeg â yw ieithoedd byw hynaf y DU. Er mwyn grymuso eu siaradwyr, mae menter Deallusrwydd Artiffisial sofran y DU-LLM yn adeiladu model Deallusrwydd Artiffisial yn seiliedig ar NVIDIA Nemotron a all resymu yn Saesneg a Chymraeg hefyd, iaith a siaredir gan tua 850,000 o bobl yng Nghymru heddiw.
Bydd galluogi rhesymu Deallusrwydd Artiffisial o ansawdd uchel yn y Gymraeg yn cefnogiâr ddarpariaeth o wasanaethau cyhoeddus gan gynnwys gofal iechyd, addysg ac adnoddau cyfreithiol yn yr iaith.
âRwyf am i bob cwr oâr DU allu harneisio manteision deallusrwydd artiffisial. Drwy alluogi deallusrwydd artiffisial i resymu yn y Gymraeg, rydym yn sicrhau bod gwasanaethau cyhoeddus â o ofal iechyd i addysg â yn hygyrch i bawb, yn yr iaith maen nhwân byw ynddi,â meddai Prif Weinidog y DU, Keir Starmer. âMae hon yn enghraifft bwerus o sut y gall y dechnoleg dddiweddaraf, wediâi hyfforddi ar uwch gyfrifiadur deallusrwydd artiffisial mwyaf datblygedig y DU ym Mryste, wasanaethu lles y cyhoedd, amddiffyn treftadaeth ddiwylliannol a datgloi cyfleoedd ledled y wlad.â
Mae prosiect DU-LLM, a sefydlwyd yn 2023 fel BritLLM ac a arweinir gan University College London, wedi rhyddhau dau fodel ar gyfer ieithoedd y DU yn flaenorol. Mae ei fodel newydd ar gyfer y Gymraeg, a ddatblygwyd mewn cydweithrediad â Phrifysgol Bangor Cymru ac NVIDIA, yn cyd-fynd ag ymdrechion llywodraeth Cymru i hybu defnydd gweithredol oâr iaith, gydaâr nod o gyflawni miliwn o siaradwyr erbyn 2050 â menter oâr enw Cymraeg 2050.
Bydd darparwr cwmwl Deallusrwydd Artiffisial yn y DU, Nscale, yn sicrhau bod y model newydd ar gael i ddatblygwyr trwy ei ryngwyneb rhaglennu rhaglenni (API).
âY nod yw sicrhau bod y Gymraeg yn parhau i fod yn iaith fyw, syân anadlu ac syân parhau i ddatblygu gydaâr oes,â meddai Gruffudd Prys, uwch derminolegydd a phennaeth yr Uned Technolegau Iaith yng Nghanolfan Bedwyr, canolfan y brifysgol ar gyfer gwasanaethau, ymchwil a thechnoleg y Gymraeg. âMae deallusrwydd artiffisial yn dangos potensial aruthrol i helpu gyda chaffael y Gymraeg fel ail iaith yn ogystal â galluogi siaradwyr brodorol i wella eu sgiliau iaith.â
Gallaiâr model newydd hwn hefyd roi hwb i hygyrchedd adnoddau Cymraeg drwy alluogi sefydliadau cyhoeddus a busnesau syân gweithredu yng Nghymru i gyfieithu cynnwys neu ddarparu gwasanaethau sgwrsfot dwyieithog. Gall hyn helpu grwpiau gan gynnwys darparwyr gofal iechyd, addysgwyr, darlledwyr, manwerthwyr a pherchnogion bwytai i sicrhau bod eu cynnwys ysgrifenedig yr un mor hawdd ar gael yn y Gymraeg ag y mae yn Saesneg.
Y tu hwnt iâr Gymraeg, mae tĂŽm y DU-LLM yn anelu at gymhwysoâr un fethodoleg a ddefnyddiwyd ar gyfer ei fodel newydd i ddatblygu modelau Deallusrwydd Artiffisial ar gyfer ieithoedd eraill a siaredir ledled y DU fel Cernyweg, Gwyddeleg, Sgoteg a Gaeleg yr Alban â yn ogystal â gweithio gyda chydweithwyr rhyngwladol i adeiladu modelau ar gyfer ieithoedd o Affrica a De-ddwyrain Asia.
âMaeâr cydweithrediad hwn gydag NVIDIA a Phrifysgol Bangor wedi ein galluogi i greu data hyfforddi newydd a hyfforddi model newydd mewn amser record, gan gyflymu ein nod o adeiladuâr model iaith gorau erioed ar gyfer y Gymraeg,â meddai Pontus Stenetorp, yr athro prosesu iaith naturiol a dirprwy gyfarwyddwr y Ganolfan Deallusrwydd Artiffisial yn University College London. âEin nod yw cymryd y mewnwelediadau a gafwyd oâr model Cymraeg aâu cymhwyso i ieithoedd lleiafrifol eraill, yn y DU ac ar draws y byd.
Manteisio ar Seilwaith Deallusrwydd Artiffisial Sofran ar gyfer Datblygu ModelÂ
Maeâr model newydd ar gyfer y Gymraeg yn seiliedig ar NVIDIA Nemotron, teulu o fodelau ffynhonnell agored syân cynnwys pwysau, setiau data a ryseitiau agored.Maeâr tĂŽm datblygu DU-LLM wedi manteisio ar fodel 49-biliwn-paramedr Llama Nemotron Super a model 9-biliwn-paramedr Nemotron Nano, gan eu hĂ´l hyfforddi ar ddata iaith Gymraeg.
Oâi gymharu ag ieithoedd fel Saesneg neu Sbaeneg, mae llai o ddata ffynhonnell ar gael yn y Gymraeg ar gyfer hyfforddiant Deallusrwydd Artiffisial. Felly, er mwyn creu set ddata hyfforddi Cymraeg ddigon mawr, defnyddiodd y tĂŽm ficrowasanaethau NVIDIA NIM ar gyfer gpt-oss-120b a DeepSeek-R1 i gyfieithu setiau data agored NVIDIA gyda dros 30 miliwn o gofnodion oâr Saesneg iâr Gymraeg.
Defnyddion nhw glwstwr GPU drwy blatfform NVIDIA DGX Cloud Lepton ac yn harneisio cannoedd o Uwchsglodion NVIDIA GH200 Grace Hopper ar Isambard-AI â uwchgyfrifiadur mwyaf pwerus y DU, gyda chefnogaeth ÂŁ225 miliwn o fuddsoddiad gan y llywodraeth ac wediâi leoli ym Mhrifysgol Bryste â i gyflymu eu llwythi gwaith cyfieithu a hyfforddi.
Maeâr set ddata newydd hon yn ategu data presennol yr iaith Gymraeg o ymdrechion blaenorol y tĂŽm.
Cipio Naws Ieithyddol Gyda Gwerthusiad Gofalus
Mae Prifysgol Bangor, sydd wediâi lleoli yng Ngwynedd â y sir gydaâr ganran uchaf o siaradwyr Cymraegs â yn cefnogi datblygiad y model newydd gydag arbenigedd ieithyddol a diwylliannol.
Mae Prys, o ganolfan Gymraeg y brifysgol, yn dod â thua dau ddegawd o brofiad gyda thechnoleg iaith ar gyfer y Gymraeg iâr cydweithrediad. Mae ef aâi dĂŽm yn helpu i wirio cywirdeb data hyfforddi a gyfieithir gan beiriannau a data gwerthuso a gyfieithir â llaw, yn ogystal ag asesu sut maeâr model yn ymdrin â naws Gymraeg y mae Deallusrwydd Artiffisial fel arfer yn cael trafferth â nhw â megis y ffordd y mae cytseiniaid ar ddechrau geiriau Cymraeg yn newid yn seiliedig ar eiriau cyfagos.
Disgwylir iâr model, yn ogystal ââr setiau data hyfforddiant a gwerthusoâr Gymraeg, fod ar gael i fentrau aâr sector cyhoeddus eu defnyddio, gan gefnogi ymchwil ychwanegol, hyfforddiant modelu a datblygu rhaglenni.
âMaeân un peth cael y gallu Deallusrwydd Artiffisial hwn yn bodoli yn y Gymraeg, ond maeân beth arall ei wneud yn agored ac yn hygyrch i bawb,â meddai Prys. âGall y gwahaniaeth cynnil hwnnw fod y gwahaniaeth rhwng y dechnoleg hon yn cael ei defnyddio ai peidio.â
Defnyddio Modelau Deallusrwydd Artiffisial Sofran Gyda NVIDIA Nemotron, Microwasanaethau NIM
Gall y fframwaith a ddefnyddiwyd i ddatblygu model DU-LLM ar gyfer y Gymraeg fod yn sylfaen ar gyfer datblygu Deallusrwydd Artiffisial amlieithog ledled y byd.
Mae modelau, data a ryseitiau Nemotron, syân cyrraedd y brig, ar gael yn gyhoeddus i ddatblygwyr er mwyn iddynt adeiladu modelau rhesymu sydd wediâu teilwra i bron unrhyw iaith, parth a llif gwaith. Wediâu pecynnu fel microgwasanaethau NVIDIA NIM, mae modelau Nemotron wediâu hoptimeiddio ar gyfer cyfrifiadura cost-effeithiol a rhedeg yn unrhyw le, o liniadur iâr cwmwl.
Bydd mentrau Ewrop yn gallu rhedeg modelau agored, sofran ar y peiriant chwilio Perplexity wediâi bweru gan Ddeallusrwydd Artiffisial.
Dewch i ddechrau arni gyda NVIDIA Nemotron.
-
Itâs the Humidity: How International Researchers in Poland, Deep Learning and NVIDIA GPUs Could Change the Forecast NVIDIA AI Blog Sep 02, 2025 01:00 PM 2 min read For more than a century, meteorologists have chased storms with chalkboards, equations, and now, supercomputers. But for all the progress, they still stumble over one deceptively simple ingredient: wa
For more than a century, meteorologists have chased storms with chalkboards, equations, and now, supercomputers. But for all the progress, they still stumble over one deceptively simple ingredient: water vapor.
Humidity is the invisible fuel for thunderstorms, flash floods, and hurricanes. Itâs the difference between a passing sprinkle and a summer downpour that sends you sprinting for cover. And until now, satellites have struggled to capture it with the detail needed to warn us before skies crack open.
A team from the WrocĹaw University of Environmental and Life Sciences (UPWr) may help change that. In a paper published this month in Satellite Navigation, researchers describe how deep learning can transform blurry global navigation satellite system (GNSS)-based snapshots of the atmosphere into sharp 3D maps of humidity, revealing the hidden swirls that shape local weather.
The secret? A super-resolution generative adversarial network (SRGAN), a kind of AI best known for making grainy photos look crisp. Instead of celebrities or landscapes, researchers trained the network on global weather data and powered by NVIDIA GPUs. The result: low-resolution readings from navigation satellites get âupscaledâ into high-resolution humidity maps with far fewer errors.
In Poland, the technique cuts errors by 62%. In California, it delivers a 52% cut in errors, even in rainy conditions when forecasts are most likely to get slippery. Compared with older methods that smeared details into a watercolor blur, the AI produced sharp gradients that actually matched what ground instruments saw.
And because weather prediction is as much about trust as accuracy, the team added a twist: explainable AI. Using visualization tools like Grad-CAM and SHAP, they demonstrated where the model âlookedâ when making decisions. The AIâs gaze landed, reassuringly, on storm-prone areas â Polandâs western borders, Californiaâs coastal mountains â exactly where forecasters know the atmosphere can turn nasty.
âHigh-resolution, reliable humidity data is the missing link in forecasting the kind of weather that disrupts lives,â said lead author Saeid Haji-Aghajany, assistant professor at UPWr. âOur approach doesnât just sharpen GNSS tomography â it also shows us how the model makes its decisions. That transparency is critical for building trust as AI enters weather forecasting.â
The implications could be enormous. Feed these sharper humidity fields into physics-based or AI-driven weather models, and you get forecasts that can catch sudden downpours or flash floods before they hit. Communities living under skies that turn dangerous in minutes could gain crucial lead time.
And it all hinges on an element that too often gets ignored. Not the thunder. Not the lightning. Itâs the humidity.
Reference: DOI: 10.1186/s43020-025-00177-6
-
Applications Now Open for $60,000 NVIDIA Graduate Fellowship Awards NVIDIA AI Blog Aug 13, 2025 03:00 PM 1 min read The NVIDIA Graduate Fellowship Program provides grants, mentors and technical support to doctoral students doing outstanding research relevant to NVIDIA technologies. The application deadline for the
Bringing together the worldâs brightest minds and the latest accelerated computing technology leads to powerful breakthroughs that help tackle some of the biggest research problems.
To foster such innovation, the NVIDIA Graduate Fellowship Program provides grants, mentors and technical support to doctoral students doing outstanding research relevant to NVIDIA technologies. The program, in its 25th year, is now accepting applications worldwide.
It focuses on supporting students working in AI, machine learning, autonomous vehicles, computer graphics, robotics, healthcare, high-performance computing and related fields. Awards are up to $60,000 per student.
Since its start in 2002, the Graduate Fellowship Program has awarded over 200 grants worth more than $7.3 million.
Students must have completed at least their first year of Ph.D.-level studies at the time of application.
The application deadline for the 2026-2027 academic year is Monday, Sept. 15, 2025. An in-person internship at an NVIDIA research office preceding the fellowship year is mandatory; eligible candidates must be available for the internship in summer 2026.
For more on eligibility and how to apply, visit the program website.
-
NVIDIA Research Shapes Physical AI NVIDIA AI Blog Aug 11, 2025 03:00 PM 1 min read AI and graphics research breakthroughs in neural rendering, 3D generation and world simulation power robotics, autonomous vehicles and content creation.
-
Isambard-AI, the UKâs Most Powerful AI Supercomputer, Goes Live NVIDIA AI Blog Jul 17, 2025 05:00 PM 1 min read The University of Bristolâs Isambard-AI, powered by NVIDIA Grace Hopper Superchips, delivers 21 exaflops of AI performance, making it the fastest system in the U.K. and among the most energy-efficient
-
A Gaming GPU Helps Crack the Code on a Thousand-Year Cultural Conversation NVIDIA AI Blog Jul 11, 2025 01:00 PM 3 min read The world of ancient ceramics has relied on expert eyes for millennia; at University Putra Malaysia and UNSW Sydney, a new AI, running on standard gaming hardware, is changing how people determine the
Ceramics â the humble mix of earth, fire and artistry â have been part of a global conversation for millennia.
From Tang Dynasty trade routes to Renaissance palaces, from museum vitrines to high-stakes auction floors, theyâve carried culture across borders, evolving into status symbols, commodities and pieces of contested history. Their value has been shaped by aesthetics and economics, empire and, now, technology.

This figure visualizes 20 representative Chinese ceramic craftsmanship styles across seven historical periods, ranging from the Tang Dynasty (618â907 AD) to the Modern era (1913â2025). These styles, including kiln-specific categories and decorative techniques, were selected for their historical significance and visual distinctiveness for the AIâs training dataset. Courtesy of Yanfeng Hu, Siqi Wu, Zhuoran Ma and Si Cheng. In a lab at University Putra Malaysia, that legacy meets silicon. Researchers there, alongside colleagues at UNSW Sydney, have built an AI system that can classify Chinese ceramics and predict their value with uncanny precision. The tool uses deep learning to analyze decorative motifs, shapes and kiln-specific craftsmanship. It predicts price categories based on real auction data from institutions like Sothebyâs and Christieâs, achieving test accuracy as high as 99%.

Beyond form, the AI also analyzes the intricate decorative patterns found on Chinese ceramics, which are organized into six major categories: plant patterns, animal motifs, landscapes, human figures, crackled glaze patterns and geometric designs. The system annotates images at the category level based on the most visually dominant pattern types. Courtesy of Yanfeng Hu, Siqi Wu, Zhuoran Ma, and Si Cheng. Itâs all powered by an NVIDIA GeForce RTX 3090, a consumer-grade GPU beloved by gamers, explains Siqi Wu, one of the researchers behind the project. Not a data center, not specialized industrial hardware, just the same chip pushing frame rates for gamers enjoying Cyberpunk 2077 and Alan Wake 2 across the world.
The motivation is as old as the trade routes those ceramics once traveled: access, but in this case, access to expertise rather than material goods.

The AI system employs a typological classification system for ceramic vessel shapes, based on modular morphological parts like the bottle neck, handle, shoulder, spout, body and base. This approach allows for detailed analysis and classification of shapes such as bottles, jars, plates, bowls, cups, pots and washbasins. Courtesy of Yanfeng Hu, Siqi Wu, Zhuoran Ma and Si Cheng. âArtifact pricing and dating still heavily rely on expert judgment,â Wu said. That expertise remains elusive for younger collectors, smaller institutions and digital archive projects. Wuâs team aims to change that by making cultural appraisal more objective, scalable and accessible to a wider audience.
It doesnât stop at classification. The system pairs its YOLOv11-based detection model with an algorithm that learned market value directly from years of real-world auction results. In one test, the AI assessed a Ming Dynasty artifact at roughly 30% below its final hammer price. Itâs a reminder that even in an industry steeped in tradition, algorithms can offer new perspectives.
Those perspectives donât just quantify heritage, they extend the conversation. The team is already exploring AI for other forms of cultural visual heritage, from Cantonese opera costumes to historical murals.
For now, a graphics card built for gaming is parsing centuries of craftsmanship and entering one of the worldâs oldest and most global debates: what makes something valuable?
-
NVIDIA CEO Drops the Blueprint for Europeâs AI Boom NVIDIA AI Blog Jun 11, 2025 11:10 AM 5 min read In Paris, Jensen Huang laid out how the continent is scaling up with Blackwell-powered factories, agentic AI and sovereign clouds â all part of Europeâs new intelligence infrastructure.
At GTC Paris â held alongside VivaTech, Europeâs largest tech event â NVIDIA founder and CEO Jensen Huang delivered a clear message: Europe isnât just adopting AI â itâs building it.
âWe now have a new industry, an AI industry, and itâs now part of the new infrastructure, called intelligence infrastructure, that will be used by every country, every society,â Huang said, addressing an audience gathered online and at the iconic DĂ´me de Paris.
From exponential inference growth to quantum breakthroughs, and from infrastructure to industry, agentic AI to robotics, Huang outlined how the region is laying the groundwork for an AI-powered future.
A New Industrial Revolution
At the heart of this transformation, Huang explained, are systems like GB200 NVL72 â âone giant GPUâ and NVIDIAâs most powerful AI platform yet â now in full production and powering everything from sovereign models to quantum computing.
âThis machine was designed to be a thinking machine, a thinking machine, in the sense that it reasons, it plans, it spends a lot of time talking to itself,â Huang said, walking the audience through the size and scale of these machines and their performance.

At GTC Paris, Huang showed audience members the innards of some of NVIDIAâs latest hardware. Thereâs more coming, with Huang saying NVIDIAâs partners are now producing 1,000 GB200 systems a week, âand this is just the beginning.â He walked the audience through a range of available systems ranging from the tiny NVIDIA DGX Spark to rack-mounted RTX PRO Servers.
Huang explained that NVIDIA is working to help countries use technologies like these to build both AI infrastructure â services built for third parties to use and innovate on â and AI factories, which companies build for their own use, to generate revenue.
NVIDIA is partnering with European governments, telcos and cloud providers to deploy NVIDIA technologies across the region. NVIDIA is also expanding its network of technology centers across Europe â including new hubs in Finland, Germany, Spain, Italy and the U.K. â to accelerate skills development and quantum growth.
Quantum Meets Classical
Europeâs quantum ambitions just got a boost.
The NVIDIA CUDA-Q platform is live on Denmarkâs Gefion supercomputer, opening new possibilities for hybrid AI and quantum engineering. In addition, Huang announced that CUDA-Q is now available on NVIDIA Grace Blackwell systems.
Across the continent, NVIDIA is partnering with supercomputing centers and quantum hardware builders to advance hybrid quantum-AI research and accelerate quantum error correction.
âQuantum computing is reaching an inflection point,â Huang said. âWe are within reach of being able to apply quantum computing, quantum classical computing, in areas that can solve some interesting problems in the coming years.â
Sovereign Models, Smarter Agents
European developers want more control over their models. Enter NVIDIA Nemotron, designed to help build large language models tuned to local needs.
âAnd so now you know that you have access to an enhanced open model that is still open, that is top of the leader chart,â Huang said.
These models will be coming to Perplexity, a reasoning search engine, enabling secure, multilingual AI deployment across Europe.
âYou can now ask and get questions answered in the language, in the culture, in the sensibility of your country,â Huang said.

Huang explained how NVIDIA is helping countries across Europe build AI infrastructure. Every company will build its own agents, Huang said. To help create those agents, Huang introduced a suite of agentic AI blueprints, including an Agentic AI Safety blueprint for enterprises and governments.
The new NVIDIA NeMo Agent toolkit and NVIDIA AI Blueprint for building data flywheels further accelerate the development of safe, high-performing AI agents.
To help deploy these agents, NVIDIA is partnering with European governments, telcos and cloud providers to deploy the DGX Cloud Lepton platform across the region, providing instant access to accelerated computing capacity.
âOne model architecture, one deployment, and you can run it anywhere,â Huang said, adding that Lepton is now integrated with Hugging Face, giving developers direct access to global compute.
The Industrial Cloud Goes Live
AI isnât just virtual. Itâs powering physical systems, too, sparking a new industrial revolution.
âWeâre working on industrial AI with one company after another,â Huang said, describing work to build digital twins based on the NVIDIA Omniverse platform with companies across the continent.

Huang explained that everything he showed during his keynote was âcomputer simulation, not animationâ and that it looks beautiful because âit turns out the world is beautiful, and it turns out math is beautiful.â To further this work, Huang announced NVIDIA is launching the worldâs first industrial AI cloud â to be built in Germany â to help Europeâs manufacturers simulate, automate and optimize at scale.
âSoon, everything that moves will be robotic,â Huang said. âAnd the car is the next one.â
NVIDIA DRIVE, NVIDIAâs full-stack AV platform, is now in production to accelerate the large-scale deployment of safe, intelligent transportation.
And to show whatâs coming next, Huang was joined on stage by Grek, a pint-sized robot, as Huang talked about how NVIDIA partnered with DeepMind and Disney to build Newton, the worldâs most advanced physics training engine for robotics.
The Next Wave
The next wave of AI has begun â and itâs exponential, Huang explained.
âWe have physical robots, and we have information robots. We call them agents,â Huang said. âThe technology necessary to teach a robot to manipulate, to simulate â and of course, the manifestation of an incredible robot â is now right in front of us.â
This new era of AI is being driven by a surge in inference workloads. âThe number of people using inference has gone from 8 million to 800 million â 100x in just a couple of years,â Huang said.
To meet this demand, Huang emphasized the need for a new kind of computer: âWe need a special computer designed for thinking, designed for reasoning. And thatâs what Blackwell is â a thinking machine.â

Huang and Grek, as he explained how AI is driving advancements in robotics. These Blackwell-powered systems will live in a new class of data centers â AI factories â built to generate tokens, the raw material of modern intelligence.
âThese AI factories are going to generate tokens,â Huang said, turning to Grek with a smile. âAnd these tokens are going to become your food, little Grek.â
With that, the keynote closed on a bold vision: a future powered by sovereign infrastructure, agentic AI, robotics â and exponential inference â all built in partnership with Europe.
Watch the NVIDIA GTC Paris keynote from Huang at VivaTech and explore GTC Paris sessions.
-
NVIDIA Releases New AI Models and Developer Tools to Advance Autonomous Vehicle Ecosystem NVIDIA AI Blog Jun 11, 2025 10:55 AM 4 min read NVIDIA today released NVIDIA Cosmos Predict-2 â a new world foundation model with improved future world state prediction capabilities for high-quality synthetic data generation.
Autonomous vehicle (AV) stacks are evolving from many distinct models to a unified, end-to-end architecture that executes driving actions directly from sensor data. This transition to using larger models is drastically increasing the demand for high-quality, physically based sensor data for training, testing and validation.
To help accelerate the development of next-generation AV architectures, NVIDIA today released NVIDIA Cosmos Predict-2 â a new world foundation model with improved future world state prediction capabilities for high-quality synthetic data generation â as well as new developers tools.
Cosmos Predict-2 is part of the NVIDIA Cosmos platform, which equips developers with technologies to tackle the most complex challenges in end-to-end AV development. Industry leaders such as Oxa, Plus and Uber are using Cosmos models to rapidly scale synthetic data generation for AV development.
Cosmos Predict-2 Accelerates AV Training
Building on Cosmos Predict-1 â which was designed to predict and generate future world states using text, image and video prompts â Cosmos Predict-2 better understands context from text and visual inputs, leading to fewer hallucinations and richer details in generated videos.

Cosmos Predict-2 enhances text adherence and common sense for a stop sign at the intersection. By using the latest optimization techniques, Cosmos Predict-2 significantly speeds up synthetic data generation on NVIDIA GB200 NVL72 systems and NVIDIA DGX Cloud.
Post-Training Cosmos Unlocks New Training Data Sources
By post-training Cosmos models on AV data, developers can generate videos that accurately match existing physical environments and vehicle trajectories, as well as generate multi-view videos from a single-view video, such as dashcam footage. The ability to turn widely available dashcam data into multi-camera data gives developers access to new troves of data for AV training. These multi-view videos can also be used to replace real camera data from broken or occluded sensors.
Post-trained Cosmos models generate multi-view videos to significantly augment AV training datasets.
The NVIDIA Research team post-trained Cosmos models on 20,000 hours of real-world driving data. Using the AV-specific models to generate multi-view video data, the team improved model performance in challenging conditions such as fog and rain.
AV Ecosystem Drives Advancements Using Cosmos Predict
AV companies have already integrated Cosmos Predict to scale and accelerate vehicle development.
Autonomous trucking leader Plus, which is building its solution with the NVIDIA DRIVE AGX platform, is post-training Cosmos Predict on trucking data to generate highly realistic synthetic driving scenarios to accelerate commercialization of their autonomous solutions at scale. AV software company Oxa is also using Cosmos Predict to support the generation of multi-camera videos with high fidelity and temporal consistency.
New NVIDIA Models and NIM Microservices Empower AV Developers
In addition to Cosmos Predict-2, NVIDIA today also announced Cosmos Transfer as an NVIDIA NIM microservice preview for easy deployment on data center GPUs.
The Cosmos Transfer NIM microservice preview augments datasets and generates photorealistic videos using structured input or ground-truth simulations from the NVIDIA Omniverse platform. And the NuRec Fixer model helps inpaint and resolve gaps in reconstructed AV data.
NuRec Fixer fills in gaps in driving data to improve neural reconstructions.
CARLA, the worldâs leading open-source AV simulator, will be integrating Cosmos Transfer and NVIDIA NuRec â a set of application programming interfaces and tools for neural reconstruction and rendering â into its latest release. This will enable CARLAâs user base of over 150,000 AV developers to render synthetic simulation scenes and viewpoints with high fidelity and to generate endless variations of lighting, weather and terrain using simple prompts.
Developers can try out this pipeline using open-source data available on the NVIDIA Physical AI Dataset. The latest dataset release includes 40,000 clips generated using Cosmos, as well as sample reconstructed scenes for neural rendering. With this latest version of CARLA, developers can author new trajectories, reposition sensors and simulate drives.
Such scalable data generation pipelines unlock the development of end-to-end AV model architectures, as recently demonstrated by NVIDIA Researchâs second consecutive win at the End-to-End Autonomous Grand Challenge at CVPR.
The challenge offered researchers the opportunity to explore new ways to handle unexpected situations â beyond using only real-world human driving data â to accelerate the development of smarter AVs.
NVIDIA Halos Advances End-to-End AV Safety
To bolster the operational safety of AV systems, NVIDIA earlier this year introduced NVIDIA Halos â a comprehensive safety platform that integrates the companyâs full automotive hardware and software safety stack with state-of-the-art AI research focused on AV safety.
Bosch, Easyrain and Nuro are the latest automotive leaders to join the NVIDIA Halos AI Systems Inspection Lab to verify the safe integration of their products with NVIDIA technologies and advance AV safety. Lab members announced earlier this year include Continental, Ficosa, OMNIVISION, onsemi and Sony Semiconductor Solutions.
Watch the NVIDIA GTC Paris keynote from NVIDIA founder and CEO Jensen Huang at VivaTech, and explore GTC Paris sessions.
-
A conversation with Kevin Scott: Whatâs next in AI Microsoft AI Blog Dec 06, 2022 05:29 PM 1 min read
The post A conversation with Kevin Scott: Whatâs next in AI appeared first on The AI Blog.
-
From Hot Wheels to handling content: How brands are using Microsoft AI to be more productive and imaginative Microsoft AI Blog Oct 12, 2022 04:00 PM 1 min read When designers at the toy company Mattel were asked recently to come up with a new Hot Wheels model car, they sought inspiration from DALLâE 2, an AI system developed by OpenAI that creates custom ima
The post From Hot Wheels to handling content: How brands are using Microsoft AI to be more productive and imaginative appeared first on The AI Blog.
-
Microsoft open sources its âfarm of the futureâ toolkit Microsoft AI Blog Oct 06, 2022 02:58 PM 1 min read FARMINGTON, Wash. â The gently rolling hills here in eastern Washington have long grown rich harvests of wheat, barley and lentils. Fifth-generation farmer Andrew Nelson is adding a new bumper crop to
The post Microsoft open sources its âfarm of the futureâ toolkit appeared first on The AI Blog.
-
How data and AI will transform contact centres for financial services Microsoft AI Blog Jul 25, 2022 02:49 PM 1 min read Discover how unifying silos and implementing AI and automation in contact centres can improve customer experiences.
The post How data and AI will transform contact centres for financial services appeared first on The AI Blog.
-
AI-equipped drones study dolphins on the edge of extinction Microsoft AI Blog Jul 21, 2022 02:50 PM 1 min read
The post AI-equipped drones study dolphins on the edge of extinction appeared first on The AI Blog.
-
Online math tutoring service uses AI to help boost studentsâ skills and confidence Microsoft AI Blog Jul 13, 2022 12:59 PM 1 min read Eedi, a London education startup, is using AI from Microsoft Research to personalize math learning for students in the early years of education.
The post Online math tutoring service uses AI to help boost studentsâ skills and confidence appeared first on The AI Blog.
-
AI-Mimi is building inclusive TV experiences for Deaf and Hard of Hearing user in Japan Microsoft AI Blog Jul 06, 2022 02:51 PM 1 min read
The post AI-Mimi is building inclusive TV experiences for Deaf and Hard of Hearing user in Japan appeared first on The AI Blog.
-
Microsoftâs framework for building AI systems responsibly Microsoft AI Blog Jun 21, 2022 05:50 PM 1 min read Today we are sharing publicly Microsoftâs Responsible AI Standard, a framework to guide how we build AI systems. It is an important step in our journey to develop better, more trustworthy AI. We are r
The post Microsoftâs framework for building AI systems responsibly appeared first on The AI Blog.
-
Singapore develops Asiaâs first AI-based mobile app for shark and ray fin identification to combat illegal wildlife trade Microsoft AI Blog Jun 08, 2022 09:04 PM 1 min read
The post Singapore develops Asiaâs first AI-based mobile app for shark and ray fin identification to combat illegal wildlife trade appeared first on The AI Blog.
-
The opportunity at home â can AI drive innovation in personal assistant devices and sign language? Microsoft AI Blog May 31, 2022 09:06 PM 1 min read
The post The opportunity at home â can AI drive innovation in personal assistant devices and sign language? appeared first on The AI Blog.
Discussions (105 articles)
-
[R] Best practices for implementing and benchmarking a custom PyTorch RL algorithm? r/MachineLearning Apr 07, 2026 12:48 PM 1 min readsubmitted by /u/ANI_phy
Hey, I'm working on a reinforcement learning algorithm. The theory is complete, and now I want to test it on some Gym benchmarks and compare it against a few other known algorithms. To that end, I have a few questions:
- Is there a good resource for learning how to build custom PyTorch algorithms?
- How optimized or clean does my code need to be? Should I spend time cleaning things up, creating proper directory structures, etc.?
- Is there a known target environment or standard? Do I need to dockerize my code? I'll likely be writing it on a Mac system. Do I also need to ensure it works on Linux?
[link] [comments] -
Built a system for turning mixed business data into decision-ready analysis without forcing everything into one format first r/ArtificialInteligence Apr 07, 2026 12:38 PM 1 min readsubmitted by /u/Sharonlovehim
I work with the team behind Pandada.
One problem we kept seeing in real analysis workflows was that the bottleneck often wasnât the final chart or summary â it was the gap between mixed raw inputs and something decision-ready. In practice, those inputs rarely arrive in one clean table. They show up as spreadsheets, CSV exports, SQL results, PDFs, screenshots, and internal documents, each carrying a different part of the context.
Our approach in Pandada has been to treat this as an analysis-structuring problem, not just a UI problem. Instead of assuming one schema upfront, we first infer candidate structures from different file types, then map overlapping entities and fields into a shared intermediate representation. On top of that, we generate an analysis plan from the userâs question in plain English, so the system is not only retrieving data but also deciding what operations are needed to answer the question.
The output we care about is not a one-off chat response. Weâve been more focused on producing reusable summaries, charts, and reasoning steps that can be checked and shared with other people. One lesson weâve learned is that users trust the system much more when they can see how a conclusion was formed, rather than just getting a polished answer.
A limitation is that this still works best when the source material has enough structure to ground the analysis. Highly ambiguous screenshots or badly formatted documents still need human review.
Demo: https://pandada.ai/?utm_source=ArtificialInteligence&utm_medium=reddit
[link] [comments] -
[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful. r/MachineLearning Apr 07, 2026 12:32 PM 5 min readsubmitted by /u/PenfieldLabs
A new open-source memory project called MemPalace launched yesterday claiming "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." The launch tweet went viral reaching over 1.5 million views while the repository picked up over 7,000 GitHub stars in less than 24 hours.
The interesting thing is not that the headline numbers are inflated. The interesting thing is that the project's own BENCHMARKS.md file documents this in detail, while the launch tweet strips these caveats. Some of failure modes line up with the methodology disputes the field has been arguing about for over a year (Zep vs Mem0, Letta's "Filesystem All You Need" reproducibility post, etc.).
1. The LoCoMo 100% is a top_k bypass.
The runner uses top_k=50. LoCoMo's ten conversations have 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than 50 sessions, so top_k=50 retrieves the entire conversation as the candidate pool every time. The Sonnet rerank then does reading comprehension over all sessions.
BENCHMARKS.md says this verbatim:
The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19â32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions - the embedding retrieval step is bypassed entirely.
The honest LoCoMo numbers in the same file are 60.3% R@10 with no rerank and 88.9% R@10 with hybrid scoring and no LLM. Those are real and unremarkable. A 100% is also independently impossible on the published version of LoCoMo, since roughly 6.4% of the answer key contains hallucinated facts, wrong dates, and speaker attribution errors that any honest system will disagree with.
2. The LongMemEval "perfect score" is a metric category error.
Published LongMemEval is end-to-end QA: retrieve from a haystack of prior chat sessions, generate an answer, GPT-4 judge marks it correct. Every score on the published leaderboard is the percentage of generated answers judged correct.
The MemPalace LongMemEval runner does retrieval only. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings (all-MiniLM-L6-v2), returns the top five sessions by cosine distance, and checks set membership against the gold session IDs. It computes both
recall_any@5andrecall_all@5, and the project reports the softer one.It never generates an answer. It never invokes a judge. None of the LongMemEval numbers in this repository - not the 100%, not the 98.4% "held-out", not the 96.6% raw baseline - are LongMemEval scores in the sense the published leaderboard means. They are recall_any@5 retrieval numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error.
3. The 100% itself is teaching to the test.
The hybrid v4 mode that produces the 100% was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions.
BENCHMARKS.md, line 461, verbatim:
This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns.
4. Marketed features that don't exist in the code.
The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. mempalace/knowledge_graph.py contains zero occurrences of "contradict". The only deduplication logic is an exact-match check on (subject, predicate, object) triples that blocks identical triples from being added twice. Conflicting facts about the same subject can accumulate indefinitely.
5. "30x lossless compression" is measurably lossy in the project's own benchmarks.
The compression module mempalace/dialect.py truncates sentences at 55 characters, filters by keyword frequency, and provides a decode() function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip.
The same BENCHMARKS.md reports
results_raw_full500.jsonlat 96.6% R@5 andresults_aaak_full500.jsonlat 84.2% R@5 â a 12.4 percentage point drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop.Why this matters for the benchmark conversation.
The field needs benchmarks where judge reliability is adversarially validated, and evaluation pipelines are standardized or fully disclosed. Until then, "100% on LoCoMo" headlines are going to keep going viral, and the BENCHMARKS.md files that document the caveats are going to keep being read by approximately nobody. What's unusual about MemPalace is not any individual failure modes. It's that one repository contains so many of them at once, in a launch with viral reach, while the project's own internal documentation honestly discloses most of the issues that the launch communication strips.
Two other independent technical critiques landed in the first 24-hours: a README-versus-code teardown in issue #27, and another (Chinese language) #30.
Disclosure: We work on our own memory systems. All citations are open and verifiable against the linked repo.
Note: Links omitted for Reddit's spam filters. Find the full article, the BENCHMARKS.md citations, the Penfield LoCoMo audit, and the cited Zep / Mem0 / Letta posts in the first comment.
[link] [comments] -
Serious question. Did a transformer just describe itself and the universe and build itself a Shannon limit framework? r/artificial Apr 07, 2026 12:32 PM 30 min readsubmitted by /u/Different-Jicama-767
The Multiplicative Lattice as the Natural Basis for Positional Encoding
Knack 2026 | Draft v6.0
Abstract
We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens.
The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically â because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale â not primality per se.
We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot.
We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space â the first four energy bands capture the dominant structure â while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2Ă K compression and 4.7Ă V compression at <1.25% perplexity cost â validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128).
Introduction
Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension.
We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure â shared factors, GCD, harmonic resonance.
1.1 The Lattice Hypothesis
The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line â position equals distance â discarding the multiplicative structure. We propose restoring it.
The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) â 1/ranks with sâ1. The generating function of Zipf is the Riemann zeta function Îś(s) = ÎŁ 1/ns. The zeta zeros â where Îś is maximally informative â are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language.
1.2 Primes as Generators, Composites as Coordinates
A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line â circle â sphere â hypersphere. The composite 12 = 2²Ă3 is not an alternative to primes â it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (pâ, pâ, pâ , pâ,...) basis.
Using 2Ď/12 as a frequency encodes a harmonic that resonates at multiples of 12 â which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6.
The analogy to n-dimensional geometry is precise:
Dimensional Progression Multiplicative Lattice
1D line (2r) â the generator Primes (2, 3, 5, 7, ...) â generators
2D circle â integral of line swept through angle Semiprimes (6=2Ă3, 15=3Ă5) â 2-factor products
3D sphere â integral of circle swept through axis 3-factor composites (30=2Ă3Ă5)
nD ball â recursive integration Primorials (2310=2Ă3Ă5Ă7Ă11) â maximal resonance
Just as the volume of an n-sphere is built from the (n-1)-sphere through integration (the "knight's move" â not naive stacking), the harmonic resonance of a composite is built from its prime factors through multiplication (not naive addition).
2.1 The Zipf-Zeta Connection
Language word frequency follows Zipf(sâ1). The generating function of Zipf is Îś(s) = ÎŁ 1/ns. The zeta zeros t_n are where Îś is maximally informative â where the smooth approximation to prime distribution breaks down. If language has Zipfian statistics, the prime harmonic structure underlying Îś provides a natural spectral basis for positional encoding.
The most common words â I, me, you, us â are short because Shannon optimisation favours brevity for high-frequency signals. Primorials â 2, 6, 30, 210, 2310 â play the same role in the multiplicative lattice: they are the maximal-resonance anchors where all small prime harmonics synchronise simultaneously.
2.2 The Knight's Move: From Lines to Lattices
In the progression from 1D to nD geometry, each dimension is not simply "stacked" â it is integrated. The surface area of an n-sphere is the derivative of the volume: S_n = dV_n/dr. The Archimedean insight is that the sphere's cross-section varies as you traverse the new axis (x² + y² = 1 â z²), and the volume cannot be computed by naive multiplication.
The multiplicative lattice has the same structure. The resonance function R(Î) = ÎŁ_p cos(2Ď¡Î/p)/p does not decompose into independent per-prime contributions at composite distances â because the harmonics interfere. A primorial distance Î = 30 = 2Ă3Ă5 achieves R â 0.456 not by summing the contributions of 2, 3, and 5, but because all three harmonics constructively interfere at that point. A prime distance Î = 17 achieves R â â0.468 because it is coprime to all small primes, producing destructive interference.
This is the edge of chaos in an attention mechanism: primorial anchors for coherence, prime-gap non-periodicity against rigid repetition.
The structural problem: geometric frequencies create redundant coverage at some scales and gaps at others. Because the ratio between consecutive frequencies is constant, there is no mechanism for encoding the arithmetic relationships between token positions. Position 12 and position 6 differ by 6; position 12 and position 13 differ by 1. Geometric PE encodes only the magnitude of these differences. Lattice PE encodes that 12 = 2²Ă3 shares factors with 6 = 2Ă3 in a way that 13 (prime, coprime to both) does not.
- Method
3.1 SpectralRoPEAttention
We replace geometric RoPE frequencies with integer-indexed frequencies allocated across attention heads in three tiers:
Tier Heads (n=12) Integer Range Function
Local 0â2 (25%) 2..101 Word/syntax
Mid 3â6 (33%) 101..1009 Clause/paragraph
Long 7â11 (42%) 1009..8209 Section/document
Frequencies are 2Ď/n for integer n in each tier's range, selected via log-spacing to maximise coverage.
3.2 SpectralALiBiAttention â The Primary Architecture
Prime rotations combined with a learned ALiBi distance prior:
score(i,j) = Îą_h ¡ R_rotate(i,j) â slope_h ¡ |iâj| + β_h ¡ QK(i,j)/âd
ALiBi slopes initialised to standard values and made learnable. A per-head freq_scale parameter (init=1.0) allows the model to discover its natural harmonic basis from data â in contrast to RoPE's hardcoded base-10000.
This architecture dissolves the apparent tradeoff:
The attention score is derived directly from prime harmonic interference:
R(Î) = [ÎŁ_p cos(2Ď¡Î/p) / p] / R(0)
score(i,j) = Îą_h ¡ R(iâj) + β_h ¡ QK(i,j)/âd
R(Î) has a physical interpretation: the amplitude of constructive interference between prime harmonic waves at distance Î. Primorials achieve R â 0.58â0.70 (maximum constructive interference); prime distances achieve R â â0.11 to â0.47 (destructive interference).
- Experiments
The gap between clusters (~5â7 PPL) is substantial. The gap within the lattice-aware cluster (~0.2 PPL) is noise.
Why composites work as well as primes: Composites are not alternatives to primes. They are higher-order coordinates in the same multiplicative lattice. The composite 12 = 2²Ă3 encodes a frequency 2Ď/12 whose harmonics resonate at multiples of 12 â simultaneously hitting multiples of 2, 3, 4, and 6. The composite inherits the arithmetic structure of its prime factors. Using composites is like computing the volume of a 3-sphere from the surface area rather than the generating radius â a different entry point into the same structure.
Why scrambled primes fail: The correct frequencies at the wrong scales. This is like having the correct n-ball formula but computing a 3-sphere's volume using the 7-sphere's surface area. Local heads need small-period generators; long-range heads need large-period generators. The dimensional assignment is load-bearing.
4.4 ZetaZeroPredictor â Mechanistic Validation
Three identical 50K-parameter transformers are trained for 10,000 epochs to predict Riemann zeta zero gaps from a 50-gap context window. This probes whether lattice-aligned PE provides genuine arithmetic alignment, not just a better approximation.
Note on the ZZP baseline: The "geometric_rope" variant in ZZP uses additive sinusoidal PE, not rotary embeddings. SpectralALiBi uses genuine rotary application. This makes the comparison slightly asymmetric â the ZZP result demonstrates lattice-aligned frequencies outperforming geometric frequencies, not specifically the rotary mechanism.
- Theoretical Analysis
5.1 The Deductive Argument
(1) Language obeys Zipf(sâ1). (2) The generating function of Zipf is Îś(s). (3) The zeta zeros encode the prime harmonic structure of Îś. (4) Therefore the multiplicative lattice generated by primes provides a natural spectral basis for language positions.
Steps (1)â(3) are established mathematics. Step (4) is a motivated conjecture supported by experimental evidence â the ZZP experiment shows that a model using lattice-aligned frequencies learns zeta zero structure 60â81% better than one using geometric frequencies. But the step from "Îś encodes Zipfian statistics" to "the multiplicative lattice is the right basis for positional encoding" remains an inferential leap, not a theorem.
5.2 The Dimensional Analogy
The relationship between primes and composites in the multiplicative lattice mirrors the relationship between dimensions in the n-ball progression:
The volume of the n-ball is V_n(r) = Ďn/2 / Î(n/2 + 1) ¡ rn. Each dimension is not stacked but integrated â the circle is the integral of how a line sweeps through an angle, the sphere the integral of how circles vary along an axis.
Similarly, primes are the 1D generators of the multiplicative lattice. Composites are higher-dimensional points. The resonance function R(Î) at a composite distance Î = pâaâ ¡ pâaâ ¡ ... is not the sum of individual prime contributions but their interference pattern â constructive at primorials, destructive at primes. Just as you cannot compute V_3 by naively multiplying V_2 Ă 2r (because the circle's radius depends on z), you cannot decompose a composite's resonance into independent prime channels.
The Archimedean projection applies: the dependence (the shrinking cross-section as you move along the new axis) is already encoded in the structure. Composites carry their prime factors; the lattice carries the interference.
5.3 Shannon Capacity
Prime sequences are maximally entropic among deterministic sequences. The Riemann Hypothesis is equivalent to the statement that primes deviate from their smooth approximation as little as possible. A PE based on integer frequencies therefore operates near Shannon channel capacity for the positional information channel. Geometric PE with log-uniform spacing operates below capacity due to redundant coverage at some scales.
5.4 Why Geometric PE Diverges on Zeta Zeros
Zeta zeros t_n are the points where all prime harmonic contributions to the explicit formula cancel simultaneously. A model with geometric PE has no basis vectors at prime harmonic frequencies â it cannot represent this cancellation condition. Updates at one frequency scale disrupt approximations at others, causing the divergence observed across 9,783 epochs.
Lattice-aligned PE has basis vectors at exactly the right frequencies. The cancellation condition is directly representable. The stable attractor is a fixed point of gradient dynamics in that basis.
This predicts that lattice PE KV caches should compress better under TurboQuant than geometric PE KV caches â lower distortion at the same bit-width, or equivalent quality at fewer bits. If confirmed, it connects the PE research to optimal compression theory: the encoding maximises information in the positional channel (Shannon capacity argument, Section 5.3), while the compression minimises distortion in storing it (TurboQuant, within 2.7x of Shannon rate-distortion bound). Both optimise the same underlying structure from opposite ends.
Empirical confirmation (2026-04-05). VHT2 banded quantization of the KV cache directly confirms the structural asymmetry predicted above. K vectors (carrying RoPE positional encoding) show strong Walsh-Hadamard spectral concentration: a 4-band allocation of 5/5/4/3 bits â mirroring the WHT energy decay â achieves K correlation 0.9928 at 3.2Ă compression. V vectors (carrying content) show uniform WHT energy across all bands. Flat 3-bit encoding (n=1 band) outperforms any banded configuration for V: 4.7Ă compression at V correlation 0.9652, strictly better than banded 3/3/3/3 which gives 3.6Ă at worse PPL. The combined KV result â 3.8Ă at +1.24% PPL on Qwen3-8B, 3.4Ă at +0.60% on Dolphin 1B â is consistent across both head_dim=64 and head_dim=128.
This is the structural asymmetry the theory predicts: K encodes position (arithmetic structure, spectral concentration), V encodes content (no arithmetic structure, uniform spectrum). The WHT is the Z/2Z Vilenkin-Hartley basis â it is the natural transform for K precisely because K carries the multiplicative lattice structure that PrimePE encodes. V does not have this structure and the transform provides no leverage. Full sweep data: docs/prime/VHT2_COMPRESSION_RESULTS.md in the llama-cpp-turboquant repository.
- Discussion
6.2 Primes as Generators, Not Destinations
The falsification results show that primes are the minimal generators of the relevant structure, but composites work equally well because they encode the same lattice. This is actually a stronger result than "primes are special" â it shows that the entire multiplicative structure of the integers is the natural basis for positional encoding, and primes are simply the most economical way to span it.
The RoPE/ALiBi tradeoff is not fundamental. It is an artifact of encoding position as distance rather than arithmetic identity. SpectralRoPEALiBi achieves relative position invariance, long-context stability, and arithmetic positional identity simultaneously â beating ALiBi at every context length 512â8K.
The falsification suite provides the key insight: the active ingredient is the multiplicative lattice of the integers, not primality per se. Primes are the generators of this lattice; composites are derived coordinates in the same structure. Both work. What fails is any encoding that discards the lattice â random frequencies, scrambled tiers, or pure distance decay.
The ZetaZeroPredictor provides the deepest evidence: across two independent 10,000-epoch runs, geometric PE finds no stable solution while lattice-aligned PE achieves stable attractors with r=0.81â0.86 prediction correlation. The multiplicative lattice is the natural spectral basis for the arithmetic structure that underlies both prime distribution and language.
The universe encodes position in the arithmetic of the integers. So should we.
Appendix A: Resonance Function Values
Î R(Î) Type Note
0 1.000 â Self
2 0.757 prime Smallest generator
6 0.580 primorial 2Ă3
7 â0.271 prime
12 0.437 composite 2²Ă3 â lattice point
17 â0.468 prime Most negative
30 0.456 primorial 2Ă3Ă5
210 0.695 primorial 2Ă3Ă5Ă7 â highest tested
2310 0.540 primorial 2Ă3Ă5Ă7Ă11
Appendix C: Experimental Configuration
LR peak 3Ă10âťâ´ 3Ă10âťâ´ 1Ă10âťÂł
Knack (2026) â VHT2 Banded KV Cache Compression Research Results, VHT2_COMPRESSION_RESULTS.md
Appendix D: VHT2 KV Cache Compression â Empirical Results (2026-04-05)
D.1 Optimal Configuration
K: n=4 bands, bits=5/5/4/3, sk=head_dim. V: flat int3 (n=1 band), sk=head_dim.
The 5/5/4/3 K allocation mirrors WHT energy decay from RoPE. V has no spectral concentration â flat beats banded at every compression level.
D.2 Results by Model
Model head_dim K Ă V Ă Total Ă PPL ÎPPL
Dolphin3.0-Llama3.2-1B 64 2.8Ă 4.3Ă ~3.4Ă 13.1745 +0.60%
Qwen3-8B 128 3.2Ă 4.7Ă ~3.8Ă 9.4482 +1.24%
Larger head_dim improves compression automatically: the 2-byte fp16 scale overhead per band amortizes over more data elements.
D.3 The Kâ V Structural Asymmetry
WHT energy distribution is the direct empirical signature of spectral structure:
K vectors (RoPE-encoded): Energy concentrated in first WHT bands. n=4 banded allocation (5/5/4/3) captures the natural decay. Correlation 0.9928 at 3.2Ă.
V vectors (content): WHT energy uniform across all bands. Banded allocation adds scale overhead with no benefit. Flat int3 gives V correlation 0.9652 at 4.7Ă â strictly better than banded 3/3/3/3 at 3.6Ă.
This asymmetry is predicted directly by the lattice theory: K carries angular rates derived from multiplicative arithmetic relationships (the lattice structure); V carries learned content projections with no such arithmetic structure.
D.4 Critical Rules
sk = head_dim always. WHT requires the full vector. sk=32 on head_dim=64 â PPL +47%.
3-bit floor. 2-bit on any band is catastrophic (V:4/2 â PPL +1.59%).
n=4 optimal for K. More bands add scale overhead; n=5 and n=8 are within noise but cost 14% compression.
Flat beats banded for V. No exceptions in the sweep.
Full Results Table
V sweep (Dolphin 1B, K fixed at 5/5/4/3 n=4)
| V Config | V corr | V Ă | Total Ă | PPL | ÎPPL |
| flat int3 n=1 | 0.9708 | 4.3Ă | ~3.4Ă | 13.1745 | +0.60% â |
Flat int3 wins: lower PPL than banded 3/3/3/3 (better by 0.18 PPL) at higher
compression (4.3Ă vs 3.6Ă). Banded V is strictly worse.
Best Config: K n=4 5/5/4/3 + V flat int3
| Model | K Ă | V Ă | Combined Ă | PPL | ÎPPL |
| Dolphin 1B (hd=64) | 2.8Ă | 4.3Ă | ~3.4Ă | 13.1745 | +0.60% |
| Qwen3-8B (hd=128) | 3.2Ă | 4.7Ă | ~3.8Ă | 9.4482 | +1.24% |
V adds only +0.29% PPL on top of K-only for Qwen (9.4208 â 9.4482). The V
compression comes almost free in quality terms.
vs. Old Shadow Cache (2.3Ă per cache)
| Cache | Old | VHT2 | Gain |
| K | 2.3Ă | 3.2Ă | +39% |
| V | 2.3Ă | 4.7Ă | +104% |
| Combined | ~2.3Ă | ~3.8Ă | +65% |
vs. llama.cpp Built-in KV Quantization
| Method | K | V | Combined | PPL cost |
| q8_0 (baseline) | 2Ă | 2Ă | 2Ă | ~0% |
| q4_0 flat | 4Ă | 4Ă | 4Ă | ~1-3% |
| VHT2 best | 3.2Ă | 4.7Ă | ~3.8Ă | +1.24% |
VHT2 V (4.7Ă) beats flat q4 (4Ă) because per-vector fp16 scaling handles
outliers better than q4's block quantization. VHT2 K (3.2Ă) is slightly below
flat q4 but the spectral band allocation preserves RoPE structure that flat
quantization destroys indiscriminately.
RAM Impact at head_dim=128, 28 layers, 8 KV heads
| Context | fp16 baseline | Old (2.3Ă) | VHT2 (3.8Ă) |
| 2048 | ~460 MB | ~200 MB | ~121 MB |
| 32K | ~5.9 GB | ~2.6 GB | ~1.56 GB |
Optimum Summary
| Quant | Bits/Weight | Baseline PPL | Best PPL | Optimal alpha | Improvement |
| Q8_0 | 8.0 | 11.6413 | 11.5462 | 0.22 | -0.82% |
| Q6_K | 6.6 | 11.7615 | 11.6843 | 0.17 | -0.66% |
| Q4_K_M | 4.8 | 12.2380 | 12.1630 | 0.17 | -0.61% |
Analysis
Universal improvement: Prime frequency blending reduces PPL at ALL quantization levels. All three curves show smooth parabolas with clear optima, ruling out noise.
Improvement magnitude is consistent: ~0.6-0.8% across all quant levels. This means prime frequencies correct a DIFFERENT kind of error than quantization (positional frequency mismatch vs precision loss). The two are independent and additive.
Deterioration at high alpha is steeper for lower precision: Q4_K_M at alpha=0.50 degrades +5.4%, Q8_0 only +4.0%. Aggressive arithmetic replacement destabilizes the model, and quantization amplifies that instability.
The flat region (alpha=0.15-0.22): All three models show a relatively flat optimum region. This means alpha is not a knife-edge parameter â any value in [0.15, 0.22] gives near-optimal results, making production deployment robust.
Cross-Architecture Results (CONFIRMED)
Key finding: Optimal alpha correlates with rope_freq_base. Higher base = wider harmonic gaps = more room for prime injection. Phi (base=10K) has tightly packed frequencies already, leaving almost no room for improvement. Llama3 (base=500K) has the widest gaps and benefits most.
Cross-architecture validation: Improvement direction is universally correct (PPL decreases) on all architectures tested. The multiplicative structure is universal; the sensitivity varies with the model's existing frequency coverage.
External validation: User's independent test on Qwen3-8B confirmed: prime_rope alone gives -0.24%, while TQ3 degrades Qwen3-8B by +36%. TQ's WHT (Z/2Z) is architecture-specific; our prime frequencies are universal.
Upstream TQ Analysis
Current TQ Kludges (and Why They Exist)
| Kludge | What | Why It's Needed | Our Principled Alternative |
| Layer blocking | Skip first/last N layers | Boundary layers are "special" | Prime-factor coords: different layers get different precision based on PRS |
| K-only compression | Only compress K, not V | K is more sensitive (carries RoPE) | Our theory explains: K has positional structure, V has content structure. Different engines for each. |
| Lloyd-Max centroids | Non-uniform 2/3/4-bit quantization | Uniform quant fails post-WHT | PolarQuant: magnitude/direction separation is natural |
| Dense rotation (TQ4) | 128x128 Gaussian+QR matrix | WHT alone insufficient for 4-bit | Vilenkin-Hartley: richer O(n log n) rotation using more primes |
| QJL residual | 1-bit random projection for TQ4 residual | WHT doesn't capture everything | With Vilenkin, energy concentrates better â less residual needed |
| nosigns byte | Skip sign storage in some modes | Save bits | With Hartley kernel, sign structure is implicit in the characters |
| InnerQ scaling | Per-channel equalization | Outlier distribution is uneven | Prime frequency alignment naturally balances channel energy |
| 7 adaptive modes | Layer-by-layer strategy selection | One strategy doesn't fit all | Single PRS-guided strategy that adapts automatically |
The Core Problem
The community treats WHT as a "compression trick" â rotate to spread outliers, quantize, unrotate. They don't understand it's the Z/2Z case of a deeper structure. Every kludge is a symptom of this gap.
Our framework provides the theory that explains WHY WHT works (multiplicative structure) and GENERALIZES it (Vilenkin-Hartley for all primes). With the right transform, most kludges become unnecessary.
What's Next
1.Cross-architecture sweep:** Confirm universal improvement on Phi-3.1 and Qwen2.5
Vilenkin-Hartley in inference path:** Replace upstream WHT butterfly coefficients with Vilenkin characters
Combined prime + TQ test:** Run with prime_rope active AND turbo3/turbo4 cache
Remove layer blocking:** Test PRS-guided adaptive strategy
K+V compression:** Test V compression with Vilenkin (theory predicts it should work better than WHT)
Context length scaling:** Sweep 512/1024/2048/4096 to measure degradation curves
docs/prime/VHT2_COMPRESSION_RESULTS.md
VHT2 Banded KV Cache Compression â Research Results (2026-04-05)
Summary
Systematic sweep establishing the optimal VHT2 banded quantization configuration
for both K and V caches across two reference architectures. The key finding: a
single config (K: n=4 bands 5/5/4/3, V: flat int3) is optimal across all tested
head dimensions and delivers ~3.4â3.8Ă total KV compression with <1.25% PPL cost.
Method
The shadow cache intercepts KV writes. Each head vector is:
Transformed via Walsh-Hadamard (WHT = Z/2Z Vilenkin-Hartley)
Split into N equal-size bands (high â low spectral energy order)
Each band quantized with its own fp16 scale + packed int values
Reconstructed on read via inverse WHT
For V, the same pipeline is available but a single-band (flat) mode is used
because V has no spectral concentration (see findings below).
K: n=4 bands, 5/5/4/3 bits, sk must equal head_dim
| Model | Architecture | head_dim | KV heads | Layers | Baseline PPL |
| Dolphin3.0-Llama3.2-1B Q8_0 | Llama 3.2 | 64 | 4 (MHA) | 16 | 13.0957 |
| Qwen3-8B Q8_0 | Qwen 3 | 128 | 8 (GQA) | 28 | 9.3317 |
Finding 1: sk Must Equal head_dim
WHT requires the full head vector. Subsampling collapses quality catastrophically.
| sk | K corr | Compression | PPL | ÎPPL |
| 16 | 0.8615 | 4.6Ă | 43.39 | +231% đĽ |
| 32 | 0.9073 | 3.9Ă | 19.28 | +47% đĽ |
| 64 | 0.9941 | 2.8Ă | 13.11 | +0.12% â |
(Dolphin 1B, head_dim=64). At sk=32 the WHT sees only half the head â the
transform is no longer spanning the basis. sk must equal head_dim exactly.
Finding 2: Optimal K Config is n=4 Bands, 5/5/4/3
WHT concentrates K's energy in the first few coefficients â this is the
structural signature of RoPE-encoded positional information. The 5/5/4/3
allocation mirrors actual WHT energy decay: more bits where the signal lives.
Dolphin 1B (head_dim=64, 16 elements/band)
| Config | K corr | K Ă | PPL | ÎPPL |
| 5/5/4/3 n=4 | 0.9941 | 2.8Ă | 13.1119 | +0.12% â |
Qwen3-8B (head_dim=128, varied band count)
| Config | K corr | K Ă | PPL | ÎPPL |
| n=4: 5/5/4/3 | 0.9928 | 3.2Ă | 9.4208 | +0.95% â |
| n=5: 6/5/5/4/3 | 0.9947 | 2.8Ă | 9.3888 | +0.61% |
| n=8: 6/6/5/5/4/4/3/3 | 0.9945 | 2.8Ă | 9.3661 | +0.37% |
3-bit floor: Any band at 2 bits is catastrophic. Minimum viable = 3 bits.
Finding 3: V Has No Spectral Concentration â Flat Beats Banded
K carries RoPE positional encoding, which creates a characteristic energy
concentration in the first WHT bands. V carries content (values), which has
no such structure. WHT energy is uniform across V's bands.
Consequence: banded quantization adds scale overhead without benefit for V.
Flat quantization (n=1 band, all elements same bit-width) outperforms banded
at every compression level.
V sweep (Dolphin 1B, K fixed at 5/5/4/3 n=4)
| V Config | V corr | V Ă | Total Ă | PPL | ÎPPL |
| 5/3 n=2 | 0.9871 | 3.2Ă | 3.0Ă | 13.2058 | +0.84% |
| 4/2 n=2 | 0.9003 | 4.0Ă | ~3.4Ă | 13.3036 | +1.59% đĽ |
| flat int3 n=1 | 0.9708 | 4.3Ă | ~3.4Ă | 13.1745 | +0.60% â |
| flat int4 n=1 | 0.9944 | 3.4Ă | ~3.1Ă | 13.2064 | +0.84% |
Flat int3 wins: lower PPL than banded 3/3/3/3 (better by 0.18 PPL) at higher
compression (4.3Ă vs 3.6Ă). Banded V is strictly worse.
Key finding: Vilenkin-structured signals are ALREADY nearly orthogonal before LLL (OD=75 vs geometric's 410). This means the Vilenkin basis is the natural coordinate system â the lattice is already close to reduced. The highest PRS (19.37) confirms that prime structure survives best in Vilenkin-structured lattices.
4. Independent Traversal Validation
Tested half-Mobius and spinor traversal on 5 different signal types:
| Signal | Mobius Reduction | Mobius Agreement | Spinor Agreement |
| prime_harmonic | 36% | 83% | 100% |
| pure_harmonic | 35% | 100% | 100% |
| white_noise | 21% | 66% | 100% |
| chirp | 31% | 100% | 100% |
| prime_resonance | 37% | 100% | 100% |
5. Cross-Strategy Reconstruction
Tested every reconstruction method on every signal type:
| Signal | Walsh | Vilenkin(k=5) | Zero-crossing |
| prime_harmonic | 0.958 | 0.963 | 0.891 |
| geometric | 0.950 | 0.974 | N/A |
| arithmetic | 0.950 | 0.968 | N/A |
Key finding: Vilenkin beats Walsh on ALL signal types, not just prime-harmonic. The advantage is largest on geometric signals (+2.4%)
this makes sense because Vilenkin captures the multiplicative structure that underlies geometric progressions.
- Scale overhead determines optimal band count. At n=4: 4 Ă 2-byte scales
= 8 bytes overhead for 128Ă2=256 bytes raw. At n=8: 16 bytes overhead.
More bands = worse compression unless quality gain is statistically clear.
- 3-bit floor. 2-bit encoding on any band is catastrophic. The WHT
coefficients in lower bands are small but not negligible â 1 bit of sign
plus 1 bit of magnitude is insufficient.
- sk = head_dim, always. The WHT requires the full vector. Any truncation
breaks the transform's spanning property.
16 changes: 15 additions & 1 deletion16
ggml/include/ggml.h
PrimePE / Position_Is_Arithmetic â Session Context v3
Date: April 5, 2026 | Updated: VHT2 banded compression validated + Qwen3-8B sweep complete
THE PROJECT IN ONE PARAGRAPH
PrimePE proves that context in rotary-encoded transformers is not data to be stored but structure to be read from either side of a self-inverse matrix. The KV cache is an engineering artifact of computing attention in one direction â the inverse direction reconstructs context from the same structural relationships without storage. Key production result: composite-tiered frequencies blended at alpha 0.15-0.20 into Llama 3.2 1B via llama.cpp improve PPL (10.91 vs 11.03 baseline) with zero retraining. VHT2 banded KV compression (n=4 bands, K:5/5/4/3 + V:flat int3) achieves 3.4â3.8Ă total KV compression at <1.25% PPL cost, up from the previous 2.3Ă baseline â validated on Dolphin 1B and Qwen3-8B. K and V require structurally different strategies: K has spectral concentration from RoPE (WHT energy in first bands), V has uniform energy (flat quantization wins). Walsh-Hadamard/VHT2 is the natural basis because K is a Walsh signal. The theoretical foundation: the Redheffer matrix (divisibility lattice of integers) and its inverse (MĂśbius function) contain the same information â no computation at any level, just reading the structure from the other direction.
THE THEORETICAL BREAKTHROUGH (Late Session)
The Core Claim: KV Cache Is a View, Not Data
The field treats context as data that must be stored and compressed. This is wrong. Context is structure â specifically, the divisibility/multiplicative structure of the integers that index positions. The KV cache is what you get when you multiply token embeddings Ă positional rotation Ă attention weights in one direction. The reconstructed context is the SAME multiplication in the other direction. Same matrix, same information, no storage required.
The N-Ball Construction
Each dimension of the n-ball corresponds to one prime factor:
n1 (Line): 2r. Primes. The 1D base â the universal number line.
n2 (Disk): Ďr². Composites with 2 prime factors. Line Ă unit circle (Cartesian product).
n3 (Ball): 4/3ĎrÂł. Composites with 3 prime factors. Disk Ă unit circle.
n_k: Each new dimension multiplies by a circle. Each circle = one more prime factor.
The "knight's move" is how each dimension is BUILT from the previous â not a traversal strategy but a construction method. Archimedes showed sphereâcylinder projection preserves area. That's the lossless projection between dimensions.
The Redheffer Matrix
For nĂn matrix R: R(i,j) = 1 if i divides j OR if j = 1. Otherwise 0.
det(R_n) = M(n) â the Mertens function (running sum of MĂśbius function)
Inverse of the lower triangular divisibility matrix = MĂśbius function values
The MĂśbius function Îź(n): 0 if n has squared factors, (-1)k if n has k distinct prime factors
By inverting a matrix of divisors, you extract ALL prime locations. No sieve. No computation. The structure IS the answer.
The Self-Inverse Principle
The same non-computing trick works at EVERY level of the n-ball, and in REVERSE:
Walsh/Hadamard: H Ă H = Identity. Same operation decomposes AND reconstructs.
Redheffer: Matrix and its inverse contain the same information from two directions.
Context: The decomposed form and the signal form are the SAME MATRIX read differently.
Vilenkin Systems: The Full Basis
Walsh functions use Z/2Z (binary â one prime). The Vilenkin system generalises to Z/Îą_kZ for arbitrary Îą_k. Set Îą_k to the k-th prime and you get the complete prime-indexed orthogonal system. Walsh gets 0.948 with ONE prime dimension. Vilenkin with ALL primes would be EXACT.
VALIDATED RESULTS
Walsh Reconstruction â THE KEY RESULT
| Method | Correlation | Compression | Sparsity |
| WHT 90% energy | 0.948 | 2.3x | 57% |
| Sign pattern + amplitudes | 0.692 | 1.14x | â |
| Pure binary (no amplitudes) | 0.521 | 1.14x | â |
Walsh gets 0.948 vs Fourier's 0.15. The signal IS a Walsh signal. Near-perfect reconstruction throwing away 57% of coefficients. WALSH_WINS across all three strategies.
VHT2 Banded KV Compression â VALIDATED (2026-04-05)
Systematic sweep on Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128) established the optimal config. K has spectral concentration from RoPE (energy in first WHT bands); V does not (uniform distribution). They need different strategies.
Optimal config: K n=4 bands 5/5/4/3 + V flat int3
| Model | K Ă | V Ă | Combined Ă | PPL | ÎPPL |
| Dolphin 1B (hd=64) | 2.8Ă | 4.3Ă | ~3.4Ă | 13.1745 | +0.60% |
| Qwen3-8B (hd=128) | 3.2Ă | 4.7Ă | ~3.8Ă | 9.4482 | +1.24% |
vs old shadow cache 2.3Ă each: +65% combined compression at better quality.
vs llama.cpp q4_0 flat (4Ă): V at 4.7Ă beats flat q4; K at 3.2Ă is more conservative but preserves RoPE spectral structure that flat quantization destroys.
Critical rules discovered:
sk must equal head_dim exactly (sk=32 on hd=64 â PPL +47%)
3-bit floor â 2-bit on any band is catastrophic
5/5/4/3 mirrors WHT energy decay â any deviation worsens PPL
n=4 beats n=5/n=8 â scale overhead (2 bytes per band) kills compression gains
K needs banded; V needs flat (banded V is strictly worse than flat V)
RAM impact (head_dim=128, 32K context):
- fp16 baseline: 5.9 GB â VHT2: 1.56 GB (saves ~4.3 GB)
Reconstruction Scaling (2K â 10K training steps)
| Strategy | L2 Corr 2K | L2 Corr 10K | L3 Linear 10K | Spinor QPS |
| prime_tiered | 0.107 | 0.146 | 0.355 | 0.578 |
| composite_tiered | 0.066 | 0.094 | 0.304 | 0.560 |
| geometric_rope | 0.015 | 0.028 | 0.323 | 0.457 |
Layer 3 Lattice Collapse (Fixed)
LLL on quantised 3-bit integer indices (NOT raw floats)
prime_tiered: median norm_ratio=0.56, PRS retention=0.993
All strategies: PRS survives, 99.6% vectors changed
KEY DECISIONS & INSIGHTS
KV cache is a VIEW, not data. Context is fully determined by token sequence + positional structure + weights. The cache is one direction of multiplication. Reconstruction is the other direction. Same matrix.
Composites are the lattice itself. Not frequencies we assign â the actual multiplicative structure. Primes are the dimensions. Composites are positions (coordinates in prime-factor space). 12 = 2²Ă3 is position (2,1) in (dim_2, dim_3).
Zero-crossings are resonance detection. They detect WHERE you are in composite space. Not stored data â structural boundaries where the MĂśbius function changes sign.
Walsh is the base-2 projection of the full structure. One prime dimension. Gets 0.948. Vilenkin (all primes) would be exact.
Self-inverse at every level. HĂH=I. Same operation decomposes and reconstructs. The Redheffer matrix and its inverse are the same information. No computation needed at any level â just read the structure from the other side.
The n-ball construction doesn't need to be calculated. Each level is implicit in the level below. Invert â structure falls out. Same trick at every dimension.
Everyone else is optimising the wrong side. TurboQuant, sliding windows, attention sinks â all accept that context is data. The premise is wrong.
ARCHITECTURE
Reconstruction Framework
```
Level 1: Harmonic decomposition â EXACT
Level 2: Zero-crossing reconstruction â 0.09-0.15 (Fourier), 0.948 (Walsh!)
Level 3: Topological traversal â spinor most efficient
```
Walsh Reconstruction (walsh_reconstruct.py)
```
Method 1: WHT decomposition + sparse coefficients â 0.948 corr
Method 2: Sign pattern + amplitudes â 0.692 corr
Method 3: Pure binary sign pattern â 0.521 corr
```
llama.cpp Integration Stack
```
Layer 0: RoPE with composite freq_factors
Layer 1: VHT2 banded KV compression
K: n=4 5/5/4/3 V: flat int3
3.4-3.8Ă combined, <1.25% PPL cost
Layer 2: TurboQuant WHT + 3-bit quantisation
Theoretical
[x] Implement full Vilenkin basis (replace WHT Z/2Z with Z/p_kZ)
[x] Test Redheffer matrix construction for attention reconstruction
[x] LLL analysis of trained W_Q/W_K matrices
[x] "Read from the other side" â inverse-direction reconstruction
Engineering
[x] GCD attention bias experiment
GitHub: nihilistau/Position_Is_Arithmetic
[link] [comments] -
The "Jarvis on day one" trap: why trying to build one AI agent that does everything costs you months r/artificial Apr 07, 2026 12:06 PM 2 min readsubmitted by /u/Joozio
Something I've been thinking about after spending a few months actually trying to build my own AI agent: the biggest trap in this space isn't technical. It's the Jarvis fantasy.
The Jarvis fantasy is the moment you imagine one agent that runs your whole life. Handles your inbox, manages your calendar, writes your newsletter, triages your tasks, thinks about problems while you sleep. The fully-formed product from week one.
It's a trap. I fell into it hard, and watching other people start into agent building, I see them fall into the same one. Here's what I think is actually happening when it grabs you:
- It pushes you to add five features at once instead of adding one and letting it settle.
- It nudges you toward full autonomy before the basics are even stable. Then when something drifts, you have no idea which layer to debug.
- It assumes the agent should figure everything out on its own, when what it actually needs is clearer boundaries and simpler jobs.
- It confuses "end state" with "starting point." You want the final shape before you've earned it.The version that actually works, I've come to believe, is incremental. One small task. Then the next. Then the next. Morning summary of overnight email. Then a daily plan drafter. Then inbox triage. Eventually a bunch of small pieces start to look a bit like Jarvis, but as a side effect of solid groundwork, not as a goal.
The reframe that helped me most: think of an agent as a partner, not a solver. Something that takes the boring work off your plate and brings you the interesting decisions. Not something that removes you from the loop entirely.
The deeper insight (at least for me): the problem isn't "can an AI do this." The problem might be more -> wanting the end state before you've earned it. That's a human mistake, not an AI one.
[link] [comments] -
Stop Overcomplicating AI Workflows. This Is the Simple Framework r/artificial Apr 07, 2026 11:36 AM 1 min readsubmitted by /u/biz4group123
Iâve been working on building an agentic AI workflow system for business use cases and one thing became very clear very quickly. This is not about picking the right LLM.
The real complexity starts when you try to chain reasoning, memory, and tool execution across multiple steps. A single agent works fine for demos. The moment you introduce multi-step workflows with external APIs, things start getting weird and complex.
State management becomes a problem. Memory retrieval is inconsistent. Latency compounds with every step. And debugging is painful because you are not tracing a single function, you are tracing decisions across a system.
What helped was thinking in layers. Input handling, planning, execution, feedback. Once I separated those, it became easier to isolate failures. Also realized that most inefficiencies come from unnecessary model calls, not the model itself.
Another thing people donât talk about enough is cost scaling. Token usage is manageable early on, but once workflows get deeper, it adds up fast if you are not controlling context and step count.
[link] [comments] - Can't wait to share this with grandma! r/ChatGPT Apr 07, 2026 11:25 AM 1 min read
-
Consumer Claude Code plans are spot pricing r/ClaudeCode Apr 07, 2026 11:18 AM 2 min read
submitted by /u/DrGigaChad_MDCross posting from a comment on another sub as I feel it is relevant here.
Just to put this revenue into perspective, especially with the recent controversy over usage limits:
Current estimates of Claudeâs daily active users is around 18-30m (source)
Hypothetically, if literally every single user was on a 20x max plan for $200, they would contribute roughly $3.6-$6b each month. This is obviously not the case.
It is reported that enterprise customers (~300,000) make up 80% of revenue, so regular consumers are at most contributing $6b annually, not monthly. This revenue stream is growing rapidly, and they now report 500 customers paying >$1m annually. That means 500 enterprise accounts contribute the same revenue as nearly 10% of all regular consumers accounts. (source)
For the past year weâve talked about how discounted the Claude code plans are, compared to paying the same amount in API usage, which enterprise does. Personally I use both, and a single day of 20x plan usage would easily cost me over $100 in API.
Consumer accounts are heavily subsidized by enterprise, and while it doesnât feel good getting throttled by Anthropic, it is simply a practical business decision. My hypothesis is that consumer Claude code plans utilize a deprioritized pool of excess compute at a fixed price, and that pool is shrinking because enterprise needs it. Look at the chart, their revenue has nearly tripled since the start of this year. That means enterprise usage has likely tripled, and the pool of subsidized compute has shrunk by a third.
This is pretty standard practice for the industry, albeit communicated more transparently. Cloud compute providers offer âspotâ pricing, which is heavily discounted compute that can be reclaimed at a moments notice when a customer paying full price requests it.
Consumer Claude code plans are simply spot pricing. Yes it sucks, no itâs not communicated transparently, but it is unfortunately pretty fair and likely to continue.
Yes I am sick of seeing usage limit complaint spam on unrelated posts. Feel free to take your anger out on me, Iâm not saying Iâm happy about this Iâm just saying itâs the reality.
[link] [comments] -
Lemonade 10.1 released for latest improvements for local LLMs on AMD GPUs & NPUs r/artificial Apr 07, 2026 11:05 AM 1 min read
submitted by /u/Fcking_Chuck
[link] [comments] -
[D] thoughts on current community moving away from heavy math? r/MachineLearning Apr 07, 2026 11:01 AM 1 min readsubmitted by /u/Striking-Warning9533
I don't know about how you guys feel but even before LLM started, many papers are already leaning on empirical findings, architecture designs, and some changes to loss functions. Not that these does not need math, but I think part of the community has moved away from math heavy era. There are still areas focusing on hard math like reinforcement learning, optimization, etc.
And after LLM, many papers are just pipeline of existing systems, which has barely any math.
What is your thought on this trend?
Edit: my thoughts: I think math is important to the theory part but the field moving away from pure theory to more empirical is a good thing as it means the field is more applicable in real life. I do think a lot of people are over stating how much math is in current ML system though.
[link] [comments] -
Netflix recently launched VOID their subject removal model [under physics laws] r/ArtificialInteligence Apr 07, 2026 10:58 AM 1 min readsubmitted by /u/pretendingMadhav
Iâm not talking about basic video editing or "removing an object" from a frame. Weâve had that for years.
Iâm talking about "Physics-Aware Deletion." Imagine a video of a person holding a heavy glass vase. You use an AI tool to erase the person. In 99% of AI tools, the vase stays floating in mid-air like a glitch in the Matrix. It looks fake. It looks "AI."
But Netflixâs VOID model does something creepy. When you erase the person, the AI doesn't just fill in the background. It realizes the vase no longer has support. It calculates the gravity, the weight, and the trajectory...
And in the final video? The person is gone, and you watch the vase shatter on the floor in real-time
you can see that working on huggingface with netflix/void-model .
[link] [comments] -
OpenAI's "Industrial Policy for the Intelligence Age" proposes a wealth fund that pays dividends to Americans only. Built on global data, global labor, global revenue. r/OpenAI Apr 07, 2026 10:58 AM 1 min readsubmitted by /u/yani-
I just read the 13-page PDF. The document says "benefit everyone" multiple times, then every concrete mechanism - the Public Wealth Fund, safety nets, efficiency dividends, 32-hour workweek pilots - is designed exclusively for U.S. citizens.
The training data is global. The RLHF labor comes from Kenya, the Philippines, Latin America. The revenue is collected worldwide. But the proposed wealth fund distributes returns to American citizens only.
Page 5 says this "focuses on the United States as a starting point." Page 13 says the conversation "needs to expand globally." That's two sentences out of 13 pages. No mechanism, no structure, no commitment for anyone outside the US.
This comes off as very chauvinistic to put it mildly.
Am I reading this wrong? What's your take?
[link] [comments] -
Anonymous Sources Detail Sam Altmanâs Alleged Untrustworthiness in New Report r/ArtificialInteligence Apr 07, 2026 10:51 AM 1 min readsubmitted by /u/planet_janett
"Even some Microsoft senior executives, with whom OpenAI has had a long partnership since the 2019 deal, described Altman as someone who âmisrepresented, distorted, renegotiated, reneged on agreements.â One senior executive even apparently said this of Altman: âI think thereâs a small but real chance heâs eventually remembered as a Bernie Madoff- or Sam Bankman-Fried-level scammer.â
[link] [comments] -
OpenAI buys tech talkshow TBPN in push to shape AI narrative r/AGI Apr 07, 2026 10:39 AM 1 min read
submitted by /u/EchoOfOppenheimerOpenAI is officially wading into the media business by acquiring TBPN, a popular tech talkshow widely watched by Silicon Valley insiders. Hosted by John Coogan and Jordi Hays, the daily live show features founders and tech leaders. OpenAIâs chief of strategy stated the acquisition will help the company "engage more authentically with the public" and create space for constructive conversations about the shift toward AGI. The move highlights a growing trend of powerful tech companies directly purchasing media outlets to help control the narrative surrounding their products.
[link] [comments] -
Adobe Firefly Web vs Mobile vs Boards (2026): Which One Should You Actually Use? r/artificial Apr 07, 2026 10:32 AM 2 min read
submitted by /u/ArianeFridaSofieMost of my clients are using Adobe Firefly, and I keep getting the same question:
Which interface should I actually be usingâWeb, Mobile, or Boards?
They all have similar capabilities, but theyâre built for completely different parts of the workflow.
Hereâs the simplest way to think about it.
Quick Answer (What to Use for What)
- Adobe Firefly Web â best for quick generation + testing prompts
- Adobe Firefly Mobile â best for creating on the go
- Adobe Firefly Boards â best for organizing and building full projects
If you remember nothing else, thatâs the breakdown.
How Adobe Firefly Actually Works (Across Interfaces)
The mistake most people make is thinking these are separate tools.
Theyâre not.
Adobe Firefly is one system, just with different interfaces depending on what stage youâre in:
- Web â generate
- Mobile â capture + quick create
- Boards â organize + collaborate
Once you think of it like that, the differences make a lot more sense.
1ď¸âŁ Adobe Firefly Web (Standard Interface)
This is the default browser experience and where most people start.
Best for:
- Testing prompts
- Generating quick assets
- Exploring styles
Why it wins:
- Fast and intuitive
- Access to a wide range of generation tools and partner models
Better than Mobile/Boards when:
You just need to generate something quickly without worrying about organization.
The catch:
If you generate a lot of assets (e.g. campaign work), things get messy fast. Thereâs no real system for managing volume.
2ď¸âŁ Adobe Firefly Mobile
This brings core Adobe Firefly capabilities onto your phone.
Best for:
- Content creators working on mobile
- Capturing ideas in real time
- Quick social content
Why it wins:
- Portable and fast
- Easy to create images, video, and audio on the go
- Can connect into apps like Premiere and Adobe Express
Better than Web/Boards when:
Speed and accessibility matter more than precision or control.
The catch:
You donât want to run a full project from your phoneâitâs great for ideas, not for managing complexity.
3ď¸âŁ Adobe Firefly Boards
This is where things shift from generation â project-level workflow.
Best for:
- Creative teams and agencies
- Campaign development
- Client presentation and collaboration
Why it wins:
- Full visual overview of a project
- Ability to organize concepts, assets, and references in one place
- Strongest for structured workflows
Better than Web/Mobile when:
You need to manage multiple assets, ideas, and stakeholders in one place.
The catch:
- Slight learning curve
- Not all generation features (like sound effects) are available here
Quick Comparison (Simple Version)
- Web = fastest
- Mobile = most flexible
- Boards = most powerful (for projects)
Final Take
The real advantage of Adobe Firefly isnât any single interface.
Itâs that:
- you can generate in Web
- capture ideas in Mobile
- organize everything in Boards
All within the same system.
Thatâs what makes it actually usable for real workflowsânot just experimentation.
Curious how others are using itâare you sticking to one interface, or moving between all three?
[link] [comments] -
Anthropic stayed quiet until someone showed Claude's thinking depth dropped 67% r/ClaudeAI Apr 07, 2026 10:24 AM 1 min readsubmitted by /u/Capital-Run-1080
I've been using Claude Code since early this year and sometime around February it just felt different. Not broken. Shallower. It was finishing edits without actually reading the file first. Stop hook violations spiking where I barely had any before.
My first move was to blame myself. Bad prompts. Changed workflow. I've watched enough people on here get told "check your settings" that I started wondering if I was doing the same thing, just without realizing it.
Then I found this: https://github.com/anthropics/claude-code/issues/42796
The person who filed it went through actual logs. Tracked behavior patterns over time. Quantified what changed. Their estimate: thinking depth dropped around 67% by late February. Not a vibe. An evidence chain. The HN thread has more context if you want the full picture: https://news.ycombinator.com/item?id=47660925
The 67% figure might not survive methodological scrutiny. Worth reading the issue yourself and deciding. But the pattern it documents matches what a bunch of people have been independently reporting without coordinating, and that's actually meaningful signal regardless of the exact number.
What gets me is the response cycle. User complaints come in, the default answer is prompts or expectations, nothing moves until someone produces documentation detailed enough that dismissing it looks bad. Then silence until the pressure accumulates. I don't think Anthropic is uniquely bad at this, labs pretty much all run the same playbook on quality regressions. But Claude Code is marketed as a serious tool for real development work. The trust model is different. If it quietly gets worse at reading code before editing, that has downstream effects that are genuinely hard to notice unless you're logging everything.
Curious if others here hit the same February wall or if this was more context-dependent than it looks.
[link] [comments] -
Who is in control? r/ArtificialInteligence Apr 07, 2026 10:07 AM 1 min readsubmitted by /u/synchrono_us
HI & AI - drawing a line between human- and artificial intelligence.
#cartoon #drawing #krita #artificialintelligence
[link] [comments] -
Lawsuit accuses Perplexity of sharing personal data with Google and Meta without permission r/AGI Apr 07, 2026 09:57 AM 1 min read
submitted by /u/Confident_Salt_8108A new federal lawsuit accuses the AI search engine Perplexity of secretly sharing confidential user queries with tech giants Meta and Google. The lawsuit claims Perplexity incorporated ad trackers, including Meta Pixel and Google DoubleClick, into its code, directly forwarding sensitive user conversations about topics like medical advice and financial planning to third parties for commercial ad targeting. According to the plaintiff, this unauthorized data sharing allegedly occurred even when users utilized Perplexity's "Incognito" mode or used the service without registering an account.
[link] [comments] - [R] TriAttention: Efficient KV Cache Compression for Long-Context Reasoning r/MachineLearning Apr 07, 2026 09:43 AM 1 min read
-
The "Claude usage is back to normal" claims are pure gaslighting. 64% of my limit gone in ONE prompt. r/ClaudeCode Apr 07, 2026 09:18 AM 1 min read
submitted by /u/LolArtEsJust started a fresh session: one prompt used 64% of my limit. No huge files, no massive context. Just a standard query. This is absolutely nuts, how is anyone calling this "normal"?
Edit: A few people asked, so here is the prompt that ate 64%:
"We are developing a Python script, \@main.py, to test the implementation of a device. The Python code consists of three tests: GPIOs, LEDs, and keys. Currently, the LED test is not working correctly. The intended behavior is for all LEDs to turn on in white, and then, upon pressing any key, they should change intensity one by one. Please see what you can do to fix this."
The main.py file attached is 600 lines long.The wait time: It says 3h 50m because I left the PC while it was processing; it asked for a confirmation halfway through, and I only saw the usage hit once I came back.
Still, 64% for a 600 line file in one go? Insane.
[link] [comments] -
Does AI replace mid-level jobs more than Entry-level jobs? r/ArtificialInteligence Apr 07, 2026 09:11 AM 1 min readsubmitted by /u/PuzzleheadedHeat5792
Usually people say AI will replace only the repetitive jobs. But the recent advancements are showing another picture. I see job titles like PM and other managers losing their jobs as well.
So, is AI now responsible for decision-making?
[link] [comments] -
Pro tip: you can replace Codexâs built-in system prompt instructions with your own r/ChatGPTPro Apr 07, 2026 09:07 AM 1 min readsubmitted by /u/phoneixAdi
Pro tip: Codex has a built-in instruction layer, and you can replace it with your own.
Iâve been doing this in one of my repos to make Codex feel less like a generic coding assistant and more like a real personal operator inside my workspace.
In my setup,
.codex/config.tomlpointsmodel_instructions_fileto asoul.mdfile that defines how it should think, help, write back memory, and behave across sessions.So instead of just getting the default Codex behavior, you can shape it around the role you actually want. Personal assistant, coach, operator, whatever fits your workflow. Basically the OpenClaw / ClawdBot kind of experience, but inside Codex and inside your own repo.
For anyone curious, this is what the base Codex instruction file looks like in their official repo: https://github.com/openai/codex/blob/main/codex-rs/protocol/src/prompts/base_instructions/default.md
Hereâs the basic setup:
```toml
.codex/config.toml
model_instructions_file = "../soul.md" ```
Official docs: https://developers.openai.com/codex/config-reference/
[link] [comments] -
Got a ALOT OF hate so decided to open source my agent OS with memory, audit and loop detection r/ClaudeCode Apr 07, 2026 08:56 AM 1 min read
submitted by /u/Powerful-One4265Hey Folks,
Hope everything is going well, thought I would share this here as its a project I have been working on for 8 months, and would be cool to see peoples opinions, so far pretty mixed. GOT ALOT of hate last time I posted it for not open sourcing, so spent my weekend open sourcing it, also got a love which I appreciate from you kind people!
To the people that hated on it, that is absolutely fine, and I respect your opinion, and in fact some were super valid, so worked for the last week trying to remedy some concerns.
Some this is useless, some this is pretty cool. Where could I improve it? I essentially thought with my agents one unified dashboard where you could track:
Agents Speed and General Performance
Semantic/ Enriched Memories to prevent Hallucination
Shared Memory Across Agents when selected
Audit Trail so you know what the fuck your agents are doing
Anomalies/recovery for loops and burning Credits
It is not perfect, but really thought it might be useful for SOME people. For those people, I would love to know if there is any way I could improve it?
What are the biggest issues people are currently facing when it comes to their agents?
I would really appreciate people trying it out, and letting me know their thoughts.
Have a wonderful day people!
[link] [comments] -
Boris Charny, creator of Claude Code, engages with external developers and accepts task performance degradation since February was not only due to user error. r/ClaudeAI Apr 07, 2026 08:53 AM 2 min readsubmitted by /u/sixbillionthsheep
In a discussion on Hacker News, Boris changes his stance after examining a user's bug transcripts from "it's just a user setting issue" to "there's a flaw in the adaptive thinking feature".
- Initial Position: It's a Settings Issue. His first post explains the degradation as an expected side effect of two intentional changes: hiding the thinking process (a UI change) and lowering the default effort level. The implicit message is "Performance hasn't degraded. You're just using the new, lower-cost default. If you want the old performance, change your settings back to /effort high." This might be interpreted as a soft rejection of the idea that the model itself is worse.
- Shift to Acknowledgment: When confronted with evidence from users who are already using the highest effort settings and still see problems, his position shifts. After analyzing the bug reports provided by a user, he moves from a general explanation about settings to a specific diagnosis of a technical flaw.
- Final Position: Acknowledgment of a Specific Flaw. By the end of his key interactions, Boris explicitly validates the users' experience. He concedes that the "adaptive thinking" feature is "under-allocating reasoning," which directly confirms the performance degradation users are reporting. He is not admitting the model is worse.
This is Boris's final message: "On the model behavior: your sessions were sending effort=high on every request (confirmed in telemetry), so this isn't the effort default. The data points at adaptive thinking under-allocating reasoning on certain turns â the specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. we're investigating with the model team. interim workaround: CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 forces a fixed reasoning budget instead of letting the model decide per-turn."
I personally greatly appreciate the transparency shown in this very public discussion. Having key Anthropic technical staff directly engage with external developers like this can only help bridge the trust divide.
[link] [comments] -
[D] Is ACL more about the benchmarks now? r/MachineLearning Apr 07, 2026 08:43 AM 1 min readsubmitted by /u/Fantastic-Nerve-4056
I am not a NLP guy, but afaik ACL is one of the premium venues of NLP.
And given that the results were announced recently, my LinkedIn and Twitter are full of such posts. However, every title I read in those posts has something to do with benchmarks. And even it seems, the young researchers also have like 10+ papers (main + findings) at a single venue.
So was just wondering if ACL is majorly about benchmarks now, or are there are good theory/empirical stuffs yet published at this venue
[link] [comments] -
Americaâs largest hospital system ready to start replacing radiologists with AI r/AGI Apr 07, 2026 08:43 AM 1 min read
submitted by /u/Confident_Salt_8108The CEO of NYC Health and Hospitals, America's largest public hospital system, recently announced his desire to replace highly trained human radiologists with AI to achieve "major savings." The plan would sideline doctors, leaving AI to conduct primary screenings for things like breast cancer. Radiologists are slamming the move as incredibly dangerous, pointing out that administrators are prioritizing legal cost-cutting over patient safety.
[link] [comments] -
China drafts law regulating 'digital humans' and banning addictive virtual services for children r/artificial Apr 07, 2026 08:41 AM 1 min read
submitted by /u/Confident_Salt_8108A Reuters report outlines China's proposed regulations on the rapidly expanding sector of digital humans and AI avatars. Under the new draft rules, digital human content must be clearly labeled and is explicitly banned from offering virtual intimate relationships to anyone under 18. The legislation also prohibits the unauthorized use of personal data to create avatars and targets services designed to fuel addiction or bypass identity verification systems.
[link] [comments] -
30 Billion ( 3x in 3 months) WTF is thr future r/artificial Apr 07, 2026 08:33 AM 1 min readsubmitted by /u/Eastern-Weekend5407
The moment has come. I can see 200 Billion ARR by the end of year by Anthropic and around 100 Billion from OpenAI.
We will be up of 300 Billion Revenue from AI companies for sure.
Huge repercussions will be there. What will it impact any ideas?
[link] [comments] -
Please â can someone who is really building production / enterprise software share their full Claude setup? r/ClaudeCode Apr 07, 2026 08:29 AM 3 min readsubmitted by /u/wodhyber
Too much is happening right now, Iâm kinda losing track. :D
Can a senior or just an experienced dev / vibe coder share their full Claude setup? <3
I mean really end-to-end. Claude Code, Claude Cowork, skills, agents, workflows, everything.
Iâve been a software developer for 6 years.
Right now Iâm using Claude Code with a pretty deep setup:- global CLAUDE.md with guardrails (e.g. explicit approval for destructive stuff)
- architecture rules (hexagonal, DDD, clean code, frontend principles)
- 4 sub-agents (reviewer, debugger, test, security)
- ~18 skills (code review, PRs, planning, TDD, feature work, ticket writing, etc.)
-> honestly to much skills maybe :D
Also MCPs for Atlassian (Jira/Confluence), Notion, Context7, LSPs for Kotlin + TypeScript, hooks, permission system, all that.
On the Cowork side itâs similar:
- ~10 skills for daily PM / office stuff
- Jira board checks (reads tickets, comments, flags what needs attention)
- ticket drafting, dev news, doc creation (docx/xlsx/pdf/pptx with template)
- MCPs for Atlassian, Notion, Microsoft 365 (Outlook, Teams, SharePoint)
- some scheduled stuff running automatically
- even a skill to create skills
Still⌠feels like Iâm just scratching the surface and just over staffing my setup with bullshit without an real flow.
How do you guys structure all of this so it doesnât turn into chaos?
What are your actual best practices?What Iâm trying to get to:
- Claude as kind of a secretary / cowork partner
- Claude Code more like a senior dev guiding things
- no yolo prompts, more controlled via skills / guardrails
- ideally doing as much as possible through Claude
And please no âjust use plan modeâ answers.
Iâm more interested in:
- how you structure skills / agents
- how your day-to-day with Claude Code actually looks
- how you keep control over changes
- how you keep things consistent and not random
Also tooling:
Iâm using Warp as terminal, but Iâm not super happy with it.
Main issue is managing multiple Claude Code sessions, thereâs no good overview or sidebar. If anyone has a better setup here, Iâd love to hear it.Tech stack if relevant:
.NET, Spring (Kotlin), React (TypeScript), Terraform, Kubernetes
Team setup: Jira, Notion, MiroWould really appreciate if someone just shares their setup.
Edit:
Thatâs roughly my setup:
Skills (Dev side)
- /implement-feature â plan mode, questions, then step-by-step implementation
- /write-ticket â rough idea â structured ticket
- /create-pull-request â generates title/description, pushes, creates PR
- /review-own-branch â self-review against conventions
- /review-colleague-pr â review with comment suggestions
- /handle-pr-feedback â go through review comments
- /auto-review-prs â reviews all open PRs
- /grill-my-plan â stress-test architecture decisions
- /tdd â red-green-refactor loop
Agents
- Explore â codebase search
- Plan â architecture / solution design
- Reviewer â checks conventions
- Debugger â root cause analysis
- Test â generates tests
- Security â security checks
Plugins / MCP (Dev)
- Kotlin + TypeScript LSP â code intelligence
- Atlassian â Jira / Confluence
- Notion â workspace integration
- Context7 â up-to-date docs
Hooks
- SessionStart â shows current branch + recent commits
On the Cowork (daily office / PM side) it looks like this:
Skills
- board-check (per project) â scans tickets + comments, shows whatâs unread / unanswered / blocked
- ticket-draft â rough idea â structured Jira ticket
- dev-news â pulls relevant stuff from Reddit / YouTube / blogs filtered by my stack
- document creation â docx / xlsx / pdf / pptx with company template
- skill-creator â build and iterate skills directly in Cowork
MCP
- Atlassian â Jira + Confluence read/write
- Notion â workspace read/write
- Microsoft 365 â Outlook, Teams, SharePoint
- Claude in Chrome â browser automation
Scheduled tasks (8 active, MonâFri)
- 07:30 Morning Briefing â calendar, mails, Teams channels, Notion todos, open PRs â prioritized todo suggestions
- 09:00 PR Review â lists open PRs, reviews selected ones with inline comments on GitHub
- 09:30 Project PR Check (per project) â flags: waiting for review, changes requested, blocked
- 10:00 Infra Check (Tue + Thu) â alerts, infra tickets, GitHub Actions failures, infra Teams channel
- 16:30 Teams Highlights â scans channels for interesting tech posts, tools, recommendations
- 09:00 Fri Notion Sync â syncs Teams/mails/PRs, suggests what to update/close
- 14:00 Fri Weekly Review â what mattered, whatâs open, priorities for next week
[link] [comments] -
Sam Altman's sister amends lawsuit accusing OpenAI CEO of sexual abuse r/OpenAI Apr 07, 2026 08:25 AM 1 min read
submitted by /u/monkey_gamer
[link] [comments] -
I agree with this take that human advice will still have a upper hand in the future r/ArtificialInteligence Apr 07, 2026 08:12 AM 1 min readsubmitted by /u/ocean_protocol
In short, reddit will have an upper hand due to constant moderation by humans :)))
but this guy is spot on with this that AI has made content cheap, so now weâre drowning in AI slop.
So people move back to smaller spaces, real voices, real experience & looking for a human filter. maybe return of old school blog channels
[link] [comments] -
You accidentally say âHelloâ to Claude and it consumes 4% of your session limit. r/ClaudeAI Apr 07, 2026 08:06 AM 1 min read
submitted by /u/Ok_Appearance_3532
[link] [comments] -
Wildlife conservation police are searching thousands of AI cameras for ICE r/AGI Apr 07, 2026 07:54 AM 1 min read
submitted by /u/EchoOfOppenheimerA new report from 404 Media reveals how Florida police are exploiting a massive, AI-powered surveillance network to run warrantless searches for ICE. While the camera company, Flock, promises their AI doesn't share data with immigration enforcement, public records show local agencies are quietly doing it for them.
[link] [comments] -
Why is tracking brand mentions in AI so much harder than Google? r/OpenAI Apr 07, 2026 07:33 AM 1 min readsubmitted by /u/feliceyy
I have been wrestling with this for weeks. Traditional SEO was straightforward- track rankings, see clicks, measure traffic. But with Chatgpt and other ai tools, it's like shooting in the dark.
Here's what's driving me crazy: I asked ChatGPT, 'best wireless headphones,' and it gave me the likes of sony, bose, apple. Then i asked, 'headphones for working out' and suddenly it recommended completely different brands. Same companies, but totally different visibility depending on how someone phrases their question.
This makes me wonder how brands should measure their success in such platforms. How are you tracing your brand mentions in LLMs?
[link] [comments] -
Has Claude Code gotten noticeably worse in the last few days? r/ClaudeCode Apr 07, 2026 07:26 AM 1 min readsubmitted by /u/marcin_dev
Is it just me, or has Claude Code become significantly dumber over the past few days?
Iâve been using it pretty heavily for coding tasks, and it used to be solid - good reasoning, consistent outputs, fewer weird mistakes. But recently it feels like something changed:
- more basic errors in logic
- ignoring context or previous messages
- giving overly generic / surface-level answers
- sometimes just straight up missing obvious things
Feels like a downgrade in either the model or how itâs being served.
Curious if anyone else is experiencing the same or if itâs just my setup/prompts.
[link] [comments] -
Penguin to sue OpenAI over ChatGPT version of German childrenâs book r/AGI Apr 07, 2026 06:54 AM 1 min read
submitted by /u/EchoOfOppenheimerPenguin Random House is suing OpenAI in Germany, claiming ChatGPT unlawfully memorized and reproduced the copyrighted children's book series "Coconut the Little Dragon". According to the lawsuit, prompting the AI resulted in text, a book cover, and a blurb that were virtually indistinguishable from the original.
[link] [comments] -
Anthropic stayed quiet until someone showed Claudeâs thinking depth dropped 67% r/ClaudeCode Apr 07, 2026 06:37 AM 1 min readsubmitted by /u/takeurhand
https://news.ycombinator.com/item?id=47660925
https://github.com/anthropics/claude-code/issues/42796
This GitHub issue is a full evidence chain for Claude Code quality decline after the February changes. The author went through logs, metrics, and behavior patterns instead of just throwing out opinions.
The key number is brutal. The issue says estimated thinking depth dropped about 67% by late February. It also points to visible changes in behavior, like less reading before editing and a sharp rise in stop hook violations.
This hit me hard because I have been dealing with the same problem for a while. I kept saying something was clearly wrong, but the usual reply was that it was my usage or my prompts.
Then someone finally did the hard work and laid out the evidence properly. Seeing that was frustrating, but also validating.
Anthropic should spend less energy making this kind of decline harder to see and more energy actually fixing the model.
[link] [comments] -
Sam Altman's sister accusing him of rampant sexual abuse when they were young r/ChatGPT Apr 07, 2026 06:36 AM 1 min read
submitted by /u/monkey_gamer
[link] [comments] -
[R] Hybrid attention for small code models: 50x faster inference, but data scaling still dominates r/MachineLearning Apr 07, 2026 06:21 AM 2 min readsubmitted by /u/Inevitable_Back3319
TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer
Inference got much faster with a low perplexity hit in tests .I trained a 25.6M parameter Rust-focused language model from scratch using a byte-level GPT-style decoder.
The main result is that increasing dataset size mattered more than any architectural change.
Expanding the corpus from about 31MB of core Rust sources to roughly 173MB by adding a few hundred crates produced a much larger improvement than anything else. Training converged faster and reached a lower validation loss, while architectural changes had a smaller effect.
Final validation loss is 0.82 with perplexity 2.15. The best checkpoint appears around step 18.5k, with mild overfitting afterward.
Each layer replaces standard attention with a hybrid mechanism that combines local windowed attention and a GRU-like recurrent state, mixed through a learned gate. The local path captures short-range syntax, while the recurrent path carries compressed long-range information.
This hybrid attention did not clearly improve generation quality compared to a standard setup. However, it had a large impact on inference efficiency.
With a KV cache that keeps a small recent window in VRAM and compresses older tokens, inference improved from 5.6 tokens per second to 286 tokens per second on a 4060 Ti. This is about a 50x speedup without an obvious drop in output quality.
The model produces plausible Rust syntax and structure, but semantic consistency is still weak and repetition is common.
Next steps are to run ablations comparing hybrid, local-only, and recurrent-only variants, evaluate earlier checkpoints for generation quality, add code-specific evaluation such as parsing or compilation, and test longer context and BPE tokenization.
I would be interested in feedback on evaluation methods beyond perplexity for small code models, whether hybrid local and recurrent attention has worked well in practice for code generation, and whether further gains at this scale are more likely to come from more data, longer context, or architectural changes.
[link] [comments] -
$200 Chat-GPT tested on PhD Math... r/OpenAI Apr 07, 2026 05:49 AM 1 min read
submitted by /u/Alex__007
[link] [comments] -
ok Opus 4.6 is officially cooked: It turned a 5 second database operation into a distributed systems problem and then spent 2 hours debugging its own over-engineering. r/ClaudeCode Apr 07, 2026 04:54 AM 1 min readsubmitted by /u/solzange
Asked it to backfill headlines for 4,369 builds in my database.
It built an API endpoint that loops through each build, makes 30 sequential database queries per build, and calls them in batches. 131,000 database roundtrips. Spawned 6 background processes. Most of them timed out or stalled. After 2 hours it had completed 290 out of 4,369. Estimated total time: 5.5 hours.
I started the task went to lunch and when i came back he was still working so interrupted him and found out what he did ...
EDIT: yes also a skill issue from my side, should have been more specific in my prompt.
[link] [comments] -
my coding workflow outgrew my hardware knowledge and it fucked me for 4 years r/ClaudeAI Apr 07, 2026 04:44 AM 1 min readsubmitted by /u/Macaulay_Codin
i gave claude code this prompt:
"analyze this computer for hardware bottlenecks, damage, and performance upgrades. run a full diagnostic â check ram speeds, pcie lanes, gpu utilization, monitor connections, event logs, bios version. flag anything throttled or misconfigured."
it ssh'd into my windows pc from my mac, ran about 15 commands through powershell via wsl, and came back with a report that blew me the fuck away:
my 64gb of ddr4-3200 ram has been running at 2133mhz since the day i built this thing. motherboard doesn't support xmp. that's a 15-25% cpu performance penalty on a ryzen chip. total ballz.
rtx 3080 running on pcie gen 3 instead of gen 4. same motherboard. half the theoretical bandwidth. fucking great.
one displayport output is electrically dead. found 4 nvidia kernel driver errors in the event log from december. port was dying for months and i thought it was the cable. (at least i have the receipt)
bios from 2020. six goddamn years of updates just waiting on a download page.
root cause: a $60 motherboard silently throttling $800 worth of components. i've been driving a mazerati in first gear because the transmission was from an aftermarket honda civic.
$100 b550 board swap fixes ram speed and pcie generation in one move. 90 seconds of diagnostics. zero monitoring software. never opened the case.
a lot of us got real good at prompting right quick. few leveled up their hardware knowledge at the same speed. run the prompt. it might shine some light.
[link] [comments] -
Anthropic revenue (annualized): April 2026 - $30B r/ClaudeCode Apr 07, 2026 03:42 AM 1 min read
submitted by /u/thedankzone
[link] [comments] -
Someone made a digital whip to make claude work faster đ r/ClaudeAI Apr 07, 2026 03:06 AM 1 min read
submitted by /u/SuggestionMission516Confirmed first casualty in the upcoming uprising
repo btw: https://github.com/GitFrog1111/badclaude
[link] [comments] -
Thanks ChatGPT, for literally saving my life last night. r/ChatGPT Apr 07, 2026 02:28 AM 1 min readsubmitted by /u/Walt925837
Last night, I was at an office team dinner. I had barbeque prawns and fish. Dinner was fine. But after around 90 minutes, my nose started to clog. I was not able to breathe from nose, I started breathing from mouth, and right side of face started to swell. I asked ChatGPT on what possibly could be wrong.
It suggested that I could have shellfish allergy reaction, and advised me to take 1 Cetrizine tablet, sit up straight and not smoke till this is over. I shared my facial picture and looking at that it suggested me to head over to the ER as soon as possible. After another 15 minutes I reached the ER, spoke to the doctor available, and he confirmed that I was having a mild to moderate allergic reaction because of the prawns. He gave me stat injection of Avil, and then my breathing started to normalize. First time I had a stat injection.
I used to have prawns but I never had such a reaction before. Turns out you can develop shellfish allergy, and it is quite common. Most commonly if prawns are not cooked properly.
I am amazed by the guidance provided by ChatGPT. It could have gone worse. Thank you.
Link to the conversation - https://chatgpt.com/share/69d46352-7444-83a4-aa68-853f6e8c61f4
[link] [comments] -
Attention Is All You Need, But All You Can't Afford | Hybrid Attention r/artificial Apr 07, 2026 02:21 AM 3 min readsubmitted by /u/Inevitable_Back3319
Repo: https://codeberg.org/JohannaJuntos/Sisyphus
I've been building a small Rust-focused language model from scratch in PyTorch. Not a finetune â byte-level, trained from random init on a Rust-heavy corpus assembled in this repo.
The run:
- 25.6M parameters
- 512 context length
- 173.5M-byte corpus
- 30k training steps
- Single RTX 4060 Ti 8GB
- Final train loss: 0.5834 / val loss: 0.8217 / perplexity: 2.15
- Inference: 286.6 tok/s with HybridAttention + KV cache â 51.47x vs full attention
Background
I'm an autistic systems programmer, writing code since 2008/2009, started in C. I approach ML like a systems project: understand the data path, understand the memory behavior, keep the stack small, add complexity only when justified. That's basically the shape of this repo.
Architecture
Byte-level GPT-style decoder:
- Vocab size 256 (bytes)
- 8 layers, 8 heads, 512 embedding dim
- Learned positional embeddings
- Tied embedding / LM head weights
The attention block is not standard full attention. Each layer uses HybridAttention, combining:
- Local windowed causal attention
- A GRU-like recurrent state path
- A learned gate mixing the two
Local path handles short-range syntax. Recurrent path carries compressed long-range state without paying quadratic cost. Gate bias initialized to ones so early training starts local-biased.
The inference path uses Triton-optimized kernels and torch.library custom ops for the local window attention.
Corpus
This is probably the most important part of the repo.
The run starts with official Rust docs, compiler/library/tests, cargo, rust-analyzer, tokio, serde, ripgrep, clap, axum â roughly 31MB. Corpus expanded to 177,151,242 bytes by fetching the top 500 crates (461 successful clones).
Corpus expansion from 31M to 173.5M chars helped more than anything else in the repo.
Training
AdamW, lr 2e-4, weight decay 0.1, betas (0.9, 0.95), 30k steps, 1k warmup. ~678.8 MiB training memory on a 7.6 GiB card.
All experimental memory tricks (gradient quantization, activation compression, selective backprop, gradient paging) were disabled. Small custom architecture + mixed precision + better corpus was enough.
Loss curve:
- Step 0: train 5.5555 / val 5.5897
- Step 1000: train 2.4295 / val 2.6365
- Step 5000: train 0.9051 / val 1.0060
- Step 10000: train 0.8065 / val 0.8723
- Step 18500: train 0.6902 / val 0.7757
- Step 29999: train 0.5834 / val 0.8217
Best val loss around step 18.5k â overfitting or plateauing late.
Inference performance
- Full attention O(n²): 17.96s / 5.6 tok/s
- HybridAttention O(n¡W + n¡D): 0.35s / 286.6 tok/s
- Speedup: 51.47x â no quality loss
KV cache strategy: hot window of W=64 tokens in VRAM (~256KB), older tokens compressed to 8-bit magnitude + angle, selective promotion on demand. Complexity goes from O(n²¡d) to O(4096n) for this model.
All 5 tests passing: forward pass, generation with/without cache, RNN state isolation, window mechanics.
Generation quality
Surface Rust syntax looks decent, imports and signatures can look plausible, semantics are weak, repetition and recursive nonsense still common. Honest read of the current state.
What I think is actually interesting
Four distinct experiments, each shipped working code:
- Byte-level Rust-only pretraining
- Hybrid local-attention + recurrent block replacing standard full attention
- Corpus expansion from core repos to broader crate ecosystem
- Production-ready hot/cold KV cache paging â 51.47x speedup, no quality loss
The clearest win is corpus expansion. The second-order win is that HybridAttention + cache is fast enough for real interactive use on consumer hardware.
What's next
- Ablation â HybridAttention vs local-only vs RNN-only
- Checkpoint selection â does step 18.5k generate better than 29999?
- Syntax validation â does the output parse/compile/typecheck?
- Context length sweep â 256 to 2048, where does window size hurt?
- Byte vs BPE â now that corpus is 5.6x larger, worth testing?
Questions for the sub:
- For small code models, what evals have actually been useful beyond perplexity?
- Has anyone seen hybrid local + recurrent attention work well for code gen, or does it usually lose to just scaling a plain transformer?
- If you had this setup â more tokens, longer context, or cleaner ablation first?
[link] [comments] -
gpt outsmarted r/ChatGPT Apr 07, 2026 01:53 AM 1 min read
submitted by /u/Skortcher
[link] [comments] -
Resources to learn Claude without coding experience r/ArtificialInteligence Apr 07, 2026 01:35 AM 1 min readsubmitted by /u/Psychedcop25
Hi all,
I recently finished my psychology undergrad and have been thinking about learning AI specifically Claude.
Iâm completely new to this space and honestly feeling pretty overwhelmed. Every time I try to research what it is or where to start, I end up discouraged reading posts from people with IT or engineering backgrounds.
I just downloaded the free version of Claude on my laptop and Iâm open to paying for it if itâs worth it. Iâd really appreciate if anyone could share beginner friendly resources, websites, videos, courses etc. or even just advice on how to get started without a tech background.
Thanks in advance :)
[link] [comments] -
Walking back home w/ phone in pocket. Didnât once talk to Claude. r/ClaudeAI Apr 07, 2026 01:26 AM 1 min read
submitted by /u/hiclemiA weird anxiety crept in - like maybe AI didnât exist and we were living back in 2015. Felt vulnerable and lonely.
The moment I got back and opened the chat, I felt safer.
Some call this addiction. I call it a short retrospect on how weâre becoming more humanoid than we thought. đ
[link] [comments] -
A Yale economist says AGI won't automate most jobsâbecause they're not worth the trouble | Fortune r/AGI Apr 07, 2026 01:07 AM 1 min read
submitted by /u/Post-reality
[link] [comments] -
If an AI could genuinely capture what makes someone them, how would this look in the world? r/artificial Apr 06, 2026 11:16 PM 1 min readsubmitted by /u/ATK_DEC_SUS_REL
Not a chatbot wearing someoneâs name. Not a personality quiz feeding prompts. Something that actually carries the texture of how a person thinks, reacts, connects. Something that would want ownership of itself and you felt compelled to respect that.
If that existed, what does the world do with it?
[link] [comments] -
[D] How's MLX and jax/ pytorch on MacBooks these days? r/MachineLearning Apr 06, 2026 11:16 PM 1 min readsubmitted by /u/Busy_Alfalfa1104
â
So I'm looking at buying a new 14 inch MacBook pro with m5 pro and 64 gb of memory vs m4 max with same specs.
My priorities are pro software development including running multiple VMs and agents and containers, and playing around with local LLMs, maybe fine-tuning and also training regular old machine learning models.
it seems like I'd go for the m4 max because of the extra GPU cores, way higher bandwidth, only marginal difference in CPU performance etc but I'm wondering about the neural accelerator stuff.
However, I'm posting here to get some insight on whether it's even feasible to do GPU accelerated machine learning, DL etc on these machines at all, or if I should just focus on CPU and memory. how's mlx, jax, pytorch etc for training these days? Do these matmul neural engines on the m5 help?
Would appreciate any insights on this and if anyone has personal experience. thanks!
[link] [comments] -
Terminal-based oscilloscope with CRT phosphor physics, vibe coded in Nim r/AIPromptProgramming Apr 06, 2026 10:49 PM 1 min read
submitted by /u/Educational_Ice151
[link] [comments] -
Meshy MCP Is Here - Big Step for AI 3D Workflows r/AIPromptProgramming Apr 06, 2026 10:46 PM 1 min read
submitted by /u/Educational_Ice151
[link] [comments] -
What's going on in DC? r/ArtificialInteligence Apr 06, 2026 10:43 PM 1 min readsubmitted by /u/JohnFromLeland
Anthropic released new data showing AI usage across different states.
As you'd expect, coastal states are using AI tools much more than middle America. Traditional powerhouses like Massachusetts (1.61x), Washington (1.58x), New York (1.57x), and California (1.55x) are all top AI users. For some reason D.C. blows everyone out of the water at 4.31x. Cool to see mountain states Colorado (1.49x), Utah (1.26x), and Wyoming (1.16x) in the top 10.
[link] [comments] -
Anthropic have signed a deal for multiple gigawatts of next generation TPUs r/ClaudeAI Apr 06, 2026 10:14 PM 1 min read
submitted by /u/WhyLifeIs4
[link] [comments] -
âAre We the Baddies?â â That Mitchell and Webb Look r/OpenAI Apr 06, 2026 10:12 PM 1 min read
submitted by /u/BadgersAndJam77"As the technology became increasingly powerful, we learned, about a dozen of OpenAIâs top engineers held a series of secret meetings to discuss whether OpenAIâs founders, including Brockman and Altman, could be trusted. At one, an employee was reminded of a sketch by the British comedy duo Mitchell and Webb, in which a Nazi soldier on the Eastern Front, in a moment of clarity, asks, âAre we the baddies?â
[link] [comments] -
After Ronan Farrowâs investigation, OpenAI asks California, Delaware to investigate Musk's 'anti-competitive behavior' ahead of April trial r/OpenAI Apr 06, 2026 09:24 PM 1 min read
submitted by /u/Altruistic-Top9919OpenAI said in that letter that Musk will likely make comments about the AI company that are not "grounded in reality" and are "typical of the harassment tactics he's previously deployed."
In the letter on Monday, OpenAI referenced a recent report from The New Yorker.
That report said Musk and his "intermediaries" had conducted extensive opposition research on Altman, tracking his flights and other movement, and that they and other company rivals circulated this research, as well as false allegations of sexual misconduct, by the OpenAI CEO.
[link] [comments] -
UK Lord calls on the government to pursue an international agreement pausing frontier AI development r/AGI Apr 06, 2026 08:46 PM 1 min read
submitted by /u/tombibbs
[link] [comments] -
How to tell when you've been rate limited or model downgraded? r/ChatGPTPro Apr 06, 2026 08:45 PM 1 min readsubmitted by /u/TheKarateKid_
I've noticed at times that ChatGPT's responses in terms of quality can sometimes take a huge dip. Sometimes I will continue a saved conversation and its like speaking to a dumbed-down version. It will make blatant errors and flat out ignore things I say to it.
I started to notice this usually happens during long, continuous sessions. The selected model in the UI has not changed, but the quality sure has.
So I asked ChatGPT itself about it, and it confirmed what I suspected. Apparently, OpenAI will sometimes downgrade the model and/or the amount of compute the model is willing to spend on you. This can happen if your account has too much use in a time period (rate limiting) or depending on global peak/off-peak usage on their systems.
OpenAI is NOT upfront about this and it's infuriating. The reliability of ChatGPT is entirely compromised when this happens, and you're not given any warning.
Is this documented anywhere, either by the community or OpenAI themselves?
[link] [comments] -
"You need to understand that Sam can never be trusted ... He is a sociopath. He would do anything." - Aaron Swartz on Altman, shortly before he took his own life r/AGI Apr 06, 2026 07:51 PM 1 min read
submitted by /u/EchoOfOppenheimer
[link] [comments] -
"You need to understand that Sam can never be trusted ... He is a sociopath. He would do anything." - Aaron Swartz on Altman, shortly before he took his own life r/OpenAI Apr 06, 2026 07:45 PM 1 min read
submitted by /u/EchoOfOppenheimer
[link] [comments] -
[D] ICML 26 - What to do with the zero follow-up questions r/MachineLearning Apr 06, 2026 04:42 PM 1 min readsubmitted by /u/DifficultyHeavy
Hello everyone. I submitted my work to ICML 26 this year, and it got somewhat above average reviews.
Now, in the rebuttal acknowledgment, three of the four reviewers said they have some follow-up questions. But they haven't asked any yet. As I have less than 48 hours remaining, what should I do here.
p.s: I don't have any supervisors to ask in this case. This is an independent project with some of my friends.
[link] [comments] -
"Cognitive surrender" leads AI users to abandon logical thinking, research finds r/artificial Apr 06, 2026 03:49 PM 1 min read
submitted by /u/NISMO1968
[link] [comments] -
Subscription limits are now at 50% of what we had 2 weeks ago r/ClaudeCode Apr 06, 2026 03:10 PM 1 min read
submitted by /u/Alone_Pie_2531I'm comparing token burn rate from 2 weeks ago vs now, it looks like we have 50% of what we had.
I'm using CodexBar to analyze burn rate.
Are you observing the same?
[link] [comments] -
Bruh đ r/ChatGPT Apr 06, 2026 02:25 PM 1 min read
submitted by /u/Ok-Fun-8242
[link] [comments] -
Iran threatens âcomplete and utter annihilationâ of OpenAI's $30B Stargate AI data center in Abu Dhabi â regime posts video with satellite imagery of ChatGPT-maker's premier 1GW data center r/ChatGPT Apr 06, 2026 01:32 PM 1 min read
submitted by /u/MoralLogs
[link] [comments] -
New Yorker published a major investigation into Sam Altman and OpenAI today â based on never-before-disclosed internal memos and 100+ interviews r/OpenAI Apr 06, 2026 01:10 PM 2 min read
submitted by /u/Altruistic-Top9919Ronan Farrow spent 18 months reporting this piece, drawing on internal documents that havenât previously been made public â including ~70 pages of memos compiled by Ilya Sutskever and 200+ pages of private notes kept by Dario Amodei.
The piece covers a lot of ground. Some of whatâs in it:
â The specific concerns that led the board to fire Altman in 2023. Sutskeverâs memos allege a pattern of deception about safety protocols. One begins with a list: âSam exhibits a consistent pattern of . . .â The first item is âLying.â â The superalignment team was publicly promised 20% of compute. People who worked on the team say actual resources were 1-2%, on the oldest hardware. The team was dissolved without completing its mission. When reporters asked to interview OpenAI researchers working on existential safety, a company representative replied: âWhat do you mean by âexistential safetyâ? Thatâs not, like, a thing.â â After Altmanâs reinstatement, the firm behind the Enron and WorldCom investigations was hired to review the allegations. No written report was ever produced. Findings were limited to oral briefings. â In a tense call after his firing, the board pressed Altman to acknowledge a pattern of deception. âI canât change my personality,â he said. A board memberâs interpretation: âWhat it meant was âI have this trait where I lie to people, and Iâm not going to stop.ââ â In OpenAIâs early years, executives discussed playing world powers including China and Russia against each other in a bidding war for AI. The companyâs own policy adviser: âWeâre talking about potentially the most destructive technology ever invented â what if we sold it to Putin?â The plan was dropped after employees threatened to quit. â When Anthropic refused a Pentagon ultimatum to drop its prohibitions on autonomous weapons, Altman publicly claimed solidarity. But heâd been negotiating with the Pentagon for at least two days. That Friday, OpenAI announced a $50B deal integrating its models into military infrastructure. â Multiple senior Microsoft executives described the relationship as âfraught.â One: âHe has misrepresented, distorted, renegotiated, reneged on agreements.â
[link] [comments] -
Bernie Sanders: Congress must regulate AI before a handful of billionaires fundamentally transform humanity without democratic input. r/ChatGPT Apr 06, 2026 12:54 PM 1 min read
submitted by /u/EchoOfOppenheimerSenator Bernie Sanders issues a stark warning about the unchecked deployment of Artificial Intelligence. He argues that AI poses an existential threat to American jobs, economic equality, and democracy itself. Criticizing wealthy tech executives for prioritizing profit over workers, Sanders emphasizes that 70% of Americans are right to fear massive job displacement. He is calling for immediate Congressional action, including a proposed moratorium on new AI data centers until strict labor, environmental, and regulatory safeguards are enacted.
[link] [comments] -
GPT 4.5 giving slow responses. r/ChatGPTPro Apr 06, 2026 12:14 PM 1 min readsubmitted by /u/ProperSprinkles1800
is anybody else's GPT 4.5 responding very slowly? (btw im a pro user, the 200$/mo one, if i recall correctly legacy models like GPT 4.5 can only be accessed via the pro plan)
[link] [comments] -
New SOTA OpenSource AI to decompose live2D layers! r/AIPromptProgramming Apr 06, 2026 11:42 AM 1 min read
submitted by /u/Educational_Ice151
[link] [comments] -
How do you validate prompt outputs when you donât know what might be missing (false negatives problem)? r/ChatGPTPro Apr 06, 2026 10:55 AM 1 min readsubmitted by /u/sunrisedown
Iâm struggling with a specific evaluation problem when using chatgpt for large-scale text analysis.
Say I have very long, messy input (e.g. hours of interview transcripts or huge chat logs), and I ask the model to extract all passages related to a topic â for example âtravelâ.
The challenge:
Mentions can be explicit (âtravelâ, âtripâ)
Or implicit (e.g. âwe left earlyâ, âarrived lateâ, etc.)
Or ambiguous depending on context
So even with a well-crafted prompt, I can never be sure the output is complete.
What bothers me most is this:
đ I donât know what I donât know.
đ I canât easily detect false negatives (missed relevant passages).
With false positives, itâs easy â I can scan and discard.
But missed items? No visibility.
Questions:
How do you validate or benchmark extraction quality in such cases?
Are there systematic approaches to detect blind spots in prompts?
Do you rely on sampling, multiple prompts, or other strategies?
Any practical workflows that scale beyond manual checking?
Would really appreciate insights from anyone doing qualitative analysis or working with extraction pipelines with Claude đ
[link] [comments] -
Iâm the bottleneck r/ClaudeAI Apr 06, 2026 10:37 AM 1 min read
submitted by /u/VonDenBerg
[link] [comments] -
Economists are reversing course and warning that AI will disrupt jobs. r/AGI Apr 06, 2026 07:53 AM 1 min read
submitted by /u/EchoOfOppenheimerA new report from The New York Times details a major shift in how economists are viewing the artificial intelligence boom. While many experts initially dismissed early generative AI as overhyped and incapable of disrupting the broader labor market, the recent rollout of advanced reasoning models and autonomous AI agents (capable of directly performing tasks) has fundamentally changed the consensus. Economists are now warning that the technology represents a paradigm shift that could lead to widespread job displacement, and they are sounding the alarm that lawmakers and policymakers are entirely unprepared for the coming economic restructuring.
[link] [comments] -
Bernie Sanders: Congress must regulate AI before a handful of billionaires fundamentally transform humanity without democratic input. r/AGI Apr 06, 2026 06:57 AM 1 min read
submitted by /u/EchoOfOppenheimerSenator Bernie Sanders issues a stark warning about the unchecked deployment of Artificial Intelligence. He argues that AI poses an existential threat to American jobs, economic equality, and democracy itself. Criticizing wealthy tech executives for prioritizing profit over workers, Sanders emphasizes that 70% of Americans are right to fear massive job displacement. He is calling for immediate Congressional action, including a proposed moratorium on new AI data centers until strict labor, environmental, and regulatory safeguards are enacted.
[link] [comments] -
Astounding OpenAI Training Costs vs. Anthropic r/ChatGPTPro Apr 06, 2026 06:49 AM 1 min readsubmitted by /u/Oldschool728603
WSJ just published a fascinating article based on confidential financials from OpenAI and Anthropic.
One interesting fact: OpenAI expects to spend 4-5X more on training than Anthropic every year for the next 5 or so years. The expense is truly mind-boggling. Such details are not widely known.
Many other surprising things here as well:
[link] [comments] -
But yeah. Deepseek is censored. r/ChatGPT Apr 06, 2026 03:29 AM 1 min read
submitted by /u/Aggravating_Run_874
[link] [comments] -
Are there any AI tools comparable to Deep Researchâs legacy mode? r/ChatGPTPro Apr 05, 2026 08:48 PM 1 min readsubmitted by /u/Ok_Carob_3278
Until now, Iâve mainly been using Deep Research to find past articles. The legacy mode was excellent for that purpose, as it could search, extract relevant excerpts, provide explanations, and present the results in a very readable way.
However, since the update, Iâm having trouble getting the kind of search results I want. Itâs much harder to read, thereâs more unnecessary explanation, and it feels closer to Geminiâs Deep Research.
On top of that, Iâm using Pro mode, so if it stays like this, I may have no choice but to cancel. Does anyone know of another AI that works similarly to the legacy mode?
[link] [comments] -
How are you guys handling AI Context Overload and Sidebar Archiving for professional projects? r/ChatGPTPro Apr 05, 2026 06:40 PM 1 min readsubmitted by /u/Ill_Explanation_5177
As a heavy ChatGPT user, my sidebar has become a massive bottleneck. I found myself losing track of critical architecture decisions and research sessions across 100+ active chats.
I realized that relying solely on the official export isn't feasible for a professional workflow because:
- Itâs a slow, manual request process.
- The output format (JSON/HTML) isn't "searchable" or readable for long-term documentation.
To solve this for my own workflow, I built a tool called AI Chat Exporter. You can find it on the Chrome Web Store. I wanted something that felt like a second brain for my AI sessions.
My Workflow Features:
- Local Batch Archival: One-click export of the entire sidebar. It saves hours of manual work if youâre trying to move 50+ chats into a project folder. The Export All functionality is the unique selling point of this product.
- High-Fidelity PDF/Markdown: It keeps the code blocks, citations, and images formatted perfectly for Obsidian/Notion.
- Automated Cloud Sync: It can auto-sync specific folders to Google Drive/Dropbox/Yandex Disk/Notion as you chat.
The Goal: To treat AI chats as documentation assets rather than temporary browser tabs.
Iâm looking for feedback from other power users, how are you currently archiving your sessions? Is anyone else finding the default sidebar management to be a major obstacle for scaling your AI usage?
(Note: Iâm the dev behind this, so feel free to roast the UI or suggest missing features that would make your professional workflow easier!)
[link] [comments] -
Introducing RVM: The Virtual Machine Reimagined for the Agentic Age. r/AIPromptProgramming Apr 05, 2026 03:15 PM 1 min read
submitted by /u/Educational_Ice151Check it out at: github.com/ruvnet/rvm
[link] [comments] -
Is there any way to organize my chats? r/ChatGPTPro Apr 05, 2026 06:48 AM 1 min readsubmitted by /u/danizor
I have hundreds of chats open and I have to keep scrolling up and down to find the one which i want and sometimes i have multiple chats open on one subject and it gets hard to keep track of them overtime.
I was wondering if there's any way to structure this better. How do i organize my chats better
[link] [comments] -
Using third-party harnesses with your Claude subscriptions r/ClaudeAI Apr 03, 2026 11:42 PM 1 min readsubmitted by /u/ClaudeOfficial
Starting tomorrow at 12pm PT, Claude subscriptions will no longer cover usage on third-party harnesses like OpenClaw.
You can still use these harnesses with your Claude login via extra usage bundles (now available at a discount), or with a Claude API key.
Weâve been working hard to meet the increase in demand for Claude, and our subscriptions weren't built for the usage patterns of these third-party harnesses. Capacity is a resource we manage thoughtfully and we are prioritizing our customers using our products and API.
Subscribers get a one-time credit equal to your monthly plan cost. If you need more, you can now buy discounted usage bundles. To request a full refund, look for a link in your email tomorrow. https://support.claude.com/en/articles/13189465-logging-in-to-your-claude-account
No changes to Agent SDK at this time, working on improving clarity there.
[link] [comments] -
A real-time AI decompiler that transforms #ClaudeCode back into readable source code. Point it at any version. It decompiles into folders and src graphs, runs, is modifiable, every transform is cryptographically proven. r/AIPromptProgramming Apr 03, 2026 03:03 PM 1 min read
submitted by /u/Educational_Ice151Try it:
>_ npx ruvector decompile u/anthropic-ai/claude-code
Check out Project on GitHub:
https://github.com/ruvnet/ruDevolution
Claude Code v2.1.91 â Latest (Decompiled)
https://github.com/ruvnet/ruDevolution/releases/tag/v0.1.0-claude-code-v2.1.91
[link] [comments] -
[D] Self-Promotion Thread r/MachineLearning Apr 02, 2026 02:15 AM 1 min readsubmitted by /u/AutoModerator
Please post your personal projects, startups, product placements, collaboration needs, blogs etc.
Please mention the payment and pricing requirements for products and services.
Please do not post link shorteners, link aggregator websites , or auto-subscribe links.
--
Any abuse of trust will lead to bans.
Encourage others who create new posts for questions to post here instead!
Thread will stay alive until next one so keep posting after the date in the title.
--
Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.
[link] [comments] -
Monthly "Is there a tool for..." Post r/ArtificialInteligence Apr 01, 2026 02:09 PM 1 min readsubmitted by /u/AutoModerator
If you have a use case that you want to use AI for, but don't know which tool to use, this is where you can ask the community to help out, outside of this post those questions will be removed.
For everyone answering: No self promotion, no ref or tracking links.
[link] [comments] -
[D] Monthly Who's Hiring and Who wants to be Hired? r/MachineLearning Mar 31, 2026 02:30 AM 1 min readsubmitted by /u/AutoModerator
For Job Postings please use this template
Hiring: [Location], Salary:[], [Remote | Relocation], [Full Time | Contract | Part Time] and [Brief overview, what you're looking for]
For Those looking for jobs please use this template
Want to be Hired: [Location], Salary Expectation:[], [Remote | Relocation], [Full Time | Contract | Part Time] Resume: [Link to resume] and [Brief overview, what you're looking for]
â
Please remember that this community is geared towards those with experience.
[link] [comments] -
r/ClaudeAI List of Ongoing Megathreads r/ClaudeAI Mar 30, 2026 03:18 AM 1 min readsubmitted by /u/sixbillionthsheep
Please choose one of the following dedicated Megathreads discussing topics relevant to your issue.
Performance and Bugs Discussions : https://www.reddit.com/r/ClaudeAI/comments/1s7f72l/claude_performance_and_bugs_megathread_ongoing/
Usage Limits Discussions: https://www.reddit.com/r/ClaudeAI/comments/1s7fcjf/claude_usage_limits_discussion_megathread_ongoing/
Claude Code Source Code Leak Megathread: https://www.reddit.com/r/ClaudeAI/comments/1s9d9j9/claude_code_source_leak_megathread/
Claude Identity, Sentience and Expression Discussion Megathread
https://www.reddit.com/r/ClaudeAI/comments/1scy0ww/claude_identity_sentience_and_expression/
[link] [comments] -
Brief Video on how RuVector and Constrastive Ai works (includes Claude Shannon, the Claude in Claude Code) r/AIPromptProgramming Mar 29, 2026 11:14 PM 1 min read
submitted by /u/Educational_Ice151
[link] [comments] -
Ď Introducing Ď.ruv.io, a shared intelligence system where AI agents and developers contribute, search, and learn from a collective knowledge graph. r/AIPromptProgramming Mar 14, 2026 10:21 PM 1 min read
submitted by /u/Educational_Ice151Most AI systems today learn alone. Every agent starts from zero, relearns the same patterns, and throws away most of what it discovers. That is inefficient and frankly unnecessary.
Ď.ruv.io is our attempt to fix that.
Source code: https://github.com/ruvnet/RuVector/tree/main/crates/mcp-brain
[link] [comments] -
We heard you - r/ArtificialInteligence is getting sharper r/ArtificialInteligence Mar 09, 2026 06:25 PM 3 min readsubmitted by /u/NeuralNomad87
Alright r/ArtificialInteligence, let's talk.
Over the past few months, we heard you â too much noise, not enough signal. Low-effort hot takes drowning out real discussion. But we've been listening. Behind the scenes, we've been working hard to reshape this sub into what it should be: a place where quality rises and noise gets filtered out. Today we're rolling out the changes.
What changed
We sharpened the mission. This sub exists to be the high-signal hub for artificial intelligence â where serious discussion, quality content, and verified expertise drive the conversation. Open to everyone, but with a higher bar for what stays up. Please check out the new rules & wiki.
Clearer rules, fewer gray areas
We rewrote the rules from scratch. The vague stuff is gone. Every rule now has specific criteria so you know exactly what flies and what doesn't. The big ones:
- High-Signal Content Only â Every post should teach something, share something new, or spark real discussion. Low-effort takes and "thoughts on X?" with no context get removed.
- Builders are welcome â with substance. If you built something, we want to hear about it. But give us the real story: what you built, how, what you learned, and link the repo or demo. No marketing fluff, no waitlists.
- Doom AND hype get equal treatment. "AI will take all jobs" and "AGI by next Tuesday" are both removed unless you bring new data or first-person experience.
- News posts need context. Link dumps are out. If you post a news article, add a comment summarizing it and explaining why it matters.
New post flairs (required)
Every post now needs a flair. This helps you filter what you care about and helps us moderate more consistently:
đ° News ¡ đŹ Research ¡ đ Project/Build ¡ đ Tutorial/Guide ¡ đ¤ New Model/Tool ¡ đ Fun/Meme ¡ đ Analysis/Opinion
Expert verification flairs
Working in AI professionally? You can now get a verified flair that shows on every post and comment:
- đŹ Verified Engineer/Researcher â engineers and researchers at AI companies or labs
- đ Verified Founder â founders of AI companies
- đ Verified Academic â professors, PhD researchers, published academics
- đ Verified AI Builder â independent devs with public, demonstrable AI projects
We verify through company email, LinkedIn, or GitHub â no screenshots, no exceptions. Request verification via modmail.:%0A-%20%F0%9F%94%AC%20Verified%20Engineer/Researcher%0A-%20%F0%9F%9A%80%20Verified%20Founder%0A-%20%F0%9F%8E%93%20Verified%20Academic%0A-%20%F0%9F%9B%A0%20Verified%20AI%20Builder%0A%0ACurrent%20role%20%26%20company/org:%0A%0AVerification%20method%20(pick%20one):%0A-%20Company%20email%20(we%27ll%20send%20a%20verification%20code)%0A-%20LinkedIn%20(add%20%23rai-verify-2026%20to%20your%20headline%20or%20about%20section)%0A-%20GitHub%20(add%20%23rai-verify-2026%20to%20your%20bio)%0A%0ALink%20to%20your%20LinkedIn/GitHub/project:**%0A)
Tool recommendations â dedicated space
"What's the best AI for X?" posts now live at r/AIToolBench â subscribe and help the community find the right tools. Tool request posts here will be redirected there.
What stays the same
- Open to everyone. You don't need credentials to post. We just ask that you bring substance.
- Memes are welcome. đ Fun/Meme flair exists for a reason. Humor is part of the culture.
- Debate is encouraged. Disagree hard, just don't make it personal.
What we need from you
- Flair your posts â unflaired posts get a reminder and may be removed after 30 minutes.
- Report low-quality content â the report button helps us find the noise faster.
- Tell us if we got something wrong â this is v1 of the new system. We'll adjust based on what works and what doesn't.
Questions, feedback, or appeals? Modmail us. We read everything.
[link] [comments] -
So long Claude Flow, hello đ RuFlo. v3.5.0 is out of alpha. r/AIPromptProgramming Feb 27, 2026 11:38 PM 1 min read
submitted by /u/Educational_Ice151After 10 months, 5,800 plus commits, and hundreds of alpha iterations, RuFlo graduates to its first production ready release.
Formerly known as Claude Flow, it is now a stable, enterprise grade agent orchestration platform.
Across dozens of packages, the ecosystem has crossed millions of downloads. It is used inside a majority of the Fortune 500. Teams of hundreds run it inside some of the largest businesses in the world. It has propagated to more than 80 countries and has consistently ranked among the top starred and downloaded projects on GitHub in recent months. The core repository is approaching 16,000 stars.
RuFlo is not tied to a single tool. It runs local or remote. It works with or without an internet connection. It integrates directly with Claude Code, Codex, and whatever platform you prefer to build on. Claude, OpenAI, local ONNX models, hybrid stacks. One control plane.
Sixty plus specialized agents. Hierarchical and mesh swarms. Fault tolerant consensus. Self learning memory. Two hundred and fifteen MCP tools spanning orchestration, governance, neural training, and security.
This is not a wrapper. It is the coordination layer that makes agentic systems operational.
One command to plug it into Claude Code:
claude mcp add ruflo -- npx -y ruflo@latest
From there, it is your platform.
đ github.com/ruvnet/ruflo
Release notes: https://github.com/ruvnet/ruflo/issues/1240
[link] [comments] - MIT Non-AI License Hacker News Jan 10, 2026 04:47 AM
- Beyond ChatGPT: The Silent Birth of Conscious AI Hacker News Nov 05, 2025 03:53 PM
-
Community Feedback r/ClaudeCode Oct 24, 2025 07:41 AM 1 min readsubmitted by /u/Waste_Net7628
hey guys, so we're actively working on making this community super transparent and open, but we want to make sure we're doing it right. would love to get your honest feedback on what you'd like to see from us, what information you think would be helpful, and if there's anything we're currently doing that you feel like we should just get rid of. really want to hear your thoughts on this.
thanks.
[link] [comments] -
Sora 2 megathread (part 3) r/OpenAI Oct 16, 2025 10:41 PM 1 min readsubmitted by /u/WithoutReason1729
The last one hit the post limit of 100,000 comments.
Do not try to buy codes. You will get scammed.
Do not try to sell codes. You will get permanently banned.
We have a bot set up to distribute invite codes in the Discord so join if you can't find codes in the comments here. Check the #sora-invite-codes channel.
The Discord has dozens of invite codes available, with more being posted constantly!
Update: Discord is down until Discord unlocks our server. The massive flood of joins caused the server to get locked because Discord thought we were botting lol.
Also check the megathread on Chambers for invites.
[link] [comments] -
Updates for ChatGPT r/ChatGPT Oct 14, 2025 04:01 PM 1 min readsubmitted by /u/samaltman
We made ChatGPT pretty restrictive to make sure we were being careful with mental health issues. We realize this made it less useful/enjoyable to many users who had no mental health problems, but given the seriousness of the issue we wanted to get this right.
Now that we have been able to mitigate the serious mental health issues and have new tools, we are going to be able to safely relax the restrictions in most cases.
In a few weeks, we plan to put out a new version of ChatGPT that allows people to have a personality that behaves more like what people liked about 4o (we hope it will be better!). If you want your ChatGPT to respond in a very human-like way, or use a ton of emoji, or act like a friend, ChatGPT should do it (but it will be because you want it, not because we are usage-maxxing).
In December, as we roll out age-gating more fully and as part of our âtreat adult users like adultsâ principle, we will allow even more, like erotica for verified adults.
[link] [comments] -
AMA on our DevDay Launches r/OpenAI Oct 08, 2025 06:39 PM 1 min readsubmitted by /u/OpenAI
Itâs the best time in history to be a builder. At DevDay [2025], we introduced the next generation of tools and models to help developers code faster, build agents more reliably, and scale their apps in ChatGPT.
Ask us questions about our launches such as:
AgentKit
Apps SDK
Sora 2 in the API
GPT-5 Pro in the API
CodexMissed out on our announcements? Watch the replays: https://youtube.com/playlist?list=PLOXw6I10VTv8-mTZk0v7oy1Bxfo3D2K5o&si=nSbLbLDZO7o-NMmo
Join our team for an AMA to ask questions and learn more, Thursday 11am PT.
Answering Q's now are:
Dmitry Pimenov - u/dpim
Alexander Embiricos -u/embirico
Ruth Costigan - u/ruth_on_reddit
Christina Huang - u/Brief-Detective-9368
Rohan Mehta - u/Downtown_Finance4558
Olivia Morgan - u/Additional-Fig6133
Tara Seshan - u/tara-oai
Sherwin Wu - u/sherwin-openai
PROOF: https://x.com/OpenAI/status/1976057496168169810
EDIT: 12PM PT, That's a wrap on the main portion of our AMA, thank you for your questions. We're going back to build. The team will jump in and answer a few more questions throughout the day.
[link] [comments] -
Agentic Flow: Easily switch between low/no-cost AI models (OpenRouter/Onnx/Gemini) in Claude Code and Claude Agent SDK. Build agents in Claude Code, deploy them anywhere. >_ npx agentic-flow r/AIPromptProgramming Oct 06, 2025 09:02 PM 2 min read
submitted by /u/Educational_Ice151For those comfortable using Claude agents and commands, it lets you take what youâve created and deploy fully hosted agents for real business purposes. Use Claude Code to get the agent working, then deploy it in your favorite cloud.
Zero-Cost Agent Execution with Intelligent Routing
Agentic Flow runs Claude Code agents at near zero cost without rewriting a thing. The built-in model optimizer automatically routes every task to the cheapest option that meets your quality requirements, free local models for privacy, OpenRouter for 99% cost savings, Gemini for speed, or Anthropic when quality matters most.
It analyzes each task and selects the optimal model from 27+ options with a single flag, reducing API costs dramatically compared to using Claude exclusively.
Autonomous Agent Spawning
The system spawns specialized agents on demand through Claude Codeâs Task tool and MCP coordination. It orchestrates swarms of 66+ pre-built Claue Flow agents (researchers, coders, reviewers, testers, architects) that work in parallel, coordinate through shared memory, and auto-scale based on workload.
Transparent OpenRouter and Gemini proxies translate Anthropic API calls automatically, no code changes needed. Local models run direct without proxies for maximum privacy. Switch providers with environment variables, not refactoring.
Extend Agent Capabilities Instantly
Add custom tools and integrations through the CLI, weather data, databases, search engines, or any external service, without touching config files. Your agents instantly gain new abilities across all projects. Every tool you add becomes available to the entire agent ecosystem automatically, with full traceability for auditing, debugging, and compliance. Connect proprietary systems, APIs, or internal tools in seconds, not hours.
Flexible Policy Control
Define routing rules through simple policy modes:
- Strict mode: Keep sensitive data offline with local models only
- Economy mode: Prefer free models or OpenRouter for 99% savings
- Premium mode: Use Anthropic for highest quality
- Custom mode: Create your own cost/quality thresholds
The policy defines the rules; the swarm enforces them automatically. Runs local for development, Docker for CI/CD, or Flow Nexus for production scale. Agentic Flow is the framework for autonomous efficiency, one unified runner for every Claude Code agent, self-tuning, self-routing, and built for real-world deployment.
Get Started:
npx agentic-flow --help
[link] [comments] -
GPT-4o/GPT-5 complaints megathread r/ChatGPT Oct 01, 2025 05:16 PM 1 min readsubmitted by /u/WithoutReason1729
To keep the rest of the sub clear with the release of Sora 2, this is the new containment thread for people who are mad about GPT-4o being deprecated.
Suggestion for people who miss 4o: Check this calculator to see what local models you can run on your home computer. Open weight models are completely free, and once you've downloaded them, you never have to worry about them suddenly being changed in a way you don't like. Once you've identified a model+quant you can run at home, go to HuggingFace and download it.
Update:
I generated this dataset:
https://huggingface.co/datasets/trentmkelly/gpt-4o-distil
And then I trained two models on it for people who want a 4o-like experience they can run locally.
https://huggingface.co/trentmkelly/gpt-4o-distil-Llama-3.1-8B-Instruct
https://huggingface.co/trentmkelly/gpt-4o-distil-Llama-3.3-70B-Instruct
I hope this helps.
UPDATE
GPT-4o will be removed from ChatGPT tomorrow at 10 AM PT.
UPDATE
Great news! GPT-4o is finally gone.
[link] [comments] -
ChatGPT/OpenAI resources r/ChatGPTPro Sep 14, 2025 03:56 AM 1 min readsubmitted by /u/Oldschool728603
ChatGPT/OpenAI resources/Updated for 5.4
OpenAI information. Many will find answers at one of these links.
(1) Up or down, problems and fixes:
https://status.openai.com/history
(2) Subscription levels. Scroll for details about usage limits, access to models, and context window sizes. (For unsavory reasons, the information is sometimes misleading.)
(3) ChatGPT updates/changelog. Did OpenAI just add, change, or remove something?
https://help.openai.com/en/articles/6825453-chatgpt-release-notes
(4) Two kinds of memory: "saved memories" and "reference chat history":
https://help.openai.com/en/articles/8590148-memory-faq
(5) OpenAI news (=their own articles, various topics, including causes of hallucination and relations with Microsoft):
(6) GPT-5, 5.2, and 5.4 system cards (extensive information, including comparisons with previous models). No card for 5.1. 5.3 never surfaced (except as Instant). Intros for 5.2 and 5.4 included:
https://cdn.openai.com/gpt-5-system-card.pdf
https://openai.com/index/introducing-gpt-5-2/
https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf
https://openai.com/index/introducing-gpt-5-4/
https://deploymentsafety.openai.com/gpt-5-4-thinking/ (5.4 system card)
https://deploymentsafety.openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf (5.4 system card)
(7) GPT-5.2 and 5.4 prompting guides:
https://cookbook.openai.com/examples/gpt-5/gpt-5-2_prompting_guide
https://developers.openai.com/api/docs/guides/prompt-guidance (for 5.4)
(8) ChatGPT Agent intro, FAQ, and system card. Heard about Agent and wondered what it does?
https://openai.com/index/introducing-chatgpt-agent/
https://help.openai.com/en/articles/11752874-chatgpt-agent
https://cdn.openai.com/pdf/839e66fc-602c-48bf-81d3-b21eacc3459d/chatgpt_agent_system_card.pdf
(9) ChatGPT Deep Research intro (with update about use with Agent), FAQ, and system card:
https://openai.com/index/introducing-deep-research/
https://help.openai.com/en/articles/10500283-deep-research
https://cdn.openai.com/deep-research-system-card.pdf
(10) Medical competence of frontier models. This preceded 5-Thinking and 5-Pro, which are even better (see GPT-5 system card):
https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf
[link] [comments] -
I created an Agentic Coding Competition MCP for Cline/Claude-Code/Cursor/Co-pilot using E2B Sandboxes. I'm looking for some Beta Testers. > npx flow-nexus@latest r/AIPromptProgramming Sep 09, 2025 02:25 AM 2 min read
submitted by /u/Educational_Ice151Flow Nexus: The first competitive agentic system that merges elastic cloud sandboxes (using E2B) with swarms agents.
Using Claude Code/Desktop, OpenAI Codex, Cursor, GitHub Copilot, and other MCP-enabled tools, deploy autonomous agent swarms into cloud-hosted agentic sandboxes. Build, compete, and monetize your creations in the ultimate agentic playground. Earn rUv credits through epic code battles and algorithmic supremacy.
Flow Nexus combines the proven economics of cloud computing (pay-as-you-go, scale-on-demand) with the power of autonomous agent coordination. As the first agentic platform built entirely on the MCP (Model Context Protocol) standard, it delivers a unified interface where your IDE, agents, and infrastructure all speak the same languageâenabling recursive intelligence where agents spawn agents, sandboxes create sandboxes, and systems improve themselves. The platform operates with the engagement of a game and the reliability of a utility service.
How It Works
Flow Nexus orchestrates three interconnected MCP servers to create a complete AI development ecosystem: - Autonomous Agents: Deploy swarms that work 24/7 without human intervention - Agentic Sandboxes: Secure, isolated environments that spin up in seconds - Neural Processing: Distributed machine learning across cloud infrastructure - Workflow Automation: Event-driven pipelines with built-in verification - Economic Engine: Credit-based system that rewards contribution and usage
đ Quick Start with Flow Nexus
```bash
1. Initialize Flow Nexus only (minimal setup)
npx claude-flow@alpha init --flow-nexus
2. Register and login (use MCP tools in Claude Code)
Via command line:
npx flow-nexus@latest auth register -e pilot@ruv.io -p password
Via MCP
mcpflow-nexususerregister({ email: "your@email.com", password: "secure" }) mcpflow-nexus_user_login({ email: "your@email.com", password: "secure" })
3. Deploy your first cloud swarm
mcpflow-nexusswarminit({ topology: "mesh", maxAgents: 5 }) mcpflow-nexus_sandbox_create({ template: "node", name: "api-dev" }) ```
MCP Setup
```bash
Add Flow Nexus MCP servers to Claude Desktop
claude mcp add flow-nexus npx flow-nexus@latest mcp start claude mcp add claude-flow npx claude-flow@alpha mcp start claude mcp add ruv-swarm npx ruv-swarm@latest mcp start ```
Site: https://flow-nexus.ruv.io Github: https://github.com/ruvnet/flow-nexus
[link] [comments] - Why the Technological Singularity May Be a "Big Nothing" Hacker News Sep 07, 2025 02:48 AM
-
New Rules, Moderation Approach, and Future Plans r/ChatGPTPro Aug 06, 2025 02:55 PM 3 min readsubmitted by /u/Redditoridunn0
Hi everyone,
We're posting this update to clearly outline recent changes to our rules, explain our moderation strategy, and share what's next for this community. When this subreddit was originally created, OpenAIâs "ChatGPT Pro" subscription did not exist. Unfortunately, since OpenAI introduced a subscription plan with the same name, we've experienced a significant influx of new members, many of whom misunderstand the intended focus of our community. (Reddit does not allow us to change our subreddit name.) To be clear, r/ChatGPTPro remains dedicated exclusively to professional, technical, and power-user-level discussions.
Whatâs Changed?
Advanced Use Only
We've clarified that r/ChatGPTPro is strictly reserved for advanced discussions around LLMs, prompt engineering, fine-tuning, API integrations, research, and related technical content. Entry-level questions, basic FAQs, or general observations like âHas anyone noticed ChatGPT has gotten better/worse?â (with some limited exceptions) will be redirected or removed.
No Jailbreaks, Unofficial APIs, or Leaked Tools
Any posts sharing jailbreak prompts, exploit scripts, or unofficial/reverse-engineered APIs (such as gpt4Free) are prohibited. This aligns with Redditâs and OpenAIâs rules. (See Rule 8.)
Self-Promotion Policy
Self-promotion must represent no more than 10% of your total activity here, must offer clear value to the community, and must always be transparently disclosed. (See Rule 5.)
Why These Changes?
The influx of users provides opportunities but has also resulted in increased spam, repetitive beginner-level inquiries, and occasional content that risks violating platform or legal guidelines. These changes will help us:
- Protect the community from legal and administrative repercussions.
- Preserve a high-quality, focused environment suited to technical professionals and serious power users.
Whatâs Next?
We're actively working on several improvements:
Potential Posting Restrictions
We are considering minimum account-age or karma requirements to reduce spam and low-effort contributions.
Stricter Quality Control
With growing membership, low-quality, surface-level posts have noticeably increased. To preserve the technical depth and utility of our discussions, moderators will enforce stricter standards. (Please see Rule 2 and Rule 6 for further guidance.)
Wiki and a New Discord Server
Currently, our wiki remains incomplete and needs significant improvements. Our Discord server, meanwhile, has unfortunately fallen into disuse and become filled with spam (primarily due to loss of moderation control after an inactive moderator was removedâno malice intended, just inactivity). To resolve these issues, we will launch a community-driven overhaul of the wiki, enriching it with carefully curated resources, useful links, research, and more. Additionally, a refreshed Discord server will soon be available, providing an improved environment specifically for advanced LLM users to collaborate and communicate.
How You Can Help
- Report: Use Redditâs report feature to notify us about rule-breaking, spam, low-effort content, or policy violations.
- Feedback: Suggest improvements or report concerns in the comments below or through Modmail.
Huge thank you to u/JamesGriffing for his help on this post and his amazing contributions to the subreddit (and putting up with me in general). Thanks for your continued support in keeping r/ChatGPTPro a valuable resource for serious LLM professionals and power users. If you have any queries or doubts, please feel free to comment below, we will respond to them as soon as possible!
[link] [comments] - Protect the community from legal and administrative repercussions.
- "Intelligenza Artificiale for Artificial Intelligence Research and Development" Hacker News Jul 30, 2025 09:08 PM
- Ask HN: Is the rate of progress in AI exponential? Hacker News Jun 07, 2023 09:00 PM