OpenAI has disbanded its Site Reliability Engineering team focused on research and training workloads.
Formed less than a year ago, the company hired Todd Underwood to lead the effort. Underwood spent 14 years and nine months at Google, where he created the machine learning SRE group, and co-authored the O'Reilly book Reliable Machine Learning.
SREs are tasked with building and maintaining highly reliable and scalable software systems. The concept originated at Google, but has since spread across the IT industry.
"I have not been successful in my attempt to start an SRE team within the research organization at OpenAI," Underwood said in a LinkedIn post. "OpenAI has eliminated the reliability function in research and redistributed the individual contributors into the remaining engineering teams on the research platform organization. I’m no longer an employee of OpenAI."
He added: "There are a few things I’m really good at. But building a new reliability function inside of this particular frenetic research startup remotely turns out to not have been one of them. I have thoughts about why this didn’t go as well as it could have but they’re not really relevant here."
Underwood declined to comment further, and referred DCD to his public statement.
An employee at OpenAI who requested anonymity said that Underwood was the only member of the new team to be let go after the company's head of research platform, Tal Broda, decided to scrap the effort.
Broda is believed to have only recently made the decision, with the company posting job listings for the division only last month.
Other members of the SRE research team have been moved to other parts of the research organization, DCD understands, including the supercomputing team, hardware health team, runtime team, and post training team.
The SRE team for the primary applied side, which predates the research group, continues unimpeded. It is led by Davin Bogan.
OpenAI and Tal Broda did not respond to requests for comment.
The move comes after notable turbulence at OpenAI - including the dramatic firing and rehiring of CEO Sam Altman last year, as well as high-profile departures from its superalignment safety team.
The superalignment team was promised 20 percent of OpenAI's vast computing power, but multiple former employees claimed that they were never given those resources. The team was disbanded in May.
The company is now seeking to raise billions of dollars from Apple, Nvidia, and others at a $100 billion valuation.
Earlier this year, it was revealed that OpenAI was set to spend some $7 billion on training and inference costs, and lose around $5 billion this year. Training could cost around $3 billion, while inference costs could grow to $4 billion.