ChatGPT, an artificial intelligence chatbot, has exploded around the world since its launch, but what outsiders may not know is that ChatGPT is so smart because it relies on an expensive supercomputer built for it by Microsoft.
Microsoft supercomputer uses tens of thousands of Nvidia GPUs
In 2019, when Microsoft invested $1 billion in ChatGPT developer OpenAI, it agreed to build a massive, cutting-edge supercomputer for the artificial intelligence research startup. The only problem: Microsoft didn’t have what OpenAI needed, nor was it entirely sure it could build something that big in the Azure cloud service without breaking it.
At the time, OpenAI was trying to train an increasingly large set of artificial intelligence programs, or “models,” that were absorbing larger and larger amounts of data and learning more and more parameters. These parameters are variables that the AI system has been trained and retrained to produce. This means that OpenAI will need to use powerful cloud computing services for a long time.
Tens of thousands of chips and hundreds of millions of dollars of investment
To overcome this challenge, Microsoft had to find a way to combine tens of thousands of NVIDIA A100 graphics chips (the workhorse for training AI models) and change the location of the servers in the rack to prevent power outages. Scott Guthrie, Microsoft’s executive vice president of cloud computing and artificial intelligence, would not disclose the exact cost of the project but said it “could be more than” a few hundred million dollars.
“We’ve built a system architecture that can operate at hyper-scale and be reliable. That’s what makes ChatGPT possible,” said Nidhi Chappell, general manager of artificial intelligence infrastructure at Microsoft Azure. “It’s a model from which many, many other models will be derived.”
ChatGPT relies on supercomputer training
The technology helped OpenAI launch ChatGPT, which attracted more than 1 million users just days after its release last November and is now being incorporated into the business models of other companies, from one run by billionaire hedge fund founder Ken Griffin to take-out company Instacart. As artificial intelligence tools such as ChatGPT become increasingly interesting to businesses and consumers, cloud service providers such as Microsoft, Amazon.com and Google will face increased pressure to ensure their data centres can deliver the massive computing power needed.
Microsoft now uses the same set of resources it built for OpenAI to train and run its own large artificial intelligence models, including the new Bing search bot that launched last month. Microsoft also sells the system to other customers. As part of Microsoft’s expanded partnership agreement with OpenAI for an additional $10 billion investment, the software giant is already working on the next generation of artificial intelligence supercomputers.
“We don’t want to make it a custom product, it started out as a custom product, but we always find ways to make it general-purpose so that anyone who wants to train large language models can take advantage of the same improvements,” Guthrie said in an interview, “and that can really help us become a more widely used AI cloud.”
Training a large AI model requires having a large number of interconnected graphics processing units in one place, much like the AI supercomputers assembled by Microsoft. Once the model is operational, answering all queries posed by users — call it reasoning — requires a slightly different setup. Microsoft also deploys graphics chips for reasoning, but those thousands of processors are geographically scattered across the company’s more than 60 data centre regions. Microsoft said in a blog post-Monday that the company is now adding the latest Nvidia graphics chip, the H100, and the latest version of Nvidia’s Infiniband networking technology to AI workloads to share data more quickly.
Microsoft Azure Cloud Services
The new Bing Search is still in preview. Microsoft is gradually adding more users to the waiting list. Guthrie’s team meets daily with about 24 employees, who are called “logistics pit crews,” a term that originally referred to the group of mechanics who tune up the cars at races. The team’s job is to figure out how to get more computing power online quickly and to solve problems that pop up.
“It’s very much a kind of a meetup, like, ‘Hey, anybody has a good idea, let’s put it on the table today and talk about it and figure out OK, can we save a few minutes here? Can we save a few hours? A few days?'” Guthrie said.
Small mistakes can lead to big problems
Cloud services rely on thousands of different parts and items, including individual components of servers, pipes, concrete for buildings, and different metals and minerals, and a delay or lack of supply of any one component, no matter how small, can lead to a lost cause. Recently, the maintenance staff had to deal with a shortage of cable trays. A cable tray is a basket-like, delicate device used to hold cables that have come off a machine. So, they designed a new cable tray that allows Microsoft to make its own or find a place to buy them. Guthrie said they are also working on ways to compress as many servers as possible in existing data centres around the world so they don’t have to wait for new buildings.
When OpenAI or Microsoft trains a large AI model, the work is done all at once. It’s distributed across all the GPUs, and at some point, the individual units need to talk to each other to share the work they’ve done. For the AI supercomputer, Microsoft had to make sure that the network devices that handle the communication between all the chips could handle that load, and had to develop software that could take full advantage of the GPUs and network devices. The company has now introduced software that can train models with tens of trillions of parameters.
Because all the machines are booted at the same time, Microsoft has to consider where they are placed and where the power supplies are located. Otherwise, Guthrie says, you end up with data centre versions of the results, just as you would if you turned on a microwave, a toaster and a vacuum cleaner in the kitchen at the same time.
A new generation of supercomputers
The company must also ensure it can cool all those machines and chips and use evaporative, outdoor air in cooler climates and high-tech swamp coolers in hot climates, said Alistair Speirs, Microsoft’s director of global infrastructure for Azure.
Guthrie said Microsoft will continue to develop custom server and chip designs and find ways to optimize the supply chain to be as fast, efficient and cost-saving as possible.
“The model that amazes the world now is built on the supercomputers we started building a few years ago. The new models will be built on the new supercomputer we are training on, which is bigger and more sophisticated.”