Alphabet Inc.’s Google Inc. on Tuesday unveiled new details about its supercomputers for training artificial intelligence models, saying the systems are faster and more power-efficient than their Nvidia counterparts.
Google has designed a chip called the Tensor Processing Unit (TPU) to train artificial intelligence models that are used in more than 90 percent of the company’s AI training efforts, which can be used for tasks such as answering questions in human language or generating images.
Google’s TPU is now in its fourth generation. Google published a scientific paper on Tuesday detailing how they use their own custom-developed optical switches to connect more than 4,000 chips in series into a supercomputer.
Improving those connections has become a key point of competition among companies building AI supercomputers, as the size of the so-called large language models that power technologies like Google’s Bard or OpenAI’s ChatGPT has exploded, meaning they’re too big to be stored on a single chip.
These models must be partitioned into thousands of chips, which must then work in concert for weeks or more to train the models. Google’s PaLM model — its largest publicly disclosed language model to date — was trained over 50 days by spreading it across two supercomputers with 4,000 chips.
Google says its supercomputers can easily reconfigure the connections between the chips in real time, helping to avoid problems and improve performance.
In a blog post about the system, Google researcher Norm Jouppi and Google Distinguished Engineer David Patterson wrote, “Circuit switching made it easy for us to bypass faulty components. This flexibility even allows us to change the topology of the supercomputer interconnect to accelerate the performance of ML (machine learning) models.”
While Google is only now releasing details of its supercomputer, it is already coming online internally in 2020, running at a data center in Mayes County, Oklahoma, USA. Google said the startup Midjourney used the system to train its model, which can generate images after inputting text.
Google said in the paper that its supercomputer is 1.7 times faster and 1.9 times more energy efficient than a system based on the Nvidia A100 chip for a system of the same size. Google said it did not compare its fourth-generation product with Nvidia’s current flagship H100 chip because the H100 came to market after Google’s chip and was built with newer technology. Google hinted that they may be developing a new TPU to compete with the Nvidia H100.