The birth of the largest chip in history: 1.2 trillion transistors, 400,000 cores

back Date: 2021-12-03 17:31:09 Click:

The largest chip in history is born! It is dedicated to AI tasks, with 1.2 trillion transistors, 400,000 cores, and a chip area of 42225 square millimeters, which is 56.7 times that of the current largest chip area of NVIDIA GPU. The learning speed is greatly improved. AI is coming. NS.

It covers an area of 42225 square millimeters, has 1.2 trillion transistors, 400,000 cores, on-chip memory 18 Gigabytes, memory bandwidth 19 PByte/s, and fabric bandwidth 100 Pbit/s. This is the largest chip ever-Cerebras Wafer Scale Engine!

This giant chip was launched by Cerebras Systems. After the chip was released, three Chinese chip experts immediately commented in the circle of friends:

Chip expert Tang Shan: "Have respect for Cerebras' giant chip, about 9 inches, 22cm on each side. I remember I wrote a similar comparison chart before writing an article. [wit] Wired's article, it seems that Cerebras is going to the front desk NS."

Yao Song, co-founder of Shenjian Technology: "Cerebras' Wafer-scale chip is indeed spectacular, with a unique aesthetic, just like seeing the magnificence of a giant cannon. I hope everything goes well for Andrew Feldman."

Wang Bing, Chief Strategy Officer of Orion Star: "The huge chip of twelve hundred billion transistors, the largest chip that 300mm wafer can make, challenge the limits of the chip industry. If it succeeds, it will inevitably disrupt the entire AI chip industry. But even with With a variety of error redundancy technologies, mass production yields will still be a huge challenge."

Cerebras Wafer Scale Engine has 1.2 trillion transistors. Intel’s first 4004 processor had 2,300 transistors in 1971, and the most recent AMD processor had 32 billion transistors. Most chips are actually a collection of chips created on 12-inch silicon wafers and mass-produced in chip factories. But Cerebras Systems chips are single chips interconnected on a single wafer. These interconnection designs keep all these chips running at a high speed, and trillions of transistors can all work together.

In this way, Cerebras Wafer Scale Engine becomes the largest processor ever, and it is specifically designed to deal with AI application problems. The company discussed the design of the "world's largest" chip at the Hot Chips conference held at Stanford University this week.

Prior to this, Samsung had actually manufactured a flash memory chip eUFS with 2 trillion transistors. But Cerebras chips are designed for process processing, with 400,000 cores and a chip area of 42,225 square millimeters. It is 56.7 times larger than the largest Nvidia GPU, which measures 815 square millimeters and contains 21.1 billion transistors.

The comparison between the largest chip in history and tennis. WSE also contains 3,000 times the high-speed on-chip memory, and has 10,000 times the memory bandwidth. The chip comes from the team led by Andrew Feldman, who founded the micro server company SeaMicro and sold it to AMD for $334 million. Sean Lie, co-founder and chief hardware architect of Cerebras Systems, will showcase the Cerebras Wafer Scale Engine at the Hot Chips conference. This Los Altos, California company has 194 employees.

Chip size is very important in AI tasks, because large-size chips can process information faster and give answers in a shorter time. This can reduce "training time", allowing researchers to test more ideas, use more data, and solve new problems. Google, Facebook, OpenAI, Tencent, Baidu, and many companies believe that the basic limitation restricting the development of AI today is that it takes too long to train the model. Therefore, shortening the training time is expected to eliminate the main bottleneck for the progress of the entire industry.

Of course, chip manufacturers usually don't make such large chips. Some impurities usually appear in the manufacturing process of a single wafer. If one kind of impurity can cause a chip to fail, then a variety of impurities on the wafer can cause multiple chips to fail. The actual production of chips accounts for only a small part of the actual working chips. If there is only one chip on the wafer, the probability that it has impurities is 100%, and the impurities will cause the chip to fail. However, the chip designed by Cerebras has redundancy, and an impurity will not cause the entire chip to be unusable.

A single wafer provides supercomputer-level computing power. "Cerebras WSE" is designed for artificial intelligence design, which contains a lot of basic innovations, which solves the decades-long technical challenges that limit chip size-such as yield, power transmission, packaging, etc., and promotes the most advanced technology development of. And packaging, each architectural decision is to optimize the performance of AI work. As a result, Cerebras WSE provides hundreds or thousands of times the performance of existing solutions based on workload, requiring only a small amount of power and space. "Said Fieldman, CEO of Cerebras Systems.

These performance improvements are achieved by accelerating all elements of neural network training. A neural network is a multi-stage computational feedback loop. The faster the input moves in the loop, the faster the loop learning speed, that is, the shorter the training time. The loop speed of the input can be accelerated by accelerating the calculation and communication within the loop.

The Cerebras WSE chip area is 56.7 times larger than the current largest GPU, and it provides more cores for computing, and more cores are close to the memory, so the cores can run efficiently. Since these large numbers of cores and memory are located on a single chip, all communication is performed on the chip, with high communication bandwidth and low latency, so the core group can collaborate with the highest efficiency.

The 46,225 square millimeter chip area in Cerebras WSE contains 400,000 AI-optimized cores, no-cache, no-overhead computing cores, and 18 gigabytes of localized distributed ultra-high-speed SRAM memory. The memory bandwidth is 9 PB per second. These cores are connected together through a fine-grained, all-hardware, and on-chip network connection communication network, which can provide a total bandwidth of 100 petabits per second. More cores, more local memory, and low-latency high-bandwidth structure together form the best architecture for AI acceleration tasks.

"Although AI is used in a general sense, no two data sets or two AI tasks are the same. New AI workloads continue to emerge, and data sets continue to grow," Tirias Research principal analyst and founder Jim McGregor said in a statement.

"With the development of AI, chip and platform solutions are also evolving. Cerebras WSE is an amazing engineering achievement in semiconductor and platform design, which provides supercomputer-level computing in a single wafer-level solution Capacity, high-performance memory and bandwidth."

Cerebras said that if they hadn't worked closely with Taiwan Semiconductor Manufacturing Company (TSMC) over the years, they would not have achieved this record-setting achievement. TSMC is the world's largest semiconductor foundry and is in a leading position in advanced process technology. The WSE chip is manufactured by TSMC using advanced 16nm process technology.

400,000 AI-optimized cores. WSE contains 400,000 AI-optimized compute cores. This kind of computing core is called Sparse Linear Algebra Cores (SLAC), which is flexible, programmable, and optimized for sparse linear algebra that supports all neural network calculations. The programmability of SLAC ensures that the kernel can run all neural network algorithms in the ever-changing field of machine learning.

Because sparse linear algebra kernels are optimized for neural network calculations, they can achieve the best utilization in the industry—usually 3 times or 4 times that of GPUs. In addition, the core of WSE also includes the sparse capture technology invented by Cerebras to accelerate computing performance on sparse workloads (including zero workloads), such as deep learning.

Zero is common in deep learning calculations. Usually, most of the elements in the vector and matrix to be multiplied are zero. However, multiplying by 0 is an act of wasting silicon, power and time, because there is no new information.

Because GPUs and TPUs are dense execution engines—the engine is designed to never encounter 0—so they will multiply every element even at 0. When 50-98% of the data is zero, as often happens in deep learning, most multiplications are wasted. Since Cerebras' sparse linear algebra core never multiplies by zero, all zero data is filtered out and can be skipped in the hardware so that useful work can be done in its place.

On-chip memory 3000 times larger than GPU. Memory is a key component of every computer architecture. Memory closer to the calculation means faster calculations, lower latency, and better data movement efficiency. High-performance deep learning requires a lot of calculations and frequent data access. This requires the computing core and the memory to be very close, but this is not the case in the GPU. Most of the memory in the GPU is slow and far away from the computing core.

Cerebras Wafer Scale Engine contains more cores and local memory than any chip to date, and has 18 GB of on-chip memory in one clock cycle. The core local memory collection on WSE provides 9 petabytes of memory bandwidth per second-3000 times the on-chip memory and 10000 times the memory bandwidth of the best GPU.

Unique communication structure with low latency and high bandwidth. The Swarm communication structure is an inter-processor communication structure used on WSE. It achieves a breakthrough in bandwidth and low latency with a small part of the power consumption of traditional communication technology. Swarm provides a low-latency, high-bandwidth 2D grid that connects all 400,000 cores on the WSE, with a total bandwidth of 100 petabits per second.

Routing, reliable messaging, and synchronization are all handled in hardware. The message automatically activates the application handler for each message that arrives. Swarm provides a unique and optimized communication path for each neural network. The software configures the optimal communication path through 400,000 cores to connect to the processor according to the structure of the specific user-defined neural network that is running.

A typical message traverses a hardware link with a nanosecond delay. The total bandwidth of a Cerebras WSE is 100 PB per second. Communication software such as TCP/IP and MPI is not required, so performance loss can be avoided. The communication energy cost of this structure is much lower than 1 joule per bit, which is nearly two orders of magnitude lower than that of GPU. Combining huge bandwidth and extremely low latency, the Swarm communication structure enables Cerebras WSE to learn faster than any currently available solution.

This article tags:The,birth,the,largest,chip,history,1.2,trillion,transistors The last:Intel confirms that it will la The last:The Application of Transistor

【Recommended reading】

News

News

Industry Application

National Service Hotline

0755-83948880