Hardware options for Machine/Deep Learning

With last Friday’s talks on Machine Learning (ML), it is clear that we are currently in a boom of ML and AI SaaS products (like Lark, and even Haystack TV). Many of these applications are hosted in public clouds with companies like Amazon, Google, and Microsoft Azure. The details of the hardware deployments are hidden from users (and often from everyone but the engineers at these startups). With that in mind, I’ve chosen to discuss the various machine learning hardware options available in enterprise clouds today. This is meant to be a pseudo-technical introduction, and no hardware background is expected. This figure from Microsoft[0] provides a nice overview of the hardware options available for ML today:

  • CPU
  • GPU
  • FPGA
  • ASIC

CPUs

When most people talk about hardware processors, they mean CPUs. CPUs are often referred to as the “brains” of a computing system – whether it is for mobile, tablet, consumer (laptop/desktop), or enterprise servers. They are extremely flexible in terms of programmability, and you can run all workloads reasonably well on them. They are often the first place that researchers will test new algorithms or machine learning methods; indeed this was the reason that neural networks never “took off” back when they were researched in 40s and 50s, because computational requirements far exceeded what was possible at the time. Typical consumer CPUs have <10 cores, while server CPUs may go all the way up to 28. Intel is the dominant CPU manufacturer compared to others (ARM, AMD, IBM POWER, Oracle SPARC, Fujitsu).

Today, CPUs are mostly used for Classic ML (i.e. not Deep Learning) and sometimes for Deep Learning Inference.

GPUs

GPUs are the next, and currently most widely used, hardware option for machine and deep learning. GPUs are designed for high parallelism and memory bandwidth (i.e. can transport more “stuff” from memory to the compute cores). A typical NVIDIA GPU has thousands of cores, allowing for fast execution of the same operation across multiple cores. Machine and deep learning boils down to a lot of matrix multiplications (“GEMM”) and convolutions, thus making GPUs are good option. Tim Dettmers[1] has a nice explanation of the difference between CPUs and GPUs: “You can visualize this as a CPU being a Ferrari and a GPU being a big truck. The task of both is to pick up packages from a random location A and to transport those packages to another random location B. The CPU (Ferrari) can fetch some memory (packages) in your RAM quickly while the GPU (big truck) is slower in doing that (much higher latency). However, the CPU (Ferrari) needs to go back and forth many times to do its job (location A -> pick up 2 packages -> location B … repeat) while the GPU can fetch much more memory at once (location A -> pick up 100 packages -> location B … repeat). So in other words the CPU is good at fetching small amounts of memory quickly (5 * 3 * 7) while the GPU is good at fetching large amounts of memory (Matrix multiplication: (A*B)*C). The best CPUs have about 50GB/s while the best GPUs have 750GB/s memory bandwidth”

One problem with GPUs is that they are often far away from the main memory of the server, thus sending all the data to the GPU takes time (it is often done using a protocol called PCI Express or PCIe, much like Ethernet or Bluetooth or WiFi). To try and speed this up, companies like NVIDIA have come up with a faster interconnect called NVLink[2]. Other problems include scalability (you hit problems beyond 16 GPUs) and the fact that GPUs are expensive ($11k compared to ~$5k for server CPU) and power-hungry (350W compared to 145W for server CPU).

Almost all major clouds today have GPU instances for machine learning.

FPGAs

Field-programmable Gate Arrays (FPGAs) are another type of hardware designed to be reprogrammable. You can think of these as containing a bunch of tables (with input columns and output columns) that you define via software. For example, if you want to add 2 numbers you can define one of these tables to have the two numbers as inputs and the answer as the output. FPGAs have more recently become a target appliance for machine learning researchers, and big companies like Microsoft and Baidu have invested heavily in FPGAs. It is observed that FPGAs offer much higher Performance/Watt than GPUs, because even though they cannot compete on pure performance, they have much less power usage (often 10s of Watts). This metric is very important for applications in IoT and self-driving car type situations.  Intel recently bought Altera, one of the two big FPGA vendors; some detail about their products can be found here.

FPGAs are also being used for inference, as they can provide quick results for a pre-trained machine learning model (stored on the FPGA memory).

ASICs

An application-specific integrated circuit is the least flexible, but highest performing, hardware option. They are also the most efficient in terms of performance/dollar and performance/watt, but require huge investment and NRE (non-recurring engineering) costs that make them cost-effective only in large quantities. ASICs can be designed for either training or inference, as you design the functionality of the ASIC and hard-code it (it can’t be changed).

Google is the best example of successful machine learning ASIC deployments; their TPU1[3] targeted inference only, and their TPU2[4] or “cloud TPU” supports both training and inference. Each TPU2 chip supports 45 TFLOPs (the metric for machine learning hardware, basically the number of floating point operations the hardware can support per second) and comes in a board of 4 = 180 TFLOPs, while the best GPU today is around 20 TFLOPs and will rise to about 120 TFLOPs next year.

Summary

While there are many hardware considerations for machine and deep learning such as precision of floating point, cost of deployment, and R&D cost, these are the 4 major buckets you can expect to see powering the current and future machine learning algorithms.

[0] https://channel9.msdn.com/Events/Build/2017/B8063

[1] https://www.quora.com/Why-are-GPUs-well-suited-to-deep-learning

[2] http://www.nvidia.com/object/nvlink.html

[3] https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

[4] https://www.blog.google/topics/google-cloud/google-cloud-offer-tpus-machine-learning/

0

2 comments on “Hardware options for Machine/Deep Learning”

  1. Hey Gaurav,

    In terms of machine and deep learning I have been hoping to get more information about the hardware side of things for some time now. It seems like the natural progression of any particular use case is to start out on a CPU and progress (depending on the use case) to GPUs and FPGAs and then eventually to ASICs once there is enough money to act as an incentive for the initial investment.

    It seems like there is an incentive for ASICs but aside from those created by Google for their specific needs not many have been created as GPUs seem to dominate. I would imagine that this means the algorithms used in different use cases differ significantly enough that the programmability of a GPU is what is keeping it the HW of choice in the space. However, I don’t know much beyond the very bare bones of what the differences between use cases are and whether or not industry experts expect the ASIC to become prominent in the near future.

    If anyone has any knowledge about why an ASIC for deep learning would be very difficult to build for anything other than a very targeted application I would love to know more.

    0
    1. Hey Kyle,

      In general I think you are right that the progression is CPUs -> GPUs/FPGAs -> ASICs. That said, I don’t think ASICs need to be tailored to the exact type of neural network. Specifically, all types of neural networks turn into matrix x matrix operations in the fully connected layers (matrix x vector if not batched). You then need primitives for pooling, convolutions, and custom transcendentals (ReLU vs sigmoid vs tanh, etc). The TPU1 paper from Google[1] has a good description of this. Notice in Figure 2 that they really only have a GEMM block, just like a GPU.

      Where ASICs beat GPUs is that they don’t have to support higher precision arithmetic, or other HPC logic that GPUs must provide for non-Deep Learning applications. Aleast, that seems to be the trend with the various startups…

      [1] https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view

      0

Comments are closed.