In the world of Artificial Intelligence (AI) and Machine Learning, NVIDIA has introduced a game-changer. The NVIDIA Tesla V100, with its advanced Volta architecture, is poised to redefine the landscape of High-Performance Computing (HPC) and Deep Learning. This article will provide an in-depth review of the Tesla V100, its features, and its potential impact on the AI and Machine Learning industry. To see how it compares with other options in the market, visit our comprehensive guide on the best graphics cards for deep learning.
The Birth of Volta: Tesla V100 and its Revolutionary Architecture
The NVIDIA Tesla V100 is the first GPU to be powered by the Volta architecture. Unveiled in 2013, this new architecture represents a significant advancement from the previous generation, Pascal. The Volta GV100 GPU is manufactured using TSMC’s 12nm Fin-FET process, and it features a colossal 21.1 billion transistors on a die size of 815mm². This makes the V100 one of the largest silicon chips ever produced.
Volta’s architecture has allowed NVIDIA to increase the number of Streaming Multiprocessors (SMs) to 84, each consisting of 64 CUDA cores. The full GV100 GPU amounts to a staggering 5,376 CUDA cores, significantly more than its predecessors.
The Tesla V100: A Deep Dive into Hardware Specifications
The Tesla V100 is a 4U server that measures 448mm x 175.5mm x 850mm. The front of the chassis reveals two main compartments. The bottom compartment houses the GPU tray, while the top approximately 1U section contains the x86 compute server portion of the server.
Front I/O includes a management port, two USB 3.0 ports, two SFP+ cages for 10GbE networking, and a VGA connector. Storage is provided by 8x 2.5″ hot-swap bays, all of which can utilize SATA III 6.0gbps connectivity. The top four drives have the option to use U.2 NVMe SSDs.
Internally, the system boasts additional storage. There is a dual M.2 SATA boot SSD module on a riser card next to the memory slots. These boot modules allow the front hot-swap bays to be kept open for higher-value storage.
The server operates on a dual Intel Xeon Scalable system with a full memory configuration. This means that each of the two CPUs can potentially take up to 12 DIMMs, for a total of 24 DIMMs.
Powering Up with High-Performance Computing
With a focus on High-Performance Computing (HPC), NVIDIA has equipped the Tesla V100 with impressive specs. The Tesla V100 delivers a whopping 15 teraflops of FP32, 30 teraflops of FP16, 7.5 teraflops of FP64, and a huge 120 teraflops for dedicated tensor operations.
The V100 provides 16GB of High Bandwidth Memory 2 (HBM2) clocked at 1.75GHz on a 4096-bit bus, allowing for 900GB/sec of bandwidth. Despite its large die size, the V100 GPU still manages to run at a peak of 1455MHz.
The Tesla V100 also introduces the second generation of NVLink, NVIDIA’s proprietary connector that allows multiple GPUs to connect directly to each other with more bandwidth than the PCI Express 3.0 bus. The NVLink 2 sports a higher 25GB/s bidirectional link bandwidth and includes six NVLinks per GPU, as compared to four on the previous GP100.
Unleashing the Power of Deep Learning with Tensor Cores
One of the standout features of the Tesla V100 is the introduction of Tensor cores, which are specifically designed for machine learning operations. The GV100 contains eight Tensor cores per SM, delivering a total of 120 TFLOPs for training and inference operations.
The dedicated Tensor cores offer a 4x performance boost compared to Pascal for tasks that can utilize them. This theoretically makes the V100 a more effective performer than Google’s dedicated Tensor Processing Unit (TPU).
Tesla V100: Pricing and Performance Analysis
When it comes to pricing, the Tesla V100 offers a significant return on investment, particularly for AI and deep learning applications. The table below provides a quick overview of the Tesla V100 GPU price, performance, and cost-effectiveness:
|Tesla GPU model
|Double-Precision Performance (FP64)
|Dollars per TFLOPS
|Deep Learning Performance (TensorFLOPS or 1/2 Precision)
|Dollars per DL TFLOPS
|Tesla V100 PCI-E 16GB or 32GB
|$10,664* ($11,458* for 32GB)
|$1,523 ($1,637 for 32GB)
|$95.21 ($102.30 for 32GB)
* single-unit list price before any applicable discounts (ex: EDU, volume)
The Tesla V100 delivers a big advance in absolute performance within just a year of its predecessor. It maintains a similar price/performance value to the Tesla P100 for Double Precision Floating Point, albeit with a higher entry price. However, it offers dramatic absolute performance and price/performance gains for AI.
Real Application Performance over Raw $/FLOP Calculations
While the generalizations above are useful, it’s important to note that application performance differs dramatically from any simplistic FLOPS calculation. Factors such as device-to-device bandwidth, host-to-device bandwidth, GPU memory bandwidth, and code maturity all play significant roles in actual application performance.
Here are some of NVIDIA’s own application performance tests across real applications:
As seen, some codes scale similarly to the on-paper FLOPS gains, while others are far more removed. Therefore, it’s advisable not to base purchasing decisions strictly upon raw $/FLOP calculations.
Deploying the Tesla V100 in Production
The V100 will first appear inside Nvidia’s bespoke compute servers. Eight of them will come packed inside the $150,000 DGX-1 rack-mounted server, which ships in the third quarter of 2017. A 250W PCIe slot version of the V100 is also in the works, as well as a half-height 150W card that’s likely to feature a lower clock speed and disabled cores.
In summary, the NVIDIA Tesla V100 is a formidable player in the AI and Machine Learning field. With its advanced Volta architecture, impressive hardware specs, and the introduction of Tensor cores, the V100 is set to revolutionize AI and HPC. While the high entry price may be a hurdle for some, the V100’s significant return on investment, particularly for AI and deep learning applications, makes it a worthwhile consideration for those in the industry.