IBM Deep Learning Breaks Through

IBM Deep Learning Breaks Through

LAKE WALES, Fla. — IBM Research has reported an algorithmic breakthrough for deep learning that comes close to achieving the holy grail of ideal scaling efficiency: Its new distributed deep-learning (DDL) software enables a nearly linear speedup with each added processor (see figure). The development is intended to achieve similar speedups for each server added to IBM’s DDL algorithm.

The aim “is to reduce the wait time associated with deep-learning training from days or hours to minutes or seconds,” according to IBM fellow and Think Blogger Hillery Hunter, director of the Accelerated Cognitive Infrastructure group at IBM Research.

Hunter notes in a blog post on the development that “most popular deep-learning frameworks scale to multiple GPUs in a server, but not to multiple servers with GPUs.” The IBM team “wrote software and algorithms that automate and optimize the parallelization of this very large and complex computing task across hundreds of GPU accelerators attached to dozens of servers,” Hunter adds.

Graph shows linear speedup to deep-learning algorithms as GPUs are added. (Source: IBM)Graph shows linear speedup to deep-learning algorithms as GPUs are added. (Source: IBM)

IBM claims test results of 95 percent scaling efficiency for up to 256 Nvidia Tesla P100 GPUs added to a single server using the open-source Caffe deep-learning framework. The results were calculated for image recognition learning but are expected to apply to similar learning tasks. IBM achieved the nearly linear scaling efficiency in 50 minutes of training time. Facebook Inc. previously achieved 89 percent efficiency in 60 minutes of training time on the same data set.

IBM is also claiming a validation accuracy record of 33.8 percent on 7.5 million images in just seven hours of training on the ImageNet-22k data set, compared with Microsoft Corp.’s previous record of 29.8 percent accuracy in 10 days of training on the same data set. IBM’s processor was its PowerAI platform — a 64-node Power8 cluster (plus the 256 Nvidia GPUs) — providing more than 2 petaflops of single-precision floating-point performance.

The company is making its DDL suite available free to any PowerAI platform user. It is also offering third-party developers a variety of application programming interfaces to let them select the underlying algorithms that are most relevant to their application.

— R. Colin Johnson, Advanced Technology Editor, EE Times Circle me on Google+

 


PreviousServer DRAM Supply Expected to Remain Tight
Next    NAND Clash Spans Chip to Code