IBM Research is looking to speed up the time it takes for deep learning to understand and recognize images and sounds with a new software library that allows the technology to work across multiple servers and GPUs.
On Tuesday, IBM researchers are releasing two blog posts that detail their research into what Big Blue calls its "Distributed Deep Learning" (DDL) software library. The library consists of multiple APIs that work across different open source machine learning frameworks, which then allows deep learning to scale across several servers running hundreds of GPUs.
The software is part of IBM's Power AI, the company's distribution platform for machine learning and artificial intelligence on the company's Power server systems. (See IBM Brings Open Databases to Private Cloud.)
One of the current problems with deep learning is that it takes a long time for the technology to recognize and "learn" different images and sound. Part of that problem is getting the technology to scale up to take advantage of larger server clusters that hold GPUs, Sumit Gupta, IBM's vice president of HPC, AI and Analytics, wrote in one of the August 8 blogs.
"At the crux of this problem has been the technical limitation that the popular open-source deep learning software frameworks do not run well across multiple servers," Gupta wrote. "So, while most data scientists are using servers with four or eight GPUs, they can't scale beyond that single node."
The example the researchers used is a neural network that took 16 days to be trained to learn to recognize images using an IBM Power "Minsky" server using four Nvidia GPUs accelerators. When they applied the newly developed DDL software, the researchers were able to scale over tens of servers with hundreds of GPUs and reduce the learning time to seven hours.
The IBM researchers created the software and algorithms using parallelization techniques that allow that task to be spread across multiple GPUs at the same time. The idea is to take advantage of the GPU structure, which has multiple, parallel cores.
Keep up with the latest enterprise cloud news and insights. Sign up for the weekly Enterprise Cloud News newsletter.
"But as GPUs get much faster, they learn much faster, and they have to share their learning with all of the other GPUs at a rate that isn't possible with conventional software," wrote Hillery Hunter, a research memory strategist and Director of the Systems Acceleration and Memory Department for IBM, in the other blog.
"This puts stress on the system network and is a tough technical problem," Hunter added. "Basically, smarter and faster learners (the GPUs) need a better means of communicating, or they get out of sync and spend the majority of time waiting for each other's results."
At the end, IBM researchers were able to run various machine learning across 256 GPUs.
For now, the new DDL software library is available in a technical preview. It is part of Version 4 of the PowerAI deep learning software distribution.
IBM is releasing the first set of APIs to work with TensorFlow, a machine learning workflow developed by Google, as well as Caffe, another open source workflow. (See Google's TPU Chips Beef Up Machine Learning.)
Later, IBM plans to add support for two other machine learning workflows: Torch and Chainer.
— Scott Ferguson, Editor, Enterprise Cloud News. Follow him on Twitter @sferguson_LR.