Copyright © 2018 DataScience.US All Rights Reserved.
Distributed Deep Learning on the MapR Converged Data Platform
Deep learning is a class of machine learning algorithms that learns multiple levels of representation of the data, through message passing and derivation, cascading many layers of nonlinear processing units.
Recently, there has been a lot of traction in the deep learning field, thanks to the research breakthroughs made by commercial entities in the tech field and the advancement of parallel computing performance in general. Quite a few deep learning applications have surpassed human performance: famous use cases in the field include AlphaGo, Image Recognition, and Autonomous Driving.
In most practices, development of deep learning applications is done using single a DevBox with multiple GPU cards installed. In some larger organizations, dedicated High Performance Computing (HPC) clusters are used to develop and train deep learning applications. While these practices are more likely to achieve better computation performance, they lack fault tolerance and create issues with moving data across different DevBoxes or clusters.
Distributed Deep Learning Quick Start Solution
The MapR Converged Data Platform provides the only state-of-the-art distributed file system in the world. With MapR File System (MapR-FS), our customers gain a unique opportunity to put your deep learning development, training, and deployment closer to your data. MapR leverages open source container technology, such as Docker, and orchestration technology, such as Kubernetes, to deploy deep learning tools, like TensorFlow, in a distributed fashion. In the meanwhile, since MapR-DB and MapR Streams are also tied closely to our file system, if you were developing a deep learning application on MapR, it is convenient to deploy your model to extend our MapR Persistent Application Client Container (PACC) to harness the distributed key-value storage of MapR-DB and cutting-edge streaming technology of MapR Streams for different use cases. Click here if you want learn more.
The distributed deep learning Quick Start Solution we propose has three layers (see Figure 1 above). The bottom layer is the data layer, which is managed by the MapR File System (MapR-FS) service. You can create dedicated volumes for your training data. We also support many enterprise features like security, snapshots, and mirroring to keep your data secure and highly manageable in an enterprise setting.
The middle layer is the orchestration layer. In this example, we propose to use Kubernetes to manage the GPU/CPU resources and launch parameter server and training workers for deep learning tasks in the unit of pods. Starting from Kubernetes 1.6, you can manage cluster nodes with multiple GPU cards; you can also manage a heterogeneous cluster, where you can use CPU nodes to serve the model while using GPU nodes to train the model. You can even take a step forward and mark nodes with different GPU cards to task with lower priority on older GPU cards and task with high priority on newer cards.
The top layer is the application layer, where we use TensorFlow as the deep learning tool. With the high performance NFS features from MapR-FS, it is easy to use TensorFlow to checkpoint the deep learning variables and models to persist in the MapR file system. This makes it easy for you to look into the TensorFlow training process and harness the models, then put them into deployment. The advantage of using container technology in the application layer for deep learning applications is that we can control the versions of the deep learning model by controlling the metadata of the container images. We can harness the trained model into a Docker image with metadata as image tags to keep the version information; all the dependencies/libraries are already install-free in the container image. When deploying the deep learning models, we just have to specify which version we wanted to deploy, and there is no need to worry about dependency.