Summary

We implemented and designed the general framework for the parallel version of the deep convolutional neural network (CNN) to explore the power of parallelism. We tried both OpenMP on Xeon Phi for CPU parallelism and CUDA on NVIDIA GTX 480 for GPU parallelism. We use C++ as the programming language and design an API to enable users specify their own structure of the CNNs while having the network running in parallelism. As a result, we managed to achieve around 30x speedup in OpenMP and 50x speedup in CUDA when training a huge CNN for images that were around 200x200; however, the parallel CNN framework was slower than our C++ sequential CNN when the image size was too small.

Background

Deep neural networks normally take very long time to train, and require large computation and memory overhead. And parallelizing the computation could be of great help to accelerate the training phase and leads to feasibility of more complex network structures. Specifically, we will mainly explore the parallelism on convolutional networks that uses expensive convolutional matrix multiplication to form training features. For convolutional neural networks, there are several kinds of the layers to represent features, Fig 1 is an example of the convolutional neural network architecture, we denote subsampling as pooling here in this report [1]:

Input layer and output layer Input layer and output layer are important parts of the whole neural network, but are not computationally expensive as other layers. For input layer, the main work is loading data into the input matrix, and output layer is mainly calculating loss and gradient for previous layer. But they can also benefit from parallelism as well by having multiple threads loading data in parallel or calculating the loss/gradient in parallel.
Convolutional layers Convolutional layers apply a convolution operation to the input, passing the result to the next layer. And commonly convolutional layers would share weights, referred to as filters or weights banks, which reduces memory footprint and improves performance. This part is data-parallel and computationally expensive because the filter needs to be applied on every local region in the input matrix via matrix-matrix multiplication, which can greatly benefit from parallelism. There is also locality in the convolutional multiplication.
Pooling layers Pooling layers are used to combine the outputs of a group of cells from previous layer into a single value. For example, max pooling uses the maximum value from each of a cluster of neurons at the prior layer. Another example is average pooling, which uses the average value from each of a cluster of neurons at the prior layer. This part is more lightweight compared to the previous convolutional layer, because no matrix multiplication is involved, but it can also benefit from parallelism by splitting the computation to subregions for a single image. The is also locality in the pooling operations because pooling is carried out on each subregion of the input matrix.
Fully connected layers Fully connected layers connect every neuron in one layer to every neuron in another layer. It is in principle the same as the traditional multi-layer perceptron neural network. Normally fully connected layers require large memory footprint for the weight matrix. This part is mainly matrix-matrix or matrix-vector multiplication, and can benefit from parallelism as well. There are also various settings of the neural networks, such as using different loss functions, using different activation functions, etc. We want to have a general convolutional neural networks that let users define which types the layers are and the parameters of each layer by using our framework. Convolutional neural network is computation intensive, but by using shared weights, parallelizing over images or parameters, and trade atomicity with parallelism, large speedup can be expected via parallelism. As for the workload, the dependencies are across different training iterations, and are within updates of the same weights. The memory access pattern would be of spatial locality. For convolutional matrix multiplication, it only requires a group of grids with locality, so it is for the pooling layer

Approach

First and foremost, we implemented a sequential version of convolutional neural network framework using C++. We implement a python version of CNN also using PyTorch and compare the gradient updates in our sequential CNN with the PyTorch version. We have two parallel implementation on OpenMP and on CUDA. We aim to compare their performance and see the differences in parallelism via CPU and GPU. Our approaches for both OpenMP and CUDA based on model parallelism, that is, all workers train on the same batch of input data. For each training data (image in some application), paralleled computation will be done across model. Reasons are explained in the subsections.

Choice of model parallelism over data parallelism When it comes to the design decision for parallelizing the CNN in Cuda, the first approach we tried was to parallelize the computation over different images (data parallelism), i.e. train multiple images at the same; however, we noticed that this was infeasible at all if we wanted to reuse our sequential framework of CNN. In our sequential CNN, each type of network layer is a class; while constructing a CNN, we instantiated different layer classes and chained them together as a list; the parameters in different layers were stored in their own class as member variables. Therefore, when launch a CUDA function at the very beginning to spawn threads to handle different images, there was no chance to access the member variable in other instance or call the host member function in other instance, and there was no way to predict how many layers in the CNN and thus we cannot determine the signature of this CUDA function. So we use model parallelism here in order to better reuse our sequential CNN code and doing this won’t hurt our training precision because there’s no conflict updating the model parameters.
Parallel CNN via OpenMP We implemented our CPU parallel version of CNN on Xeon Phi and experimented on different sizes of inputs, shown in the next section. For OpenMP parallelism, since there is no concept of thread blocks, and all paralleled computation is executed via CPU threads. So our parallel strategy for all types of layers are relatively consistent. And we have tried two different strategies to parallel the model. We first try to parallel the computation over all grids of the output matrix for each layer, but since the number of threads that can be executed at the same time are very limited for CPU compared to GPU, the parallel performance in turn is even worse than the sequential version. Then we try to parallel the computation over lines of the output matrix for each layer. That is, each thread computes a single line or a chunk of lines for convolution multiplication and normal matrix multiplication. The results will be shown in the following section.
Parallel CNN via CUDA Different parallel designs were used in different layers. Here are the details:
1. Convolutional Layer In the convolutional layer forward function, we launch CUDA convolution layer forward function where the block number is the total number of output channel and the number of threads in per block is the size of each output matrix. Therefore, each thread block is responsible for the result of each output channel and each thread in one thread block is responsible for one pixel (cell) in the output matrix. At the beginning, we have tried the approach that each thread block is responsible for one input channel and each thread is responsible for one filter in the convolution layer; however, such approach doesn’t deliver much performance improvement, so we give it up immediately and uses the approach mentioned above so that more threads could be utilized. As for the convolution layer backward, things are a bit tricky, the block number is the maximum of between the number of input channel and the number of filters while the number of threads per block is the size of the input channel (because the size of filter cannot be bigger than the input channel); when calculating the gradient for the weights, each thread block is responsible for the gradient of one filter and each thread is responsible for one cell in the weight gradient matrix; the mapping changes when computing the gradient for the previous layer; now each thread block computes the gradient for each input matrix and each thread in the block computes one cell of each input matrix.
2. Pooling Layer In pooling layer, the parallel design is almost the same as the convolutional layer; in pooling forward, each thread block computes for one output matrix and each thread computes for one cell in each output matrix; for pooling backward, the number of thread block is the same as the number of input channel while the size of thread block is the same as the size of filter rather than the size of input channel because the stride in our pooling layer is always the same as the pooling size and thus each thread could compute the gradient of a block (size is the same as the pooling size) for the input channel.
3. Fully Connected Layer In the fully connected layer forward function, the output matrix was always a n x 1 matrix, so I manually set the number of threads in each thread block as 1024 and increased the number of thread blocks until the total number of threads exceed n; in each thread, it multiplied one row of the filter with all the input matrices. However, the mapping changed in the fully connected backward function; the number of thread block is the same as the the number of input channel and the size of thread block is the same as the size of input matrix; each thread block calculates the gradient for each input channel; as for the weight gradient, each thread block computes the gradient for a block of weights (chunk by the size of input matrix) and each thread computes one column of weight gradient.

Result

In our experiments, we basically compared the speed of parallel OpenMP and CUDA version of CNN framework with the speed of sequential C++ version of CNN framework. Since we didn’t do any batch process in the paralleled CNN, the accuracy of the paralleled CNN was the same as the sequential CNN. Besides, our framework supports custom configuration of the CNN structure, so the accuracy of the CNN totally relies on what the clients configure; our framework is only responsible for speeding up the training and testing speed of the network. The sequential CNN was benchmarked on the GHC machines which have 16 CPUs and 31GB memory; the parallel CNN for OpenMP is benchmarked on the 60-core Xeon Phi 5110P with 8GB RAM, and the CUDA version is benchmarked on the GeForce GTX 1080 which has 2560 cores and 8GB memory.

Speed up via OpenMP

For the speedup via OpenMP, we created 5 synthetic workloads; the only difference was the size of the images, 512x512, 200x200, 128x128, 64x64 and 32x32 respectively. The metrics we used was the speedup of parallel CNN over the sequential CNN; the graph below shows the total training time speedup of OpenMP. To be noticed that when the image is 512x512, the speedup reaches the core number of the Xeon Phi and it’s a relatively good speedup compared to the sequential version, while for image size less than 512x512, the speedup decrease. We use thread number of 50 here for all image sizes. We believe this may because that spawning threads has overhead and splitting the work into smaller chunks of fewer lines may reveal this overhead and yields in less satisfied speedup. Also when the image size reduces to 32x32, the speedup is 0.95, meaning the parallel version has worse performance than the sequential one because each thread only responsible for less than 1 line of computation and the overhead and less usage of locality hurt the speedup performance. We will explain these reasons more in the following section.

Speed up via CUDA and deeper analysis We created 5 synthetic workloads; the only difference was the size of the images, 512x512, 200x200, 128x128, 64x64 and 32x32 respectively. The metrics we used was the speedup of parallel CNN over the sequential CNN; the graph below shows the total training time speedup of CUDA:

The speedup of the parallel CNN increased significantly when the size of image increased; we were trying to do a benchmark on even bigger scale, but the sequential CNN was too slow so we gave up waiting. However, we should noticed that when the image size was smaller than 64x64, the sequential version of CNN beat the parallel CUDA CNN; we did a profiling on the CUDA CNN and it showed that spawning threads in GPU consumed a decent amount of time; what’s more, the computation power of one GPU thread was worse than one single CPU thread, so parallelizing the computation in each network layer didn’t help much when the image size was too small because there were not much parallelism works to do. In such case where the image size was small, we believed batch processing may be a future work to improve the training speed. In addition to the total speed up of CNN, we also cared about how much improvement was achieved in different type of layer:

The convolution layer gained huge speed improvement when the size of image increased even when the image size was 64x64; to break it down mathematically, the time complexity of the sequential convolution forward function is O(n^2k^2) where n is the width of the image and k is the width of the filter; however, in the sequential convolution backward function, the time complexity increased dramatically because it used both the input matrices and the gradient matrices from next layer (size is the same as the current output matrices) to compute the gradient matrices for its own input layer, so the time complexity increased to O(n^2(n-k)^2) which was nearly O(n^4) when the size of filter was small; parallelism increased the performance significantly in such computation intensive tasks. The improvement from parallelized pooling layer was almost the same as the convolution layer.

However, the parallel version of fully connected layer was always slower than the sequential version. We believed the major reason was that there were so many GPU threads for us to use in the GeForce GTX 1080, but we only used a small number of them in fully connected layer in order to avoid any synchronization overhead; in our fully connected layer design, the number of threads was the same as the number of rows in the filter so that each thread could compute the output for the entire row in the filter and took the sum of them; however, the number of rows in the filter was much smaller than the size of matrix while the number of threads was usually the same as the size of matrix in other layer. Therefore, the overhead of spawning thread exceeded the benefit gain from parallelizing the fully connected layer in CUDA.

Comparison between OpenMP and CUDA We experiment on both version of paralleled implementation, and have some insights of their differences to explain. Intrinsically, CUDA should have better performance than OpenMP since it (GPU) has much much more cores compared with CPU, and due to its design of thread blocks, threads, shared memory etc. And this fact is also reflected in our experiments. However, parallelizing our sequential program via CUDA also has some limitation, for example, the communication between thread blocks, which makes it harder for us to implement data parallelism in this scenario without the help of parameter server introduced in class. Parallelizing convolutional neural network, yields reasonable performance when parallelizing on CPU/OpenMP, but its nature of convolution multiplication makes it very computationally expensive, and using GPU with the help of parameter server may be a perfect solution for speedup.

References

https://en.wikipedia.org/wiki/Convolutional_neural_network
Krizhevsky, Alex. “One weird trick for parallelizing convolutional neural networks.” arXiv preprint arXiv:1404.5997 (2014).
Hegde, Vishakh, and Sheema Usmani. Parallel and Distributed Deep Learning. Tech. report, Stanford University, June 2016. https://stanford. edu/~ rezab/dao/projects_reports/hedge_usmani. pdf, 2016.

LIST OF WORK BY EACH STUDENT, AND DISTRIBUTION OF TOTAL CREDIT

The division of effort is 50%-50% in our team. We used pair programming to develop this project. Both of us understand each line of the codes. This report is also a combined effort from both members.