ID2223 — Deep Learning on Big Data at KTH

9 min readJul 22, 2018

tl;dr ID2223 is a MSc course that marries data parallel programming with deep learning, and has been given at KTH Stockholm since 2016 and now has over 120 students. It is the first course we are aware of, where students work on distributed deep learning problems with big datasets ranging from several GBs to 100s of GB in size. This article covers the coursework in ID2223.

Large Scale Machine Learning and Deep Learning

ID2223 (Large Scale Machine Learning and Deep Learning) is given as a course in the EIT Data Science Masters program, but it also draws in students from the MSc in Machine Learning, and the more data-engineering oriented MSc programs in distributed systems and cloud computing. As such, the students’ backgrounds can be classified into one of 3 groups:

(1) no machine learning, good data-engineering;

(2) good machine learning, no data-engineering;

(3) good machine learning, some data-engineering.

The coursework in ID2223 covered programming in both SparkML and TensorFlow. We had one lab exercise in Scala/SparkML and one in Python/TensorFlow (FashionMnist). Then students worked in groups to solve a prediction problem on a large (at least multi-GB sized) dataset. Most students used Deep Learning, and an analysis of the frameworks they chose is presented later.

The course was provided with hardware from the SICS ICE data center (including GPUs) and for software it used the Hops data platform (including Spark, Keras/TensorFlow, PyTorch, OpenAI-Gym) and the platform is administered by Logical Clocks AB.

At the end of the course, the Stockholm Hadoop User Group (SHUG) meetup invited us to present the course at a meetup in February 2018, attended by about 130 people. I presented an overview of the course and its learning goals, then Robin Andersson, Logical Clocks AB, presented a live demo of the Hops platform used for programming with Spark/TensorFlow/Keras/PyTorch. Finally, the two students who won best project presented their project on predicting human activity using mobile phones, with a live demo.

Lab 1: Linear Regression on the Million Song Dataset with SparkML

This lab is an introduction to large-scale ML with SparkML and Scala. For group (1), students have to familiarize themselves with ML concepts: linear regression, model selection, SGD. For group (2), students were challenged with programming in SparkML/Scala, and for (3) they were challenged with distributed hyperparameter optimization. Students were given a tiny 1.9MB subset of the Million Song Dataset, as well as a larger 1.9 GB subset and the full 200GB dataset on www.hops.site. Most students worked on the tiny dataset on their laptops, many tried out the 1.9 GB dataset o www.hops.site and a few managed to complete the bonus task on hops.site on the full dataset.

Analysis: This task was straightforward for students who had previously programmed in Scala (groups 1 and 3), acted as an introduction to ML for group 1, and was challenging for students in group 2 who had not programmed Scala before. In future, we may replace Scala with Python, due to improvements in the Python ecosystem for data engineering.

Lab 2: Image Classification on FashionMNist with TensorFlow (or Keras)

This lab is designed to be more challenging than the MNist tutorial found on the TensorFlow website. It is more of an introduction to building ConvNets and working with .tfrecords.

Two students trained the most accurate network to 95.38% accuracy. The first network architecture had 3 conv layers, 1 relu layer and 1 softmax layer and was trained with standard SGD, learning-rate of 0.001, no dropout, no batch norm, no weight decay, nothing fancy. Keep it simple, stupid. It was trained on a Nvidia 1080 GPU. The next most accurate model at 95.49% (slightly higher, but this highest result was after the deadline, so didn’t count!) by a PhD student. This was an ensemble (trained on several GPUs) of networks with 4 Conv + 1 FC, batch norm, image augmentation (rotate, flip, random erasing), and dropout.

Analysis: FashionMNIST was a new dataset in 2017 when the course was given, but in 2018, there are many example notebooks for FashionMNIST available online, so we will replace it with a new dataset/tutorial.

The Group Project

The final piece of coursework was a project (worth 3 labs in terms of grade weighting). Students were allowed to do the project in groups (typically with 2 members), and they were required to solve a prediction problem on at least GB-sided datasets. A large number of data sets for machine learning were made available to students, and students could use those or a dataset of their choosing. The vast majority of projects selected were Deep Learning projects, see the Table below.

For those students that did not choose the self-defined project option, they could either work with a 200GB+ dataset (NYC Taxi data) or a smaller yahoo_i2 — Yahoo! Shopping Shoes Image Content, version 1.0. The NYC Taxi data problem had 3 prediction problems and the idea was that students would need to use SparkML on Hops, as the dataset was so large. The Yahoo! shoes dataset problem was an image classification problem with a smaller dataset — an extension of the FashionMNist problem with a new dataset (not available on the Internet!). Only two groups chose the Yahoo! Shoes dataset, and one chose the NYC Taxi data project.

Platform Chosen by Student Groups

Students were free to choose the programming framework to be used for their project. As can be seen below, most students chose to implement the project in TensorFlow. Some good students chose PyTorch, and other students chose Keras/TensorFlow, while one group chose SparkML.

Datasets Chosen by Student Groups

Students chose a wide variety of datasets from computer vision, natural language processing, time-series data, and sound/music.

Project Results

In the following sections, we show a few of the results by the students for different classes of prediction problems.

Know your Dataset

As many groups were working with new datasets without any backing publication, they had to first evaluate the quality of the training data. Poor quality training data will mean that it is hard to learn to predict anything from the dataset. For example, the NiH X-Ray dataset was poor quality. The students who worked with this dataset tried to build a convnet for multilabel classification of common thoracic diseases. It is a multi-label classification problem — some x-rays show signs of more than 1 disease.

The students examined the dataset quality using a confusion matrix, showing that it is hard to train a classifier when the data is not accurate:

CNNs

Many students chose image classification problems and used convolutional neural networks (CNNs) to solve their prediction problems. With good clean datasets, this generally worked well, but for more noisy datasets, like the NiH X-Ray dataset, this predictably produced classifiers that were less accurate.

LSTMs

Several groups used LSTMs to generate image captions, with the typical setup as follows:

Here are some graphs and sample images produced by students:

GANs

GANs were the talk of the NIPs conference in 2017, and many students wanted to do projects on GANs. I warned them that they might not manage to train their system in time, even with our Nvidia 1080Ti GPUs. However, I was pleasantly surprised that 3 groups managed to demo working GANs. I guess if you use GPUs and throw enough compute at them, even with small datasets (Celeb-A, MNIST), you can get them to generate plausible images.

Changing hairstyles for celebrities from the Celeb-A dataset using GANs.

Best Project

Logical Clocks AB, who develop an Enterprise Hops platform, awarded the prize of a Nvidia 1070 8GB GPU to the best project. The best project was “Distributed LSTM training — Predicting Human Activities on Edge Devices”, by Kim Hammar and Konstantin Sozinov.

The students took an Open source Heterogeneity Activity Recognition Data Set collected by Stisen et al, of size ∼ 10GB with sliding windows. They used a LSTM model, inspired by Venelin/Valkov, and they trained the model with distributed training with TensorFlowOnSpark on hops.site. The slides are available here.

Lessons Learned

As with all courses in this field, ID2223 is under constant development, in terms of course content and methods introduced. Distributed deep learning is an important part of the course as is systems level understanding of machine learning pipelines. Here are a few practical lessons we learnt on providing a platform for students using Hops (with HDFS/TensorFlow/Spark/etc):

Reading and Writing files to/from HDFS. Pandas, SparkML, and TensorFlow all support HDFS natively, but when you try to do something outside those frameworks, HDFS is not often supported. PyTorch does not support HDFS natively, so reading datasets in HopsFS (HDFS in Hops) involves first reading Pandas dataframes and then converting them into NumPy arrays for use in PyTorch. This approach is only viable for datasets that fit on a single host, and not for larger distributed datasets.
When students used a library that does not support HDFS (such as pre-processing images with OpenCV), the solution is typically to copy files from HDFS local storage and then work with the local files. This is inefficient, and only works up to the size of local scratch storage, but works.
How do I debug this thing on my laptop? Many students did not want to have to connect to a remote server for development. Many students wrote and tested code on their laptops, and then made changes in the code to use HDFS instead of local files when working on larger datasets on the Hops cluster.
Sharing GPUs. Hops supports GPUs-as-a-Resource. Predictably, we had slack demand for GPUs in the first few weeks after students started projects, and then the 2 weeks before the project was due, there was a large queue of applications waiting for free GPUs. We reduced the number of GPUs that could be used by an application to 4 for the last week before a project, to prevent hogging.
Identifying Root Cause for Bugs. Data parallel systems are distributed and challenging to debug. Typical problems that are confusing at first to newcomers are Jupyter returning an error, for example, because Spark ran out of memory. Spark can run out of on-heap or off-heap memory, and each requires its own solution (increase spark.executor.memory and/or spark.driver.memory (for YARN cluster-mode) and/or increase spark.executor.memoryOverhead and spark.memory.offHeap.size and spark.memory.offHeap.enabled=true). Another problem with YARN is that logs are only available for an application after the application completes. The solution in Hops is to write logs to logstash from the application, and then view the logs in real-time using Kibana.

Acknowledgements

Thanks to the two TAs on the course, Alex Ormenisan and Kamal Hakimzadeh, and others who helped out with grading labs (Salman Niazi). Alex, in particular, carried a heavy load. Thanks also to Robin Andersson, at Logical Clocks AB, who provided support on gitter during the projects. And thanks to Tor Björn Minde, CEO of RISE SICS North, who provided cluster resources to the students on www.hops.site.