Monday, April 4, 2016

Review of "Scalability! but at what Cost?" co-authored with Lee Savoie (http://www.cs.arizona.edu/~lsavoie/)


This semester in a graduate seminar on programming languages (http://www.cs.arizona.edu/classes/cs620/spring16/), we are learning about and comparing various parallel programming models.  One paper we discussed as a class was McSherry, Isard, and Murray’s “Scalability! But at what COST?” (https://www.usenix.org/conference/hotos15/workshop-program/presentation/mcsherry).   This paper considers the question of how to measure the performance of parallel codes. The generally accepted metric for measuring parallel performance is speedup, which is the single threaded execution time divided by the parallel execution time. When this is measured at several core counts, speedup gives an idea of the scalability of the code, but it says nothing about its absolute performance. This paper suggests that a new metric is needed to take into account the overhead of parallelization compared to a simple, single-threaded implementation (i.e., not the single-threaded version of the parallelized code). The class conclusion was that the Scalability! paper pointed out a key consideration that new parallel programming models for graph computations should carefully select their baseline, however the proposed COST metric is not necessary if all speedup is reported with respect to a single, simpler baseline.

What is the problem?

Since computer clock rates have been stagnant for a decade, there has been an explosion of parallel hardware and distributed computing systems.  Leveraging such hardware resources without overly burdening programmers has become an important research and practical problem.  Some of the low-level implementation details that programmers are faced with include  distributing data across resources, scheduling computation and communication amongst those resources, and gathering the results.  All of these details must be dealt with such that the resulting implementation is more efficient than a straight-forward serial implementation, otherwise why bother.

In the area of graph analytics, researchers in academia and industry have been developing programming frameworks that handle these details, thus enabling programmers to just focus on specifying their algorithm.  The Scalability! paper  observes that the way such solutions have been evaluated overemphasizes parallel scalability at the detriment of the bottomline, which is absolute performance.  Much like those of us that claim we lost weight after working off our extra holiday padding, comparing a parallel and distributed implementation to a single-threaded implementation with all the same programming model abstraction overheads smacks of cheating.

Why should we care?


The adage that ``that which is measured, improves'' is alive and well in research.  This can sometimes lead to too much weight (pun intended) being placed on a subset of the applicable evaluation metrics thus leading a sub-field of study off on a tangent.  In the sub-field that researches programming models, it is difficult to quantitatively or even qualitatively evaluate programmability.  Although performance in terms of execution is much simpler in comparison, it also has its pitfalls in terms of reporting the variance in results, etc (see “Scientific Benchmarking of Parallel Computing Systems” by Hoefler and Belli, http://htor.inf.ethz.ch/publications/index.php?pub=222).  The ``Scalability!'' paper is pointing out that many researchers have fallen into one of these pitfalls with parallel scalability, specifically failing to ensure a competitive baseline.  Such papers are important to help encourage discussion about what the evaluation metrics should be.  In other words, how do we know when a problem we are working on is solved?

As a sidenote, the Scalability! paper has been pointed out to reviewers at top conference venues as something to consider while making review decisions on papers.  This indicates that the research community does take such reconsideration of evalution metrics seriously, and that ignoring such discussions may result in curtailing one's ability to contribute to the research discussion (aka publish).

Approach in the Scalability! paper


 The Scalability! paper suggests using a new metric: COST, or the Conguration that Outperforms a Single Thread.  The COST of a parallel code is the number of cores that we must run that code on in order to outperform a reasonable serial implementation. COST gives us an idea of the absolute performance of parallel code relative to the simpler alternative of serial code. Thus, in addition to using speedup to gauge scalability, we can also measure COST to understand the tradeoffs between the overheads and benefits of our parallel implementation.

This paper demonstrates the utility of the COST metric by comparing the runtimes of simple, serial implementations of common problems running on a laptop to the runtimes of parallel codes solving the same problems on clusters. The laptop runtimes were measured directly, and the parallel runtimes were taken from published papers. Surprisingly, the COST for some of the systems was in the hundreds of cores. Some systems even had infinite COST, meaning that the parallel code could not outperform the serial code on any number of cores, which brings into question the usefulness of parallelizing those codes. Thus, the paper demonstrates that many published systems that have good scalability also have a high COST, which shows that speedup is insufficient by itself, and something like COST is needed as well.

Cdc 620 Class Conclusions


The Scalability! paper asks whether during the development of parallel programming models, we as a research community have lost sight of why we are parallelizing things in the first place.  This is an important question, but we also had some gripes with the paper.  First, new graph analytic programming models often aim to provide more than just programmability and performance.  Other features such as reliability, security, and ease of integration into an existing system should somehow be balanced with the overhead they introduce.  Second, we do not really need a new metric.  If the straight-forward serial implementation (or ideally an optimized implementation) is used as the baseline, then it will be clear if a line that “scales” well ever manages to break above a scalability of 1.  And finally we have a nitpick that there was not enough information provided about the laptop in which the experiments were run.  The associated blog post (http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) had more information, but one question we had was whether it had significantly faster disk I/O.

We enjoyed the discussion, and we would like to thank the authors Frank McSherry, Michael Isard, and Derek Murray for sharing their research.