14th Compiler-Driven Performance Workshop

Wednesday, November 4, 2015

Hilton Suites Toronto/Markham Conference Centre

Associated with CASCON 2015


Program


10:00-10:30 The Known Object table: Compile-time reasoning about object identities
Patrick Doyle, Christopher Black - IBM Canada

10:30-10:45 Coffee

10:45-11:15 Asymmetric Blockizing in Thread-Level Speculation
Alexander Krolik, Zhen Cao, Clark Verbrugge - McGill University
11:15-11:45 Loop Thread Level Speculation on IBM POWER8 and Intel TSX
Juan Salamanca Guillen1, Guido Araujo1, Jose Nelson Amaral2 - 1Universidade de Campinas, Brazil, 2University of Alberta, Canada

11:45-11:55 Break

11:55-12:25 RECU:Rochester Elastic Cache Utility
Chencheng Ye1,2, Jack Brock1, Chen Ding1, Hai Jin1 - 1University of Rochester, 2Huazhong University of Science
12:25-12:55 RubyBOP: Safe Parallel Ruby
Chen Ding, Jacob Bisnett, Benjamin O'Halloran, Cesar De Hoyos, Brian Gernhardt - University of Rochester

13:00-14:00 Lunch

14:00-14:30 Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse
Guoyang Chen, Xipeng Shen - North Carolina State University
14:30-15:00 SIMD and GPU Support in IBM Java Just-in-Time Compiler
Jimmy Kwa, Aayush Prakash, Alon Shalev Housfater, Kazuaki Ishizaki, Gita Koblents - IBM Canada
15:00-15:30 Data-centric Combinatorial Optimization of Parallel Code
Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, Xipeng Shen - University of Rochester

15:30-15:45 Coffee

15:45-16:15 Giving New Capabilities to Language Runtimes
Mark Stoodley, Daryl Maier, Matthew Gaudet - IBM Canada


The Known Object table: Compile-time reasoning about object identities

Patrick Doyle, Christopher Black - IBM Canada

Dynamic compilers have the potential to achieve tremendous performance improvements if they are permitted to specialize generated code for a particular object. Values of final primitive fields can be folded; object relationships can be inferred; and methods can be devirtualized and inlined. In this talk, we will describe the Known Object Table: the core data structure used by IBM's Java JIT compiler to reason about object identities and contents at compile time. Developed to support optimization of jsr292 MethodHandle chains, the Known Object Table has the potential for broad application to other areas.


Asymmetric Blockizing in Thread-Level Speculation

Alexander Krolik, Zhen Cao, Clark Verbrugge - McGill University

Software-based Thread-Level Speculation (TLS) requires speculative threads executing ahead of normal execution be isolated from main memory until validation. The resulting read/write buffering requirements, however, mean speculative execution proceeds slower than unbuffered, non-speculative execution, resulting in poor parallel coverage of loops as the non-speculative threads catches up to and prematurely joins with its speculative children. In this work we investigate an "asymmetric blockizing" strategy, modifying speculative code generation to better partition loops, balancing the load assuming a range of constant factor slowdowns on speculative threads. An implementation within the LLVM-based MUTLS system shows a significant benefit to memory intensive benchmarks is possible, although it is dependent on relatively precise estimates of the memory access rate that induces buffering slowdown for individual benchmarks.


Loop Thread Level Speculation on IBM POWER8 and Intel TSX

Juan Salamanca Guillen1, Guido Araujo1, Jose Nelson Amaral2 - 1Universidade de Campinas, Brazil, 2University of Alberta, Canada

Thread Level Speculation (TLS) is a hardware/software technique that enables the correct parallel execution of loops even in the presence of loop-carried dependences. How- ever, TLS can lead to a significant overhead if the loops have high probability of loop-carried dependences. Efficient Thread Level Speculation requires hardware features to support data conflicts, speculative stores, and rollback transactions. Previous studies proposed advanced features such as: ordered transactions in hardware, data forward- ing, and multi-version caches. Current implementations of Hardware Transactional Memory (HTM), such as Intel TSX and IBM POWER8, support the three main features for TLS, but none of the advanced features. Thus, it is important to evaluate TLS performance on these HTMs for amenable loops according to previous studies, and to investigate what features would be necessary to achieve speedups in these loops. In our work, we studied TLS on Intel TSX and IBM POWER8. We parallelized with TLS some loops from SPEC CPU2006 benchmarks. The experimental evaluation shows that false sharing is one of the main causes of data conflicts in the evaluated benchmarks, and it is not always possible to remove false sharing with strip mining because other problems can appear: capacity aborts or false sharing due to cache line prefetching or non sequential writes. The results shows that TLS achieves speedups up to 30 % in some loops for POWER8 implemented ordered transactions with tsuspend/tresume instructions. Our research suggests that a feature that should be supported by the future hardware is multi-version cache to overcome WAR and WAR loop- carried dependences that are source of data conflicts on the current HTMs.


RECU:Rochester Elastic Cache Utility

Chencheng Ye1,2, Jack Brock1, Chen Ding1, Hai Jin1 - 1University of Rochester, 2Huazhong University of Science and Technology

When renting computing power, fairness and overall performance are important for customers and service providers. However, strict fairness usually results in poor performance. In this paper, we study this trade-off. In our experiments, equal cache partitioning results in 131% higher miss ratios than optimal partitioning. In order to balance fairness and performance, we propose two elastic, or movable, cache allocation baselines: Elastic Miss Ratio Baseline (EMB) and Elastic Cache Space Baseline (ECB). Furthermore, we study optimal partitions for each baseline with different levels of elasticity, and show that EMB is more effective than ECB. We also classify programs from the SPEC 2006 benchmark suite based on how they benefit or suffer from the elastic baselines, and suggest essential information for customers and service provider to choose a baseline.


RubyBOP: Safe Parallel Ruby

Chen Ding, Jacob Bisnett, Benjamin O'Halloran, Cesar De Hoyos, Brian Gernhardt - University of Rochester

Parallel programming is difficult in Ruby because of the complexity of interpretation.

This talk will present the design of safe parallel Ruby, which extends the Ruby language with parallel programming constructs that guarantee sequential semantics, i.e. no race conditions or non-determinism.

It describes the programming interface and the safe implementation through process-based speculation, parallel heaps, and memory based access monitoring. In addition to the solutions, the talk will shows the benefits of parallel Ruby such as modularity and the simplicity of design.

Safe parallel Ruby adds two new constructs: a PPR (possibly parallel region) block and an Ordered block. Each block acts as a hint or suggestion to the interpreter. A PPR block suggests possible parallelism, while an ordered block suggests a possible dependence. A programmer parallelizes a Ruby script by inserting these two types of hints. The runtime uses speculation to ensure that a script with the hints produces the same result as it would without the hints. In the worst case scenario, the time of execution is that of the sequential execution of the program.


Free Launch: Optimizing GPU Dynamic Kernel Launches through Thread Reuse

Guoyang Chen, Xipeng Shen - North Carolina State University

The support of dynamic kernel launches is a recent feature added to GPU programming models. It helps meet the needs of applications with irregular or dynamic parallelism, and has demonstrated great benefits for improving GPU programming productivity for such applications. However, due to the massive parallelism of GPU, the support is expensive, incurring large runtime overhead and seriously limiting the usage of the feature in practice.

In this work, we propose kernel launch removal, a compiler technique for addressing the runtime overhead of dynamic kernel launches. Through program transformations, the technique automatically enables the reuse of the parent threads for handling the tasks supposed to be handled by the newly created children threads. The reuse eliminates the dynamic kernel launches and hence their large runtime overhead, providing hundreds of times of speedups to GPU kernels. Meanwhile, as it is an automatic code transformation technique, programmers can still write code with dynamic kernel launches, and hence enjoy the programming productivity benefits of the feature without suffering the large runtime overhead.


SIMD and GPU Support in IBM Java Just-in-Time Compiler

Jimmy Kwa, Aayush Prakash, Alon Shalev Housfater, Kazuaki Ishizaki, Gita Koblents - IBM Canada

Supporting SIMD and GPU in Java poses additional challenges comparing to static languages such as C/C++ or Fortran. Java's dynamic nature, precise exception semantics, and write-once-run-everywhere requirement makes it difficult to exploit these hardware features independent on whether we choose an implicit or an explicit approach. With the implicit approach, compiler either optimizes some known code patterns or applies auto parallelization techniques in order to determine whether an arbitrary sequence of code can be vectorized or executed on GPU. With the explicit approach, the programmer has some control over which sections of code should be parallelized by using a provided programming model. However, automatic parallelization is known to be both expensive and limited, and a new programming model might not be portable. In addition, both approaches need to handle Java exceptions properly.

In this presentation, we will describe our experience with implementing SIMD and GPU support in IBM Java 8 Just-in-Time compiler, discuss some new technical issues we encountered, as well as outline our future plans in this area.


Data-centric Combinatorial Optimization of Parallel Code

Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, Xipeng Shen - University of Rochester

Memory performance is essential for tapping into the full potential of the massive parallelism of GPU. It has motivated some recent efforts in GPU cache performance optimization. This paper presents a new technique to model and optimize GPU cache performance. The proposed new technique is based on a composable cache model: it accurately predicts the performance of all possible data placements of a program by profiling the program just once. It predicts set-associative cache and caches with different management including LRU cache as well as sector cache. Evaluation on 13 GPU benchmarks shows that the technique has unprecedented time efficiency and consistently high accuracy (99.3%). It is hundred times faster or four times as accurate than two previous techniques. When it is applied to guide data placement optimizations, the new model boosts the average speedup from 9% to 19%. It opens new opportunities for understanding and improving GPU memory performance.


Giving New Capabilities to Language Runtimes

Mark Stoodley, Daryl Maier, Matthew Gaudet - IBM Canada

Imagine you have a virtual machine full of powerful technology: Just-in-Time compilation, Garbage Collection, VM monitoring and more. What happens when you start to unlock the technology inside that virtual machine, by separating out the language specific parts? Then what happens when you plan to open source these technologies? You get the ability to experiment!

This talk covers a set of experiments where IBM has tested out language-agnostic runtime technologies, inside of CRuby and other language runtimes, including a GC, a JIT and more-- all while still running real Ruby applications, including Rails. We share the results of these experiments, describe the how we connected to CRuby, and start talking about how this may one day become a part of anyone's language runtime..