Waterloo Programming Languages Seminars

Previous terms: F07 W07 F06 S06 W06

Current term:

Date	Speaker	Title (click for abstract)	Location
Mon. Jan. 14, 2008 at 1:30 pm	Ayelet Wasik	Features of a Multi-Threaded Memory Allocator	DC 2314
Wed. Jan. 30, 2008 at 1:00 pm	Abid Malik	Optimal Basic Block Scheduling with out Spilling for Multi-issue Processors	DC 2314
Thu. Feb. 07, 2008 at 1:30 pm	Patrick Lam	Statically Verifying Software Safety Properties with Tracematches	DC 2314
Wed. Feb. 13, 2008 at 2:30 pm	Ghulam Lashari	A review of auto-vectorization techniques	DC 2314
Thu. Feb. 14, 2008 at 1:30 pm	Patrick Lam	continuation of talk from Feb 7	DC 1304
Fri. Mar. 07, 2008 at 3:00 pm	José Nelson Amaral	Fixing Performance Bugs Created by Good Programmers	DC 1304
Wed. Mar. 19, 2008 at 1:00 pm	Abid Malik	Using Machine Learning for Iterative Compiler Optimization	DC 1331
Wed. Mar. 26, 2008 at 2:30 pm	Ghulam Lashari	Control Flow Emulation on Tiled SIMD Architectures	DC 2314
Thu. Apr. 10, 2008 at 1:30 pm	Nomair Naeem	TBA	DC 2314

All are welcome.

Talk announcements are also posted to a public mailing list.

To volunteer to give a talk, contact Ondřej Lhoták. Talks can range from informal to formal, and they can be about any of the following:

completed research
research that you plan to start ("half-baked ideas" to be critiqued by your peers)
interesting paper(s) you have read

The purpose is for us all to see what people are interested in, and motivate discussions.

Abstracts

Features of a Multi-Threaded Memory Allocator

Ayelet Wasik
Jan. 14, 2008
Multi-processor computers are becoming increasingly popular and are important for improving application performance. Providing high-performance memory management is important for multi-threaded programs. This thesis looks at memory allocation of dynamic allocation memory in concurrent C and C++ programs. The challenges facing the design of any memory allocator include minimizing fragmentation, and promoting good locality. A multi-threaded memory-allocator is also concerned with minimizing contention, providing mutual exclusion, avoiding false-sharing, and avoiding heap-blowup (a form of fragmentation).

Several potential features are identified in existing multi-threaded memory-allocators. These features include per-thread heaps with a global heap, object ownership, object containers, thread-local free-list buffers, remote free-lists, allocation buffers, and lock-free operations. When used in different combinations, these features can solve most of the challenges facing a multi-threaded memory-allocator. Through the use of a test suite composed of both single and multi-threaded benchmark programs, several existing memory allocators and a set of new allocators are compared. It is determined that different features address different multi-threaded issues in the memory allocator with respect to performance, scaling, and fragmentation. Finally, I make recommendations for a general purpose memory allocator.

Optimal Basic Block Scheduling with out Spilling for Multi-issue Processors

Abid Malik
Jan. 30, 2008
The run-time performance of modern processors depends heavily on the compiler optimization phases. Typical compiler phases include instruction scheduling, which maximizes instruction level parallelism (ILP), and register allocation, which minimizes data spills to external memory. If ILP is maximized with out considering register constraints, high register pressure may results, leading to increased spill code and reduced run-time performance. In this work, I present a framework that gives a schedule of instructions for a given basic block with the minimum schedule length and register pressure less than or equal to to the available physical registers if it exists. I evaluated my approach for various idealized and realistic architectures using the SPEC2000 Benchmark. Experimental results show that my approach is able to solve basic blocks as large as 50 instructions, which constitutes 97.496% of all basic blocks in the benchmark, within 10 minutes. This compares favorably to recent work on integrated code generation using integer programming.

Statically Verifying Software Safety Properties with Tracematches

Patrick Lam
Feb. 07, 2008
Tracematches enable developers to declaratively specify safety properties for Java programs. These properties can be either domain-specific properties (e.g. compiler passes are always executed in the correct order) or generic properties (e.g. linked lists are never modified during iteration). While it is possible to dynamically monitor tracematches, such an approach 1) carries runtime overhead and 2) does not suggest reasonable recovery actions. We investigate a static analysis that promises to reduce runtime overhead and, in some cases, can guarantee that a program will never violate certain safety properties (eliminating the need for recovery actions). Our approach propagates tracematch states through methods to ensure that unsafe states are never triggered during an execution, thereby enabling the elimination of runtime checks. Pointer analysis turns out to be important in the static analysis of tracematches, and I will discuss the various analyses needed to make our overall static analysis work. This is joint work with Eric Bodden and Laurie Hendren.

A review of auto-vectorization techniques

Ghulam Lashari
Feb. 13, 2008
Parallel processors like the multicore CPUs and GPUs have become a commodity in the recent years. The GPUs are well-suited for the data-level parallelism, while the processing model of the multicore CPUs is naturally suited to task-level parallelism. In addition each of the many CPU cores has a vector unit suited to data-level parallelism. The parallelization and vectorization techniques from 70s are again in focus.

We review auto-vectorization techniques: (1) the use of dependence to ensure reordering of the statements is safe, (2) a vectorization algorithm, and (3) a wealth of loop transformations that have been studied to uncover and enhance vector parallelism. The choice of appropriate loop transformations, that are profitable and do not conflict with other useful transformations, is challenging and important to find better parallelism. Another challenge for vectorization is control flow that introduces a new constraint for the transformations---not handled by data dependence.

Fixing Performance Bugs Created by Good Programmers

José Nelson Amaral
Mar. 07, 2008
Good programming and software engineering practices encourage the semantically-meaningful encapsulation of data into objects or data structures. These practices also encourage the design and implementation of relatively small, self contained, modules. While these practices increase programmer's productivity and reduce the cost of writing and maintaining software, they also may incur a significant performance penalty.

The semantically-meaningful encapsulation of data into aggregated data structures yields a data layout that diminishes the amount of locality in the data and can have a significant performance penalty in computer architectures with several levels of caches. Automatic data restructuring combined with memory pooling are two techniques that optimizing compilers can use to re-arrange the data layout in memory in order to regain the performance lost due to encapsulation. This talk describes recent results from data structure splitting in the IBM XL compiler suite.

José Nelson Amaral is a Professor, and the Associate Chair for Graduate Studies, in the Department of Computing Science at the University of Alberta. He received a Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin in 1994. His current research focus on the design and implementation of optimizing compilers, parallel programming models, and high-performance computing. He has published extensively in these and other areas, and has participated in the organization and program committees of several international conferences. He is the 2008 Global Chair for Parallel and Distributed Programing for Euro-Par and the General and Program Chair for the 2008 Languages and Compilers for Parallel Computing (LCPC) workshop. His group has maintained intense collaboration with the IBM compiler development group in Toronto. In 2006 his team was selected for the Project Team of the Year award by the Centre for Advanced Studies.

Using Machine Learning for Iterative Compiler Optimization

Abid Malik
Mar. 19, 2008
Program-specific or function-specific optimization phase sequences are universally accepted to achieve better overall performance than any fixed optimization phase ordering. A number of heuristic phase order space search algorithms have been devised to find customized phase orderings achieving high performance for each functions. This, however, is at the cost of large numbers of evaluations of the program.

In the talk, I will review an approach by F. Agakov, G. Fursin and M. O Boyle. They use machine learning for iterative compiler optimization. The approach is relatively fast and is independent of search algorithm, search space or compiler infrastructure and scales gracefully with the compiler optimization space size.

Control Flow Emulation on Tiled SIMD Architectures

Ghulam Lashari
Mar. 26, 2008
Heterogeneous multi-core and streaming architectures such as the GPU, Cell, ClearSpeed, and Imagine processors have better power/performance ratios and memory bandwidth than traditional architectures. These types of processors are increasingly being used to accelerate compute-intensive applications. Their performance advantage is achieved by using multiple SIMD processor cores but limiting the complexity of each core, and by combining this with a simplified memory system. In particular, these processors generally avoid the use of cache coherency protocols and may even omit general-purpose caches, opting for restricted caches or explictly managed local memory.

We show how control flow can be emulated on such tiled SIMD architectures and how memory access can be organized to avoid the need for a general-purpose cache and to tolerate long memory latencies. Our technique uses streaming execution and multipass partitioning. Our prototype targets GPUs. On GPUs the memory system is deeply pipelined and caches for read and write are not coherent, so reads and writes may not use the same memory locations simultaneously. This requires the use of double-buffered streaming. We emulate general control flow in a way that is transparent to the programmer and include specific optimizations in our approach that can deal with double-buffering.