|Date||Speaker||Title (click for abstract)||Location|
|Mon. Jan. 14, 2008 at 1:30 pm||Ayelet Wasik||Features of a Multi-Threaded Memory Allocator||DC 2314|
|Wed. Jan. 30, 2008 at 1:00 pm||Abid Malik||Optimal Basic Block Scheduling with out Spilling for Multi-issue Processors||DC 2314|
|Thu. Feb. 07, 2008 at 1:30 pm||Patrick Lam||Statically Verifying Software Safety Properties with Tracematches||DC 2314|
|Wed. Feb. 13, 2008 at 2:30 pm||Ghulam Lashari||A review of auto-vectorization techniques||DC 2314|
|Thu. Feb. 14, 2008 at 1:30 pm||Patrick Lam||continuation of talk from Feb 7||DC 1304|
|Fri. Mar. 07, 2008 at 3:00 pm||José Nelson Amaral||Fixing Performance Bugs Created by Good Programmers||DC 1304|
|Wed. Mar. 19, 2008 at 1:00 pm||Abid Malik||Using Machine Learning for Iterative Compiler Optimization||DC 1331|
|Wed. Mar. 26, 2008 at 2:30 pm||Ghulam Lashari||Control Flow Emulation on Tiled SIMD Architectures||DC 2314|
|Thu. Apr. 10, 2008 at 1:30 pm||Nomair Naeem||TBA||DC 2314|
All are welcome.
Talk announcements are also posted to a public mailing list.
To volunteer to give a talk, contact Ondřej Lhoták. Talks can range from informal to formal, and they can be about any of the following:
The purpose is for us all to see what people are interested in, and motivate discussions.
Several potential features are identified in existing multi-threaded
memory-allocators. These features include per-thread heaps with
a global heap, object ownership, object containers, thread-local
free-list buffers, remote free-lists, allocation buffers, and
lock-free operations. When used in different combinations, these
features can solve most of the challenges facing a multi-threaded
memory-allocator. Through the use of a test suite composed of both
single and multi-threaded benchmark programs, several existing memory
allocators and a set of new allocators are compared. It is determined
that different features address different multi-threaded issues
in the memory allocator with respect to performance, scaling, and
fragmentation. Finally, I make recommendations for a general purpose
Optimal Basic Block Scheduling with out Spilling for Multi-issue Processors
Jan. 30, 2008
The run-time performance of modern processors depends heavily on the compiler optimization phases. Typical compiler phases include instruction scheduling, which maximizes instruction level parallelism (ILP), and register allocation, which minimizes data spills to external memory. If ILP is maximized with out considering register constraints, high register pressure may results, leading to increased spill code and reduced run-time performance. In this work, I present a framework that gives a schedule of instructions for a given basic block with the minimum schedule length and register pressure less than or equal to to the available physical registers if it exists. I evaluated my approach for various idealized and realistic architectures using the SPEC2000 Benchmark. Experimental results show that my approach is able to solve basic blocks as large as 50 instructions, which constitutes 97.496% of all basic blocks in the benchmark, within 10 minutes. This compares favorably to recent work on integrated code generation using integer programming.
Statically Verifying Software Safety Properties with Tracematches
Feb. 07, 2008
Tracematches enable developers to declaratively specify safety properties for Java programs. These properties can be either domain-specific properties (e.g. compiler passes are always executed in the correct order) or generic properties (e.g. linked lists are never modified during iteration). While it is possible to dynamically monitor tracematches, such an approach 1) carries runtime overhead and 2) does not suggest reasonable recovery actions. We investigate a static analysis that promises to reduce runtime overhead and, in some cases, can guarantee that a program will never violate certain safety properties (eliminating the need for recovery actions). Our approach propagates tracematch states through methods to ensure that unsafe states are never triggered during an execution, thereby enabling the elimination of runtime checks. Pointer analysis turns out to be important in the static analysis of tracematches, and I will discuss the various analyses needed to make our overall static analysis work. This is joint work with Eric Bodden and Laurie Hendren.
A review of auto-vectorization techniques
Feb. 13, 2008
Parallel processors like the multicore CPUs and GPUs have become a commodity in the recent years. The GPUs are well-suited for the data-level parallelism, while the processing model of the multicore CPUs is naturally suited to task-level parallelism. In addition each of the many CPU cores has a vector unit suited to data-level parallelism. The parallelization and vectorization techniques from 70s are again in focus.
We review auto-vectorization techniques: (1) the use of dependence to
ensure reordering of the statements is safe, (2) a vectorization algorithm,
and (3) a wealth of loop transformations that have been studied to uncover
and enhance vector parallelism. The choice of appropriate loop
transformations, that are profitable and do not conflict with other useful
transformations, is challenging and important to find better parallelism.
Another challenge for vectorization is control flow that introduces a new
constraint for the transformations---not handled by data dependence.
Fixing Performance Bugs Created by Good Programmers
José Nelson Amaral
Mar. 07, 2008
Good programming and software engineering practices encourage the semantically-meaningful encapsulation of data into objects or data structures. These practices also encourage the design and implementation of relatively small, self contained, modules. While these practices increase programmer's productivity and reduce the cost of writing and maintaining software, they also may incur a significant performance penalty.
The semantically-meaningful encapsulation of data into aggregated data structures yields a data layout that diminishes the amount of locality in the data and can have a significant performance penalty in computer architectures with several levels of caches. Automatic data restructuring combined with memory pooling are two techniques that optimizing compilers can use to re-arrange the data layout in memory in order to regain the performance lost due to encapsulation. This talk describes recent results from data structure splitting in the IBM XL compiler suite.
José Nelson Amaral is a Professor, and the Associate Chair for
Graduate Studies, in the Department of Computing Science at the
University of Alberta. He received a Ph.D. in Electrical and Computer
Engineering from the University of Texas at Austin in 1994. His
current research focus on the design and implementation of optimizing
compilers, parallel programming models, and high-performance
computing. He has published extensively in these and other areas, and
has participated in the organization and program committees of several
international conferences. He is the 2008 Global Chair for Parallel
and Distributed Programing for Euro-Par and the General and Program
Chair for the 2008 Languages and Compilers for Parallel Computing
(LCPC) workshop. His group has maintained intense collaboration with
the IBM compiler development group in Toronto. In 2006 his team was
selected for the Project Team of the Year award by the Centre for
Using Machine Learning for Iterative Compiler Optimization
Mar. 19, 2008
Program-specific or function-specific optimization phase sequences are universally accepted to achieve better overall performance than any fixed optimization phase ordering. A number of heuristic phase order space search algorithms have been devised to find customized phase orderings achieving high performance for each functions. This, however, is at the cost of large numbers of evaluations of the program.
In the talk, I will review an approach by F. Agakov, G. Fursin and M. O
Boyle. They use machine learning for iterative compiler optimization. The
approach is relatively fast and is independent of search algorithm, search
space or compiler infrastructure and scales gracefully with the compiler
optimization space size.
Control Flow Emulation on Tiled SIMD Architectures
Mar. 26, 2008
Heterogeneous multi-core and streaming architectures such as the GPU, Cell, ClearSpeed, and Imagine processors have better power/performance ratios and memory bandwidth than traditional architectures. These types of processors are increasingly being used to accelerate compute-intensive applications. Their performance advantage is achieved by using multiple SIMD processor cores but limiting the complexity of each core, and by combining this with a simplified memory system. In particular, these processors generally avoid the use of cache coherency protocols and may even omit general-purpose caches, opting for restricted caches or explictly managed local memory.
We show how control flow can be emulated on such tiled SIMD architectures and how memory access can be organized to avoid the need for a general-purpose cache and to tolerate long memory latencies. Our technique uses streaming execution and multipass partitioning. Our prototype targets GPUs. On GPUs the memory system is deeply pipelined and caches for read and write are not coherent, so reads and writes may not use the same memory locations simultaneously. This requires the use of double-buffered streaming. We emulate general control flow in a way that is transparent to the programmer and include specific optimizations in our approach that can deal with double-buffering.