Current term:
Date | Speaker | Title (click for abstract) | Location |
Mon. Jan. 14, 2008 at 1:30 pm | Ayelet Wasik | Features of a Multi-Threaded Memory Allocator | DC 2314 |
Wed. Jan. 30, 2008 at 1:00 pm | Abid Malik | Optimal Basic Block Scheduling with out Spilling for Multi-issue Processors | DC 2314 |
Thu. Feb. 07, 2008 at 1:30 pm | Patrick Lam | Statically Verifying Software Safety Properties with Tracematches | DC 2314 |
Wed. Feb. 13, 2008 at 2:30 pm | Ghulam Lashari | A review of auto-vectorization techniques | DC 2314 |
Thu. Feb. 14, 2008 at 1:30 pm | Patrick Lam | continuation of talk from Feb 7 | DC 1304 |
Fri. Mar. 07, 2008 at 3:00 pm | José Nelson Amaral | Fixing Performance Bugs Created by Good Programmers | DC 1304 |
Wed. Mar. 19, 2008 at 1:00 pm | Abid Malik | Using Machine Learning for Iterative Compiler Optimization | DC 1331 |
Wed. Mar. 26, 2008 at 2:30 pm | Ghulam Lashari | Control Flow Emulation on Tiled SIMD Architectures | DC 2314 |
Thu. Apr. 10, 2008 at 1:30 pm | Nomair Naeem | TBA | DC 2314 |
All are welcome.
Talk announcements are also posted to a public mailing list.
To volunteer to give a talk, contact Ondřej Lhoták. Talks can range from informal to formal, and they can be about any of the following:
The purpose is for us all to see what people are interested in, and motivate discussions.
Several potential features are identified in existing multi-threaded
memory-allocators. These features include per-thread heaps with
a global heap, object ownership, object containers, thread-local
free-list buffers, remote free-lists, allocation buffers, and
lock-free operations. When used in different combinations, these
features can solve most of the challenges facing a multi-threaded
memory-allocator. Through the use of a test suite composed of both
single and multi-threaded benchmark programs, several existing memory
allocators and a set of new allocators are compared. It is determined
that different features address different multi-threaded issues
in the memory allocator with respect to performance, scaling, and
fragmentation. Finally, I make recommendations for a general purpose
memory allocator.
Optimal Basic Block Scheduling with out Spilling for Multi-issue Processors
Abid Malik
Jan. 30, 2008
The run-time performance of modern processors depends heavily on the
compiler optimization phases. Typical compiler phases include instruction
scheduling, which maximizes instruction level parallelism (ILP), and
register allocation, which minimizes data spills to external memory. If ILP
is maximized with out considering register constraints, high register
pressure may results, leading to increased spill code and reduced run-time
performance. In this work, I present a framework that gives a schedule of
instructions for a given basic block with the minimum schedule length and
register pressure less than or equal to to the available physical registers
if it exists. I evaluated my approach for various idealized and realistic
architectures using the SPEC2000 Benchmark. Experimental results show that
my approach is able to solve basic blocks as large as 50 instructions,
which constitutes 97.496% of all basic blocks in the benchmark, within 10
minutes. This compares favorably to recent work on integrated code
generation using integer programming.
Statically Verifying Software Safety Properties with Tracematches
Patrick Lam
Feb. 07, 2008
Tracematches enable developers to declaratively specify safety
properties for Java programs. These properties can be either
domain-specific properties (e.g. compiler passes are always executed
in the correct order) or generic properties (e.g. linked lists are
never modified during iteration). While it is possible to dynamically
monitor tracematches, such an approach 1) carries runtime overhead and
2) does not suggest reasonable recovery actions. We investigate a
static analysis that promises to reduce runtime overhead and, in some
cases, can guarantee that a program will never violate certain safety
properties (eliminating the need for recovery actions). Our approach
propagates tracematch states through methods to ensure that unsafe
states are never triggered during an execution, thereby enabling the
elimination of runtime checks. Pointer analysis turns out to be
important in the static analysis of tracematches, and I will discuss
the various analyses needed to make our overall static analysis work.
This is joint work with Eric Bodden and Laurie Hendren.
A review of auto-vectorization techniques
Ghulam Lashari
Feb. 13, 2008
Parallel processors like the multicore CPUs and GPUs have become a
commodity in the recent years. The GPUs are well-suited for the data-level
parallelism, while the processing model of the multicore CPUs is naturally
suited to task-level parallelism. In addition each of the many CPU cores has
a vector unit suited to data-level parallelism. The parallelization and
vectorization techniques from 70s are again in focus.
We review auto-vectorization techniques: (1) the use of dependence to
ensure reordering of the statements is safe, (2) a vectorization algorithm,
and (3) a wealth of loop transformations that have been studied to uncover
and enhance vector parallelism. The choice of appropriate loop
transformations, that are profitable and do not conflict with other useful
transformations, is challenging and important to find better parallelism.
Another challenge for vectorization is control flow that introduces a new
constraint for the transformations---not handled by data dependence.
Fixing Performance Bugs Created by Good Programmers
José Nelson Amaral
Mar. 07, 2008
Good programming and software engineering practices encourage the
semantically-meaningful encapsulation of data into objects or data
structures. These practices also encourage the design and
implementation of relatively small, self contained, modules. While
these practices increase programmer's productivity and reduce the cost
of writing and maintaining software, they also may incur a significant
performance penalty.
The semantically-meaningful encapsulation of data into aggregated data structures yields a data layout that diminishes the amount of locality in the data and can have a significant performance penalty in computer architectures with several levels of caches. Automatic data restructuring combined with memory pooling are two techniques that optimizing compilers can use to re-arrange the data layout in memory in order to regain the performance lost due to encapsulation. This talk describes recent results from data structure splitting in the IBM XL compiler suite.
José Nelson Amaral is a Professor, and the Associate Chair for
Graduate Studies, in the Department of Computing Science at the
University of Alberta. He received a Ph.D. in Electrical and Computer
Engineering from the University of Texas at Austin in 1994. His
current research focus on the design and implementation of optimizing
compilers, parallel programming models, and high-performance
computing. He has published extensively in these and other areas, and
has participated in the organization and program committees of several
international conferences. He is the 2008 Global Chair for Parallel
and Distributed Programing for Euro-Par and the General and Program
Chair for the 2008 Languages and Compilers for Parallel Computing
(LCPC) workshop. His group has maintained intense collaboration with
the IBM compiler development group in Toronto. In 2006 his team was
selected for the Project Team of the Year award by the Centre for
Advanced Studies.
Using Machine Learning for Iterative Compiler Optimization
Abid Malik
Mar. 19, 2008
Program-specific or function-specific optimization phase sequences are
universally accepted to achieve better overall performance than any fixed
optimization phase ordering. A number of heuristic phase order space
search algorithms have been devised to find customized phase orderings
achieving high performance for each functions. This, however, is at the
cost of large numbers of evaluations of the program.
In the talk, I will review an approach by F. Agakov, G. Fursin and M. O
Boyle. They use machine learning for iterative compiler optimization. The
approach is relatively fast and is independent of search algorithm, search
space or compiler infrastructure and scales gracefully with the compiler
optimization space size.
Control Flow Emulation on Tiled SIMD Architectures
Ghulam Lashari
Mar. 26, 2008
Heterogeneous multi-core and streaming architectures such as the GPU,
Cell, ClearSpeed, and Imagine processors have better power/performance
ratios and memory bandwidth than traditional architectures. These
types of processors are increasingly being used to accelerate
compute-intensive applications. Their performance advantage is achieved
by using multiple SIMD processor cores but limiting the complexity
of each core, and by combining this with a simplified memory system.
In particular, these processors generally avoid the use of cache
coherency protocols and may even omit general-purpose caches, opting for
restricted caches or explictly managed local memory.
We show how control flow can be emulated on such tiled SIMD architectures and how memory access can be organized to avoid the need for a general-purpose cache and to tolerate long memory latencies. Our technique uses streaming execution and multipass partitioning. Our prototype targets GPUs. On GPUs the memory system is deeply pipelined and caches for read and write are not coherent, so reads and writes may not use the same memory locations simultaneously. This requires the use of double-buffered streaming. We emulate general control flow in a way that is transparent to the programmer and include specific optimizations in our approach that can deal with double-buffering.