As an approach to promoting whole-system synergy on a heterogeneous
computing system, compilation of fine-grained SPMD-threaded code (e.g.,
GPU CUDA code) for multicore CPU has drawn some recent attentions. This
paper concentrates on two important sources of inefficiency that limit
existing translators. The first is overly strong synchronizations;
the second is thread-level partially redundant computations. In this
paper, we point out that both kinds of inefficiency essentially come
from a single reason: the non-uniformity among threads. Based on that
observation, we present a thread-level dependence analysis, which
leads to a code generator with three novel features: an instance-level
instruction scheduler for synchronization relaxation, a graph pattern
recognition scheme for code shape optimization, and a fine-grained
analysis for thread-level partial redundancy removal. Experiments show
that the unified solution is effective in resolving both inefficiencies,
yielding speedup as much as a factor of 14.