Fran Allen: Compilers and Parallel Computing Systems

Fran Allen delivers Organick Lecture
Fran Allen delivers Organick Lecture
(click to enlarge)

Fran Allen was the Turing Award winner for 2006. This afternoon she's giving the University of Utah's Organick Memorial Lecture. I've reported on some of these in the past few years:

I try to come every year. I find it's something I'm inspired by each time.

The grand goal of high performance computers right now is a 1 petaflop machine. This requires 1,000,000 gigaflop processors. Wow. She shows a semilog plot of peak speed vs year introduced that is a linear line (Moore's law at work).

Much of Allen's work in the 80's and early 90's was around the PTRAN system of analysis for parallelism. The techniques are used, for example in the optimization stage of IBM's XL family of compilers.

Because more and more transistors are being placed on chips, they're using more and more energy--getting hotter. Part of the solution--which we're seeing play out--is multi-core chips. This requires parallelism to achieve the performance users expect. But making use of multi-codes requires that tasks be organized by either users or software to run in parallel.

By 2021, there will be chips with 1024 cores on them. Is parallelism the tool that will make al these ores useful? John Hennessey has called it the biggest challenge Computer Science has every faced. He has credentials that might make you believe him. Allen says that it's also the best opportunity that Computer Science has to improve user productivity, application performance and system integrity.

For parallel (superscalar, etc.) architectures, compilers--software--have been used to automatically manage scheduling tasks so that they can operate in parallel. What about those techniques will be useful in this new world of multi-cores?

Allen says we need to get rid of C--soon. C, as a language, doesn't provide enough information to the compiler for it to figure out interdependencies--making it hard to parallelize. Another way to look at it is that pointers allow programmers to build programs that can't be easily analyzed to find out which parts of the program can be executed at the same time.

Another factor that makes parallelization hard is data movement. Allen offers no silver bullet. The latency of data movement inhibits high performance.

The key is the right high level language that can effectively take advantage of the many good scheduling and pipelining algorithms that exist. If we don't start with the right high level language, those techniques will have limited impact.

She presents some research from Vivek Sarkar on compiling for parallelism. Only a small fraction of application developers are experts in parallelism. Expecting them to become such is unreasonable. The software is too complex and the primary bottleneck in the usage of parallel systems. X10 is an example of a language (object oriented) that tries to maximize the amount of automatic parallel optimization that can be done.

Major themes include cross-procedure parallelization, data dependency analysis, control dependency analysis, and then using those analyses to satisfy the dependencies while maximizing parallelism.

Useful parallelism depends on the run time behavior of the program (i.e. loop frequency, branch prediction, and node run times) and the parameters of the target multiprocessor. Finding the maximal parallelism isn't enough because it probably can't be efficiently mapped on the multiple cores or processors. There is a trade off between the partition cost and the run time. Finding the intersection gives the right level of parallelism--the level that is the most efficient use of available resources. Interprocedural analysis is the key to whole program parallelism.

One of the PTRAN analysis techniques was the transform the program into a functional equivalent that used static single assignment. This, of course, is what functional programming enthusiasts have been saying for years: one of functional programming's biggest advantages is that functional programs--those without mutation--are much more easily parallelized than imperative programs (including imperative-based object oriented languages).

There's a long list of transformations that can be done--everything from array padding to get easily handled dimensions to loop unrolling and interleaving. Doing most of these transformations well requires detailed knowledge of the machine--making it a better job for compilers than humans. Even then, the speedup is less than the number of additional processors applied o the job. That is, applying 4 processors doesn't get you a speedup of 4--more like 2.2. The speed up--at present--is asymptotic.