Technical Computing @ Microsoft: Lecture Series on the History of Parallel Computing

Scalable Parallel Computing on Many/Multicore Systems

This set of lectures will review the application and programming model issues that will one must address when one gets chips with 32-1024 cores and “scalable” approaches will be needed to make good use of such systems. We will not discuss bit-level and instruction-level parallelism i.e. what happens on a possibly special purpose core, even though this is clearly important. We will use science and engineering applications to drive the discussion as we have substantial experience in these cases but we are interested in the lessons for commodity client and server applications that will be broadly used 5-10 years from now.

We start with simple applications and algorithms from a variety of fields and identify features that makes it “obvious” that “all” science and engineering run well and scalably in parallel. We explain why unfortunately it is equally obvious that there is no straightforward way of expressing this parallelism. Parallel hardware architectures are described in enough detail to understand performance and algorithm issues and need for cross-architecture compatibility; however in these lectures we will just be users of hardware. We must understand what features of multicore chips we can and should exploit. We can explicitly use the shared memory between cores or just enjoy its implications for very fast inter-core control and communication linkage. We note that parallel algorithm research is hugely successful although this success has reduced activity in an area that deserves new attention for the next generation of architectures.

The parallel software environment is discussed at several levels including programming paradigm, runtime and operating system. The importance of libraries, templates, kernels (dwarfs) and benchmarks is stressed. The programming environment has various tools including compilers with possible parallelism hints like OpenMP; tuners like Atlas; messaging models; parallelism and distribution support as in HPF, HPCS Languages, co-array Fortran, Global arrays and UPC. We also discuss the relevance of important general ideas like object-oriented paradigms (as in for example Charm++), functional languages and Software Transactional Memories. Streaming, pipelining, co-ordination, services and workflow are placed in context. Examples discussed in the last category include CCR/DSS from Microsoft and the Common Component Architecture CCA from DoE. Domain Specific environments like Matlab and Mathematica are important as there is no universal silver programming bullet; one will need interoperable focused environments.

We discuss performance analysis including speedup, efficiency, scaled speedup and Amdahl’s law. We show how to relate performance to algorithm/application structure and the hardware characteristics. Applications will not get scalable parallelism accidentally but only if there is an understandable performance model. We review some of the many pitfalls for both performance and correctness; these include deadlocks, race conditions, nodes that are busy doing something else and the difficulty of second-guessing automatic parallelization methods. We describe the formal sources of overhead; load imbalance and the communication/control overhead and the ways to reduce them. We relate the blocking used in caching to that used in parallel computation. We note that load balancing was the one part of parallel computing that was easier than expected.

We will mix both simple idealized applications with “real problems” noting that usually it is simple problems that are the hardest as they have poor computation to control/communication ratios. We will explain the parallelism in several application classes including for science and engineering: Finite Difference, Finite Elements, FFT/Spectral, Meshes of all sorts, Particle Dynamics, Particle-Mesh, and Monte Carlo methods. Some applications like Image Processing, Graphics, Cryptography and Media coding/decoding have features similar to well understood science and engineering problems. We emphasize that nearly all applications are built hierarchically from more “basic applications” with a variety of different structures and natural programming models. Such application composition (co-ordination, workflow) must be supported with a common run-time. We contrast the difference between decomposition needed in most “basic parallel applications” to the composition supported in workflow and Web 2.0 mashups. Looking at broader application classes we should cover; Internet applications and services; artificial intelligence, optimization, machine learning; divide and conquer algorithms; tree structured searches like computer chess; applications generated by the data deluge including access, search, and the assimilation of data into simulations. There has been a growing interest in Complex Systems whether it be for critical infrastructure like energy and transportation, the Internet itself, commodity gaming, computational epidemiology or the original war games. We expect Discrete Event Simulations (such as DoD’s High Level Architecture HLA) to grow in importance as they naturally describe complex systems and because they can clearly benefit from multicore architectures. We will discuss the innovative Sequential Dynamical Systems approach used in the EpiSims, TransSims and other critical infrastructure simulation environments. In all applications we need to identify the intrinsic parallelism and the degrees of freedom that can be parallelized and distinguish small parallelism (local to core) compared to large parallelism (scalable across many cores)

We will not forget many critical non-technical Issues including “Who programs – everybody or just the marine corps?”; “The market for science and engineering was small but it will be large for general multicore”; “The program exists and can’t be changed”; “What features will next hardware/software release support and how should I protect myself from change?”

We will summarize lessons and relate them to application and programming model categories. In last lecture or at end of all lectures we encourage the audience to bring their own multicore application or programming model so we can discuss examples that interest you.

Speaker Details

Geoffrey Charles Fox (812.219.4643, gcf@indiana.edu, http://www.infomall.org)Fox received a Ph.D. in Theoretical Physics from Cambridge University and is now professor of Computer Science, Informatics, and Physics at Indiana University. He is director of the Community Grids Laboratory of the Pervasive Technology Laboratories at Indiana University. He previously held positions at Caltech, Syracuse University and Florida State University. He has published over 550 papers in physics and computer science and been a major author on four books. Fox has worked in a variety of applied computer science fields with his work on computational physics evolving into contributions to parallel computing and now to Grid systems. He has worked on the computing issues in several application areas – currently focusing on Earthquake and Ice-sheet Science and Chemical Informatics. He is involved in several projects to enhance the capabilities of Minority Serving Institutions.

Date:
Speakers:
Geoffrey Fox
Affiliation:
Indiana University
    • Portrait of Jeff Running

      Jeff Running