|
The success of parallel computing in solving realistic computational applications relies on their efficient mapping and execution on large-scale multiprocessor architectures. When the algorithms and data structures corresponding to these problems are unstructured or dynamic in nature, efficient implementation on parallel machines offers considerable challenges.
Unstructured applications are characterized by irregular data-access patterns whereas dynamic mesh adaptation causes computational workloads to grow or shrink at run time. For such applications, dynamic load balancing is required in order to achieve algorithmic scaling on parallel machines. Our objectives were to implement various parallel versions of a dynamic unstructured algorithm and to criti-cally compare their performances in terms of run time, scalability, programmability, portability, and memory overhead.
A multithreaded version of a dynamic unstructured mesh-adaptation algorithm has been implemented on the Cray (formerly Tera) Multithreaded Architecture (MTA). Multithreaded machines can tolerate memory latency and utilize substantially more of their computing power by processing several threads of computation. For example, the MTA processors each have hardware support for up to 128 threads, and are therefore especially well suited for irregular and dynamic applications. Unlike traditional parallel machines, the MTA has a large uniform shared memory, no data cache, and is insensitive to data placement. Parallel programmability is significantly simplified since users have a global view of the memory, and need not be concerned with partitioning and load-balancing issues. Performance was compared with an MPI implementation on the T3E and the Origin2000, and a shared-memory directives-based implementation on the Origin2000.
A standard computational mesh simulating flow over an airfoil was used for our experiments to compare the three parallel architectures. The initial mesh, consisting of more than 28,000 triangles, was refined a total of five times to generate a mesh 45 times larger, as shown in figure 1. Performance by platform and programming paradigm is presented in figure 2. It is important to note that different parallel versions use different dynamic load-balancing strategies. The multithreaded implementation of the adaptation algorithm required adding a trivial amount of code to the original serial version and had little memory overhead. In contrast, the MPI version doubled the size of the code and required significant additional memory for the communication buffers. The simulation on the eight-processor MTA at San Diego Supercomputing Center using 100 threads per processor was almost ten times faster than that obtained on 160 processors of a T3E, and more than 100 times faster than that on a 64-processor Origin2000 running a directives-based version. Results indicate that multithreaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Point of Contact: Rupak Biswas
(650) 604-4441
rbiswas@nas.nasa.gov
Back To Top
Previous Paper
Return to Advanced Space Transportation
Next Paper |
|
Fig. 1. A close-up view of the computational mesh that was geometrically refined in specific regions to better capture fine-scale phenomena.
|
|
Fig. 2. Performance of the dynamic unstructured mesh-adaptation algorithm by platform and programming paradigm. All times are in seconds; the fastest times are highlighted.
|
|