Types of Parallel Computers
There are many types of computers available today, from
single processor or 'scalar' computers to machines with vector processors to
massively parallel computers with thousands of microprocessors. Each platform
has its own unique characteristics. Understanding the differences is important
to understanding how best to program each. However, the real trick is to try to
write programs that will run reasonably well on a wide range of computers.
Scalar computers
Your typical PC or workstation is a scalar, or
single-processor, computer. These use many different processors, and run a wide
variety of operating systems. Below is a list of some current processors and the
operating systems each can run. Linux is now available for almost all
processors.
Intel Pentiums AMD athlons |
--> 1.5 GHz |
Linux MS Windows FreeBSD NetBSD |
Alpha 21264 |
--> 667 MHz |
Tru64 Unix Alpha Linux |
IBM Power3 |
--> 375 MHz |
AIX Linux |
Sun UltraSparc3 |
--> 500 MHz |
Solaris Linux |
SGI R12000 |
--> 400 MHz |
IRIX |
Motorola G4 |
--> 500 MHz |
Apple Macintosh OS Linux |
Parallel vector processors
The great supercomputers of the past were built around
custom vector processors. These are the expensive, high performance masterpieces
pioneered by Seymor Cray. There are currently only a few examples of computers
in production that still use vector processors, and all are parallel vector
processors (PVP's) that run small numbers of vector processors within the same
machine.
Cray SV1 |
peak of 4.8 GFlops/proc |
--> 100 processors |
Fujitsu VSX4 |
peak of 1.2? GFlops/proc |
--> 64 processors |
Vector processors operate on large vectors of data at the
same time. The compiler automatically vectorizes the innermost loops to break
the work into blocks, often of 64 elements in size, when possible. The
functional units are pipelined to operate on all 64 elements within a single
clock cycle, and the memory subsystem is optimized to keep the processors fed at
this rate. Since the compilers do much of the optimization automatically, the
user only needs to attack those problem areas where there is some impediment to
the compiler understanding how to vectorize the loop.
MPI, compiler directives, OpenMP, and pThreads packages can
be used to parallelize a program to run across multiple vector processors.
Shared-memory multiprocessors
Shared-memory multiprocessor (SMP) systems have more than 1
scalar processors that share the same memory and memory bus. This category
includes everything from a dual-processor Intel PC to a 256 processor
Origin3000. Each processor may have its own cache on a dedicated bus, but all
processors are connected to a common memory bus and memory bank.
In a well designed system, the memory bus must be fast
enough to keep data flowing to all the processors. Large data caches are also
necessary, as they allow each processor to pull data into 'local' memory to
crunch the data while other processors use the memory bus. Most current SMP
systems share these two design criteria. Early Pentium SMP systems, and the
multiprocessor nodes of the Intel Paragon, did not and therefore the additional
processors were relatively useless. Below are some other SMP systems, and the
number of processors that each system can have.
SGI Origin3000 SGI Origin2000 SGI Origin200 |
--> 256 MIPS R12000 processors --> 128 MIPS
R10000 processors 2-4 MIPS R10000 processors
|
IRIX |
Compaq ES40 Compaq DS20 |
2-4 Alpha 21264 processors 2 Alpha 21264 processors |
Tru64 or Linux |
IBM 43p IBM 44p |
1-2 Power3 processors 2-4 Power3 processors |
AIX or Linux |
Intel Pentium PCs |
2-4 Pentium processors |
MS Windows Linux others |
SMP systems can be programmed using several different
methods. A multithreading approach can be used where a single program is run on
the system. The program divides the work across the processors by spawning
multiple light-weight threads, each executing on a different processor and
performing part of the calculation. Since all threads share the same program
space, there is no need for any explicit communication calls.
Compilers are sophisticated enough to using multithreading
to automatically parallelize some codes. Unfortunately, this is not the case
very often. While using multithreading is undoubtedly the most efficient way to
program SMP systems, it is not easy to do it manually. Plus there are many
choices when it comes to choosing a multithreading package. Gel make up may be the most
popular standard at the moment, but there are many vendor specific packages and
other standards such as the POSIX pThreads package.
In summary, multithreading may produce the most efficient
code for SMP systems, but it may not be the easiest way parallelize a code. It
also may not be portable, even across SMP systems, unless a standard like OpenMP
is chosen that is supported almost everywhere. Of even more concern is that a
multithreading code will not run on distributed memory systems.
In the message-passing paradigm, each processor is treated
as a separate computer running a copy of the same program. Each processor
operates on a different part of the problem, and data exchange is handled by
passing messages. More details about this approach will be presented in the next
section.
While not as efficient as using multithreading, the
resulting code is much more portable since message-passing is the predominant
method for programming on distributed memory systems. Message-passing
implementations on SMP systems are very efficient since the data doesn't have to
traverse the network as in distributed memory systems. It is just copied from
memory to memory, which should occur at high speed and with little overhead.
Distributed memory MPPs
There is an even wider variety of the largest MPP
(massively parallel processor) systems, the distributed memory computers.
However, these systems all share a few traits in common. These systems are made
of many individual nodes, each of which is essentially an independent computer
in itself. In fact, in the case of workstation/PC clusters, each node is a
computer in itself.
Each node consists of at least one processor, its own
memory, and a link to the network that connects all nodes together. Aside from
these generalizations, distributed memory systems may look very different.
Traditional MPP systems may contain hundreds or thousands
of individual nodes. They have custom networks that are typically very fast with
relatively low latencies. The network topologies very widely, from completely
connected systems to 2D and 3D meshes and fat trees. Below is a partial listing
of some of the more common MPP systems available today, and some of the
characteristics of each.
Cray T3E |
--> 1000 nodes |
Alpha 21164 |
3D Torroid |
340 MB/sec |
2-3 µs |
IBM SP |
--> 512 nodes |
Power3 |
Colony |
900? MB/sec |
??? µs |
Intel Paragon |
--> 1836 nodes |
i860 proc |
2D mesh |
130 MB/sec |
100 µs |
There have been many other large MPP systems over the last
decade. While some systems may remain, the companies have gone out of business.
These include machines such as the Thinking Machines CM-5, the Kendall Square
systems, the nCube 2, and the Intel Paragon listed above. These systems soared
in popularity, then crashed just as fast. The reasons for this vary, from the
difficulty of surviving the high cost of using custom components (especially
processors) to having good technology but a poor business model.
MPP systems are programmed using message-passing libraries.
Most have their own custom libraries, but all current systems also support the
industry standard MPI message-passing
interface. This allows codes programmed in MPI to be portable across distributed
memory systems, and also SMPs as described previously. Unfortunately, the MPI
implementation is not always as efficient as the native communication library,
so there is still some temptation to us the native library at the expense of
program portability. The Cray T3E is an extreme case, where the MPI
implementation only achieves 160 MB/sec while the native SHMEM library can
deliver twice that rate.
Cluster computers
Distributed memory computers can also be built from scratch
using mass produced PCs and workstations. These cluster computers are referred
to by many names, from a poor-man's supercomputer to COWs (clusters of
workstations), and NOWs (networks of workstations).
They are much cheaper than traditional MPP systems, and
often use the same processors, but are more difficult to use since the network
capabilities are currently much lower. Cluster computers are also usually much
smaller, most often involving fewer than 100 computers. This is in part because
the networking and software infrastructure for cluster computing is less mature,
making it difficult to make use of very large systems at this time. Below is a
list of some local clusters and their characteristics, plus some other notable
systems from around the country.
Octopus |
16-node Pentium III cluster |
Fast Ethernet |
8.5 MB/sec |
100 µs |
Linux |
ALICE |
64-node dual-Pentium cluster |
Fast Ethernet |
8.5 MB/sec |
~100 µs |
Linux |
Gecoa |
24-node Alpha 21164 cluster |
Gigabit Ethernet |
30 MB/sec |
~100 µs |
Linux |
IBM Cluster
|
52 processor Power3 IBM cluster |
Gigabit Ethernet (+ Myrinet soon) |
~100 MB/sec |
~100 µs |
AIX |
C-PLANT
|
???-node Alpha 21164 cluster |
Myrinet |
??? MB/sec |
??? µs |
Tru64 Unix |
One look at the communication rates and message latencies
shows that they are much worse than for traditional MPP systems. Obviously you
get what you pay for, but the networks for cluster computers are quickly closing
the gap.
It is therefore more difficult to get many programs to run
well on clusters computers. Many applications will not scale to as many nodes
due to the slower networking, and some codes simply will not run well at all and
must be limited to MPP systems.
There is also a wide range of capabilities illustrated by
the mixture of clusters above. This range starts with the ultra-cheap PC
clusters connected by Fast Ethernet. These systems can be built for tens of
thousands of dollars, but the slow speed of the Fast Ethernet interconnects
greatly limits the number of codes that can utilize this type of a system.
The workstation clusters cost more, but can handle faster
networking. Gigabit Ethernet is maturing to where it can deliver up to 100
MB/sec, albeit at fairly high latencies at this point. Custom solutions such as
Myrinet can deliver slightly faster rates at lower latencies, but also cost
more. These are therefore only appropriate for clusters made from more costly
workstations.
Network computing
Cluster computers are made from many computers, usually
identical, that are located in the same room. Heterogeneous clusters use
different computers, and are much more difficult to program because the
computing rates and even the communication rates may vary.
Network computing, or internet computing, is using a
heterogeneous mixture of geographically separated workstations to perform
calculations. The idea of using the spare cycles on desktop PCs or workstations
has been around for years. Unfortunately, it only works for a very limited
number of applications due to the very low communication rate of the network,
the large latencies, the differing CPU rates. and the need to allow for fault
tolerance.
The SETI at
home project is probably the most famous application that can make use of
any spare cycles on the internet. There are also commercial companies such as Entropia that help companies to do the same.
Metacomputing
Metacomputing is a similar idea, but with loftier goals.
Supercomputers that may be geographically separated can be combined to run the
same program. However, the goal in metacomputing is usually to provide very high
bandwidths between the supercomputers so that these connections do not produce a
bottleneck for the communications. Scheduling exclusive time on many
supercomputers at the same time can also pose a problem. This is still an area
of active research.
Distributed memory systems with SMP
nodes
The memory subsystems of PCs and workstations are rapidly
improving to where they can support more processors. Large cache sizes and more
memory bandwidth are instrumental to this success. With these improvements, it
is becoming more common to find distributed memory systems built with SMP nodes.
This has been done in the past. The Intel Paragon is one
example, where the MP nodes had two i860 compute processors in addition to one
as a communication coprocessor. Unfortunately, the cache size and memory
bandwidth were so small, and the time to switch which processor had access to
main memory was so great, that the second processor was pretty much useless
except for heating the computer room.
Our IBM cluster is one example, where each node consists of
a dual-processor IBM 43p computer with a few nodes being 4-processor IBM 44p
workstations. IBM MPP systems are currently being built with 16-way SMP nodes.
Compaq's wildfire systems are built essentially with 4-processor ES40s.
How best to program in this mixed environment of
distributed SMP systems is still a very open question. The most efficient method
would probably be to do message-passing between SMP nodes and multithreading
within, but this requires two levels of parallelization by two distinctly
different methods. This takes a lot of programming effort, and presents the same
difficulties as multithreading itself. The more common approach is to use
message-passing for everything. This is easiest, and the most portable approach,
but not the optimal choice. Research is still needed in this area.
Links to more advanced topics
Ames Laboratory | Condensed Matter Physics | Disclaimer |
ISU Physics