MPI programming on
T3E
We review a simple code to sample skelectal basics of MPI programming.
We discuss early MPI experience and related issues on T3E installed at
SERI. General premise is that there is no serious problem with MPI on T3E.
Since MPI hides all the hardware details from ordinary users and the efficiency
of MPI implementation on T3E is Cray's responsibility, we touch only on
the generalities and look into some examples.
Contents
1. A sample code
2. Datatypes
3. Collective Communications
4. T3E usage
5. MPI usage
6. Performance Issues
7. Known Bug
8. Examples
1. A sample code
The sample code next page contains SIMD code of matrix
x matrix
-
RANK : represent a process. 0,1,2,...,N-1
-
MPI_COMM_WORLD : all the processes in the MPI appl.
-
Initialization & Finalization
-
MPI_INIT(ierr)
-
MPI_FINALIZE(ierr)
-
Who Am I? Who Many They?
-
MPI_COMM_RANK(MPI_COMM_WORLD,mynode,ierr)
-
MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
-
Sending Messages
-
MPI_SEND(buffer,count,datatype,destination,tag,MPI_COMM_WORLD,ierr)
-
Receiving Messages
-
MPI_RECV(buffer, maxcount, datatype, source, tag, MPI_COMM_WORLD, istatus,
ierr)
2. Datatypes
MPI_SEND(buffer,count,datatype,destination,tag,MPI_COMM_WORLD,ierr)
<Type> buffer(*)
integer count,
datatype, destination, tag, MPI_COMM_WORLD,ierr
-
DataTypes
-
MPI_CHAR(char), MPI_INT(integer),
-
MPI_FLOAT(float), MPI_DOUBLE(double)
-
MPI_UNSIGNED_CHAR, MPI_UNSINED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LONG,
MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE, MPI_PACKED
-
MPI derived datatype, MPI packed datatype
-
contiguous : one array length, no displacement, one datatype
-
vector : one array length, one displacement, one datatype
-
indexed : multiple array length, multiple displacements, one datatype
-
structure : multiple everything
3. Collective Communications
-
Broadcast
-
MPI_BCAST(buf, count, dtype, root, comm, ierror)
-
<type> buf(*)
-
integer count, dtype, root, comm
-
Scatter/Gather
-
MPI_SCATTER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype,
root, comm, ierror)
-
MPI_GATHER(sendbuf, sendcount, sendtype, secvbuf, recvcount, recvtype,
root, comm, ierror)
-
<type> sendbuf(*), recvbuf(*)
-
integer sendcount, sendtype, recvcount
-
integer recvtype, root, comm
-
Reduce
-
MPI_REDUCE(sendbuf, recvbuf, count, dtype, op, root, comm, ierror)
-
<type> sendbuf(*), recvbuf(*)
-
integer count, dtype, op, root, comm
4. T3E Usage
-
single system Image
-
login, telnet, and ftp : same as an ordinary computer
-
single file system
-
parallel I/O partition
-
UNIX shell
-
compiler
-
f90 and cc
-
does not need explicit library linking for
-
communcation library
-
MPI, PVM, SHMEM, HPF
-
f90 is compatible with f77 standard
-
execution : mpprun -n ? ./a.out
-
malleable or not
-
optional "mpirun" : to be consistent with PVP, not recommended for
T3E
-
execution : NQS
-
script file : similar to C90
-
job monitoring
-
MPP consideration
-
check-pointing is a "must"
-
avoid using small number of nodes for long period of time ¡æ
node fragmentation
-
periodic maintenance
-
difficulty with job scheduling ¡æ generic problem
5. MPI Usage
-
include 'mpif.h' (Fortran) or 'mpi.h'
-
blocked vs. non-blocked point-to-point call
-
no message co-processor
-
complication due to stream buffer
-
MPI calls are stream safe
-
bug fix : vendor's responsibility
-
I/O
-
private I/O unit for each nodes and different file names for each nodes
-
unique I/O unit for each nodes and different file names for each nodes
-
single I/O unit for all the nodes and different file pointer for each node
(cf. PFS global mode)
6. Performance Issues
-
Timing
-
rtc() : real time clock tick (hardware clock)
-
one tick = 1/(4.5 x 10^8) nsec
-
UNIX clock
-
Cray utility : ja
-
writing disturbs cache coherence ¡æ defer writing to later
stage
-
latency vs. bandwidth
-
wide fluctuation for small size communication among processors
-
streaming
-
low latency for message passing
-
long message and less frequent send-receive helps efficiency of code
-
cache and streaming affects bandwidth
7. Known bug
-
MPI_CHAR does not work
-
avoid it
-
convert data type (character to integer)
-
send-receive MPI_INTEGER
-
convert data type (integer to character)
-
STDOUT
-
"optimizing f90 compiler removes "un-used loops" in a short program
-
???
8. Examples
-
Step 0. Write a program (examples in /settmp/samples/MPI/ )
-
Step 1. Compile f90, cc, CC
-
Step 2. run or NQS