MPI (Message Assing Interface) is what we are here calling the "universal assembly language" for parallel computers. Implementations of MPI run on many (most?) parallel computers, including most shared memory computers, most distributed memory computers and many or most of the more specialized, high-performance parallel computers (such as the Intel Paragon, the IBM SP-2, etc.). MPI defines a binding to both the FORTRAN and C~languages, since both languages are commonly used for high performance computing. Here, we discuss only the C~binding. As we shall see, MPI has standardized on the FORTRAN convention for parameter passing which includes input/output parameters. Thus, MPI often specifies a pointer argument which may contain user-defined, input data on entry to the MPI routine. MPI may read that data, and it may also place output data in the buffer referenced by the pointer before returning. MPI routines are all defined to return a value of type int, which is always interpreted as an error code [MPI_ERROR??]. In addition to MPI, there is another earlier contender for universal assembly language, PVM (Parallel Virtual Machine???), which is also in widespread use. At this time, it appears that MPI will become the more widely disseminated standard, although there have also been implementations both of MPI on top of PVM (QUOTE EXAMPLE) and PVM on top of MPI (QUOTE EXAMPLE). There have also been proposals that some version of PVM's higher level features may form an optional library on top of MPI. Further, it is quite easy to translate simple invocations of parallel facilities between the two systems, since both are based on the idea of messages with sender, receiver, [OTHER THINGS?], etc. [GIVE YEARS FOR THIS HISTORY, REFERENCES] Historically, PVM had grown up first as a single, major implementation [QUOTE WHICH IMPL, UTK], and rather than formally define a PVM standard, the PVM standard was defined operationally by the behavior of that implementation. There is a similar history for p4 [REFERENCE], one of several predecessors to MPI. In contrast, it was decided to build MPI first as a standard, with participation by many groups and individuals (including many with experience with p4 and PVM). MPICH was developed at about the same time, but as a {\it reference implementation} of the standard. A reference implementation is typically (and is for MPICH) a free implementation whose goal is correctness, and only secondarily speed. A reference implementation often serves as a mathematical analogue of an existence proof (albeit a constructive existence proof) that the standard is self-consistent and can be implemented using general methodology (without undue reliance on features particular to certain hardware or software systems). In practice, a reference implementation also often serves as a means to introduce entrants to the field to the new concepts. Hence, it will sometimes include examples and/or a tutorial. At the time of this writing, there are two major free implementations of MPI: MPICH and LAM. Both run on a large variety of machines. There are also several vendor implementations of MPI that have been fine-tuned to run faster on specific machines. MPI specifies only an initialization command, \verb|MPI_Init()|, and a finalization command, \verb|MPI_Finalize()|, that must be invoked on each processor. \verb|MPI_Init()| must be invoked before all other MPI commands, except \verb|MPI_Initialized()|, and \verb|MPI_Finalize()| must be invoked after all other commands. According to the standard, \verb|MPI_Init()| may not be re-invoked later to start a second MPI ``session'', perhaps with the same or a different number of nodes. Just as the original C~language standard did not specify any I/O commands but left such commands (\verb|printf()|, \verb|getc()|, etc.) to a C~library, MPI does not specify what \verb|MPI\_Init()| must do to start a MPI session or how MPI should specify which processors will take part in the MPI session. This note will discuss the procgroup mechanism used by MPI (and by LAM???), which apparently originated with p4, an earlier system with goals similar to MPI. ===== A MPI program is typically compiled by linking a MPI library (typically a ``.a'' archive file, although any format acceptable to the linking loader is possible) and by including a MPI include file (``.h'' files). Description of procgroup and typical MPI_Init mechanism ("rsh" for distributed memory, with -p4amslave, pthread_create() [SAME NAME FOR ALL VENDORS?], or specific for specialized, high-performance hardware). The procgroup file need exist only on the machine first invoking MPI. [DISCUSS TYPICAL PARAMETERS FOR rsh: -p4amslave host portnum DISCUSS MPI_Init rationale for *argc, **argv (to allow slave to reset command line in MPI_Init before giving control back to user program. Hence, MPI_Init should be called even before programmer's own command line processing MENTION trick of modifying command line (even argv[0]), and then observing current status of program using only ps or top. NOTE that sendmail uses this trick: sendmail: accepting connections on port 25 COMPARE WITH port 25/TCP in /etc/services If a password is needed on the remote host, MPI_Init() will not stop to ask for one, but will often exit with a cryptic UNIX comment, such as "permission denied". "man rsh" describes the rules for when passwords are required. In particular, the user can provide a .rhosts file in his own directory that will automatically provide the password as needed. DO WE NEED TO DEFINE CONCEPTS OF master (original process) and slaves (remote processes)? ] local 0 HOSTNAME # PROGRAM local refers to the machine first invoking MPI. [WHY IS 0 REQUIRED?] # is usually 1, but on a shared memory machine it may be larger than one, with the interpretation that it represents the number of threads in the single process. On high-performance hardware, # may have the interpretation of how many nodes to attach. Note that it is often possible to have two MPI nodes on the same processor. For example, the following procgroup file sets up a MPI session with the current process (local), a remote process on the same host, and two remote processes, both on shared_mem.podunk.edu and both containing two threads. local 0 shared_mem.podunk.edu 2 /home/me/myprogram shared_mem.podunk.edu 2 /home/me/myprogram If one is programming MPI with a SPMD style of programming, then one will typically choose the same program on all machines (/home/me/myprogram). This is made particularly easy in a shared file system such as NFS (Networked File System) or AFS (Andrew File System). However, if you are using machines of heterogeneous architectures, then you will need multiple binaries, and in that case you may need to specify different programs for different machines. During initial debugging, it is wise to begin with a procgroup file, such as: local 0 localhost 1 /home/me/myprogram localhost 1 /home/me/myprogram assuming that localhost is defined at your site as an alias for your the current machine. There are many ways to test this, but one of the surest is simply to try to execute a remote command, ``rsh localhost pwd'', and see if it ``works''. Note that on some systems, localhost may itself require a password to login, in which case the .rhosts mechanism may be useful. Although the MPI standard does not specify the internal format of a MPI message, it is useful to begin the description of MPI with the description of an internal format consistent with the MPI standard. Our MPI message format consists of the following: ___________________________________________________________________ | Communicator | Tag | Source | Destination | Size | ... DATA ... | ------------------------------------------------------------------- [TERMINOLOGY: VERIFY TERM NODES IN THIS CONTEXT FOR MPI STANDARD. IF STANDARD, DEFINE IT {\it node}.} The {\it source} and {\it destination} fields specify which MPI nodes are sending and receiving this message. Both fields are specified by a {\it rank}, which is encoded as an integer. MPI assigns a unique rank to each node. By convention [IS THIS IN THE STANDARD?], the local node is assigned rank 0, and other nodes are assigned consecutively increasing integers. The {\it size} field for our internal format will specify the size of the DATA field in bytes. The MPI standard defines only the related comment of count (discussed later), but not size. The {\it DATA} field is of course arbitrary user-defined binary data of length specified by the size field. The {\it tag} field is a user-defined integer. The user can choose to set this field according to his own requirements. For example, the user might use the tag field as a message sequence number, as a task_type number, as an indication of the format of the data, or simply set it always to 0 and to ignore that ``feature''. The {\it communicator} field is analogous to a specified protocol for sockets (such as the TCP protocol, UDP, IP, DECNet, etc.). Distinct communicators imply distinct communication streams that are oblivious to each other. However, unlike protocols, the MPI programmer has the ability to create a new communicator and to establish a new communication stream defined by that communicator. To continue the analogy with sockets, the Ethernet standard wraps an outermost Ethernet header around all messages, while allowing packets with independent protocols to operate inside. Most current drivers will read the packets of the desired protocol and ignore all packets of other protocols. Communicators will be made more precise later [IN WHAT SECTION?]. For now, it suffices in simple programs to set all communicators to the unique value, \verb|MPI_COMM_WORLD|. ======================= Given an underlying internal format, which may or may not correspond to the one above, the MPI standard lays down several principles, of which the following two may be the most important: 1. Non-overtaking of messages: Given two messages of the same {\it message type} (same communicator, same tag, same source, same destination), the message sent earlier must also be received the the destination node earlier. 2. Progress rule: After a message, $M$, is sent, if the destination node repeatedly calls one of MPI's commands for receiving any of a set of messages that includes~$M$, then there is some upper bound on the number of messages that the destination may receiver before receiving the~$M$. ======================= Next, we consider the MPI commands, \verb|MPI_Send()| and \verb|MPI_Recv()|. Along with \verb|MPI_Init()| and \verb|MPI_Finalize()|, these MPI library calls suffice to write almost any program that one would want to write with sockets (at least for a session of limited duration with known participants --- sockets, of course, have the further important task of allowing a computer to provide services on a ``well-known port'' to unknown remote nodes). Indeed, note that the forms \verb|MPI_Send()| and \verb|MPI_Recv()| correspond in spelling with the socket analogues, \verb|send()| and \verb|recv()|. The calling syntax is: \begin{verbatim} MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) \end{verbatim} We have alrady seen the dest, source, tag, and comm fields from the earlier description of the internal format. \verb|buf| is a pointer to a user-defined field containing the DATA for the message. \verb|datatype| specifies a MPI datatype. MPI provides constants to indicate the primitive data types. Among the more common types are MPI_INT, MPI_FLOAT, MPI_CHAR and MPI_BYTE. In heterogeneous environments, it is important to carry out data translation, often using the XDR facility (SEE LIST OF TERMS). In particular, one should distinguish between MPI_BYTE, for which the sole data translation should be truncation or padding to a common size (8-bit bytes are by far the most common size today), and MPI_CHAR, for which a character set translation may be required (for example among ASCII, EBCDIC, a different national character set, or one of the ISO standards for international character sets [LOOK UP ISO NUMBERS AND STANDARDS]). MPI also provides for compound [RIGHT WORD? CHECK C~MANUAL] datatypes, such as struct and vector, as will be discussed later. Finally, the \verb|count| parameter is used to specify the number of such datatypes to be sent. Hence, a count of~1 is common, but count may be larger, to easily represent a vector of values, and count may even be~0, if only the tag information is of interest. Only the {\it status} field remains to be discussed. The status buffer can be used as either an input or output parameter, depending on the MPI command. A status buffer is always associated with a particular message that has already been sent (but may still be in q queue at the destination). MPI_Recv uses it as an output parameter. The status buffer is a struct. The MPI standard requires the struct to contain at least two fields with names, MPI_SOURCE and MPI_TAG. Both fields are C~int's, representing the rank of the source and the tag. In addition, the MPI status buffer typically has one additional, internal field that encodes the size of the message (in bytes or other convenient unit). The programmer can access this field through the MPI library call: \begin{verbatim} MPI_Get_count( MPI_Status *status, MPI_Datatype datatype, int *count ) \end{verbatim} If the programmer desires to know the number of values in an associated message, the programmer must not only provide the associated status buffer for the message on input, but also the datatype that the programmer is expecting for the message. A typical implementation of MPI_Get_count() will then invoke a statement similar to \verb|return status->size/sizeof(datatype);|, although \verb|sizeof(datatype)| would be replaced by some implementation-specific expression, since datatype is only a MPI_datatype and not a legal C~type. The requirement to specify the input parameter, datatype, illustrates another important principle: \begin{quote} The MPI COUNT parameter is not the SIZE of the message. \end{quote} In fact, the MPI standard never mentions size. This field was defined only in our hypothetical internal implementation of a MPI message. The size of the data in the message is purely an internal implementation consideration (which may use units of bytes or any other convenient units). The count parameter is the number of values of type datatype being sent or received in the corresponding message. It is up to the programmer to specify the same datatype both when sending and when receiving a message. ===== Bxxx and Ixxx forms; Mention that an implementation is free to define these to do the same thing as Send/Recv. Hence, these provide opportunities for an MPI implementation to provide greater efficiency, but an MPI implementation is not required to provide greater efficiency. ================================================== [maybe a little about different kinds of sending/receiving (using I or B) but not much more. Then introduce the 7 or so layers of the MPI standard; point-to-point, collective communication, ...] The next consideration for a more sophisticated MPI implementation is the use of collective communication. Depending on the MPI implementation, the collective communication library calls may provide no efficiency advantage over the point-to-point library calls. Nevertheless, such calls have the potential for much greater efficiency due to binary communication trees [MAKE SURE TO COVER THIS ELSEWHERE], parallel prefix computations, etc. Note, however, that if all nodes are on a common Ethernet network, then there may be no advantage to a binary communication tree if all messages are sequentialized on the network. In such a situation, there may still be an advantage to parallel prefix techniques since it may allow overlapping of communication and computation. \begin{verbatim} MPI_Bcast(XXX) MPI_Gather(XXX) MPI_Scatter(XXX) MPI_Reduce(XXX) \end{verbatim} The collective communication calls are posed in a form that encourages a SPMD style of programming. Each node that is a member of the given communicator must execute the same collective communication command with the same parameter values. The action will not be complete until all nodes have executed the command. Further, under some MPI implementations, all nodes executing this command may stall until the last node has joined the collective communication action. In particular, collective communication routines all include a {\it root} parameter. The root parameter is an int that specifies a rank (a distinguished node). In the case of MPI_Bcast (broadcast), the root is the node originating the message. The effect is as if multiple messages had been sent to every non-root node of the communicator. {\it (N.B.: The root node is not one of the destinations in a broadcast communication.)} In the case of scatter, gather (different sendtype and recvtype possible. Explain). {\bf FILL IN} In the case of MPI_Reduce, an additional parameter, {\it op}, is defined. This is set to a MPI constant such as MPI_MAX, MPI_SUM, MPI_PROD, MPI_XOR, MPI_LOR, MPI_BOR, etc. Further, it is possible to define new operations. Note that all of the operations are associative, and therefor susceptible to parallel prefix or related techniques. All of these operations are also commutative, and hence the order in which nodes participate in a reduce collective communication does not matter. When creating user-defined operations, those operations may or may not be commutative, but the user is required to specify whether the operation is commutative, in order to allow MPI the opportunity for more efficient communication. {\bf discuss scan, barrier, reduce scatter, AllXXX versions of above} ===== Other layers of MPI model ==== [REFER TO FULL MANUAL AND ON-LINE STANDARD, ETC., FOR DETAILS] [SIMPLE MPI PROGRAM WITH TRIVIAL TAG, COMM, ETC., GOES SOMEWHERE] [EXPLAIN COMMUNICATOR IN MORE DETAIL, NOW] ===== Issues for MPI2: Interoperability, spawn (dynamic processes), Parallel I/O, [WHAT ELSE?, LIBRARIES?] (note that process kill is not handled -- and the MPI standard allows (but does not require) the entire MPI jobs to die when one process dies. The rationale for not requiring recovery from this event appears to be that a major target for MPI is the specialized, high-performance parallel computers, and in many of these cases, when one node of a parallel computer goes down, much of or the whole computer goes down. Further, on a large job on a dedicated parallel computer, there may be no competing jobs, in which case, having one node of the computer die may be an infrequent event. Nevertheless, it should be noted that LAM does support queries about a processor going down and continuation of the job [DETAILS, WHAT COMMANDS?]. [INCLUDE LIST OF EXISTING IMPLEMENTATIONS] [NOTES: The MPI standard can be extended to other languages besides C and FORTRAN. Such extensions exist for C++, LISP, and other languages.] [NOTES: strangeness: MPI_INT, etc. don't have to be C constants. They can be implemented as C variables. This can be used as a debugging technique. By assigning distinct values to MPI_INT during distinct sessions (perhaphs assigning random values), one may be able to better debug situations in which the MPI_INT value is accidentally confused with some other program value. Whether this is truly a help in debugging is unclear. However, it is clear that this design can catch an unwary programmer. For example, if MPI_INT is a variable, then its value cannot be assigned until MPI_Init() is called. Hence, any MPI program that refers to MPI_INT before MPI_Init() is called is in error in such an implementation. Unfortunately, this also means that MPI_INT cannot be used in C initialization constructs.] NOTES: Certain MPI communicator layer for creating and managing new communicators (MPI_Comm_create(), ...) are considered to be a form of collective communication. Hence, under some MPI implementations, such commands may stall until all nodes involved in the indicated action have joined the action.