The Queen's University of Belfast

Parallel Computer Centre
[Next] [Previous] [Top]
How Does PVM Work?
A brief look into the entrails of PVM.
How Does PVM work?
Sub Topics
- Design Considerations
- Components
- Messages
- PVM Daemon
- Libpvm
- Protocols
- Routing
- Task Environment
- Resource Limitations
Design Considerations
- Portable - avoid use of features (operating system and language) which are hard to replace if not available. Therefore generic port is as simple as possible but can be optimised
- Sockets used for interprocess communication
- Connection within a virtual machine is always possible via TCP or UDP protocols
- Mutiprocessor machines which don't support sockets on the nodes have front-end processors which do
Components
TIDs
- TID consists of 4 fields
- TID fits into largest integer data type (32 bits)
- Tid Structure

- S, G and H have global meaning
- H - host number relative to virtual machine
- S - address pvmds (historical may be reclaimed in H or L)
- L - local task TIDs per pvmd
- G - GIDs used in multicasting
H Field
- Each pvmd configured with unique host number
- Each pvmd "owns" part of the TID address space
- Maximum number of hosts 212 -1 = 4095
- Global host table synchronises host numbers
- Host number 0 refers to local pvmd (or shadow pvmd') depending on context
L Field
- Local task ids
- Controlled by each pvmd
- 18 bits available therefore 218-1tasks on each host
- All bits zero means the pvmd itself
- Generic version L assigned using a counter and the pvmd maps the L values to Unix process numbers
Summary
- Tasks assigned TIDs by local pvmds without host-to-host communication
- Message routing by hierarchical naming
- Functions may return a mixed vector of error codes and legal TIDs (see table)
- TIDs should be opaque to applications - never attempt to predict TIDs (use groups or implement a name server)

Architecture Classes
- Architecture name is assigned to each machine implementation
- Architecture name used to locate executables (and pvm libraries)
- List of architecture names may (will) expand
- Machines with incompatible executables may have the same data representation
- Architecture names are mapped to data encoding numbers which are used to determine when encoding is required
PVM Daemon
- pvmds are owned by users and do not interact with those of other users
- pvmd - message router and controller
- pvmd provides authentication, process control and fault detection
- pvmds continue running after an application crashes to help with debugging
- First pvmd to start is designated as the master which then starts the other pvmds which are designated as slaves
- All pvmds are considered equal except:
- Reconfiguration requests are forwarded from slave to master
- Only the master can forcibly delete hosts
LIBPVM
- Shares address space with user's (possibly) buggy code - library routines may be affected (crashed, corrupted....)
- Parameter checking is minimal - pvmds are responsible
- Top level code is machine independent
- Low level code is modularised to permit machine specific optimization
Messages
Fragments and Databufs
- pvmd and libpvm manage message buffers - potentially a large dynamic amount of data
- Data is stored in data buffers and accessed via pointers which are passed arround - eg. a multicast message is not replicated only the pointers
- A refcount is also passed with the pointer and pvm routines use this to decide when to free the data buffer
- Messages are composed without a maximum length. The pack functions allocate memory blocks for buffers and frag descriptors to chain the databufs together
- A frag desciptor struct frag holds:
- a pointer to a block of data fr_dat
- the length of the block fr_len
- a pointer to the databuf fr_buf and its total length fr_max (to prepend or append data)
- link pointers for chaining a list
- refcount - frag is deleted when refcount = 0
- refcount of frag at head of list applies to list and list is deleted when refcount = 0
libpvm
- Packing functions - 5 sets of encoder and decoder functions
- Esch buffer has an associated set of functions
- Parameter to pvm_mkbuf (pvmf_initsend) associates encoder set
- Received messages set decoder functions via encoding field in header
- Common formats: raw and XDR (default)
- Inplace encoders pack descriptors of the data which is sent without being copied into the buffer - no inplace decoder
- Foo - pvmd communication
- Alien decoder - encoder format doesn't match host's format - message is not unpacked only held or forwarded
- Message ids (MID) - integer index to message heap
- MID is recycled when buffer freed
- Heap grows as required (usually small unless application explicitly stores messages)
- Encoder/decoder vector used for speed (avoid case switch in each pack)
- Message storage in libpvm:

pvmd
- Simple packing using foo - signed and unsigned integers, and strings
- Messages for the pvmd are reasssembled and handled according to their source:
- local task
- remote pvmd
- local or remote special tasks (hoster, tasker, resource manager....)
- Message storage in pvmd:

Control Messages
- Sent to tasks as normal messages
- Tags set in reserved space
- Messages are not queued for receipt by program and can't be used to attract the task's attention
- Control messages can cause asynchronous actions to be started
- Three control messages are used for direct routing (see later)
- connection request
- connection acknowledge
- task exit (non-existant)
- Other messages include:
- NOOP - do nothing
- OUTPUT - claim child's stdout
- SetTMask- change task's trace mask
- Future modifications may include - user definable control messages for PVM signal handlers
PVM Daemon
Startup
- pvmd configures itself as master or slave
- Creates and binds sockets
- Opens error log file /tmp/pvml.uid
- Master pvmd reads host file (if specified)
- Slave pvmds get their setup configuration from the master pvmd
- Pvmds enter a loop in a function called work() which:
- probes all sources of input
- receives and routes packets
- assembles messages to the pvmd and passes these to the appropriate entry points
Shutdown
- pvmd shuts down when:
- it is deleted from the virtual machine
- killed (via signal)
- loses contact with the master pvmd
- crashes (eg. bus error)
- Two actions are taken:
- kill all local tasks (SIGTERM)
- send a messages all other pvmds
Host Table
- Describes the configuration of the virtual machine
- Various host tables exist - constructed from host descriptors
- List the name, address and communication state of each host
- Issued by the master pvmd and kept synchronised across the pvmds
- Each pvmd can autonomously delete hosts which become unreachable
- Hosts are added in a 3 phase commit operation to ensure global availability
- As the configuration changes host descriptors are propagated throught the virtual machine
- Host tables are used to manipulate the set of hosts; eg. select a machine when spawning a process
- Hostfile is parsed and a host table constructed
Host Table Structure

Tasks
- pvmd maintains a list of its local tasks
- Tasks are stored in a doubly linked list - sorted by TID and process id
- Tasker:
- specialised debugger tasks which can be developed (simple template provided)
- introduced in 3.3
- starts(parents) other tasks
- spawn flags cannot enable/disable tasker
- must be explicitly started
Task Table Structure

Wait Contexts
- pvmd is not truly multi-threaded but performs operations concurrently
- Wait contexts (waitc) used to store current state when the context must be changed
- Where more than one phase of waiting is necessary waitcs are linked in a list
- Waitcs are sequentially numbered and this id is sent in the message header of replies - once reply is acknowledged waitc is discarded
- Waitcs associated with exiting(failing) hosts and tasks are retained until outstanding replies are cleared
Fault Detection and Recovery
- pvmd can recover from the loss of any foreign pvmd - except the master
- If a slave loses contact with the master it shuts down
- Virtual machine retains its integrity - does not fragment into partial virtual machines
- Fault tolerance is limited as the master must never crash - run master on most secure system
- Master cannot hand over to another pvmd and exit
- PVM 2 - failure of any pvmd crashed the virtual machine
pvmd'
- Shadow pvmd on master host
- Used to start slave pvmds
- this process can block for seconds (minutes) or hang at various points
- master should be free to respond to other requests
- pvmd' has host number 0
- Talks to master using the normal pvmd-pvmd mechanism (does not communicate with other pvmds or tasks)
- If pvmd' fails the normal detection/clean-up mechanism is used
Starting slave pvmds
- Goal: securely start a process on a remote processor
- Avoid typing passwords every time or storing passwords in a file (very dodgey)
- Possible solutions include inetd, rlogin and telnet
- rsh and rexec() are used
- rsh uses .rhosts or host.equiv files
- rexec() function compiled into pvm - supply password at run time or put it in .netrc file (bad idea)
- Example: task calls pvm_addhosts()
- request is passed to local pvmd
- if local pvmd is a slave it passes the request to the master
- master looks up IP addresses, adds the host table entry and sets the option before passing the request to pvmd'
- pvmd' starts the slave pvmd (multiple slaves may be started in parallel)
- new slave is committed to the virtual machine
- master finally acknowledges addhost request with
new host tid and original function call completes.
Note that the task was blocked during the entire
process
- Process for committing a new slave to the virtual machine:
- master broadcasts to all existing and new slaves
- slaves now know the new virtual machine configuration
- master waits for all slaves to acknowledge new configuration
- master broadcasts commit message to slaves which then start to use the new hosts table. Thus new hosts are not available until all pvmds know the new configuration
Resource Manager
- RM is a pvm task responsible for making task and host placement decisions
- 3.3 introduced the RM interface - register a task as an RM with pvm_reg_rm()
- Simple scheduling in pvmd - explicit user placement (may) give maximum efficiency
- RM can use information such as host load averages or interact with external systems such a queue manager
- The number of RMs can vary from one for the entire machine to one per pvmd
- RM intercepts libpvm function calls such as addhost, delhost, spawn or queries such as config, notify, task)
Libpvm
Language Support
- Written in C - supports C and C++ directly
- Fortran library libfpvm3.a written in C as a set of wrapper functions that conform to Fortran (77) call conventions
- Fortran/C linking: preprocess C code for Fortran library with m4 prior to compilation
Interaction with daemon
- Connection: daemon listens to socket
- /tmp/pvmd.uid contains either: socket IP address and port number: or path if Unix-domain socket
- uid is the numeric user id under which pvm is running
- Spawned tasks are passed the socket address via PVMSOCK (see later) to save overhead
- Disconnection:
- pvm_exit - clean shutdown (reconnectable with new TID)
- Task exit - forced shutdown
- Reconnection: tasks use PID (Process ID) as identifier
- Task may not always be child of pvmd (intervening processes such as debuggers) therefore PVMEPID (see later) is used and task passes back real PID so that pvmd can control it
- A TCP socket is created and connection dance performed
- Task and pvmd must prove their identity. A file is created in /tmp owned and writeable by the user. Each tries to write to the other's file. Note: this very rudimentary security and is only as strong as the system's filesystem and integrity of root.
Protocols
- TCP, UDP and Unix-domain sockets are used due to their availability (although other more appropriate protocols do exist)
- Three possible connections:
- pvmd to pvmd
- pvmd to task
- task to task
Message Headers

- Code - message type (tag)
- Encoding - always set to 1 (foo) by pvmd
- Wait Context ID - wait IDs of waitc (may be 0). Tasks such as resource manager, tasker, hoster also use waitid.
- Checksum - reserved for future use
- Messages are sent in fragments each with a fragment header. The message header is at the start of the first header.
Pvmd to pvmd
- UDP sockets are used
- UDP is unreliable (packets may be lost, duplicated or reordered) so an acknowledgment retry mechanism has been implemented
- UDP limits packet size - long messages are fragmented
- TCP rejected due to
- scalability (N-1 open connections required per pvmd)
- overhead - set up handshaking
- fault tolerance - simpler with UDP
- Packet header:

- TIDs are true source and final destination addresses (route independent)
- Sequence and Ack numbers start at 1 and increment to 65535 then cycle through 0
- Bit Fields:
- SOM(EOM) - start (end) of message, cleared for intervening fragments
- DAT - packet contains data - must be delivered (sequence number is valid)
- ACK - ack field is valid (future modifcations may use a combination of DAT and ACK for greater efficiency)
- FIN - closing connection (on a panic shutdown pvmd tries to send a FIN and ACK to all its peers)
- Pvmd stores the state of connections to other hosts in its host table (see earlier)
- Outgoing packets are appended (FIFO) to the transmit queue (txp)
- Incoming packets are passed to send queues (usually a task) or reassembled into messages and delivered to the pvmd (no receive queue)
- Outstanding packets (sent but not ACKed) are queued (opq) - multiple queues improve efficiency
- Out of sequence packets are held (rxq) until they can be reordered and accepted
- Round trip time (rtt)
- Round trip time (rtt)
- Time difference between sending a packet and its ACK.
- On ACK a packet is removed from opq and discarded.
- A retry timer and count record are held on a per packet basis.
- The timer starts at 3*rtt and doubles to a max of 18 seconds.
- rtt is limited to a maximum of 9 seconds.
- Backoff occurs until at least 10 attempts to reach a host have failed.
- A packet expires after 3 minutes and the host marked as unreachable.
Pvmd to task and task to task
- Tasks talk to pvmds and other tasks via TCP sockets:
- TCP delivers data reliably but at a price
- UDP can lose packets (even within one host). This unreliability requires timers and interrupts but computing tasks may not be interrupted to perform I/O
- Reliability simplifies implementation - no sequence numbers; only SOM and EOM flags
- TCP reads requires 2 reads header and message. (An optimization to include the next header in a message has been dropped.)
- V3.3 introduced Unix domain socket support for local communication (roughly twice as fast). Default NOUNIXDOM (see installation options)
- Pvmd to task packet header:

Message Routing
pvmd
- Messages are routed by destination address
- Messages to other pvmds are linked to packet descriptors and attached to send queue
- pvmd often sends loop-back style messages to itself (no packet descriptor)
- Messages to a pvmd are reassembled from packets in message reassembly buffers - one buffer for each local task and remote pvmd
- Packet routing
- From local tasks: pvmd reads header, creates a buffer and then chains the descriptor on to the queue for its destination
- Multicast: descriptor is replicated and one copy placed on each relevant queue, when the last message is send the buffer is freed
- Refragmentation: messages are built to avoid fragmentation - in some cases a pvmd needs to fragment a packet for retransmission
Pvmds and foreign tasks
- Pvmds usually don't communicate with foreign tasks (those on another host)
- Messages are routed through the remote task's local pvmd
- Sending task's pvmd reassembles and receives the message and then sends it to the foreign pvmd which reassembles it, reads the destination and forwards the message to the task
libpvm
- Direct message routing permits direct task to task message passing via a TCP link. The pvmd forwarding overhead is avoided at a cost of the additional setup time.
- Implemented entirely in libpvm
- Direct routing is not the default - it must be enabled (PvmRouteDirect)
- If a direct route cannot be established the message is passed by the normal route
Multicasting
- Current implementation only routes multicast messages via the pvmd
- 1:N fanout minimises efffect of a host failure
- A multicast address (GID) is formed by setting the G bit of the TID
- The L field is incremented on each multicast so that a new address is created
- Pvmd creates a multicast descriptor and GID: sorts and validates addresses; and copies the message to each local task and foreign pvmd
- Receiving pvmds copy messsage to each local task
Task Environment
Environment Variables
- Increased use of environment variables - may even be extended to machines which don't support such a concept
- Tasks may export any part of their environment to spawned tasks
- For example
PVM_EXPORT=DISPLAY:SHELL
would export the variables DISPLAY and SHELL
to spawned tasks (and PVM_EXPORT)
- User definable variables:

- Variables set by PVM which should not be modified

Standard Input and
Output
- /dev/null opened for stdin
- Parent supplies stdout sink (TID, code)
- Output on stdout (and stderr) is read via a pipe by pvmd and packed into messages to be sent to TID with message tag code
- Output to TID zero is sent to the master pvmd to be written to its error log
- Use pvm_setopt to alter TID or code for processes about to be spawned
- TID may be: value inherited from parent; its own TID; or zero
Output
- Code may only be set if TID is set to its own TID
- Output may not be assigned to an arbitrary task
- There are 4 types of output message
spawn - a task has been spawned
begin - first output from task
output - output from a task
end - last output from task
- Messages always contain TID and code
- Each task sends spawn, begin, zero or more output, end
- Begin, output and end are always in sequence as they originate from the same source
- Spawn originates at the pvmd of the parent so it may be out of sequence
- Output sink stops listening for output when end (EOF) is received
- Use pvm_catchout to collect output from children task(s) into a file
Resource Limitations
Operating system and hardware limits are imposed on PVM applications, PVM avoids setting explicit limits but returns an error when resources are exhausted. Mulit-user systems mean that many limits vary dynamically.
- In pvmd
- The maximum number of tasks managed by one pvmd depends on the open file limit and the process creation limit
- Message buffering: messages accumulate (on a FIFO queue) until receiving tasks accept them. Pvmd will accept messages until it runs out of memory - consider sending/receiving balance in the design of tasks
- In tasks
- Direct routing consumes a file descriptor - hence open file limit applies to the number of direct connections
- Size of message is limited by memory:
- Message is packed into buffer from elsewhere, therefore half memory size is an upper limit
- In-place encoding extends half memory upper limit but the buffering at the receiving end is still a factor (may be no room to unpack the message)
- Many to one message passing may overload receiver
- Keeping messages using multiple buffers uses memory
- Some solutions: use smaller messages, eliminate bottlenecks; process messages in the order they are generated
[Next] [Previous] [Top]
All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk
Generated with CERN WebMaker