PVM: How Does PVM Work?, QUB

The Queen's University of Belfast

Parallel Computer Centre

[Next] [Previous] [Top]

How Does PVM Work?

A brief look into the entrails of PVM.

How Does PVM work?

Sub Topics

Design Considerations
Components
Messages
PVM Daemon
Libpvm
Protocols
Routing
Task Environment
Resource Limitations

Design Considerations

Portable - avoid use of features (operating system and language) which are hard to replace if not available. Therefore generic port is as simple as possible but can be optimised
Sockets used for interprocess communication
Connection within a virtual machine is always possible via TCP or UDP protocols
Mutiprocessor machines which don't support sockets on the nodes have front-end processors which do

Components

TIDs

TID consists of 4 fields
TID fits into largest integer data type (32 bits)
Tid Structure
- S, G and H have global meaning
- H - host number relative to virtual machine
- S - address pvmds (historical may be reclaimed in H or L)
- L - local task TIDs per pvmd
- G - GIDs used in multicasting

H Field

Each pvmd configured with unique host number
Each pvmd "owns" part of the TID address space
Maximum number of hosts 212 -1 = 4095
Global host table synchronises host numbers
Host number 0 refers to local pvmd (or shadow pvmd') depending on context

L Field

Local task ids
Controlled by each pvmd
18 bits available therefore 218-1tasks on each host
All bits zero means the pvmd itself
Generic version L assigned using a counter and the pvmd maps the L values to Unix process numbers

Summary

Tasks assigned TIDs by local pvmds without host-to-host communication
Message routing by hierarchical naming
Functions may return a mixed vector of error codes and legal TIDs (see table)
TIDs should be opaque to applications - never attempt to predict TIDs (use groups or implement a name server)

Architecture Classes

Architecture name is assigned to each machine implementation
Architecture name used to locate executables (and pvm libraries)
List of architecture names may (will) expand
Machines with incompatible executables may have the same data representation
Architecture names are mapped to data encoding numbers which are used to determine when encoding is required

PVM Daemon

pvmds are owned by users and do not interact with those of other users
pvmd - message router and controller
pvmd provides authentication, process control and fault detection
pvmds continue running after an application crashes to help with debugging
First pvmd to start is designated as the master which then starts the other pvmds which are designated as slaves
All pvmds are considered equal except:
- Reconfiguration requests are forwarded from slave to master
- Only the master can forcibly delete hosts

LIBPVM

Shares address space with user's (possibly) buggy code - library routines may be affected (crashed, corrupted....)
Parameter checking is minimal - pvmds are responsible
Top level code is machine independent
Low level code is modularised to permit machine specific optimization

Messages

Fragments and Databufs

pvmd and libpvm manage message buffers - potentially a large dynamic amount of data
Data is stored in data buffers and accessed via pointers which are passed arround - eg. a multicast message is not replicated only the pointers
A refcount is also passed with the pointer and pvm routines use this to decide when to free the data buffer
Messages are composed without a maximum length. The pack functions allocate memory blocks for buffers and frag descriptors to chain the databufs together
A frag desciptor struct frag holds:
- a pointer to a block of data fr_dat
- the length of the block fr_len
- a pointer to the databuf fr_buf and its total length fr_max (to prepend or append data)
- link pointers for chaining a list
- refcount - frag is deleted when refcount = 0
- refcount of frag at head of list applies to list and list is deleted when refcount = 0

libpvm

Packing functions - 5 sets of encoder and decoder functions
Esch buffer has an associated set of functions
Parameter to pvm_mkbuf (pvmf_initsend) associates encoder set
Received messages set decoder functions via encoding field in header
Common formats: raw and XDR (default)
Inplace encoders pack descriptors of the data which is sent without being copied into the buffer - no inplace decoder
Foo - pvmd communication
Alien decoder - encoder format doesn't match host's format - message is not unpacked only held or forwarded
Message ids (MID) - integer index to message heap
MID is recycled when buffer freed
Heap grows as required (usually small unless application explicitly stores messages)
Encoder/decoder vector used for speed (avoid case switch in each pack)
Message storage in libpvm:

pvmd

Simple packing using foo - signed and unsigned integers, and strings
Messages for the pvmd are reasssembled and handled according to their source:
- local task
- remote pvmd
- local or remote special tasks (hoster, tasker, resource manager....)
Message storage in pvmd:

Control Messages

Sent to tasks as normal messages
Tags set in reserved space
Messages are not queued for receipt by program and can't be used to attract the task's attention
Control messages can cause asynchronous actions to be started
Three control messages are used for direct routing (see later)
- connection request
- connection acknowledge
- task exit (non-existant)
Other messages include:
- NOOP - do nothing
- OUTPUT - claim child's stdout
- SetTMask- change task's trace mask
Future modifications may include - user definable control messages for PVM signal handlers

PVM Daemon

Startup

pvmd configures itself as master or slave
Creates and binds sockets
Opens error log file /tmp/pvml.uid
Master pvmd reads host file (if specified)
Slave pvmds get their setup configuration from the master pvmd
Pvmds enter a loop in a function called work() which:
- probes all sources of input
- receives and routes packets
- assembles messages to the pvmd and passes these to the appropriate entry points

Shutdown

pvmd shuts down when:
- it is deleted from the virtual machine
- killed (via signal)
- loses contact with the master pvmd
- crashes (eg. bus error)
Two actions are taken:
- kill all local tasks (SIGTERM)
- send a messages all other pvmds

Host Table

Describes the configuration of the virtual machine
Various host tables exist - constructed from host descriptors
List the name, address and communication state of each host
Issued by the master pvmd and kept synchronised across the pvmds
Each pvmd can autonomously delete hosts which become unreachable
Hosts are added in a 3 phase commit operation to ensure global availability
As the configuration changes host descriptors are propagated throught the virtual machine
Host tables are used to manipulate the set of hosts; eg. select a machine when spawning a process
Hostfile is parsed and a host table constructed

Host Table Structure

Tasks

pvmd maintains a list of its local tasks
Tasks are stored in a doubly linked list - sorted by TID and process id
Tasker:
- specialised debugger tasks which can be developed (simple template provided)
- introduced in 3.3
- starts(parents) other tasks
- spawn flags cannot enable/disable tasker
- must be explicitly started

Task Table Structure

Wait Contexts

pvmd is not truly multi-threaded but performs operations concurrently
Wait contexts (waitc) used to store current state when the context must be changed
Where more than one phase of waiting is necessary waitcs are linked in a list
Waitcs are sequentially numbered and this id is sent in the message header of replies - once reply is acknowledged waitc is discarded
Waitcs associated with exiting(failing) hosts and tasks are retained until outstanding replies are cleared

Fault Detection and Recovery

pvmd can recover from the loss of any foreign pvmd - except the master
If a slave loses contact with the master it shuts down
Virtual machine retains its integrity - does not fragment into partial virtual machines
Fault tolerance is limited as the master must never crash - run master on most secure system
Master cannot hand over to another pvmd and exit
PVM 2 - failure of any pvmd crashed the virtual machine

pvmd'

Shadow pvmd on master host
Used to start slave pvmds
- this process can block for seconds (minutes) or hang at various points
- master should be free to respond to other requests
pvmd' has host number 0
Talks to master using the normal pvmd-pvmd mechanism (does not communicate with other pvmds or tasks)
If pvmd' fails the normal detection/clean-up mechanism is used

Starting slave pvmds

Goal: securely start a process on a remote processor
Avoid typing passwords every time or storing passwords in a file (very dodgey)
Possible solutions include inetd, rlogin and telnet
rsh and rexec() are used
rsh uses .rhosts or host.equiv files
rexec() function compiled into pvm - supply password at run time or put it in .netrc file (bad idea)
Example: task calls pvm_addhosts()
- request is passed to local pvmd
- if local pvmd is a slave it passes the request to the master
- master looks up IP addresses, adds the host table entry and sets the option before passing the request to pvmd'
- pvmd' starts the slave pvmd (multiple slaves may be started in parallel)
- new slave is committed to the virtual machine
- master finally acknowledges addhost request with
  new host tid and original function call completes.
  Note that the task was blocked during the entire
  process
Process for committing a new slave to the virtual machine:
- master broadcasts to all existing and new slaves
- slaves now know the new virtual machine configuration
- master waits for all slaves to acknowledge new configuration
- master broadcasts commit message to slaves which then start to use the new hosts table. Thus new hosts are not available until all pvmds know the new configuration

Resource Manager

RM is a pvm task responsible for making task and host placement decisions
3.3 introduced the RM interface - register a task as an RM with pvm_reg_rm()
Simple scheduling in pvmd - explicit user placement (may) give maximum efficiency
RM can use information such as host load averages or interact with external systems such a queue manager
The number of RMs can vary from one for the entire machine to one per pvmd
RM intercepts libpvm function calls such as addhost, delhost, spawn or queries such as config, notify, task)

Libpvm

Language Support

Written in C - supports C and C++ directly
Fortran library libfpvm3.a written in C as a set of wrapper functions that conform to Fortran (77) call conventions
- Fortran/C linking: preprocess C code for Fortran library with m4 prior to compilation

Interaction with daemon

Connection: daemon listens to socket
- /tmp/pvmd.uid contains either: socket IP address and port number: or path if Unix-domain socket
- uid is the numeric user id under which pvm is running
- Spawned tasks are passed the socket address via PVMSOCK (see later) to save overhead
Disconnection:
- pvm_exit - clean shutdown (reconnectable with new TID)
- Task exit - forced shutdown
Reconnection: tasks use PID (Process ID) as identifier
- Task may not always be child of pvmd (intervening processes such as debuggers) therefore PVMEPID (see later) is used and task passes back real PID so that pvmd can control it
- A TCP socket is created and connection dance performed
- Task and pvmd must prove their identity. A file is created in /tmp owned and writeable by the user. Each tries to write to the other's file. Note: this very rudimentary security and is only as strong as the system's filesystem and integrity of root.

Protocols

TCP, UDP and Unix-domain sockets are used due to their availability (although other more appropriate protocols do exist)
Three possible connections:
- pvmd to pvmd
- pvmd to task
- task to task

Message Headers

Code - message type (tag)
Encoding - always set to 1 (foo) by pvmd
Wait Context ID - wait IDs of waitc (may be 0). Tasks such as resource manager, tasker, hoster also use waitid.
Checksum - reserved for future use
Messages are sent in fragments each with a fragment header. The message header is at the start of the first header.

Pvmd to pvmd

UDP sockets are used
UDP is unreliable (packets may be lost, duplicated or reordered) so an acknowledgment retry mechanism has been implemented
UDP limits packet size - long messages are fragmented
TCP rejected due to
- scalability (N-1 open connections required per pvmd)
- overhead - set up handshaking
- fault tolerance - simpler with UDP
Packet header:
TIDs are true source and final destination addresses (route independent)
Sequence and Ack numbers start at 1 and increment to 65535 then cycle through 0
Bit Fields:
- SOM(EOM) - start (end) of message, cleared for intervening fragments
- DAT - packet contains data - must be delivered (sequence number is valid)
- ACK - ack field is valid (future modifcations may use a combination of DAT and ACK for greater efficiency)
- FIN - closing connection (on a panic shutdown pvmd tries to send a FIN and ACK to all its peers)
Pvmd stores the state of connections to other hosts in its host table (see earlier)
- Outgoing packets are appended (FIFO) to the transmit queue (txp)
- Incoming packets are passed to send queues (usually a task) or reassembled into messages and delivered to the pvmd (no receive queue)
- Outstanding packets (sent but not ACKed) are queued (opq) - multiple queues improve efficiency
- Out of sequence packets are held (rxq) until they can be reordered and accepted
- Round trip time (rtt)
Round trip time (rtt)
- Time difference between sending a packet and its ACK.
- On ACK a packet is removed from opq and discarded.
- A retry timer and count record are held on a per packet basis.
- The timer starts at 3*rtt and doubles to a max of 18 seconds.
- rtt is limited to a maximum of 9 seconds.
- Backoff occurs until at least 10 attempts to reach a host have failed.
- A packet expires after 3 minutes and the host marked as unreachable.

Pvmd to task and task to task

Tasks talk to pvmds and other tasks via TCP sockets:
- TCP delivers data reliably but at a price
- UDP can lose packets (even within one host). This unreliability requires timers and interrupts but computing tasks may not be interrupted to perform I/O
- Reliability simplifies implementation - no sequence numbers; only SOM and EOM flags
- TCP reads requires 2 reads header and message. (An optimization to include the next header in a message has been dropped.)
V3.3 introduced Unix domain socket support for local communication (roughly twice as fast). Default NOUNIXDOM (see installation options)
Pvmd to task packet header:

Message Routing

pvmd

Messages are routed by destination address
Messages to other pvmds are linked to packet descriptors and attached to send queue
pvmd often sends loop-back style messages to itself (no packet descriptor)
Messages to a pvmd are reassembled from packets in message reassembly buffers - one buffer for each local task and remote pvmd
Packet routing
- From local tasks: pvmd reads header, creates a buffer and then chains the descriptor on to the queue for its destination
- Multicast: descriptor is replicated and one copy placed on each relevant queue, when the last message is send the buffer is freed
Refragmentation: messages are built to avoid fragmentation - in some cases a pvmd needs to fragment a packet for retransmission

Pvmds and foreign tasks

Pvmds usually don't communicate with foreign tasks (those on another host)
Messages are routed through the remote task's local pvmd
Sending task's pvmd reassembles and receives the message and then sends it to the foreign pvmd which reassembles it, reads the destination and forwards the message to the task

libpvm

Direct message routing permits direct task to task message passing via a TCP link. The pvmd forwarding overhead is avoided at a cost of the additional setup time.
- Implemented entirely in libpvm
- Direct routing is not the default - it must be enabled (PvmRouteDirect)
- If a direct route cannot be established the message is passed by the normal route

Multicasting

Current implementation only routes multicast messages via the pvmd
1:N fanout minimises efffect of a host failure
A multicast address (GID) is formed by setting the G bit of the TID
The L field is incremented on each multicast so that a new address is created
Pvmd creates a multicast descriptor and GID: sorts and validates addresses; and copies the message to each local task and foreign pvmd
Receiving pvmds copy messsage to each local task

Task Environment

Environment Variables

Increased use of environment variables - may even be extended to machines which don't support such a concept
Tasks may export any part of their environment to spawned tasks
For example

PVM_EXPORT=DISPLAY:SHELL

would export the variables DISPLAY and SHELL

to spawned tasks (and PVM_EXPORT)

User definable variables:
Variables set by PVM which should not be modified

Standard Input and
Output

/dev/null opened for stdin
Parent supplies stdout sink (TID, code)
Output on stdout (and stderr) is read via a pipe by pvmd and packed into messages to be sent to TID with message tag code
Output to TID zero is sent to the master pvmd to be written to its error log
Use pvm_setopt to alter TID or code for processes about to be spawned
TID may be: value inherited from parent; its own TID; or zero

Output

Code may only be set if TID is set to its own TID
Output may not be assigned to an arbitrary task
There are 4 types of output message

spawn - a task has been spawned

begin - first output from task

output - output from a task

end - last output from task

Messages always contain TID and code
Each task sends spawn, begin, zero or more output, end
Begin, output and end are always in sequence as they originate from the same source
Spawn originates at the pvmd of the parent so it may be out of sequence
Output sink stops listening for output when end (EOF) is received
Use pvm_catchout to collect output from children task(s) into a file

Resource Limitations

Operating system and hardware limits are imposed on PVM applications, PVM avoids setting explicit limits but returns an error when resources are exhausted. Mulit-user systems mean that many limits vary dynamically.

In pvmd
- The maximum number of tasks managed by one pvmd depends on the open file limit and the process creation limit
- Message buffering: messages accumulate (on a FIFO queue) until receiving tasks accept them. Pvmd will accept messages until it runs out of memory - consider sending/receiving balance in the design of tasks
In tasks
- Direct routing consumes a file descriptor - hence open file limit applies to the number of direct connections
- Size of message is limited by memory:
- Message is packed into buffer from elsewhere, therefore half memory size is an upper limit
- In-place encoding extends half memory upper limit but the buffering at the receiving end is still a factor (may be no room to unpack the message)
- Many to one message passing may overload receiver
- Keeping messages using multiple buffers uses memory
- Some solutions: use smaller messages, eliminate bottlenecks; process messages in the order they are generated

[Next] [Previous] [Top]

All documents are the responsibility of, and copyright, © their authors and do not represent the views of The Parallel Computer Centre, nor of The Queen's University of Belfast.
Maintained by Alan Rea, email A.Rea@qub.ac.uk

Generated with CERN WebMaker

How Does PVM Work?

A brief look into the entrails of PVM.

How Does PVM work?

Sub Topics

Design Considerations

Components

TIDs

H Field

L Field

Summary

Architecture Classes

PVM Daemon

LIBPVM

Messages

Fragments and Databufs

libpvm

pvmd

Control Messages

PVM Daemon

Startup

Shutdown

Host Table

Tasks

Wait Contexts

Fault Detection and Recovery

pvmd'

Starting slave pvmds

Resource Manager

Libpvm

Language Support

Interaction with daemon

Protocols

Message Headers

Pvmd to pvmd

Pvmd to task and task to task

Message Routing

pvmd

Pvmds and foreign tasks

libpvm

Multicasting

Task Environment

Environment Variables

Standard Input and Output

Output

Resource Limitations

Standard Input and
Output