Abstract

 

Emerald is an object-oriented programming system, which makes execution of Emerald applications upon several computers connected by a network possible. Emerald supports the movement of individual objects and any contained processes between computers. This framework suggests implementing load distribution, so that the workload is distributed across the available processors.

When load distribution distributes processes and objects, inter-object communication must use the network. This is much slower than communication on a single computer, therefore load distribution should consider inter-object communication as well as utilization of processors.

Such load distribution is designed, implemented, and experimented with in order to test if: (1) Load distribution not considering communication can significantly reduce the elapsed wall-clock time of compute-intensive object-oriented applications. (2) Load distribution considering communication as well as load can reduce elapsed wall-clock execution time of non-compute-intensive applications.

The developed algorithms move objects and their contained processes from heavily-loaded computers to lightly-loaded computers. In periods when all computers are lightly-loaded, heavily-communicating objects are gathered on the same computer. Their placement is based on statistics describing the inter-object communication.

The experiment shows that: (1) Load distribution can significantly reduce elapsed wall-clock execution time of compute-intensive applications (2) Load distribution only considering load, not object communication, in object-oriented distributed programming systems may increase elapsed wall-clock time. (3) The chosen approach of considering both load and object communication cannot, in general, reduce elapsed wall-clock time of execution.

Key words: object-oriented programming systems, object mobility, Emerald, load distribution, inter-object communication, reducing network usage.

Appendix 3 contains a brief summary in Danish.

Table of Contents

1 Introduction 6

1.1 System 6

1.2 New Performance Possibilities 6

1.3 Theses and Objective 7

1.4 Experiment 7

1.5 Contributions 7

1.6 Thesis Organization 7

2 Environment of the Experiment: Emerald 8

2.1 Emerald at the Language Level 8

2.2 System 9

2.3 Object Implementation 10

2.4 Processes and their Implementation using Threads 10

2.5 Various Definitions 11

2.6 Assumptions 12

2.7 Summary 12

3 Overall Design 13

3.1 Design Criteria 13

3.2 Aims 13

3.3 Load Distribution in Operating or Programming Systems 14

3.4 Previous Work 15

3.5 A Strategy without Object Thrashing 15

3.6 Summary 17

4 Designing the Load Distribution Algorithm (LDA) 18

4.1 Design Method and Overview 18

4.2 Static and Dynamic Algorithms 18

4.3 Load Factor 19

4.4 Migration/Nonpreemptive Movement 20

4.5 Components of a Load-Distribution Algorithm 21

4.6 Selecting constants and Adapting Values 28

4.7 The LDA 30

4.8 Properties of the LDA 33

4.9 Summary 36

5 Designing the Remote Invocation Reduction Algorithm (RIRA) 37

5.1 Three Approaches 37

5.2 Selecting the RIRA 38

5.3 Designing the RIRA 39

5.4 The RIRA 43

5.5 Properties of the RIRA 43

5.6 Summary 44

6 Implementation 45

6.1 Program Modules 45

6.2 Load Distribution 46

6.3 Statistic Gathering 46

6.4 Reduction of Remote Invocations 49

6.5 Performance Considerations 49

6.6 Summary 50

7 Experiment 51

7.1 Contents of Experiment 51

7.2 Test Conditions 52

7.3 Performing the Tests 53

7.4 Statistic Gathering's impact on Invocation Performance 54

7.5 Compute-intensive Application 54

7.6 Compute-intensive Application with Large Data Area 57

7.7 Compute-intensive Application with Manual Load Distribution 59

7.8 Load varying application 60

7.9 Summary 61

8 Conclusion 63

8.1 Designed Algorithms 63

8.2 Confirmation or Denial of Theses 64

8.3 Future Work 64

8.4 Lessons Learned 65

8.5 Contributions 66

References 67

Appendix 1: Measurements behind figure 23 69

Appendix 2: Measurements behind figure 28 70

Appendix 3: Brief Danish Summary 71

Acknowledgements

 

Thanks to my advisor Eric Jul for his guidance during the writing of this thesis.

Thanks to Niels Elgaard Larsen for his competent and quick answers to subtle Emerald issues.

I am grateful to Peter Dickman for his thorough comments to two earlier versions of this thesis. His comments have increased the quality of the argumentation as well as the language considerably. Also, the punctuation in this final version of the thesis should be much closer to English punctuation than before.

Also, thanks to my brother-in-law, Allan Dystrup Nielsen for rewarding comments to earlier versions of this thesis.

Thanks goes to my fellow students in the study group on object-oriented kernels.

I thank my fiancee, Gitte, for her patience during my nights and weekends spent at the university. Also, thanks for the Danish-English dictionary I received when I decided to write this thesis in English. It has been indispensable.

Finally, I am grateful to my parents for their financial support during my (many) years as a student.

1 Introduction

This thesis describes an experiment in load distribution in Emerald. In this introduction, we discuss the background, theses, objective, experiment, and main contributions of the thesis. In section 1.6, an overview of the thesis is presented.

Through out the thesis we assume that the reader has some knowledge about distributed systems and object-oriented programming systems. However, previous knowledge of Emerald is not required; chapter 2 provides the necessary introduction.

1.1 System

Computer systems consisting of a moderate number, e.g., below 100, of computers connected by a network are fairly common. The computers can be of any kind and size, e.g., personal computers, workstations, or mainframes. The network is a relatively fast broadcast-network, e.g., an Ethernet.

The system may be connected to other such systems, but the connections to the other systems are often slow and the network might not be able to perform broadcasts on these systems.

We will only consider systems in which it is possible to perform broadcasts. We call these distributed systems.

It is possible to put several types of system software onto such hardware; in this thesis, we concentrate on distributed programming systems. Many distributed programming systems have been proposed and some implemented, e.g., Distributed Smalltalk [Bennett 90], and SR [Andrews 82]. Here we use the object-oriented distributed programming system Emerald [Hutchinson 87a].

1.2 New Performance Possibilities

Significant performance gains can be experienced by utilizing the large number of resources in distributed systems. Though the utilization of remote resources of any kind is of interest, in this thesis, we describe an experiment dealing only with the utilization of processors, called Load distribution.

In operating systems load distribution is often implemented by moving a program and all or some of its processes to a lightly-loaded node. We refer to this as task migration. In an object-oriented programming system, however, it is possible to move individual objects and the contained processes. The movable entity is smaller; this raises the problem that a program split up among several nodes must use the network intensively for communication between the objects. This can reduce performance although the system utilizes intensive parallelism.

In this thesis, we develop and test algorithms that address both utilization of parallelism and communication overhead in Emerald.

Our aims are:

o To distribute processes among processors when there is need for the processing power of multiple processors.

o To gather processes onto a few processors when the need for processing power is less than the overhead of communication.

There is a conflict between these two aims, because objects cannot be distributed and gathered at the same time.

1.3 Theses and Objective

We believe that:

(1) Load distribution can significantly reduce elapsed wall-clock time of compute-intensive object-oriented applications.

(2) Load distribution and reduction of communication can reduce elapsed wall-clock time of non-compute-intensive object-oriented applications.

Our objective is to perform an experiment to confirm or deny these theses. We elaborate on the theses in chapter 3 after we have described Emerald in chapter 2.

1.4 Experiment

An experiment is performed in order to test the theses. We do this by designing and implementing load distribution and communication reduction in Emerald. After that, we test the theses by executing Emerald applications upon the new system.

Thus, the experiment contains the following steps:

o Design of a load distribution algorithm.

o Design of a communication reduction algorithm.

o Limiting problems with the potential conflict of these two algorithms.

o Obtaining results from the experiment, so that the theses can be confirmed or denied.

1.5 Contributions

Our work has shown that:

o Load distribution in object-oriented distributed programming systems can significantly improve the performance of compute-intensive applications.

o Load distribution only considering load, not object communication, in object-oriented distributed programming systems may degrade performance seriously.

o Our approach to considering both load and object communication could not, in general, improve performance.

1.6 Thesis Organization

The thesis is organized as follows: chapter 2 contains a description of the environment in which we perform the experiment, i.e., Emerald. Further, the terminology used through out the thesis and assumptions about the system are described. Chapter 3 is devoted to a discussion of the overall design of the two algorithms. Chapter 4 contains the design of the load distribution algorithm and chapter 5 the design of the communication reduction algorithm. In chapter 6 we give a description of the implementation and chapter 7 contains a discussion and documentation of the performed experiment. Chapter 8 is the conclusion.

Chapter 6 can be skipped. Some footnotes contain terms and topics that require previous knowledge of Emerald; these may also be skipped. We cannot recommend skipping chapter 2, even for readers familiar with Emerald, because some of the terminology used is specific to this thesis.

2 Environment of the Experiment: Emerald

In this chapter we describe the environment in which the experiment is performed. This involves a description of relevant parts of Emerald. Also terminology specific to Emerald and this thesis is described. First, in section 2.1, relevant parts of Emerald are described at the language level. Second, in sections 2.2 through 2.4, Emerald is described at lower levels. Third, in section 2.5 various definitions are presented. Finally, section 2.6 contains assumptions about the system.

We do not describe the Emerald programming language or the complete Emerald system. Such descriptions can be found in [Hutchinson 87b] and [Jul 89] respectively.

2.1 Emerald at the Language Level

An Emerald system consists of a moderate number of workstations connected by a broadcast-network. We use the terms node or processor for workstation. Because all nodes communicate though the broadcast-network, they all communicate with one another at the same speed. The number of nodes may vary dynamically and there may only be a single node. All nodes are considered to be uniprocessors with the same processor architecture and they fail in a fail-stop manner. An Emerald kernel runs on each node in the system; it provides run-time support to Emerald applications that have been compiled to native machine code by the Emerald compiler.

An Emerald application consists of a number of objects described by object definitions. An object definition includes data structures, operations and a process definition; each of these three items are optional. An object encapsulates data structures, meaning that these can only be read or written from outside the object by performing invocations of the operations that the object provide. A process definition contains the process code; a process is created and begins executing the code of the process definition when the object is evaluated by the use of an object constructor. Several instances of the same object definition, and hence process, may be created by using several object constructors on the same object definition. Thus, Emerald supports dynamic object and process creation.

A programmer may indicate that an object is immutable, meaning that its state cannot change. Immutable objects are considered to be at all nodes. An object that is not immutable is resident at exactly one node at any given time.

Emerald applications synchronize through monitors and condition variables [Brinch Hansen 75].

Invocations can be local or remote. A local invocation is performed when the invoker and the invokee reside on the same node. A remote invocation is performed when they do not. Apart from the time required, it is transparent to the Emerald programmer whether an invocation is local or remote.

To some extent an Emerald programmer can control distribution of objects in an application: an object can be moved to a specified node by the use of a move statement; and the node where an object is resident can be found. Also, the programmer may specify an object, O1, to be attached to some other object, O2. This implies that if O2 moves to another node, O1 will move along. An Emerald program can obtain a list of all local nodes which currently are running Emerald kernels.

A user runs an Emerald application by starting a program, runec, from the Unix command line. runec calls the resident kernel, that loads the requested Emerald application from disk into memory. All objects in the application will initially reside on the node from which runec is executed.

2.2 System

Emerald, and several other distributed programming systems, e.g., Distributed Smalltalk [Bennett 90], Matchmaker [Jones 86], and Eden [Almes 85], are implemented as kernels that run as a process upon a conventional operating system in parallel with other unrelated processes. In Emerald, the underlying operating system is Unix and the Emerald kernel is a single Unix-process. Inside this Unix-process several threads of control are executing in pseudo-parallel implemented by an Emerald round-robin scheduler with time-slicing.

The threads of control are signal handlers, kernel tasks and threads. Signal handlers and kernel tasks are pieces of kernel code. Threads are a special part of Emerald user processes. We discuss threads in detail in section 2.4 below. A simplified structure of an Emerald system on a single node is shown in figure 1. The figure shows that the Emerald kernel is only a part of the entire Unix operating system. The Emerald kernel and all of its threads compete with other Unix processes for the node's resources.

The Emerald kernel, compiler and supporting programs, e.g., runec, are all programmed in C.

Emerald runs on the following architectures: VAX, Sun-3, Sparc and HP-9000, but on one system all kernels must run on the same architecture. In other aspects, however, all nodes do not have to be identical, e.g., they may not all be equipped with the same amount of memory. In this thesis only the Sparc architecture is used.

Emerald kernels on different nodes keep track of each other by sending messages though the network during start-up and termination. In Emerald terminology start-up is called boot.

Emerald supports reliable delivery of point-to-point messages, and unreliable delivery of broadcasts.

Emerald applications may be compiled and started at any time on any node. Therefore, the Emerald kernel cannot know about applications until they are executed.

2.3 Object Implementation

Emerald has four different, internal representations of objects. The representations reflect the fact that different objects do not require the same flexibility. By using four different representations large performance improvements are obtained. The four representations are: global objects, local objects, direct objects, and immutable objects. We describe these and related topics in the following.

The most flexible object representation is global object. These are the only objects that can move independently between nodes. Local objects reside inside global objects and always move with the enclosing global object. As previously described, immutable objects are considered to be resident at all nodes. They are copied between nodes and cannot be moved like global objects. Direct objects are simple immutable objects. They are used to represent integers and other simple build-in types or structures. The Emerald compiler determines if an object is represented as a global, a local, or a direct object. Its decision is transparent to the Emerald programmer.

When moving a global object from one node, S, to another, T, all threads executing in this global object on S are destroyed, and new, equivalent threads are created on T. It is not possible to migrate threads. The code of a moving object is not moved with the object. If the receiving node has not got the code, it is loaded from disk.

When we write that an object, S, invokes another object, T, we actually mean that a thread (see next section) executing code in S invokes an operation provided by T.

When we write that an object O has X interactions with node N, we mean that X is the sum of the number of invocations of O from all objects on N and the number of invocations O performs on all objects on N. Likewise, object O1 has X interactions with object O2 means that the sum of the invocations O1 performs on O2 and the invocations O2 performs on O1, is X.

2.4 Processes and their Implementation using Threads

We consider both the objects and the processes created by using several object constructors on the same object definition to be different.

The code of an Emerald process is executed by threads. A thread is created at the same time as the process. When a thread T1, on node N1, performs a remote invocation, T1 is stopped and a new thread, T2, is created on the remote node, N2, by sending it an invocation request. T2 may again perform a new remote invocation, be stopped, and so on. When T2 has executed the code of the invoked operation, it is destroyed, and T1 is resumed by sending an invocation reply to node N1. The remote invocation is terminated. Note that in this way a process may have several threads on different nodes; however, all but one are stopped.

In figure 2 an example of a process created on node N1 is shown. On this node, thread T1 is created. T1 is stopped when it performs a remote invocation of an object on node N2. On this node a new thread T2 is created. This thread invokes an object on node N3, and so on. Note that the process has two threads on node 2. All threads but T4 are stopped.

A thread that is not stopped need not be executing code, because it may be blocked waiting on some event. Possible events are monitor entry and (signal to a) blocked condition variable.

We call a thread that is either executing in a processor, or waiting for access to this (and nothing else), active. Also, we call an object in which at least one active thread is executing, active. Non-active threads and objects are passive.

A passive object, O, becomes active in any one of the following situations:

(1) O containing a process-section is created.

(2) O is invoked.

(3) An event, which a thread in O is waiting for, occurs.

(4) A remote invocation returns to O.

An active object, O, becomes passive in any one of the following situations:

(1) O reaches the end of its process-section.

(2) The only executing thread in O makes an invocation.

(3) The only executing thread in O starts waiting on an event.

2.5 Various Definitions

Emerald allows movement of active objects; this can be viewed as thread migration, and we use this term, though it is not possible to move threads in Emerald as previously described. In nonpreemptive movement only newly created processes are moved. Many operating systems only allow nonpreemptive moves due to the difficulty of collecting all state information of an executing process.

By processor thrashing we mean that threads are continuously migrated without performing any useful work. By object thrashing we mean that objects are being moved back and forth between nodes, because the optimal placement cannot be determined.

By speed-up on N nodes we mean the elapsed wall-clock time on execution on one processor divided by the elapsed wall-clock time on execution on N processors [Møller-Nielsen 87].

There are two possible definitions of run-time for execution time on a single processor: the wall-clock time using a sequential algorithm; and the wall-clock time using the same algorithm as on N processors, i.e., the parallel algorithm executing in pseudo-parallel on a single processor. As in [Møller-Nielsen 87] we use the run-time of the sequential algorithm.

The speed-up is a number in the interval ]0;N]. A speed-up below 1 means that the run-time is increased by using more processors. The closer the speed-up is to N, the larger the benefit of using several processors.

Processor starvation means that some processors are processing threads, while other are not, because there are not enough threads that need processing [Møller-Nielsen 87]. Thread starvation means that some thread "is being indefinitely delayed" [Ben-Ari 82].

In a distributed system, one cannot expect all nodes to be available all the time. A robust algorithm is able to continue even if some nodes become unavailable for some time. We define an algorithm implemented in Emerald to be robust if it is able to recover from Emerald kernel crashes (whether they are caused by errors in Emerald, node crashes, or other things).

Load factors are defined precisely in chapter 4. For now, we simply note that the Unix processor load indicates the number of Unix processes in the processor's ready-queue. Emerald load is the number of active threads; its value is kept in the variable cEmRunnable. In the rest of the thesis, we use this variable name when referring to Emerald load. Processor load is the sum of the Unix load and cEmRunnable. Further, we say that a processor is overloaded if its processor load exceeds some threshold. Likewise, system load indicates the load of all processors in the system. It cannot be calculated in practise because of network latency, but a possible theoretical definition could be: the average of all processor's individual loads at an instant.

2.6 Assumptions

We assume that the number of nodes is sufficiently limited that broadcasts can be used without seriously affecting performance.

Also, we assume that a remote invocation on average is slower than the equivalent local invocation.

We assume that the Emerald applications do not contain statements which move objects.

2.7 Summary

We have described the Emerald kernel's relation to the Unix operating system, the four object implementations, and how Emerald processes are implemented though threads. Further, we have presented definitions and assumptions about the Emerald system.

3 Overall Design If nobody uses it, there's a reason (fortune cookie)

In this chapter we discuss the overall design of the system. First, we determine which criteria to use when making design decisions. Second, we refine aims of the system. Third, we discuss at what level to implement load distribution; fourth, we discuss previous work; thereafter, we begin the design process. The first design issue is how object thrashing caused by conflicts between the two algorithms can be avoided. We present the problem, after that we discuss some solutions, and select the best.

3.1 Design Criteria

The design criteria are of great importance to the design. They will be used through out this and the following chapters when making design decisions.

Our objective with the algorithms are to make them useable for the experiment. They need not be optimal, but they should be able to increase performance. This is reflected in the criteria. For instance, if we have to select between a complex solution that we believe can improve performance significantly, or a simple solution with smaller performance possibilities we will sometimes choose the simple solution, because optimality is not vital. Obviously, the precise choice depends on the specific ratio between complexity and performance gains. Therefore, we cannot prioritize the criteria.

A problem is that we cannot implement and test all possibilities so that exact measures of their advantages and disadvantages can be estimated. Therefore, the selections are based on our beliefs. Obviously, we may be wrong, and this may affect the result of the experiment. We return to this in section 7.1, where we present the contents of the experiment.

The following criteria will be used:

(1) Performance. The main performance criterion is the wall-clock time of executing Emerald applications: application performance. Secondary performance criteria are: minimizing processor usage, network usage, and memory requirements. Further, the algorithm should impose a small overhead when it is not used.

(2) Complexity. This is closely connected to performance, because a complex algorithm often has larger overhead. We consider the required work to implement the algorithm as the complexity criterion.

We cannot always use the main criterion, because it is too difficult to estimate. In these cases, we use the secondary criteria.

3.2 Aims

When an Emerald application is loaded from disk into memory all objects are located on a single node, therefore there is no remote invocation. But if more than one kernel is booted, it may be advantageous to move some threads to these other nodes, thereby utilizing the parallelism of multiple processors. By doing this we introduce remote invocations. The main aims therefore contains a trade-off:

(1) We want to use the available processors to reduce the time used on computations; but this should

(2) not create so many remote invocations that the total run-time of the application is increased.

By total run-time, we mean the elapsed wall-clock time from when the application is started until it is finished. As a consequence of the trade-off, we can identify two separate reasons for moving an object from one node, Nx, to another, Ny:

(1) The load distribution reason: Ny is not as loaded as Nx.

(2) The reduction of remote invocations reason: the object has more interactions with Ny than with Nx.

Reason (1) only deals with active objects. It is worth noting that these two reasons are conflicting. (1) will distribute the objects among the nodes, while (2) will gather all related objects on a single node. A simple example can illustrate this problem.

Example

On node N1 we have two objects, each of which contains one thread: T1 and T2; node N2 is idle. The load distribution reason therefore suggests moving the object O1 containing the T1 thread to node N2. However, T2 often invokes an operation in O1. Therefore, the reduction of remote invocations reason depicts moving O1 back to N1. Object thrashing occurs. n

Note, that object thrashing is inherent in the problem. We must design a strategy in which the consequences are minimized.

In the following, we abbreviate Load Distribution Algorithm as LDA and Remote Invocations Reduction Algorithm as RIRA.

3.3 Load Distribution in Operating or Programming Systems

Previous research on load distribution has been done in operating systems running upon distributed systems. At least two reasons make it preferable to distribute load in the operating system instead of doing it in the programming system:

o The programming system has only partial control of the system, e.g., it cannot move all processes.

o The overhead may become large if several different programming systems all perform load distribution.

However, it is difficult to do this in our environment. First, it would be necessary to supply new information to the Unix kernel about objects and their threads. Second, to implement this would require changing the Unix kernel and we would therefore need exclusive control of some nodes. Third, it would require more time than we have available, because the Unix kernel is a more complex program than the Emerald kernel.

In other environments, e.g., environments with distributed operating systems, we believe that implementing load distribution at the operating system level is preferable and less cumbersome than in Unix.

3.4 Previous Work

[Chu 80] investigates the problem of utilizing parallelism under the constraint that the total running time should not be increased due to increased inter-process communication. The approach is to make an allocation of program modules to processors based on estimates of their communication with other modules. A method to perform an inter-module communications analysis of the program at compile-time is suggested.

An attempt is made to solve the allocation problem theoretically by graph-theory and by linear programming. [Chu 80] found linear programming to be the best approach. No implementation was attempted.

However, because the allocation of processors are done before the processing starts, data dependent communication cannot be considered. The algorithm does not react to the current system state, i.e., it is static. Also, the algorithm must have previous knowledge of all modules to be executed, and when that happens. None of this information is available in Emerald.

One could attempt to use either graph theory or linear programming when designing dynamic load distribution, but we believe that the processing time of such algorithms are a significant obstacle. Thus, the techniques described in [Chu 80] are not used in this thesis.

3.5 A Strategy without Object Thrashing

As previously described a major problem of the design is to avoid object thrashing caused by the conflict between the LDA and the RIRA. If object thrashing occurs, the system will waste time moving objects among nodes without experiencing the gains of either the LDA or the RIRA.

In this section, we suggest three ways to avoid object thrashing. We select the one we find best for the design.

3.5.1 Strategy 1: Active and Passive Objects

The rationale behind this strategy is that the LDA need only move active objects and the RIRA need only move passive objects. By dividing all objects into these two categories conflicts are presumably limited.

There is a significant problem in this strategy: as described in section 2.4, objects may change state from active to passive and from passive to active for several reasons, and these transitions may occur frequently. Therefore, although the LDA and the RIRA do not move the same objects at the same time, they may choose to move the same object to one node at one moment and to another node a moment later.

3.5.2 Strategy 2: Compute-intensive and Communication-intensive Periods

The rationale behind this strategy is that the need for reduction of remote invocations is low when the need for processing power is high. Therefore, the LDA and the RIRA need not be active at the same time, i.e., the LDA is activated and the RIRA deactivated when the system load exceeds some threshold and vice versa.

The problem with this strategy is that the LDA alone may distribute objects in a way that introduces many remote invocations. However, this should not be a significant problem, because the system load is high. Therefore, the processor utilization is more important than the communication overhead.

3.5.3 Strategy 3: Thorough Analysis

This strategy is based on the idea that a thorough run-time analysis of communication and processor usage can be used to give objects an optimal (or almost optimal) placement.

Two problems in this strategy are: first, it is necessary to compare:

o The increase in run-time due to making local invocations remote; to:

o The increase in run-time due to having threads share one processor.

In general, this comparison is extremely complicated to perform, because it depends on the processor usage of the threads, the number of invocations they perform, the network bandwidth, etc.

Second, the analysis will be time-consuming because of the complexity of determining an optimal placement. The complexity is mostly due to the fact that some or all nodes must have a total view of the system. This will require much information exchange inducing problems with network latency, network overhead, processor overhead, kernel crashes, etc.

3.5.4 Choosing a Strategy

We believe that strategy 1 has significant problems, because objects cannot properly be divided into active and passive for long periods. Thus, the strategy may result in object thrashing, thereby reducing processor and network performance. Because of this, we reject strategy 1 although its complexity is low.

Strategy 2 should be able to increase application performance for compute-intensive applications with only a small number of invocations. However, for compute-intensive applications with many invocations the strategy might reduce application performance, because the RIRA cannot gather objects having many interactions with each other. It is not possible to estimate how much performance is lost due to this because this is highly application dependent. The consequence of this may be that only some applications will experience performance gains. Strategy 2's complexity is low, which is an advantage.

Strategy 3 has the best capability for increasing performance, if two conditions are fulfilled:

(1) The required comparison can be made without inducing a large overhead.

(2) The overhead of determining the optimal placement can be kept low.

We believe that it is possible to fulfil condition (2), but we do not believe that it is possible to keep the overhead and complexity of the comparison, (1), sufficiently low. Thus, strategy 3 impose problems for both our selection criteria.

Having rejected strategy 1, we will determine if strategy 2 or 3 can be used. We do believe that strategy 2 may have problems, possibly increasing the run-time of some applications, but strategy 3 may have problems with all applications. Because of this we choose strategy 2.

Strategy 2 implies that the LDA and the RIRA can be designed as two separate algorithms.

3.6 Summary

We have determined which design criteria to use, namely application performance, processor overhead, network overhead, memory requirements, and complexity. Also, we have identified the trade-off between utilizing multiple processors and keeping the number of remote invocations low. This trade-off implies that there are two reasons for moving objects between nodes: load and communication structure. Conflict may occur because utilizing parallelism involves distributing objects among processors, while keeping the number of remote invocations low involves gathering objects. We discussed three strategies to minimize this conflict. The selected strategy distinguishes between periods with high system load, in which only the load reason is used, and periods with low system load, in which only communication is considered.

4 Designing the Load Distribution Algorithm (LDA)

In this chapter we design the Load Distribution Algorithm (LDA). We start with a short discussion of how to design the algorithm, after that we discuss the design. In section 4.7 we present the algorithm in pseudo-code; and in section 4.8 we discuss properties of the algorithm.

4.1 Design Method and Overview

There are at least two methods we can use to design the algorithm:

(1) We could discuss a number of existing algorithms and choose the one that suited our needs best.

(2) We could identify the issues that must be considered when designing an algorithm and develop an algorithm to fulfil our needs with respect to these issues.

We will choose the method which we believe provides the best performing algorithm with the least effort.

We believe that this is method (2), because the Emerald environment is characterized by several items which imply that existing algorithms may perform poorly, e.g. dynamic object and thread creation, small movable entities, and remote invocations. If we selected method (1) we would need to adapt the chosen algorithm to Emerald. We believe that less work is required and a better algorithm provided if it is designed by method (2) instead.

In [Shivaratri 92] the following issues are discussed:

o Static and dynamic algorithms.

o Load.

o Migration/nonpreemptive process movement.

o Components of a load distribution algorithm.

o Stability.

Each of these issues is discussed in a separate section apart from stability, which we discuss among other properties of the algorithm in section 4.8. Because selection of constants and values of the algorithm turns out to be a complex problem, we discuss this separately in section 4.6. Therefore, we devote the following five sections, 4.2 through 4.6, to these issues. We will not only obtain the issues from the mentioned article, but also the terminology, adapted to Emerald. In general the terminology in this field is rather confusing and contradictive. In the analysis we use results from other articles when the discussion makes this necessary.

4.2 Static and Dynamic Algorithms

All load distribution algorithms can be characterized as either static or dynamic.

Static algorithms do not use any statistics about the running system. All they use are some information about "the average behaviour of the system" [Eager 86]. Dynamic algorithms base their decisions upon statistics about the system.

A simple static algorithm for the Emerald system could be as follows: every time an application is started, a node is chosen at random, and the entire application is executed at that node. There is no statistic gathering in this scheme apart from knowledge of available nodes. Obviously, this algorithm can perform poorly, because the selected node sometimes will be the most heavily-loaded.

In general, the main advantage of static algorithms is that there is no risk of the statistic gathering overhead being greater than the benefit of the load distribution. However, the performance improvements experienced by static algorithms tend to be low: "they do not result in good performance in distributed computer systems" [Goscinski 91, p.485]; [Eager 88] reports the same. We have no reason to believe that static algorithms would perform better in Emerald. Therefore, we will not consider static algorithms.

Some dynamic algorithms are termed adaptive, meaning that they dynamically change parameters or policies as a consequence of the gathered statistics. Some concerns suggest that an algorithm should be adaptive. For instance, it is difficult for a strategy to work well both on a lightly-loaded and a heavily-loaded system. We return to this discussion in subsection 4.6.1.

4.3 Load Factor

In this section, we discuss how to identify a load factor. The load factor is used to distinguish lightly-loaded nodes from heavily-loaded.

4.3.1 Demands on the Load Factor

We have three demands on the load factor: first, the load factor must correlate to the elapsed wall-clock time of executing a thread. Second, for efficiency, the load factor should not be expensive to calculate. Third, the load factor must consider the processor load, not just the Emerald load.

4.3.2 Choosing an Appropriate Load Factor

Kunz has tested the impact of different load factors on performance of load distribution [Kunz 91]. Kunz's test conditions can briefly be described as: the only information transmitted over the network is the command name, because all workstations in the distributed system are homogeneous and diskless; only newly created commands are moved; the receiving workstation loads the program corresponding to the command from a shared file-server; the testing is done by generating an artificial workload; and all workstations use Unix. The objective of the load distribution is to minimize the mean task processing time. We believe that these test conditions makes Kunz's results useable to us.

In [Kunz 91] the following load factors, and some combinations thereof, are tested:

(1) Length of run queue.

(2) Size of free memory.

(3) Rate of context switches.

(4) Rate of system calls.

(5) Average run queue length over one minute.

(6) Amount of free processor time.

By length of run queue Kunz means the number of Unix processes waiting to be loaded into the processor and executed, plus one for the currently executing process (if any).

The tests show that all load factors lower the mean response time of the tasks, i.e., load distribution does improve performance whatever load factor is chosen.

Another result is that the load factor used is of great importance to the performance obtained. (1) provides the largest performance improvement, (5) the smallest. The combinations of load factors do not yield better results.

It might seem strange that (5) is much worse that (1), but Kunz suggests that (5) is bad because a period of one minute is too long when the average run time of the Unix processes in the artificial load is 20 seconds. He has not tested other time periods, because Unix does not directly provide these. We do not know the average processing time of threads in Emerald and therefore cannot use that in deciding what period should be used.

We are restrained by what Unix directly can provide, because we believe that the work required to change this exceeds the importance of getting the right load factor.

The load factor must be obtained from a protected part of the Unix kernel. In our Unix, the only option is to use a privileged program that provides the average run-queue length of the processor over the last minute. This program can be called from the Emerald kernel.

Because obtaining the average run-queue length though this program induces an overhead, the length is not recalculated on every call. A constant, C, can be set to the number of seconds in which the same value is returned. If C=0, the value is recalculated on every call. If C=60, it is only recalculated when a minute has passed since the calculation. We choose C=60 as a relatively arbitrary choice between limiting overhead and obtaining a recent value.

Because the Emerald kernel implements its own time-slicing of threads, these will only contribute one to the Unix load, although there may be several threads. Therefore, we need to add cEmRunnable to obtain the processor load.

Note that we add a one-minute average to a specific value obtained at a certain time. This is the best we can do, but it would have been better if both values were of the same type.

We believe that the obtained value will provide reasonable good performance. It is close to the current run queue length, because cEmRunnable reflects the immediate number of active threads.

4.4 Migration/Nonpreemptive Movement

In this section, we discuss if the algorithm should be migratory or not. We use the term thread generally even though the referenced articles use the term process.

4.4.1 The Benefits of Migration

Though migration is possible, it may not be worth using it. The advantage of migration is the possibility of moving threads at any time to off-load overloaded processors. The importance of this is great, when the system has long-running threads. On the other hand, the collection of state information of an executing thread imposes a non-trivial overhead. In [Eager 88] this problem is addressed. The most interesting conclusions, based on both theoretic studies and simulation, are:

(1) "There are likely no conditions [sic] under which migration could yield major performance improvements beyond those offered by non-migratory load sharing..." [Eager 88, p.71].

(2) "The benefits of migration are not limited by its cost..." [Eager 88, p.71].

Eager finds that migration is worthwhile, especially when there is a "high variability in both job service demands and the workload generation process", but even under these conditions the improvements due to using migration are modest. However, the benefit of load distribution in general is significant. Also [Leland 88] conclude that migration improves the run-time of long-running jobs, but does not improve the average run-time of all jobs. Conclusion (2) above is due to the fact that migration does improve performance when moving long-running threads. In these cases the cost of collecting thread state-information is not significant in comparison to the transfer cost.

Because threads in Emerald may be long-running, we believe that the load distribution algorithm should be migratory.

4.5 Components of a Load-Distribution Algorithm

The load-distribution algorithm needs four components:

o Transfer policy.

o Location policy.

o Selection policy.

o Information policy.

In the following, we discuss the first three of these in separate subsections. Information policy, i.e., selection of information to distribute between nodes, is not discussed in a separate subsection, but discussed with the other policies. First, in subsection 4.5.1, we decide if the migration should be initiated from the sender or the receiver.

4.5.1 Initiating Node

The question is whether migration should be initiated by an overloaded sender, sender-initiated, or by an underloaded receiver, receiver-initiated. An overloaded node will attempt to migrate some of its threads to other nodes, while an underloaded node will attempt to get threads from other nodes.

The main problem of receiver-initiated algorithms is that they waste system resources when all nodes are lightly-loaded. However, when the system load is high, sender-initiated algorithms put the algorithm administration on the highly-loaded sender, and worse: when all nodes are heavily-loaded, they waste system resources trying to find lightly-loaded nodes that are not there.

The problem, that receiver-initiated algorithms waste resources when all nodes are lightly-loaded, is important, because this will reduce performance for applications which do not experience any performance gains from the LDA. Methods exist to limit the disadvantage of senders searching for non-existing lightly-loaded nodes. We address this in subsection 4.6.1 and therefore make the algorithm sender-initiated.

4.5.2 Transfer Policy - Sending Node Selection

The transfer policy determines if a node is suitable to deliver a thread.

The essential issue when selecting such a node is that we want to off-load the node. We can determine if a node need to be off-loaded through a threshold policy: if a node has a load factor of Load, it needs to be off-loaded when:

where LoadThreshold is a threshold value. It is necessary to test that cEmRunnable is greater than zero, because the overload may be caused by other Unix processes; the LDA cannot do anything about this. Testing for cEmRunnable being greater than zero, instead of one, implies that it is possible to off-load a processor entirely of threads when it is overloaded by Unix processes.

This policy contains an obvious problem: if all nodes have exceeded their thresholds, there is no place to send threads. A solution is to extend the algorithm to include the load of other nodes in its decisions. The other nodes are potential receivers. In the next subsection these are considered.

However, first we consider when to perform the threshold test. Several strategies exist, including:

(1) Periodically.

(2) During remote invocations.

We consider it too difficult the estimate the impact of these strategies on application performance, therefore we use the following selection criterion:

o Minimization of processor overhead.

We believe that the memory requirements and complexity of all strategies to be approximately the same, and there is no network usage, so we will not consider these.

Strategy (1) does impose a constant (in time) processor overhead. This is an advantage; the disadvantage is that there is an overhead even if all kernels are idle. Strategy (2) induces an overhead during every remote invocation, which is a disadvantage, because we cannot put an upper limit to it. However, because the cost of remote invocations are large, this extra overhead is almost certainty insignificant; it only involves comparing two numbers.

The conclusion of this is that there is no important reason for choosing one strategy over the other. We have chosen strategy (1).

We need to determine a proper interval between tests. We return to that in subsection 4.6.3. For the rest of the thesis we refer to the interval as dt.

4.5.3 Location Policy - Receiving Node Selection

The location policy is used to find the node to receive a thread.

Let us imagine that the transfer policy described in subsection 4.5.2 has found that node Nx is overloaded. Now the location policy must select a node to off-load node Nx. The simplest is just to select a node, Ny, at random, and send a thread to that node. Surprisingly, this method can provide reasonably good performance [Eager 86], because the overhead is extremely small. However, it does present some problems: if Ny is overloaded, should it pass the thread on? What happens if all nodes are overloaded? In [Eager 86] these problems are overcome by setting a transfer limit. If a thread is passed on transfer limit times, the receiving node must process it.

However, [Eager 86] shows that performance can be improved by not sending the thread to an arbitrary node, but instead by probing randomly selected nodes. If a probed node would pass the thread on, another node will be probed, otherwise a thread is send to the first node. A probe limit prevents the algorithm from probing endlessly on an overloaded system. If no node is found, when probe limit nodes have been probed, the original node must itself continue executing the thread.

Finally [Eager 86] presents an algorithm that tries to select the "best" node. The algorithm probes a number of nodes and sends the thread to the node with the lowest load factor. This method does not present significant performance gains compared to the second algorithm due to higher overhead.

When selecting an algorithm we have to consider one important point: the results of [Eager 86] are about task migration, i.e., migrating self-contained threads that do not communicate with other threads. Therefore, the consequences of making inexpedient migrations are small. In Emerald, this is not the case. Undesirable migrations can reduce performance, even though the RIRA may help. Thus, we expect to use more information than was found to be necessary in [Eager 86]. Method three is therefore the best for our needs. However, because the number of nodes running Emerald is small (assumption of section 2.6), the probing can be done in parallel by broadcasts. This reduces the time used during probing significantly.

As a result of the broadcast the other nodes on the system must return their loads to the broadcasting node.

When the broadcasting node has received these replies, it can select the node with the lowest load and send threads to this node.

Because migration has a cost it will not pay to migrate a thread if the lowest loaded node's load are the same as the broadcasting node's load. Therefore, we choose a limit, LoadDiff: only if the replying node's load is LoadDiff below the broadcasting node's load should a thread be migrated.

A simple optimization can be made to this scheme: if a node can be certain that it cannot be selected as receiver it need not reply. If the node's load is OwnLoad and the broadcasting node's load is ReceivedLoad, this is the case, when:

 

When this situation occurs the node will not reply. But how does the broadcasting node then know when all replies have been received? The answer is that it cannot know that anyway, because:

o The number of kernels in the system may vary dynamically.

o Broadcasts are unreliable.

However, a limit must be put on how long the broadcasting node waits for replies. It is not possible to decide to wait until a certain number of replies have been received, because possibly there may be no replies at all. Instead we define a time limit. To simplify matters, we choose to use the same dt period as earlier.

Also in another case a node receiving a broadcast need not reply. This is the case if the node itself is overloaded, i.e., when:

If one node receives more than one reply with the same load factor, it selects the node from which the answer arrived latest. The reason is that, on average, this reply's load is closer to the current load, because it is likely to be more recent.

Also, out-dated information must be considered. When the broadcasting node receives the replies its own load could have dropped, so that

where CurrentLoad is the broadcasting node's load, when it inspects the replies and LowestLoad is the load of the lowest loaded node. In this case, migration is aborted.

When implementing the LDA one could avoid broadcasts when only a single kernel is active, because all kernels usually have knowledge of all other running kernels. Therefore, if a kernel detected that it was the only booted kernel, it could suspend the LDA until another kernel was booted. We have chosen not to implement this, but it is suggested as a possible (and simple) extension.

Another issue must be considered. If a node receives a great number of broadcasts, should it reply to them all? The risk is that it is selected as receiver for migration by a great number of broadcasting nodes and therefore becomes overloaded.

In the current scheme, the node cannot tell how many broadcasts it has replied to that are still being considered by the broadcasting nodes, because the broadcasting node does not inform the replying nodes that it has made a decision. One could provide such information but it would be costly, because more messages would be send. Further, if this information was lost, the replying node would be in an incorrect state. Another solution would be to set a time limit between the replies a node could send: if it is less than t seconds since a node last replied to a broadcast, it should not reply to any other broadcasts. The main disadvantage of this is that if the first broadcasting node does not migrate a thread anyway, the underloaded replying node's processing power is wasted.

It is difficult to compare the advantages and the disadvantages. We choose not to limit the number of replies a kernel may send.

Before proceeding, we present an example of the algorithm so far.

Example

In figure 3 an example of the message traffic in an system of four nodes is shown. We assume LoadDiff to be 1, cEmRunnable to be greater than 0 and node X's LoadThreshold to be 2. Because Node X's Load is larger than 1, it broadcasts its load.

Nodes 1 and 3, which have the same load, receive the broadcast and make the following test:

This turns out to be true, because . Thus, they both reply to node X that they are possible candidates for migration. Node 2 performs the same test, but the test fails and node 2 does not reply.

Assuming that node X's load is the same when a dt period has elapsed, the following situations are possible:

(1) No reply has been received.

(2) One reply has been received.

(3) Two replies have been received.

The replies can be lost due to network failures; they can be on their way; or the broadcast might not even have been received by nodes 1, 2 or 3 yet; or it may be lost. Node X cannot tell the difference, and just proceeds using the received replies (if any).

If no replies have been received, it does nothing. If one reply has been received, it selects an object and sends that to the replying node. If both replies have been received, it sends the object to the node from which the latest reply was received. n

4.5.4 Selection Policy - Thread Selection

Presented with an sending and a receiving node, the selection policy chooses which thread(s) to migrate.

Because Emerald cannot migrate threads as such, the algorithm must select an object with the thread it wants to migrate. When one object contains many threads, all the threads are moved. We will only move objects with a single active thread for two reasons. First, moving many threads may cause problems because the receiving node becomes overloaded. Second, in the current version of Emerald moving objects with several threads sometimes cause Emerald kernels to end abnormally.

Various factors can be considered when selecting a thread for migration. [Shivaratri 92] mentions:

(1) The transfer overhead, i.e., the cost of migration.

(2) The remaining run-time of the thread.

(3) The number of location-dependent system calls of the thread, i.e., resources on the original node on which the thread depends.

We will consider each of these in relation to threads in Emerald.

The cost of moving Emerald objects depend on:

(1) If the code is present on the receiving node.

(2) The size of the code (if it is not present).

(3) The size of the data area.

Clearly, if the code is resident on the receiving node, the data area is the most significant factor. In general, we do not know, if the code is resident or not, so we cannot use this in the decision. We could calculate a weighted average, such as:

where CostAverage is the average cost of migration, SizeData and SizeCode are the sizes (in bytes) of data area and code respectively. PNotPresent is the probability that the code is not present on the target node. C is some constant. PNotPresent can be based on whether, we have migrated such an object to the node earlier, or on the time since the Emerald kernel was booted.

The second issue is the remaining run-time. This means both the run-time of the thread executing in the moved object and also all threads which will later execute in the object (until it is again moved). We believe that this is impossible to estimate in general.

The third issue is the number of location-dependent system calls made by threads in the object. In Emerald, this is the number of interactions with the original node. This is a concern of the RIRA and could be a way to integrate the LDA and the RIRA. It is only a part of the RIRA, because it only deals with distribution of objects, not gathering.

A fourth approach is to select an object arbitrarily. Its advantage is that it has no overhead, but it may select the object with the largest transfer overhead, the shortest remaining run-time, or most interactions with the original node.

When selecting among these four alternatives we consider:

o Application performance.

o Complexity.

We believe that typical objects are small enough to fit in a single network packet. This is an assumption which may be incorrect. We base it on the fact that code is not transmitted. Given an effective packet size of about 1,800 bytes there is plenty of room for a relatively large data area. If this is the case, the transfer cost of moving an object is only slightly dependent on it size; and in the long run code will be available at any node. Based on this we do not believe that (1) can improve application performance.

Issue (2) cannot be used, because we believe it is impossible to estimate; and even if it could be done it would require excessive resources.

Issue (3) may be able to improve application performance. An example is shown in figure 4: we have two objects, O1 and O2. O1 has 5 interactions with node N2; O2 has 50. None of the objects have local interactions. When moving an object from N1 to N2, because node N1 is overloaded, the obvious selection is O2 (assuming that both objects have been created at the same time.)

The overhead and complexity of performing the selection based on RIRA statistics are small if the RIRA can provide the statistics necessary to perform the decision.

Now the decision is whether to select (3) or make an arbitrary thread selection. We base our decision on the fact that practice has shown that only a few threads are active in moveable objects, therefore the possibility of selecting the right object at random is large (when only a single object can be moved, it is 100%). Therefore we think, that the overhead imposed when selecting the right thread by (3) is greater than the advantage.

Thus, we will not use any criteria to select a certain object, but just select one arbitrarily.

4.6 Selecting constants and Adapting Values

The constants/values we need to determine are LoadThreshold, LoadDiff, and dt. For each we need to determine whether it should be a constant or adapt to system changes.

4.6.1 Selecting the Value of LoadThreshold

The simplest solution is to let LoadThreshold be a constant. However, this induces the following two problems:

(1) How to determine the value of the constant.

(2) If all nodes are overloaded, they will all exceed their thresholds, all broadcast their loads, but never receive any replies.

The only solution to problem (1) is to find the value by testing. However, this might not work well when system parameters are changed, e.g., other applications are run, more kernels are booted, a faster network is used, etc. Therefore, it turns out to be a serious problem.

The consequences of problem (2) will not be serious, because no object is migrated, but the continuous broadcasting does provide a non-negligible overhead; especially on the network. Because the system is overloaded this is particularly undesirable.

This suggests that LoadThreshold should not be a constant, but should depend on the system load, i.e., the LDA should be adaptive. In this way broadcasts can be avoided by increasing LoadThreshold. The LDA is turned off at high system load.

The problem remains of obtaining and distributing an approximation of the system load. One possibility is to let all nodes communicate more. However, this also induces a load on the network and this might be higher than the load imposed by letting LoadThreshold be constant. Further, it induces an overhead even if the system is not overloaded.

We therefore develop a method by which to obtain and distribute a threshold without using any more communication and only a slight processing overhead. The method has to make use of the messages that are already distributed in the system, however, we can add fields to the messages because that only induces a very small overhead as long as all data can be send in a single packet.

Also, the method must not depend on data arrival. It should work even if broadcasts are lost. We now develop an algorithm that fulfils these demands.

A broadcasting node has the best opportunities for obtaining the system load because it is the target of many network packets. If it receives no reply to a broadcast, it is an indication that the system is overloaded. Thus, it should increase its threshold.

We also need a way to decrease thresholds, so that they adapt to a falling system load. We cannot make the decrement depend on message arrivals because no LDA-messages are transmitted when the system is not overloaded. Decreasing the threshold can instead be done at every elapsed dt-period.

To distribute thresholds all broadcasts include them. Any node receiving a broadcast therefore updates its current threshold.

Now, the following problems must be solved:

o How much and when to increase the threshold.

o How much and when to decrease the threshold.

o If there should be lower and upper limits to the variation of LoadThreshold.

o How a node receiving a broadcast should update its current threshold.

The determination of the increment and decrement present the same problems as determining the value of a constant threshold. Further, creation of many threads within a short time is not uncommon in Emerald, especially when applications are started. Because of this, a node's load may rise rapidly, e.g. from 1 to 10 within only one millisecond. If we choose to increment the threshold by one, we need 10 iterations before the threshold is adjusted on an overloaded system (there is no problem if the system is not overloaded). Instead, we could choose to calculate the new threshold, NewLoadThreshold by the following formula:

where Increment is a constant.

In this way, the adjustment is much faster, because Load influences the new value.

Another possibility is to let the threshold increase in a exponential manner, e.g., by multiplying the last increment factor by 2, and adding this value to the threshold. When decreasing the threshold, the increment factor should be reset (e.g., to 1). We arbitrarily choose the last method, because we do not believe that there are significant reasons for choosing one over the other.

There should be an lower limit to the threshold, because there is obviously no reason to off-load a processor, which is not loaded at all. The lower limit is LoadDiff because if some processor is completely idle, i.e., has a load of zero, another processor will migrate to this processor, only if its load is at least LoadDiff + 0.

When a node receives a broadcast, it updates its threshold, simply by letting the new threshold be the simple average of the previous threshold and the one received in the broadcast. Many other updating schemes are possible. The rationale behind the chosen scheme is that both nodes have equal knowledge about the system LoadThreshold.

4.6.2 Selecting the Value of LoadDiff

The cost of migrating a given thread is not constant. It depends on the network load, the processor load, the thread's dependencies of the node it leaves, etc. Because LoadDiff should reflect the cost of migration, it should not be a constant either. However, it is almost impossible to let LoadDiff reflect all system parameters, and even if it can be done, it might not be worth the effort.

Therefore, we let LoadDiff be constant; we arbitrarily select the constant 1.

4.6.3 Selecting the Value of dt

dt can be a constant without problems. However, it might be possible to improve performance by varying dt according to the system activity.

We could choose to let the value depend on the current load: increased load, decreased dt. In this way more migrations could be possible. However, the current load alone is a bad measure. If, e.g., the system load is high, no migration is possible and decreasing dt only induces an extra overhead without improving performance.

Instead, we can let dt depend on the previous success of the LDA, i.e., follow the principle:

The more the LDA can migrate, the more it should offer to migrate.

This can improve performance, especially when many threads are created within a short time. However, we will make dt a constant, because of the problems of determining a proper way to adjust its value. We believe that the effort required to solve these problems are larger than the gains needed to make the LDA improve performance. However, we do believe that making dt vary according to the above principle will increase performance. Therefore, we suggest it as an extension.

The value of dt should be determined by the trade-off between:

o The overhead induced by each iteration.

o The time a node must wait until the LDA can migrate a threads.

Besides the overhead caused by each iteration, processor thrashing may also occur if dt is too small, because we cannot guarantee that the processor does any productive work between the iterations. We discuss this in the following section. The value of dt is arbitrarily set to one second.

4.7 The LDA

We now present the LDA; given are: a variable threshold, LoadThreshold, a constant load factor difference, LoadDiff, and a constant time interval, dt, as previously described.

The variable names used in the following pseudo-code should be self-explanatory. Variable names prefixed by Received are values obtained from network messages.

Every node running Emerald follows the algorithm outlined in the following figures. The algorithm is divided into three procedures:

Figure 5: Procedure LDATimeout called each time dt elapses.

Figure 6: Procedure LDAMessageHandler called each time a kernel receives a LDA network message.

Figure 7: Procedure LDAMigrateCheck to determine if migration should be done.

Besides these procedures, a number of less interesting, supporting procedures shown in figure 8 are used.

Notes to figure 5:

o A boolean variable WaitingForReplies, local to the LDA, indicates if the node has made a broadcast and is waiting for replies to this. Replies received after a decision (positive or negative) about migration has been made are rejected, line (5.12).

o When a dt-period elapses, a node can be in one of two states:

(1) A period ago, the node was overloaded. WaitingForReplies is true.

(2) A period ago, the node was not overloaded. WaitingForReplies is false.

If the current load does not exceed threshold, and there is no active thread, line (5.5), the node does not check if it is in state (1) or (2), it just clears WaitingForReplies and continues, line (5.12). If, on the other hand, the current load exceeds threshold and the node is in state (1) it considers migration, line (5.11). If in state (2) it makes a broadcast, line (5.18).

Notes to figure 6: when the node receives a broadcast, it updates its threshold and tests whether it should reply to the broadcasting node, line (6.7).

When it receives a reply, it checks, line (6.11), if it is expecting one. If not it simply rejects the reply, it is of no use. Otherwise it records the reply, i.e. the load and the node if it is the lowest load received yet.

Notes to figure 7: now, the node received at least one reply, it recalculates its own load, and tests if the lowest load is at least LoadDiff smaller than this new load. Therefore, the migration might fail if the load has fallen since the broadcast.

4.8 Properties of the LDA

In this section, we discuss properties of the algorithm. We start by investigating the LDA's behaviour in general. Then we discuss its behaviour when confronted with out-dated information. Thereafter, its robustness is discussed. In the final subsection, we briefly note some general properties.

4.8.1 Behaviour

In this subsection, we will investigate behaviour of the LDA under different load situations. Thereafter, we discuss its stability and possible thread starvation.

We assume that thresholds are distributed efficiently to all nodes, i.e., all nodes have the same threshold.

The LDA partitions any node into one of three categories: overloaded, underloaded, or neither of these, i.e., normally loaded. This is illustrated in figure 9. In total, eight situations are possible, because each category may be empty or non-empty. If we ignore normally loaded nodes, only four situations are possible.

In figure 10 we show these four possible situations and how they interact. We assume that no Unix processes or threads enter or leave the system.

In situation 0 the overloaded nodes off-load their threads to underloaded nodes. The system ends in situation 1, 2, or 3 dependent on the number of underloaded and overloaded nodes.

In situation 1, no underloaded node can receive threads from the overloaded nodes, therefore the threshold increases. The increasing of the threshold stops when no node is overloaded, situation 3, or some node becomes underloaded, situation 0, or if both these things occur simultaneously, situation 2.

In situation 2 the threshold decreases until some node becomes overloaded, situation 0 or 1, or until the minimal threshold is reached, situation 2.

In situation 3 the threshold decreases until some normally loaded node becomes overloaded and the system ends in situation 1, or until the minimal threshold is reached, situation 3.

Note, that there are cycles in this, e.g., situation 0 may become situation 1, that may become situation 0, and so on; this might indicate instability. In the following we discuss if this is the case. However, as long as the LDA can be used in the experiment, i.e., performance is not reduced significantly, we do not consider it vital that the LDA is indeed stable.

Various definitions of stability exist. They all originate from the fundamental definition: "a bounded input produces a bounded response" [Casavant 88].

[Goscinski 91, p.497] defines an algorithm to be stable, if:

(1) The response time to any 'reasonable' burst of thread arrivals, does not exceed some bound.

(2) An upper limit on the relative difference in load between two nodes can be determined.

(3) There is no processor thrashing.

In [Shivaratri 92], two perspectives on stability are mentioned: the queuing theoretic perspective and the algorithmic perspective. According to the first, an algorithm is instable if its overhead is higher than the processing of applications, because in that situation the processor queues will not be reduced. According to the algorithmic perspective, an algorithm is instable if it is not possible to guarantee that the algorithm does not make fruitless actions indefinitely.

Following [Goscinski 91] the LDA may not be stable because it may not meet criteria (1) and (3). As shown in figure 10, criterion (2) will, in the long run, be meet assuming that thresholds are efficiently distributed. According to the queuing theoretic perspective the LDA may be instable, if its overhead is larger than the processing of threads. We do not believe the overhead is that large, because the LDA has an iteration time of one second, and even though the overhead of collecting thread state information is large, this is only performed every other second. However, we have no evidence for this. We cannot determine if the LDA is stable according to the algorithmic perspective.

The conclusion is that the LDA may be instable, but several properties of the LDA suggest that this is not the case:

(1) It has a large iteration time.

(2) Only objects with a single active thread are moved every other second.

(3) Movements are aborted if the broadcasting node's load has fallen since the broadcast.

We cannot guarantee that thread starvation will not occur. Imagine a system of nodes where a migrated thread, T, always arrives at some node which is to select a thread for migration; assume that T is always selected. In this way T will never be processed. However, it is obvious that this is highly improbable and we consider it a purely theoretical problem.

4.8.2 Out-dated Information

The risk of adapting to out-dated statistics arise whenever these are passed over the network, i.e., loads and thresholds will not be current when they are received. The thresholds are non-critical, therefore we will not investigate the consequences of this further.

When a node broadcasts its load, this load might rise or fall until it receives the replies.

If the load rises it may not receive any replies, because the broadcasted load is not LoadDiff larger than any other node's load. Therefore, the required migration is delayed. However, this is not a serious consequence.

If the load falls it may receive more replies to its replies than it "should". However, because the broadcasting node checks its load again before migrating, this is not serious either.

A node replying to a broadcast can also experience a rise or fall in its load. Here the consequences are worse. If its load falls, it might not be chosen, even if it is now the lowest loaded node. However, if its load rises, it might receive threads, which it instantly will attempt to get rid of again. This is the worst case, because it can generate processor thrashing. In section 8.4, we discuss if this turned out to be a problem.

4.8.3 Robustness

If a kernel crashes all its resident objects are obviously unavailable until the kernel is rebooted; and even after a reboot they may be in an undefined state. However, this is not a property of the LDA, but of the Emerald system itself. Because the LDA does not depend on any global state information (not even knowledge of all nodes in the system) and because it is totally distributed, it is robust.

The LDA increases an application's vulnerability by distributing its objects among several nodes. A user may prefer robustness to efficiency. Therefore, if we were to implement a user-friendly LDA, an option to let a user turn off LDA should be provided. It is suggested as an extension.

A related problem is the shut down of the kernels. In Emerald, a user can at any time shut down a kernel, N. When the LDA is active, however, this shutting down might abort several applications, because some of their objects reside on N. In a user-friendly version of the LDA the shut down should ensure that all objects are moved, before the kernel is shut down. This is also suggested as an extension.

4.8.4 General Properties

The LDA is:

o Dynamic by reacting to the current processor load.

o Adaptive by dynamically varying thresholds and distributing these among nodes.

o Preemptive by migrating threads after they have begun execution.

o Sender-initiated because it is the sending node, which decides to move an object.

o Distributed because all processors execute the same LDA, i.e., there is no server, centralized database, etc.

o Instable? We cannot guarantee that the LDA is stable.

o Robust because it can continue after kernel crashes. However, the LDA increases the vulnerability of an application by distributing objects on several processors.

When a kernel is idle, the algorithm imposes only a slight overhead. The overhead is lines (5.3)-(5.5) and (5.19) of figure 5 at every dt-period. There is no overhead on the network when all kernels are idle.

On a system with only a single booted kernel the overhead is also small. The reason is that a heavily-loaded single kernel will never receive any replies to its broadcasts. Therefore, it will raise its threshold. Thus, there are long periods between broadcasts.

The LDA does not depend on any static information about the system, e.g., the number of nodes, threads and Emerald processes may vary dynamically.

4.9 Summary

We have designed a LDA that we find suitable for the Emerald system. The algorithm is dynamic, because static algorithms perform poorly in general; and adaptive to perform well under high system load. The LDA measures processor load by adding cEmRunnable to the one-minute Unix load average. It is migratory, because threads may be long-running. We made the LDA sender-initiated, because this does not impose an overhead when the LDA is not used. A node is overloaded when its load exceeds a threshold. The lowest loaded node is selected to receive threads. The moved objects are selected at random, because we do not believe that the overhead imposed by making a more complicated selection would pay-off. We made the LDA adaptive by letting it increase the period between broadcasts, when the system is overloaded. We found that the LDA is robust, but it may be instable.

5 Designing the

Remote Invocation Reduction Algorithm (RIRA)

In this chapter we design the reduction of remote invocations algorithm (RIRA). We do this by outlining three approaches that have been considered; discuss their advantages and disadvantages; and select one of them. The selected approach is then refined to become an algorithm. Finally, we discuss properties of the RIRA.

5.1 Three Approaches

By only evaluating three approaches we may overlook better approaches. We risk doing this for two reasons: the chosen approach need not be optimal, and we believe the three approaches are among the best possible (given the overall design of chapter 3).

The first approach is inspired by the Emerald language parameter passing construct: move. The second approach is based on statistic gathering. The third collects objects on a single node.

5.1.1 The Movement Approach

In this approach we almost avoid remote invocations by always moving the invoking object, S, on node NS, to the node, NT, of the invoked object, T (when they do not initially reside on the same node). When object S arrives it performs a local invocation of T.

We cannot avoid all remote invocations because T might move before S arrives at NT; either because the LDA moves T or because a thread in T invokes a remote object.

The rationale behind the approach is that when S invokes T, there is a high possibility that it will do so several times. If this is the case, co-locating the objects will reduce the number of remote invocations.

Implementing the movement approach is fairly simple, because all administration is done at node NS and no statistics are gathered. All that is required of NT is to accept S and perform the local invocation.

5.1.2 The Co-locating Approach

The previously described approach has no overall view of the system. It regards each remote invocation as a problem to be solved.

The co-locating approach records invocations and deals with these when an overall picture of the system can be made. Therefore, this approach is divided in two parts: the statistic gathering and the movement decision. The statistic gathering records invocations and stores these for later use. The movement part inspects the statistics and moves objects among nodes to reduce the number of remote invocations.

5.1.3 The Object Gathering Approach

The rationale behind this approach is that no statistic gathering is required, if all object can reside on a single node, N. In that case, we just move all objects to N at low system load. N could be the node with the most available memory.

The complexity of the approach lies in the selection of N, because distributed selections are cumbersome. The solution will require some coordination between the kernels and that can reduce performance.

5.2 Selecting the RIRA

We use the following criteria in the selection:

o Application performance.

o Complexity.

Whether the movement approach increases application performance is highly dependent on whether the assumption that objects invoke other objects several times in succession holds. If it does not, the movement approach will decrease application performance because of the many movements and because of instability. Instability can occur, e.g. if an object, T, continuously invokes two other objects on different nodes then T will move from node to node. Although its complexity is low, we do not believe that the movement approach in general can increase application performance.

Whether the co-locating approach increases application performance depends on two conditions:

(1) If statistics can predict future invocations.

(2) If statistic gathering overhead can be kept sufficiently low.

We do believe in (1), because we think that the invocation structure of most applications is rather simple and predictable.

The wall-clock time to perform a remote execution is about three orders of magnitude larger than the time to perform a local invocation [Juul 89]. Therefore, the statistic gathering overhead will be insignificant if only a few remote invocations can be made local. Thus, we believe that the co-locating approach may increase application performance.

The object gathering approach provides application performance gains when all object have been located on a single node, N. In that situation there is no remote invocation, and no statistic gathering overhead. This overhead cannot be avoided in the co-locating approach. However, in the periods around activation and deactivation the object gathering approach may reduce performance because:

o The mass-movement to/from N is costly.

o During the mass-movement to/from N the number of remote invocations may increase, because objects are momentarily separated.

o Overhead on processor and network to select node N.

Some other problems with this approach are:

o There may be no node with sufficient memory to keep all objects.

o The system is highly dependent on the availability of node N.

From this discussion, we believe that the co-locating approach provides the largest application performance gains and relatively modest complexity.

In the following section, we design the co-locating approach in detail.

5.3 Designing the RIRA

The following issues are discussed: what statistics to gather, when to inspect the statistics, how to use them, and when to initiate the RIRA.

5.3.1 What Statistics to Gather

We consider two types of statistics, that could be gathered:

(1) Object-object statistics: all objects record the number of interactions with all objects.

(2) Object-node statistics: all objects record the number of interactions with all nodes.

We chose one of these by considering:

o Application performance.

o Processor overhead.

o Memory requirements.

Object-object statistics offer the best possibilities for improving performance; they provide a way to identify groups of objects that always should be kept together because they have many interactions with each other. Object-node statistics do not directly provide this possibility, thereby they may separate members of such groups and induce a large number of remote invocations.

In figure 11 we shown a example with three objects, O1, O2, and O3, on two nodes. The lines between the objects indicates the number of interactions. Thus, with object-node statistics, objects O2 should be moved to node N2, or O3 should be moved to N1 to reduce the number of remote invocations. If it is decided to move O2 to N2, O1 and O2 are separated and must invoke each other remotely. However, the problem may be solved later, because O1 can be moved to N2.

With object-object statistics O1, O2, and O3 can be identified as a group, and be moved together immediately.

Another disadvantage with object-node statistics is that an object may never reach its final destination, because it has interactions with an object that moves from node to node. One can imagine a situation where an object hunts another object without ever catching up with it. This may result in object thrashing. Only two nodes are necessary for this to occur as shown in the following example.

Example

On node N1 is an object O1. Likewise, on node N2 is an object O2. O1 has most interactions with O2, O2 has most interactions O1, and both objects invokes one another. In this situation, the RIRA on node N1 will move O1 to node N2, and the RIRA on node N2 will move O2 to node N1. Furthermore, if the two kernel's RIRAs are synchronized, we cannot guarantee, that O1 and O2 will not continue to swap nodes forever. n

A necessary condition for this is that O1 and O2 continue to invoke each other at approximately the same instant and with the same rate. If, at an instant O1 and O2 reside on one node, the situation becomes stable.

They do not need to invoke each other at exactly the same instant. This is shown in figure 12. The figure shows that the statistic gathering for object O2 may be overlapped with object O1's movement. If the time required for movement is large compared to the time required for statistic gathering, the nodes need not be particularly synchronized.

Although we cannot prove that this would not be a problem to the RIRA, we consider it a theoretical problem, because we believe it to be highly improbable to occur often. Thus, we ignore it for practical reasons, not theoretical.

The main difference between the two types of statistics is, therefore, the time they require to co-locate objects. When the movements are complete, the results will be equivalent.

From this discussion, we do not believe, that the extra overhead on processor and memory to gather object-object statistics is repaid by the shorter latency until the final placement is obtained. We therefore chose to use object-node statistics.

Next, we must consider how to deal with "old" statistics. Three approaches can be used:

(1) Statistics are time-stamped.

(2) Statistics are cleared periodically.

(3) Statistics are kept.

By time-stamping statistics it would be possible to let new statistics have larger impact on movements than old statistics.

The periodic clearing prevents the RIRA from basing its decisions upon old statistics, however, both old and new statistics would be cleared (unless statistics were time-stamped in which case approach (2) would turn into approach (1)).

We do not believe that neither approach (1) or (2) would be able to improve application performance, because the consequences of using out-dated statistics is at most a larger latency until the correct placement is found, and because of the overhead involved in both approaches.

Thus, we keep statistics. However, we clear them when moving an object for two reasons, First, it is costly to move the statistics. Second, the statistics now have been used to make a decision. If we do not clear them, an object may be stuck on a node because a long time ago it had many interactions with that.

For each object, O, we therefore record the following statistics:

(1) The total number of interactions O has with local objects.

(2) The number of interactions O has with nodes N1, N2, ..., NN; one counter per node.

5.3.2 When and how to use the Statistics

We need to determine when to inspect the statistics and decide which objects to move and where to move these. The inspection can be done in several places, including:

(1) Periodically.

(2) During remote invocations.

(1) has the advantage that we can co-locate objects even before they have begun to invoke each other remotely. The disadvantage is that the periodically running through the statistics imposes an overhead, even if no remote invocation occurs. (2) has the main advantage that it only imposes an overhead, when the RIRA is required. The complexity of (1) is larger than that of (2), because there is no simple way in Emerald to traverse all objects on a node. We cannot estimate how much the two methods influence application performance.

We choose method (2) mainly because it is simpler to implement.

To make the statistics useful, we must assume that they are typical, i.e., the invocation structure does not suddenly change. On average this must be true. A problem is that all objects are not created at the same time, and that they move. With object-node statistics we cannot deal with the problem that an object has many recorded interaction with an object, O, on a node, from which O has recently moved. The implications of this depend on how many invocations are recorded between the movements; we briefly return to this below. Given that the invocation structure does not suddenly change, an object should be located on the node with which it has the most interactions. However, moving has a cost, therefore local interactions might be given priority over remote invocations so that an object is only moved if it has considerably more interactions with the remote node. It is difficult to determine if this strategy will improve or reduce application performance. It may improve performance by limiting movements, however, it may reduce performance by increasing the latency before objects are moved. Because of this, and the complexity of determining what ratio should exist between local and remote invocations, we chose not to give priority to local invocations.

A related problem is if a minimum number of interactions should be recorded before we consider movements. Again the advantage of selecting a minimum is that the number of movements can be reduced; the disadvantages are that it induces a latency, and that out-dated statistics may lead to an incorrect placement. However, in this case we consider it necessary to use a minimum value, because not doing this can mean that the algorithm behaves as described in subsection 5.1.1 on the movement approach, i.e., the invoking object always moves to the invoked object. We arbitrarily select a value of 8 interactions as minimum.

5.3.3 When to Initiate the RIRA

We want the RIRA to be active when the LDA is inactive. The LDA is inactive when no LDA message traffic is send, and all nodes LoadThreshold are minimal. The problem is that no node can know about all other node's LoadThreshold. The result is that some nodes will activate the RIRA, some will not. There are at least three solutions to this problem:

(1) Coordinate the RIRAs by messages.

(2) Only activate the RIRAs, when the LDA has been deactivated for some time.

(3) Ignore the problem.

We select one of this based on our believes about their impact on application performance, overhead on network, and complexity.

(1) has an overhead on the network, even before the RIRA has begun reducing remote invocations. Further, the coordination can be complex because the number of nodes may change. We believe that the main advantage of method (1) is that it delays the activation of the RIRA. But (2) does the same without imposing an overhead on the network. Both methods have the disadvantage that they impose a latency until the RIRA is activated. (3) has not got this problem, but does provide possibilities of object thrashing, because the LDA and the RIRA may be activated at approximately the same time.

Because we believe that object thrashing may influence application performance more than the RIRA latency we choose method (2).

The length of the time limit should be found as a trade-off between the cost of object thrashing and the cost of the latency until the RIRA is activated. We arbitrarily select the length to be 4 seconds.

In the following section, we describe the RIRA as it has been designed.

5.4 The RIRA

Most of the RIRA is dealing with statistics gathering and low level issues. We present these in chapter 6 on implementation. In this section we present, in pseudo-code, the part of the RIRA called during a remote invocation.

The procedure in figure 13 is called during a remote invocation of object Invokee by object Invoker on node N; the procedure is called on node N.

Procedure RIRACanMove, line (13.3), determines if the invoker can be moved. Procedure RIRATargetNode, line (13.5), inspects the statistics to find the node where Invoker should be placed. There are three possibilities:

(1) Insufficient statistics have been collected.

(2) The chosen node is node N.

(3) The chosen node is not node N.

Only in the third case is Invoker moved to TargetNode by a call to procedure Migrate.

5.5 Properties of the RIRA

In this section, we present properties of the algorithm.

The RIRA is:

o Dynamic by reacting to collected statistics about interactions objects have with nodes.

o Sender-initiated because it is the sending node which decides to move an object.

o Distributed because all processors execute the same RIRA, i.e., there is no server, centralized database, etc. All decisions are made by individual kernels only considering resident objects.

o Instable? There is a possibility that RIRAs on different kernels may induce object thrashing, however, in practise we believe this situation is improbable.

o Robust because it can continue after kernel crashes. Further, by gathering objects, the RIRA decreases an application's vulnerability to kernel crashes.

When a kernel is idle the algorithm imposes no overhead, because it is only activated during invocations.

When only a single kernel is booted, there is an overhead imposed by the statistic gathering during invocations.

The algorithm does not depend on any static information about the system, e.g., the number of nodes, threads and Emerald processes may vary dynamically.

Incorrect movements can be performed because an object's invocation structure has changed and old statistics therefore indicate an incorrect destination node.

5.6 Summary

We have designed the RIRA so that it migrates the invoking object during a remote invocation to the node with which it has the most interactions. The object is only moved if a certain number of interactions have been recorded. The RIRA is only active in periods when the LDA is inactive. The RIRA is dynamic, distributed and robust, but may induce object thrashing.

6 Implementation

In this chapter we present issues of the implementation of the LDA and the RIRA developed in previous chapters. The presentation is divided into five sections:

Section 6.1 Program modules.

Section 6.2 Load distribution.

Section 6.3 Statistic gathering.

Section 6.4 Reduction of remote invocations.

Section 6.5 Performance considerations.

6.1 Program Modules

In figure 14 four kernel modules are shown: the invocation protocol, the LDA, the RIRA and the statistic gathering. The box labelled various procedures represents various kernel procedures. For each module we show the procedures that are called from outside the module itself and calls to procedures in the LDA, RIRA and statistic gathering modules. An arrow from one module's procedure to another module's procedure indicates that the first calls the second, e.g., the figure shows that procedure TryInvoke of the invocation protocol calls both procedure RecordInInvocation and procedure RecordoutInvocation.

Figure 15 gives a short overview of the implemented changes to the Emerald system. Type specifies where the changes have been made. File indicates the name of the file in which the module is implemented. The column C/A, indicates whether the module is entirely implemented for this project, i.e., Added, or if we have modified an existing module, i.e., Changed. Several other small and uninteresting changes have been made to other kernel modules. Mainly these changes are initializations of data structures or calls of the LDA, RIRA or statistic gathering.

6.2 Load Distribution

The implementation follows the algorithms of section 4.7 closely. A single remark is given below.

To determine if an object can be moved requires a great number of tests. These are performed in procedure LDACanMove. Because this has presented large problems during the implementation, a large amount of test output is included in this procedure. To test, if exactly one thread is active in an object can be done in a single line:

but for the test output, we calculate the number of threads, even though this is more complicated and time consuming.

6.3 Statistic Gathering

In this section we describe how the data structure of global objects has been changed, we also describe changes to the Emerald kernel and compiler to update the data in the structure.

6.3.1 General Remark

The RIRA does not need to distinguish between the invocations an object, O, makes upon another object (outgoing invocations) and invocations of O (incoming invocations). However, we have implemented the statistic gathering so that these are recorded separately. The advantage is easier extensibility. The disadvantages are slightly more complex code and reduced performance.

6.3.2 Data Structure

We have extended the data structure of global objects, the GOD, to include statistics. The statistic record is shown in figure 16.

We record all remote invocations of a single object in a linked structure of these records. Local invocations, however, are handled specially for efficiency because local invocations are common and therefore even a small overhead can increase run-time significantly. Thus, special entries for local incoming and outgoing invocations are allocated directly in the GOD. These can be updated efficiently by compiled code.

Obviously, we do not need to record a node number, when the invocations are local. The fields added to the GOD are shown in figure 17. InInvocationsPtr is a pointer to a linked list of statistic records recording incoming remote invocations. OutInvocationsPtr is a pointer to a linked list of statistic records recording outgoing remote invocations.

In the following, we first describe how compiled code can update the fields during local invocations. Second, we describe how the kernel updates the structure during remote invocations.

6.3.3 Changes to the Compiler

In figure 18 we show some of the code generated by the Emerald compiler to perform an invocation of a global object.

The changes to the compiler involves increments of InInvocations and OutInvocations in line (18.5), where all local invocations are performed. We have made compiled code update the InInvocations of the invoked object and the OutInvocations of the invoking object.

In figure 19 we show the generated code in pseudo-code. b and g refer to processor registers, these are architecture independent names internal to the Emerald system.

Lines (19.2)-(19.5) update the incoming invocation, lines (19.6)-(19.9) the outgoing invocation. The address of the TargetGODP can be obtained at compile time without problems, line (19.2). Normally register b points to the current global object (the invoking object), but because we are in the middle of an invocation, register b has been pushed onto the stack and contains a scratch value. Therefore, in line (19.6) its value is fetched from the stack. The Sparc cannot increase a memory location, therefore the (cumbersome) loading and storing of registers is necessary.

6.3.4 Changes to the Emerald Kernel

The changes involves inserting calls to the procedures RecordInInvocation and RecordOutInvocation during remote invocations.

The statistic records are allocated with malloc(). When we move an object this storage is not reclaimed, and if the object is moved back, we allocate a new structure. Obviously, this is a shortcut and should be changed if the RIRA were to be used for other purposes than the experiment.

6.4 Reduction of Remote Invocations

Procedure RIRAInvocation finds the node on which the invoking object should reside, if there are sufficient statistics. If this node is not the current one the object is moved.

There is a problem with this scheme due to a deficiency in the current implementation of Emerald. We return to this in subsection 7.2.2.

6.5 Performance Considerations

Although performance is a prime concern when implementing load distribution, in general we have given priority to transparency and extendability, especially in the following places:

(1) The statistic gathering is more specific than needed, but the overhead occurs primarily during remote invocations. Therefore, the overhead is small compared to the time to perform a remote invocation.

(2) When many nodes are used the statistic gathering requires many calls to the time-consuming malloc() function. Note, however, that these calls are only performed during remote invocations that already are time-consuming. A less general data structure (e.g., linked arrays of records) could reduce the number of calls to malloc().

(3) The code generated to update the number of local invocations might be optimized because the compiler probably can supply scratch registers in a efficient way. We have chosen just to push used registers upon the stack.

6.6 Summary

In this chapter, we have described the program modules and their interactions. The LDA follows the design in chapter 4 closely. The main part of the chapter described how statistic gathering was implemented in the Emerald kernel and compiler. In general, transparency and extensibility have priority over efficiency in the implementation.

7 Experiment

In this chapter we experiment with the implemented kernel. The chapter contains the following: in the first section, we discuss the contents of the experiment; the second section contains a discussion of the test conditions; in the third section, we describe how the tests are performed; in sections 7.4 through 7.8, we present results of the experiment; section 7.9 is a conclusion.

7.1 Contents of Experiment

We divide this section in two: first, we describe what the experiment includes. Second, we explicitly mention several items that it does not include.

7.1.1 What the Experiment Includes

The purpose of the experiment is to test the theses of section 1.3 on page *. Therefore, we concentrate on the wall-clock time of executing Emerald applications. We cannot execute all possible Emerald applications, but must select some that represent interesting categories of applications.

We test some compute-intensive applications. Such applications should experience performance gains through the LDA.

An application that should experience gains from both the LDA and the RIRA must be communication-intensive in periods, while being compute-intensive in others. Therefore, we also test the system with such an application.

For the mentioned applications, we measure elapsed wall-clock time of executing them under three different conditions:

o Without LDA and RIRA: ordinary kernel.

o With LDA and without RIRA: LDA kernel.

o With both LDA and RIRA: LDA/RIRA kernel.

For one of the compute-intensive applications speed-up is also measured. For another we measure the impact of moving large objects.

Another interesting item to measure is how our load distribution performs compared to load distribution performed by the Emerald programmer. It is almost impossible for an Emerald programmer to perform the task of the RIRA, so we only test this manual load distribution with a compute-intensive application.

Before performing these experiments, we perform some low level tests to measure how much the statistic gathering alone affects the time to perform a local and a remote invocation.

7.1.2 What the Experiment does not Include

The purpose of the experiment is not to insure that the LDA/RIRA kernel behaves optimally. Therefore, we do not perform tests to optimize kernel constants. Neither do we perform tests where we vary these constants.

The crucial issue of the experiment is to perform a qualitative analysis. We will not perform a quantitative analysis, but briefly mention such results. Reasons for only performing a qualitative analysis are:

o A quantitative analysis will require a larger number of applications and preferably applications that perform useful work.

o The kernel is not optimized.

o A qualitative analysis is sufficient to confirm or deny our theses.

An ideal experiment should not only consist of executions of individual applications, but also test the performance in a real Emerald environment, i.e., one where several users start applications on different nodes. However, it is difficult to control such an environment so that results are comparable, such tests are, therefore, not performed. This also implies that all developed applications creates several Emerald processes, so that the LDA/RIRA has threads to move. In a real environment, non-parallel applications should also experience gains through the LDA/RIRA.

7.2 Test Conditions

Several problems exist when we want to measure elapsed wall-clock time of executing applications. In this section, we discuss these problems and their solutions.

7.2.1 No Exclusive Access to Nodes

The main problem is that the nodes on which we execute the applications differ in many ways: they run with different clock frequencies, have different amounts of memory, access to external storage, caches, etc. It is not possible to take all these factors into account when evaluating the measurements. Our solution involves two parts. First, we always execute runec from the same node. This node is the only one that we can be certain will be used by the Emerald application, therefore is should be the same during all executions. Second, we perform many executions of the same application and select the smallest run-time as the result. By choosing the smallest run-time, we rule out disturbing factors.

Another problem is that the nodes and the network are used by unrelated Unix processes. The solution, as before, is to perform many executions and select the smallest run-time. Also, we perform the executions when the nodes and network are lightly-loaded.

Despite this, some of our measurements are influenced by disturbing factors; this can be seen from figure 28. In the figure the disturbing factors (for the original kernel) are about 0.4% of the total run-time. In other measurements the degree of accuracy may be lower. However, because our conclusions are qualitative, and because even factors much larger than 0.4% should not change these, we believe that our results are significant.

7.2.2 Problems in the Current Version of Emerald

We developed test applications that we unfortunately could not use, because of problems with object mobility in the current version of the Emerald kernel. The problems are time-dependent, but the exact reason is currently unknown. The problems also exist in kernels without LDA/RIRA, but because object movements are much more frequent in a LDA/RIRA kernel, they are of greater importance in this thesis. The problems lead to the executing Emerald kernel aborting with a segmentation fault or bus error.

The result is that we cannot perform exactly the tests we want to. We have developed and would have liked to test an application (a simulated database manager), that in some periods is highly compute-intensive, in other highly communication-intensive, but has a more complex communication structure than the application used in section 7.8. Further, we have developed a communication-intensive application with a complex communication structure. We would have tested this application by executing it on a highly-loaded node to see if it would move to a lightly-loaded node. Also, a large distributed mail-system for Emerald is available, but the mobility problems also prevented us from using this.

In the specification of Emerald a forwarding protocol is used to locate objects, i.e., if object S on node NS remotely invokes T on node NT and S moves before the invocation reply returns, the invocation reply is forwarded to the node where S now resides.

In the current implementation of the Emerald kernel, there is no forwarding of invocation replies. When NS receives an invocation reply to a non-resident object, the reply is discarded. However, before NT sends its reply to NS, it checks if S is resident. In that case, the reply is passed to S directly; this is an optimization.

This implies that the RIRA works perfectly on two nodes, but not on three or more. We therefore cannot test the RIRA on more than two nodes. The consequence is that the LDA is tested much more thoroughly than the RIRA.

7.3 Performing the Tests

In the header of all tables in this chapter is written the number of executions, that have been performed for each of the indicated time measurements, e.g., for the results in figure 21 on page * a total of 36 runs have been made. This gives an impression of the uncertainty involved in the measurements.

The names of nodes used are Narfe, Modi, Gere, Garm, Freke, Trud, Rimfaxe, and Skinfaxe. They are all equipped with Sparc processors. We use the nodes in this order, i.e., when we require one node, we use Narfe, when we require two, we use Narfe and Modi, and so on.

The table columns named LDA Migrations and RIRA Migrations contain the number of migrations for the nodes in the test written as:

(XNarfe, XModi, XGere, XGarm, ...),

where XNarfe is the number of migrations from Narfe to some other node, and so on.

When performing tests with the ordinary kernel or the LDA kernel, i.e., with RIRA turned of, we compile the Emerald applications so that no statistic gathering is performed. In this way, the impact of statistic gathering is only put upon kernels that need it.

It should be noted that the applications we use in the tests are rather primitive and perform no "useful" work.

7.4 Statistic Gathering's impact on Invocation Performance

We have tested this by using an Emerald timing application programmed by Eric Jul in 1987.

The results are shown in figure 20. Both the LDA and the RIRA has been turned off during the tests.

The statistic gathering has no impact on the time to perform a remote invocation. The reason is that the time required for statistic gathering is insignificant compared to the total time required. For local invocations, the run-time is increased by 43% when gathering statistics. Without statistic gathering one can perform about 3000 local invocations in the time required for one remote invocation. With statistic gathering, this number is reduced to about 2100.

In the following, we investigate how this overhead affects applications at higher levels.

7.5 Compute-intensive Application

In this section, we first describe the application used, then we measure the benefit of using the LDA and RIRA, and finally we measure speed-up.

7.5.1 Application

We have developed two versions of an application which determines whether a given odd integer, N, is prime.

The application determines whether N is prime by dividing N by all odd numbers from 3 to ŻN . If N is divisible by any of these numbers, N is not prime, and the program terminates. When all divisions have completed, without N being divisible by any of them, N is prime.

Two versions of this application have been made. The sequential version is straightforward, having one Emerald process performing all divisions. The sequential version is used during the speed-up test. The parallel version is made by letting the central section of the application split the divisions up in intervals, after which each is done by a separate Emerald process. After an Emerald process has been created, it does not communicate with the central section until it has found N to be divisible by a number, or has performed all divisions in its interval. Therefore, given a large prime, the application is very compute-intensive.

For the testing, we choose two large primes as N and makes the central section create 10 Emerald processes to perform the divisions. 10 is chosen as a arbitrary number greater than the maximal number of nodes used in the speed-up test. We use a large prime (321,677,767) through out the tests in this section expect in one place, where we use a smaller prime (100,006,607).

7.5.2 The Benefit of LDA and RIRA

Figure 21 shows the results of executing the application using two nodes. Several things can be deduced.

First, using the LDA kernel reduces the run-time by 26%. Second, the RIRA is not used at all. This is not surprising, because the application performs few invocations of global objects and because the RIRA is not activated during high load. Third, the overhead of performing the statistic gathering is about 3.5%. Compared to the 43% increasing of local invocations of section 7.4, this is a good result. The LDA performs five migrations from Narfe. These are performed in the first few seconds of the run-time. During the execution, there are five threads executing on each processor. The five migrations from Modi are performed when the Emerald processes terminate, thereby off-loading Modi on which the Emerald processes have not reached as far as the Emerald processes on Narfe, because they have been migrated. All in all, this is exactly the way we want the LDA to work.

7.5.3 Speed-Up

In figure 22 we show the run-times and the calculated speed-up of executions on a number of nodes varying from 1 to 7.

Note that the run-time of the sequential algorithm is 6% lower than that of the parallel algorithm on a single node due to the time-slicing overhead imposed by processor-contention. Also note from the second column that all nodes in each test are used.

In figure 23 the speed-up from figure 22 is shown graphically together with the speed-up for the same application, when calculating on the smaller prime (a table of the run-times of this is shown in appendix 1). In the following we discuss figure 23. All numbers are calculated for the large prime, unless otherwise specified.

For the large prime, there is a continuous rise in speed-up when more processors are used, i.e., within the chosen interval, run-time is reduced, whenever more processors are added to the system. However, the relative improvement decreases, when more than four processors are used. For the smaller prime, the speed-up falls, when using 7 nodes, instead of 6. There may be several reasons for these results. First, some of the processors (Rimfaxe and Skinfaxe) are almost always loaded by unrelated Unix processes. Therefore, when these processors are used they may not improve run-time significantly, or even degrade it. Second, the LDA may use many resources moving objects to the lowest loaded node, when there are many nodes. The overhead of this decreases the (relative) improvement.

By using 7 nodes we can reduce run-time by 65%. We can make this percentage almost arbitrarily large by selecting a larger prime. However, there are three main reason that the speed-up is not larger than 65%.

First, with 10 Emerald processes the allocation of processors cannot be optimal in all cases.

Second, the LDA has a significant latency, because only one object is moved every other second. Let us consider the system with five nodes all being idle from the start. In that case, the LDA must move 8 objects. It then requires 16 seconds until all nodes are equally loaded.

Third, there is starvation, when Emerald processes terminate, because they do not all terminate at the same time. When five nodes are used, there is a period of about five seconds, where one or more nodes experience processor starvation, i.e., in these five seconds, fewer than five Emerald processes are running (we have measured this during the testing. The result is not documented.)

In total, these two last reasons imply that for 21 seconds, or 25%, of the total wall-clock time, not all nodes are used (on a system of five nodes).

The latency problem can be reduced by letting the LDA move several objects in each iteration, or by reducing the iteration time, e.g., by only waiting until a certain number of nodes have replied to a broadcast. The processor starvation problem is more complicated to handle, and it is a minor problem when several applications are executed at the same time, because the work-load in that case is more evenly distributed.

Even with these problems and deficiencies, the advantage of the LDA is obvious for compute-intensive applications.

7.6 Compute-intensive Application with Large Data Area

In this section, we test the impact of moving large objects with the LDA. First, we describe the application used, then we presents results from measurements of run-times under different conditions.

7.6.1 Application

We have developed an application which creates a number of Emerald processes that each count to a large integer; the counting processes are created by evaluating of an object P. Apart from the required synchronization at start-up and termination, the application performs no communication and therefore is highly compute-intensive.

Two versions of the application have been made. In the first, small data version, the data area of the P object consists of a single integer, i.e., four bytes. In the second, large data version, we have attached an array of 25,000 integers, i.e., a total size of 100,000 bytes to each P object.

7.6.2 The Impact of Moving Objects with Large Data Areas

Figure 24 shows the results of executing the two versions of the application using two nodes with 12 Emerald processes created. The RIRA is not used because the application is compute-intensive. This is expected and not shown in the figure.

For comparison we have shown the run-time of the application using the ordinary kernel.

The figure shows that performance is improved by the LDA for both versions. So even though about 1.3 megabytes are transferred across the network in the large data version, load distribution improves performance. The run-time of the large data version is 23% higher than that of the small data version. The reason for this is not primarily due to the cost of transferring data, because this is done in parallel with processing. However, there is a longer delay until migrated threads start execution on the remote node. Therefore it takes a longer time until the LDA has distributed the threads evenly, i.e., the processing power of the remote node is not used for a longer period than in the small data version. Furthermore, the LDA moves two more objects in the large data version. This may be due to the mentioned delay, i.e., the remote node does not report a larger load until all data of the migrated threads have been received and scheduled.

In figure 25 we shown the same results as in figure 24, but in this case only two Emerald processes are created. Here, the LDA only increases performance for the small data version. The reason is that the transfer cost now is significant compared to the total run-time, which is reduced. Further, the run-time of the large data version is increased significantly. The main issue of this is to note that load distribution can decrease performance significantly for objects with a large data area, when the run-time is short. However, in Emerald it is not in general possible to estimate remaining run-time.

The conclusion is that for compute-intensive applications of some length the size of the data area is not of great importance to the performance gains of the LDA. Even though we increase the required number of network packets considerable, the run-time only increases slightly.

7.7 Compute-intensive Application with Manual Load Distribution

In this section we test how well the LDA performs compared to an Emerald programmer's manual load distribution. First, we describe the application used, then we present the results of measurements of run-times under different conditions.

7.7.1 Application

We use essentially the same application as described in section 7.6.1. We only use the small data version, and instead make two new versions of the application; both versions create two Emerald processes. The plain version is exactly as in section 7.6.1, apart that the Emerald processes count to a larger number. In the manual version one of the Emerald processes is migrated by a move statement to a remote node just after creation.

7.7.2 Comparison

In figure 26 the results of executing the two versions of the application using two nodes are shown. The plain version is executed on a LDA/RIRA kernel, while the manual version is executed on an ordinary kernel. The RIRA is not used because the application is compute-intensive.

The figure shows that the Emerald programmer can obtain better performance than the LDA. This is expected for two reasons. First, in this case it is obvious to the programmer that exactly one object should be moved to a remote node so that the nodes are evenly loaded (provided that they are otherwise idle). The LDA performs four movements, which in this case are too many. Optimization of the LDA might solve this problem. Second, the manual movement occurs almost immediately after start-up, because the programmer does not need the latency of the LDA.

The conclusion is that in very simple cases, the Emerald programmer can outperform the LDA.

7.8 Load varying application

In this section, we present measurements of run-times with an application that in certain periods is compute-intensive, while being communication-intensive in others. First, we describe the application, then we present measurements.

7.8.1 Application

The application is a simulated card game. It consists primarily of three object definitions: a main object, a dealer object, and a player object. The dealer object has no process definition. The main object creates a number of players by evaluating the player object. After that, the main object waits for the game to end, to measure elapsed time. Thus, during most of the game, only Emerald processes created in the player objects are active.

A player follows the algorithm shown in figure 27. The players execute in parallel, because they pass the turn on to the next player and thereafter start to think about the next move. Interactions with the dealer are implemented as a number, Interactions, of invocations of the dealer object. The thinking is implemented as a for-loop counting to a number, X. X is varied by first increasing it until a certain maximum, WorkloadMaximum, is reached, then it is decreased until it reaches a certain minimum, WorkloadMinimum. During this, the players use the intermediate values of X. Thus, there is a stable, contiguous change in how compute-intensive the application is.

The application needs both the LDA and the RIRA, because the load changes: in periods with high load players should spread-out among processors; in periods with low load they should be on the node where the dealer object resides.

By varying Interactions, WorkLoadMaximum and WorkLoadMinimum the application behaves differently. We have chosen to keep WorkLoadMaximum and WorkLoadMinimum constant in the following, just varying Interactions. The two constants have been selected so that the RIRA migrates objects during periods with low load, and the LDA migrates objects during periods with high load, because this is what we want to investigate.

7.8.2 The Benefit of LDA and RIRA

In figure 28 we show the run-time of the application as a function of Interactions, under the three conditions described in subsection 7.1.1. The measurements that lie behind the curve are shown in appendix 2. Several interesting conclusions can be drawn from the figure.

First, for the ordinary kernel the run-time is independent upon Interactions. That the curve is falling is due to disturbance from unrelated Unix processes.

The run-time is independent of Interactions, because all invocations are local. Therefore the overhead of the interactions is insignificant compared to the counting to X.

Second, the LDA kernel worsens the performance compared to the ordinary kernel. Also, the more interactions, the worse it becomes, because the number of remote invocations increases. This is what we predicted and the reason for developing the RIRA.

Third, it can be seen that the RIRA cannot remedy this performance reduction in all situations. When there are six or more interactions, the ordinary kernel performs better than the LDA/RIRA kernel. At this point, the performance gain by utilizing parallelism becomes smaller than the overhead of making the invocations remote.

The load varying application is well-suited to illustrate the complexity of creating a load distribution algorithm that performs well under different conditions.

7.9 Summary

It has been shown that the LDA significantly reduces the run-time of compute-intensive applications whether or not they contain large data areas; unless they have a very short run-time. The RIRA cannot reduce their run-time, and the statistic gathering therefore imposes an overhead to no avail. In simple cases, an Emerald programmer can obtain a better performance using an ordinary kernel by doing the load distribution with move-statements in his application.

The LDA kernel imposes significant performance reductions (compared to the original kernel) on load varying applications. The problem is that the LDA distributes objects that in some periods communicate heavily.

The RIRA can only remedy this situation to a certain degree. The result is that under some conditions the original kernel has higher performance than the LDA/RIRA kernel. Although optimization of the RIRA might reduce this problem, it is possible that it could reoccur if other load varying applications were tested.

Load varying applications do pose significant problems for the designer of load distributing algorithms in object-oriented systems. We discuss possible solutions to this in the following chapter.

8 Conclusion

The conclusion consists of five sections. In section 8.1, we present the designed algorithms. In section 8.2, we discuss if the theses of section 1.3 on page * have been confirmed or denied. In section 8.3, we give suggestions for future work. Section 8.4 contains a discussion of what have been learned from the thesis. In section 8.5, we summarize in what ways this thesis has contributed to research of load distribution in object-oriented distributed systems.

8.1 Designed Algorithms

The load distribution algorithm (LDA) and communication reduction algorithm (RIRA) have been designed and implemented in the Emerald system.

The LDA works by letting each processor periodically check if it is overloaded by comparing its load to a threshold. When it is it makes a broadcast. Underloaded processors reply to this broadcast and the broadcasting node sends an object with one active thread to the lowest loaded processor. In this way load is distributed from overloaded processors to underloaded processors.

The LDA is:

o Dynamic by reacting to the current processor load.

o Adaptive by dynamically varying thresholds and distributing these among nodes.

o Preemptive by migrating threads after they have begun execution.

o Sender-initiated because it is the sending node which decides to move an object.

o Distributed because all processors execute the same LDA, i.e., there is no server, centralized database, etc. All decisions are made by individual kernels only considering resident objects.

o Instable? We cannot guarantee that the LDA is stable.

o Robust because it can continue after kernel crashes. However, the LDA increases the vulnerability of an application by distributing objects on several processors.

The LDA considers processor load, but not the different processing capacity in other aspects, e.g., available memory, clock frequency etc. The LDA arbitrarily selects objects to move and has a large latency because it only moves one object every other second on each node.

The RIRA moves the invoking object during a remote invocation to the node with which the object has the most interactions. It is only moved if a certain number of interactions have been recorded. Due to an implementation restriction in Emerald, this only works on one or two nodes.

The RIRA is:

o Dynamic by reacting to collected statistics about interactions.

o Sender-initiated because it is the sending node which decides to move an object.

o Distributed because all processors execute the same RIRA, i.e., there is no server, centralized database, etc. All decisions are made by individual kernels only considering resident objects.

o Instable? There is a possibility that RIRAs on different kernels may induce object thrashing, however, in practise we believe this situation is improbable.

o Robust because it can continue after kernel crashes. Further, by gathering objects, the RIRA decreases an application's vulnerability to kernel crashes. The RIRA is only active in periods where the LDA is inactive.

Problems caused by the potential conflict of the LDA and the RIRA have been reduced by letting them be activated interchangeably. The RIRA is activated at low system load when the LDA is idle. The consequence of this is that the LDA/RIRA system does not reduce communication on overloaded systems.

8.2 Confirmation or Denial of Theses

First, it must be noted that all applications we have used in the experiment are created especially for this, and otherwise perform no useful work. We have not tested the LDA and the RIRA on an Emerald environment with users on different nodes executing applications in parallel. All applications are self-contained and have been executed alone on the system.

The load distribution algorithm significantly reduces the run-time of compute-intensive applications. The improvements, however, are limited by processor starvation and latency in the algorithm. The performance gain is only slightly dependent on object size. In simple cases, an Emerald programmer can perform better load distribution than the algorithm. Communication reduction cannot improve run-time of compute-intensive applications and induces an overhead for statistic gathering. For local invocations this overhead is 43%. For remote invocations the overhead is insignificant.

As a non-compute-intensive applications we have developed an application that in certain periods is communication-intensive, while in others periods it is compute-intensive.

Non-compute-intensive applications can gain by load distribution and communication reduction. However, the results are ambiguous: there exist situations in which performance is reduced, when load distribution and communication reduction are introduced, compared to the original kernel. However, using load distribution without communication reduction reduces performance even further.

8.3 Future Work

Designing an algorithm that improves performance for all kinds of applications has proven to be complicated. We have four suggestions that might reduce the problem.

First, in the designed algorithm load distribution and communication reduction are not active at the same time. Changing this may solve the problem. However, the overhead and complexity of such a solution might be prohibitive.

Second, it may be possible to avoid the overhead at run-time by performing it a compile-time. Aspects of Emerald make this complicated, e.g., dynamic object and thread creation and that the number of nodes may vary dynamically. However, a compile-time analysis should in some cases be able to identify the main communication patterns of the application and from this deduce placement decisions, e.g., by automatically attaching objects to one another. A pre-processor should be able to do this; but a run-time load distributing algorithm should be added. This solution is a compromise between our dynamic approach and Chu's static approach [Chu 80].

Third, one could let the Emerald programmer give hints about communication patterns to the load distribution algorithm, e.g., through attachments. However, we do not believe that attachments are sufficient to resolve problems for all non-compute-intensive applications because compute-intensive periods make it necessary to distribute even objects attached to one another.

Fourth, it may be possible to generate such hints from collected statistics about previous executions of the application.

The existing LDA/RIRA kernel might be improved by:

o Optimization of constants.

o Reducing the LDA latency.

o Avoiding broadcasts when only a single kernel is booted.

o Allow users preferring robustness to efficiency to disable the LDA.

o Allow for controlled shut down of Emerald kernels.

8.4 Lessons Learned

In this section, we will briefly describe some of the experience we have gained during the writing of this thesis.

The design problem that required the most attention was how to reduce conflicts between the LDA and the RIRA. The intension was that they should both be active at the same time, and that the "optimal" placement of objects should be found by a clever algorithm. However, as the discussion in section 3.5.3 concluded, the complexity of this was prohibitive. When this was realized the division of objects into active and passive was for long considered a solution, although the problems with this were to some extent understood. Finally, it was decided to abandon the idea that the LDA and the RIRA should be active at the same time. In the final phase of the writing we realized that it would be possible to have the LDA and the RIRA cooperate to a certain degree without experiencing object thrashing. This can be done by limiting RIRA movements, so that it may only migrate to underloaded or normally loaded nodes; the LDA does the same so no conflict occurs.

The adaptive part of the LDA turned out to require more consideration than expected, because somehow all nodes must know about the global state of the system, i.e., the system load. This is not easily obtained without imposing a load on the network, which should be avoided. We have no idea how effective the chosen solution is.

During the final phase of the writing of this thesis we realized that the load distribution latency had large implications for the speed-up. Therefore, we did not correct the LDA, though this would have been simple; but it would have meant that all measurements of chapter 7 should be redone. This would imply several days (and nights) work. We rejected this because efficiency of the LDA/RIRA is not vital to the experiment.

The implementation and experiment encountered problems because of deficiencies and errors in the current version of the Emerald kernel. The mobility problem described in subsection 7.2.2 implied that several weeks were spent without progress in fruitless attempts to solve the problem. However, this is what must be expected from projects requiring implementation, and despite the problems it is a great advantage having access to a system that makes such implementations possible.

In the described version of the LDA, processor and object thrashing have not been experienced. We have executed a version where objects were moved even if they contained several threads. This version often caused kernels to terminate abnormally due to the mobility problems but apart from this we experienced processor thrashing with the card game application, because the dealer object was moved back and forth between nodes by the LDA.

One thing might have made this thesis more interesting, namely if several different strategies for load distribution and especially for communication reduction had been implemented and compared. This was our initial intention, but the work required to do it was prohibitive.

8.5 Contributions

Our work has shown that:

o Load distribution in object-oriented distributed programming systems can significantly improve the performance of compute-intensive applications.

o Load distribution only considering load, not object communication, in object-oriented distributed programming systems may degrade performance seriously.

o Our approach to considering both load and object communication could not, in general, improve performance.

References

 

[Almes 85] G.T.Almes et al: The Eden system: A Technical Review. IEEE Transactions on Software Engineering, SE-11(1). January 1985.

[Andrews 82] G.R.Andrews: The distributed programming language SR - mechanisms, design and implementations. Software-Practice and Experience, vol.12, 1982.

[Ben-Ari 82] M.Ben-Ari: Principles of Concurrent Programming. Prentice-Hall International, Inc., 1982.

[Bennett 90] J.K.Bennett: Experience With Distributed Smalltalk. Software- Practice and Experience, Vol. 20(2), 1990.

[Brinch Hansen 75] P.Brinch Hansen: The programming language Concurrent Pascal. IEEE Transactions on Software Engineering 1(2): 199-207, 1975.

[Casavant 88] Casavant et al: Effects of Response and Stability on Scheduling in Distributed Computing Systems. IEEE Transactions on Software Engineering, Vol.14, no.11, November 1988.

[Chu 80] W.W.Chu et al: Task Allocation in Distributed Data Processing, Computer, 13(11), 1980.

[Eager 86] D.L.Eager, E.D.Lazowska and J.Zahorjan: Adaptive Load Sharing in Homogeneous Distributed Systems. IEEE transaction on software engineering vol. SE-12, no.5, May 1986.

[Eager 88] D.L.Eager et al: The Limited Performance Benefits of Migrating Active Processes for Load Sharing. Proceedings of the 1988 ACM SIGMETRICS Conference on Measurements and Modelling of Computer Systems, Sante Fe, New Mexico, 1988.

[Goscinski 91] A.Goscinski: Distributed Operating Systems. The Logical Design. Addison-Wesley Publishers, 1991.

[Hutchinson 87a] N.C.Hutchinson: Emerald: An Object-Based Language for Distributed Programming. Ph.D.Thesis, Department of Computer Science, University of Washington, Washington, January 1987.

[Hutchinson 87b] N.C.Hutchinson et al: The Emerald Programming Language report. Technical Report 87-10-07, Dept. of Computer Science, University of Washington, Seattle, Washington, October 1987.

[Jones 86] M.B.Jones and R.F.Rashid: Mach and Matchmaker: Kernel and Language Support for Object-Oriented Distributed Systems. OOPSLA'86 Proceedings, 1986.

[Jul 89] Eric Jul: Object Mobility in a Distributed Object-Oriented System, Ph.D. Dissertation. University of Washington, 1989.

[Juul 89] N.C.Juul: STATUS REPORT, Timing the Emerald Implementation on VAXstation 2000. DIKU Report no.89/2, 1989.

[Kunz 91] T.Kunz: The Influence of Different Workload Descriptions on a Heuristic Load Balancing Scheme. IEEE Transactions on Software Engineering, Vol.17, No.7, July 1991.

[Leland 88] W.Leland et al: Load-Balancing Heuristics and Process Behaviour. Proceedings on the Performance '86 and ACM SIGMETRICS 1986.

[Møller-Nielsen 87] Møller-Nielsen et al: Problem-heap: A paradigm for multiprocessor algorithms. Parallel Computing 4, 1987.

[Shivaratri 92] N.G.Shivaratri et al: Load Distributing for Locally Distributed Systems. Computer. Vol.25, No.12, December 1992.

Appendix 1: Measurements behind figure 23

Results of executions of the prime determination applications with the prime 100,006,607 (5 runs).

ÚÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄ┐

Nodes LDA migrations RIRA migrations Time/s Speed-up

ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄ┤

1 (0) (0) 51.6 1.00

ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄ┤

2 (7,4) (0,0) 41.4 1.25

ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄ┤

3 (6,1,1) (0,0,0) 32.0 1.61

ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄ┤

4 (8,2,3,3) (0,0,0,0) 23.7 2.18

ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄ┤

5 (9,1,1,0,1) (0,0,0,0,0) 23.1 2.23

ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄ┤

6 (9,3,4,4,1,1) (0,0,0,0,0,0) 22.7 2.27

ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄ┤

7 (9,2,2,4,0,2,2) (0,0,0,0,0,0,0) 24.1 2.14

ÀÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÙ

 

Appendix 2: Measurements behind figure 28

Results of executions of the card playing application on two nodes. All results in seconds (8 runs).

ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ┐

Interactions Original kernel LDA kernel LDA/RIRA kernel

ÃÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ┤

1 118.0 119.8 107.7

ÃÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ┤

5 118.3 124.6 116.8

ÃÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ┤

8 118.5 129.2 120.6

ÃÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ┤

10 118.1 136.9 123.5

ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ

Appendix 3: Brief Danish Summary

 

Følgende er den tekst,

som anvendtes ved annoncering af specialeforsvaret 17.06.93

 

Belastningsudjævning i Emerald: Et eksperiment

 

Emerald er et objekt-orienteret programmeringssystem, som muliggør udførelse af Emerald programmer på et system af datamater forbundet af et netværk. Emerald understøtter flytning af objekter og deres processer datamater imellem. Dette giver mulighed for at implementere belastningsudjævning, således at belastningen af processorerne fordeles, hvorved programmernes køretid reduceres.

Specialet beskriver design, implementation og et eksperiment med en belastningsudjævner i Emerald. Belastningsudjævneren placerer objekter på systemets datamater, således at disses processorer udnyttes jævnt. I perioder med lav belastning samles kommunikerende objekter, således at køretiden ikke øges, fordi objekterne må benytte netværket til kommunikation.

Eksperimentet viser, at belastningsudjævning kan reducere køretiden for processor-intensive Emerald programmer mærkbart. For ikke-processor-intensive programmer er det helt nødvendigt at samle kommunikerende objekter, da deres køretid ellers øges pga. øget behov for kommunikation over netværket. Samlingen af kommunikerende objekter, som den er implementeret, kan ikke i alle tilfælde reducere køretiden.