SAAPP: Background
Because of potential nondeterminism caused by the interleaving of execution, debugging parallel programs is considerably more difficult than debugging sequential programs. Nondeterminism is the possibility that a parallel program yields different outputs for different runs with the same input data [RAM85]. In addition to the problem of non-determinism and reproducibility, a traditional test setting suffers from lack of controllability and observability [GHL98].
Errors such as deadlocks, livelocks and race conditions are not usual for sequential programs but are all latent in parallel shared memory programs. Without a formal proof, the very existence of these errors can be very hard to find.
Taylor showed that, for Ada programs containing no branches or conditionally executed code, the problem of computing the possible execution orderings is NP-complete [TAY80,TAY83]. This result explains why formal methods are limited. Since the overall problem of verification of parallel progams is computationally hard this project will have an initial focus on race condition analysis. After this area is understood, a possible extension to other categories of errors is possible.
Although there are many mechanisms designed to prevent data races from occuring, such as Hoare's Monitor [HOA74], there are very few methods to detect them. In [NET91] Netzer claims that previous work on data race detection has only characterized races informally or not at all. Netzer presents a formal model for reasoning about race conditions. By formulating the race-detection problem in terms of the formal model proofs can be given regarding the complexity. For programs using synchronization powerful enough to implement two-process mutual exclusion, Netzer prove that the problem of deciding the necessary ordering relations is either NP-hard or co-NP-hard. For weaker synchronization, the problem can be decided in time linear in the number of events in the program execution.
Static and dynamic methods
In general, analysis of parallel processes can be either static (compile-time) or dynamic (run-time). Formal methods are inherently static and can be used for design and verification purposes. Likewise, semantic-based program analysis is static, whereas testing is dynamic.
Ignoring the limited size of tractable programs suitable for formal and semantic based analysis, one could still argue that the best time to ensure correct concurrent behavior is during the design phase, and not at compile- or runtime. On the contrary, there are several reasons why static and dynamic methods are useful for detecting errors in concurrent programs [SLO99]:
- There is a vast difference between the design and the implementation. A programmer will usually make mistakes in the implementation of the design.
- For very large and important systems (e.g. operating systems) there is a great benefit to having more than one method to check correctness, especially when more than one programmer is working on the project.
- Dynamic checking could be useful for foreign programs or objects in a large distributed system.
- The study of the techniques for data race detection could also be useful in designing new methods of data race prevention.
- Most methods do not allow for the protection of dynamically generated shared variables.
Static methods
Given the NP-completeness of the general race detection problem, two different approaches to static analysis have emerged [NET91]. Firstly, some methods traverse the space of all possible states that a program may enter. This state space can either be constructed explicitly, by building a graph, or implicitly, by constructing a representation of the state space (such as a formal language). In the general case, these methods have exponential time complexity, and in some cases, exponential space complexity as well.
Secondly, other static analysis methods perform a data flow analysis of the program to discover potential event orderings. These methods have polynomial time and space complexity but are less accurate than the state-space methods, sometime reporting races that the program could never exhibit.
Most static analysis methods for race detections assumes that all execution paths are possible. Under this assumption, it suffices to only examine the explicit synchronizations to determine whether sections of code may execute concurrently [NET91]. An exception is the work by Young and Taylor [YOT88], which employs symbolic execution to discover more information about possible execution paths.
Static analysis can also be used to complement dynamic methods. If the possibility of races between some sections of the program can be ruled out, there is no need for dynamic analysis for these parts.
Dynamic methods
Unlike static analysis, dynamic analysis detects the race condition exhibited by a particular execution of the program. Dynamic analysis is especially useful for debugging since precise information about manifestations of particular bugs is available.
Most of the previous work on dynamic race detection are based on Lamport's HappensBefore [LAM78] relationship. In these methods, a partial order of all events in a concurrent execution is created. First the order in which events occur within threads is noted. Then between threads, the order is determined by the properties of the synchronization objects that they access.
Typically, dynamic race detection methods use binary rewriting to instrument the program so that information about the program events is collected during execution. An event represents an execution instance of either a single synchronization (a synchronization event) or code executed between two synchronization events (a computation event). Two kinds of information is recorded: (1) for each event in the execution, the sets of shared memory locations that are read and written by the event (READ and WRITE sets), and (2) the relative order in which some events execute. The unavoidable intrusion inherent in all methods using binary rewriting gives rise to the probe effect [GAI86] possibly causing "Heisenbugs" [ZRK96] , i.e., by attempting to observe program behavior, the probes may inadvertently affect the outcome of the observation.
Methods based on the Happens-Before relation are difficult to implement efficiently because they require per thread information about concurrent access to shared variables. More importantly, tools based on the Happens-Before relation can miss data races because it is highly dependent on the interleaving produced by the scheduler. As for any dynamic test method, a large number of test cases are needed to test many possible interleavings.
Post-Mortem and On-the-Fly analysis
The main differences among dynamic race detection methods lie in the way information is collected and analyzed. Two approaches exist: post-mortem and on-the-fly analysis. Both post-mortem and on-the-fly methods usually instrument the program to collect information required for race detection. Post-mortem methods detects races after the execution has ended by analyzing trace files produced during the monitored execution. On-the-fly methods buffers partial trace information and detects races an on-going analysis as the program executes, without producing (possible very large) trace files. A thrust of previous work on post-mortem and on-the-fly methods has been to develop efficient encoding schemes for data collection.