Barrier (computer science)

In parallel computing, a barrier is a type of synchronization method. A barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier.

Many collective routines and directive-based parallel languages impose implicit barriers. For example, a parallel do loop in Fortran with OpenMP will not be allowed to continue on any thread until the last iteration is completed. This is in case the program relies on the result of the loop immediately after its completion. In message passing, any global communication (such as reduction or scatter) may imply a barrier.

Implementation

The basic barrier has mainly two variables, one of which records the pass/stop state of the barrier, the other of which keeps the total number of threads that have entered in the barrier. The barrier state was initialized to be "stop" by the first threads coming into the barrier. Whenever a thread enters, based on the number of threads already in the barrier, only if it is the last one, the thread sets the barrier state to be "pass" so that all the threads can get out of the barrier. On the other hand, when the incoming thread is not the last one, it is trapped in the barrier and keeps testing if the barrier state has changed from "stop" to "pass", and it gets out only when the barrier state changes to "pass". The following C++ code demonstrates this procedure.[1][2]

 1 struct barrier_type
 2 {
 3     // how many processors have entered the barrier
 4     // initialize to 0
 5     int arrive_counter;
 6     // how many processors have exited the barrier
 7     // initialize to p
 8     int leave_counter;
 9     int flag;
10     std::mutex lock;
11 };
12 
13 // barrier for p processors
14 void barrier(barrier_type* b, int p)
15 {
16     b->lock.lock();
17     if (b->leave_counter == p)
18     {
19         if (b->arrive_counter == 0) // no other threads in barrier
20         {
21             b->flag = 0; // first arriver clears flag
22         }
23         else
24         {
25             b->lock.unlock();
26             while (b->leave_counter != p); // wait for all to leave before clearing
27             b->lock.lock();
28             b->flag = 0; // first arriver clears flag
29         }
30     }
31     b->arrive_counter++;
32     int arrived = b->arrive_counter;
33     b->lock.unlock();
34     if (arrived == p) // last arriver sets flag
35     {
36         b->arrive_counter = 0;
37         b->leave_counter = 1;
38         b->flag = 1;
39     }
40     else
41     {
42         while (b->flag == 0); // wait for flag
43         b->lock.lock();
44         b->leave_counter++;
45         b->lock.unlock();
46     }
47 }

The potential problems are as follows:

When sequential barriers using the same pass/block state variable are implemented, a deadlock could happen in the first barrier whenever a thread reaches the second and there are still some threads have not got out of the first barrier.
Due to all the threads repeatedly accessing the global variable for pass/stop, the communication traffic is rather high, which decreases the scalability.

The following Sense-Reversal Centralized Barrier is designed to resolve the first problem. And the second problem can be resolved by regrouping the threads and using multi-level barrier, e.g. Combining Tree Barrier. Also hardware implementations may have the advantage of higher scalability.

Sense-Reversal Centralized Barrier

A Sense-Reversal Centralized Barrier solves the potential deadlock problem arising when sequential barriers are used. Instead of using the same value to represent pass/stop, sequential barriers use opposite values for pass/stop state. For example, if barrier 1 uses 0 to stop the threads, barrier 2 will use 1 to stop threads and barrier 3 will use 0 to stop threads again and so on.[3] The following C++ code demonstrates this.[1][4][2]

 1 struct barrier_type
 2 {
 3     int counter; // initialize to 0
 4     int flag; // initialize to 0
 5     std::mutex lock;
 6 };
 7 
 8 int local_sense = 0; // private per processor
 9 
10 // barrier for p processors
11 void barrier(barrier_type* b, int p)
12 {
13     local_sense = 1 - local_sense;
14     b->lock.lock();
15     b->counter++;
16     int arrived = b->counter;
17     if (arrived == p) // last arriver sets flag
18     {
19         b->lock.unlock();
20         b->counter = 0;
21         // memory fence to ensure that the change to counter
22         // is seen before the change to flag
23         b->flag = local_sense;
24     }
25     else
26     {
27         b->lock.unlock();
28         while (b->flag != local_sense); // wait for flag
29     }
30 }

Combining Tree Barrier

A Combining Tree Barrier is a hierarchical way of implementing barrier to resolve the scalability by avoiding the case that all threads are spinning at the same location.[3]

In k-Tree Barrier, all threads are equally divided into subgroups of k threads and a first-round synchronizations are done within these subgroups. Once all subgroups have done their synchronizations, the first thread in each subgroup enters the second level for further synchronization. In the second level, like in the first level, the threads form new subgroups of k threads and synchronize within groups, sending out one thread in each subgroup to next level and so on. Eventually, in the final level there is only one subgroup to be synchronized. After the final-level synchronization, the releasing signal is transmitted to upper levels and all threads get past the barrier.[4][5]

Hardware Barrier Implementation

The hardware barrier uses hardware to implement the above basic barrier model.[1]

The simplest hardware implementation uses dedicated wires to transmit signal to implement barrier. This dedicated wire performs OR/AND operation to act as the pass/block flags and thread counter. For small systems, such a model works and communication speed is not a major concern. In large multiprocessor systems this hardware design can make barrier implementation have high latency. The network connection among processors is one implementation to lower the latency, which is analogous to Combining Tree Barrier.[6]

References

Solihin, Yan (2015-01-01). Fundamentals of Parallel Multicore Architecture (1st ed.). Chapman & Hall/CRC. ISBN 978-1482211184.
"Implementing Barriers". Carnegie Mellon University.
Culler, David (1998). Parallel Computer Architecture, A Hardware/Software Approach. ISBN 978-1558603431.
Nanjegowda, Ramachandra; Hernandez, Oscar; Chapman, Barbara; Jin, Haoqiang H. (2009-06-03). Müller, Matthias S.; Supinski, Bronis R. de; Chapman, Barbara M. (eds.). Evolving OpenMP in an Age of Extreme Parallelism. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 42–52. doi:10.1007/978-3-642-02303-3_4. ISBN 9783642022845.
Nikolopoulos, Dimitrios S.; Papatheodorou, Theodore S. (1999-01-01). A Quantitative Architectural Evaluation of Synchronization Algorithms and Disciplines on ccNUMA Systems: The Case of the SGI Origin2000. Proceedings of the 13th International Conference on Supercomputing. ICS '99. New York, NY, USA: ACM. pp. 319–328. doi:10.1145/305138.305209. ISBN 978-1581131642.
N.R. Adiga, et al. An Overview of the BlueGene/L Supercomputer. Proceedings of the Conference on High Performance Networking and Computing, 2002.

[1]

"Parallel Programming with Barrier Synchronization". March 2012.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[:0-1] Solihin, Yan (2015-01-01). Fundamentals of Parallel Multicore Architecture (1st ed.). Chapman & Hall/CRC. ISBN 978-1482211184.

[impl-2] "Implementing Barriers". Carnegie Mellon University.

[:2-3] Culler, David (1998). Parallel Computer Architecture, A Hardware/Software Approach. ISBN 978-1558603431.

[:1-4] Nanjegowda, Ramachandra; Hernandez, Oscar; Chapman, Barbara; Jin, Haoqiang H. (2009-06-03). Müller, Matthias S.; Supinski, Bronis R. de; Chapman, Barbara M. (eds.). Evolving OpenMP in an Age of Extreme Parallelism. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 42–52. doi:10.1007/978-3-642-02303-3_4. ISBN 9783642022845.

[5] Nikolopoulos, Dimitrios S.; Papatheodorou, Theodore S. (1999-01-01). A Quantitative Architectural Evaluation of Synchronization Algorithms and Disciplines on ccNUMA Systems: The Case of the SGI Origin2000. Proceedings of the 13th International Conference on Supercomputing. ICS '99. New York, NY, USA: ACM. pp. 319–328. doi:10.1145/305138.305209. ISBN 978-1581131642.

[6] N.R. Adiga, et al. An Overview of the BlueGene/L Supercomputer. Proceedings of the Conference on High Performance Networking and Computing, 2002.

[7] "Parallel Programming with Barrier Synchronization". March 2012.

Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered Multi-Thread (CMT) Hardware scout
Theory	PRAM model PEM Model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array data structure
Coordination	Multiprocessing Memory coherency Cache coherency Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD SIMT MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost.Thread Chapel HPX Charm++ Cilk Coarray Fortran CUDA HIP Dryad C++ AMP Global Arrays MPI OpenMP OpenCL OpenHMPP OpenACC TPL PLINQ PVM POSIX Threads RaftLib UPC TBB ZPL
Problems	Automatic parallelization Deadlock Livelock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: parallel computing Media related to Parallel computing at Wikimedia Commons