Lock Cohorting:
A General Technique for Designing NUMA Locks
Aaron Schutza
COMP 522
November 6th, 2014
1
Outline
• Non-Uniform Memory Access (NUMA) architecture
• Differences between NUMA systems and uniform memory access
• Synchronization on NUMA architectures
• Lock cohorting
• A transformation for creating NUMA locks
• Different cohort lock designs
• Example of a cohort lock in action
• Empirical results
• Conclusion
2
NUMA Architectures
• Consequence of scaling memory bandwidth with
processor count
• Not all memory is equidistant to all cores
• Typically has system wide cache coherence
• Memory access latency depends on the distance between
the core and data location
• Memory or cache in local socket
• Memory or cache in remote socket
• Multiple levels of memory locality
• 1 hop, 2 hop, 3 hop…
3
A Multi-Socket NUMA System
4
Synchronization on NUMA
• Problem:
• Passing locks between threads on different sockets can be costly
• Overhead from passing lock and data it protects
• Data that has been accessed on a remote socket produces long
latency cache misses
• Solution:
• Locks can be designed to improve locality of reference
• Encourage threads with mutual locality to acquire a given lock
consecutively
• Benefits:
• Reduces migration of locks between NUMA nodes
• Reduces cache misses
5
Example: Hierarchical Backoff Lock
• Test-and-test-and-set lock with backoff scheme to reduce
cross node contention of a lock variable
• Thread locality is used to tune the backoff delay
• When acquiring a lock, assign thread ID to lock state
• When spin waiting, compare thread ID with lock holder and backoff
proportionally
• Limitations:
• Reduce lock migration only probabilistically
• Lots of invalidation traffic: costly for NUMA
• For more details see Radović & Hagersten HPCA 2003
6
Lock Cohorting
• Use two levels of locks:
• Global locks
• Local locks, one for each socket or cluster (NUMA node)
• First in socket to acquire local lock:
• Acquire socket lock then the global lock
• Pass local lock to other waiters in the local node
• Eventually relinquish global lock to give other nodes a chance
• Recipe for NUMA-aware locks without special algorithms
• Cohorting can compose any kind of lock into a NUMA lock
• Augments properties of cohorted locks with locality preservation
• Benefits:
• Reduces average overhead of lock acquisition
• Reduces interconnect traffic for lock and protected data
7
Global and Local Lock Properties
• Global lock G:
• Thread-oblivious: acquiring thread can differ from releasing thread
• Globally available to all nodes of the system
• Local lock S:
• Cohort detection property: a thread releasing the lock can detect if
there are threads attempting to acquire the lock
• Records last state of release as global or local
• Once S is acquired:
• Local release → proceed to critical section
• Global release → try to acquire G
• Upon release of S:
• IF may_pass_local OR alone? → release globally
• ELSE → release locally
8
Lock Cohorting in Action
• Suppose L is a cohorting lock implemented by global lock
G, socket locks S1, and S2
• t1 encounters the lock L first
Node 1
t1
Ac. L
9
Node 2
t2
t3
t4
t5
t6
t7
t8
Lock Cohorting in Action
• To enter the critical section t1 must acquire S1 first
Node 1
t1
Ac. S1
10
Node 2
t2
t3
t4
t5
t6
t7
t8
Lock Cohorting in Action
• After t1 acquires S1, G must be acquired for t1 to enter
the critical section
Node 1
t1
S1
Ac. G
11
Node 2
t2
t3
t4
t5
t6
t7
t8
Lock Cohorting in Action
• t1 acquires G and enters the critical section
• Subsequently t5 and t6 attempt to acquire L
Node 1
t1
S1
G
12
Node 2
t2
t3
t4
t5
t6
Ac. L
Ac. L
t7
t8
Lock Cohorting in Action
• t5 and t6 compete first to acquire S2, t5 wins
• t6 is added to S2’s cohort since it’s spinning on S2
• t5 spins on G
• Threads on node 2 wait until G is released
Node 1
t1
S1
G
13
Node 2
t2
t3
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
• Next t2 and t3 encounter L first seeking S1
• t2 and t3 add to S1’s cohort
Node 1
14
Node 2
t1
t2
t3
S1
{t2,t3}
G
Ac. S1
Ac. S1
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
• t1 finishes with lock and sees S1’s cohort is not empty
• t1 releases S1 locally and G remains locked
• t2 acquires S1 and enters the critical section
Node 1
15
Node 2
t1
t2
t3
G
S1
{t3}
Ac. S1
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
• t2 finishes the critical section and sees its cohort is still not
empty
• t2 releases S1 locally and t3 acquires it
Node 1
t1
G
16
Node 2
t2
t3
S1
{}
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
• t3 exits and sees an empty cohort for S1
• t3 releases S1 globally and then releases G
• The next acquisition of S1 will see that it was released
globally and that thread will seek G
Node 1
t1
G
17
Node 2
t2
t3
{}
Re. G
t4
t5
t6
S2
{t6}
Ac. G
Ac. S2
t7
t8
Lock Cohorting in Action
• Now t5 acquires G and enters the critical section
• Upon exiting it will see a non-empty cohort for S2 and
release it locally
Node 1
t1
18
Node 2
t2
t3
t4
t5
t6
S2
{t6}
G
Ac. S2
t7
t8
Lock Cohorting in Action
• t6 then enters the critical section
• Upon exiting it will see an empty cohort for S2 and
globally release it
Node 1
t1
19
Node 2
t2
t3
t4
t5
t6
G
S2
{}
t7
t8
Lock Cohorting in Action
• t6 releases G and S1 and S2 are in the globally ready
state
• "Cohorting" within a node ensures that a lock will not
unnecessarily migrate between nodes 1 and 2
Node 1
t1
20
Node 2
t2
t3
t4
t5
t6
t7
t8
Cohort Lock Designs
• C-BO-BO lock
• Global backoff (BO) lock and local backoff locks per node
• C-TKT-TKT lock
• Global ticket lock and local ticket (TKT) locks per node
• C-BO-MCS lock
• Global backoff lock and local Mellor-Crummey Scott (MCS) lock
• C-MCS-MCS lock
• C-TKT-MCS lock
• Use of abortable locks in cohort designs needs extra
features to limit aborting while in a cohort:
• A-C-BO-BO lock
• A-C-BO-CLH lock (queue lock of Craig, Landin, & Hagersten)
21
C-BO-MCS Lock in Action
• 1A acquires local MCS lock and then acquires the global
lock
22
Dice et al.
C-BO-MCS Lock in Action
• 2A acquires local MCS lock and then spins on the global
lock
• 1A enters the critical section
23
Dice et al.
C-BO-MCS Lock in Action
• 1B and 1C add themselves to local MCS queue
24
Dice et al.
C-BO-MCS Lock in Action
• 1A exits the critical section and sees that it points to 1B
• Because the MCS tail pointer is not null, 1A releases the
MCS lock and leaves the global lock untouched
• 1B is allowed to enter the critical section
25
Dice et al.
C-BO-MCS Lock in Action
• 1B exits the critical section and sees that the queue points
to 1C
• The MCS lock is released locally and acquired by 1C
• 1C enters the critical section
26
Dice et al.
C-BO-MCS Lock in Action
• 1C exits the critical section and sees that the MCS tail
pointer is null
• 1C then releases the global lock and the local lock
27
Dice et al.
C-BO-MCS Lock in Action
• 2A acquires the global lock and the critical section passes
to the other cluster
28
Dice et al.
Empirical Results
• Dice et al. conduct experiments on benchmarks that test the
performance of each lock design
• A microbenchmark LBench is used as a representative
workload
• LBench launches identical threads
• Each thread loops as follows:
• Acquire central lock
• Access shared data in critical section
• Release lock
• ~4ms of non-critical work
• Run on Oracle T5440 series machine
• 256 hardware threads
• 4 NUMA clusters
• Evaluation shows that cohort locks outperform previous locks
by at least 60%
29
Average Throughput vs. # of Threads
• These results use LBench
• Similar results were found for different LBench thread settings
• MCS is the baseline scalable lock: low performance without locality awareness
30
Dice et al.
Conclusions
• Lock cohorting yields an improvement over previous
NUMA aware lock designs
• Powerful lock design
• No special locks required
• Versatility
• Can be extended to further layers of locality
• e.g., tile based systems where locality is based on grid position
• Multiple levels of lock cohorts
• Performance scaling with thread count is better with lock
cohorting
31
Reference
Z. Radovic and E. Hagersten. Hierarchical Backoff Locks for Nonuniform
Communication Architectures. In HPCA-9, pages 241–252, Anaheim, California,
USA, Feb. 2003.
David Dice, Virendra J. Marathe, and Nir Shavit. 2012. Lock cohorting: a general
technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN
symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM,
New York, NY, USA, 247-256. DOI=10.1145/2145816.2145848
32
© Copyright 2025