Jun
CS MSc Thesis Presentation 5 June 2026
One Computer Science MSc thesis to be presented on 5 June
Friday, 5 June there will be a master thesis presentation in Computer Science at Lund University, Faculty of Engineering.
The presentation will take place in E:2116.
Note to potential opponents: Register as an opponent to the presentation of your choice by sending an email to the examiner for that presentation (firstname [dot] lastname [at] cs [dot] lth [dot] se). Do not forget to specify the presentation you register for! Note that the number of opponents may be limited (often to two), so you might be forced to choose another presentation if you register too late. Registrations are individual, just as the oppositions are! More instructions for opponents are found here on the LTH thesis project page.
10:15-11:00 in E:2116 N.B. No more opponents for this presentation
- Presenter: Einar Bratthall
- Title: Optimization techniques for inter-process communication over shared memory on different CPU architectures
- Examiner: Flavius Gruian
- Supervisor: Jonas Skeppstedt (LTH)
This thesis investigates optimization techniques for shared-memory single-producer multiple-reader
(SPMR) FIFO queues through a progressive series of 16 implementations, each isolating a specific design choice. The implementations range from a mutex-based baseline to mutex-free atomic designs employing per-consumer indices with strict cache-line isolation and batched publication. All designs are benchmarked on two production server platforms, Intel Xeon Gold 6444Y (x86 64) and NVIDIA Grace (ARM64), with 2 to 12 readers, measuring throughput, tail latency, batch sensitivity, and behavior under dynamic workloads.
The results demonstrate that mutex-free atomic synchronization yields 4–30× throughput improve- ment over mutex-based designs depending on implementation and reader count, and that within this design space, cache-line isolation and the choice of reader-acknowledgment instruction dominate per- formance. The best-performing design, per-consumer indices with cache-line isolation, achieves a 50th- percentile (P50) latency of 156 ns on x86. With dual-side batching (reader and writer) at 12 readers on ARM, the aggregate fan-out throughput reaches 1.9 billion message deliveries per second (155 M per reader). A key finding is that implementation rankings reverse between architectures: designs using locked read-modify-write atomics achieve only 30–50% of their x86 throughput on ARM, while plain-store-based designs achieve up to 3.1× higher throughput on ARM than on x86 at 12 readers. These rank reversals demonstrate that cross-architecture evaluation is essential for portable high-performance queue design.
About the event
Location:
E:2116
Contact:
birger [dot] swahn [at] cs [dot] lth [dot] se