BEGIN:VCALENDAR
PRODID:-//eluceo/ical//2.0/EN
VERSION:2.0
CALSCALE:GREGORIAN
BEGIN:VEVENT
UID:6578e00f82557547b79523b05efdd8ba
DTSTAMP:20260609T084741Z
SUMMARY:CS MSc Thesis Presentation 5 June 2026
DESCRIPTION:Kontakt: birger.swahn@cs.lth.se\n\nFriday\, 5 June there will b
 e a master thesis presentation in Computer Science at Lund University\, Fa
 culty of Engineering.The presentation will take place in E:2116.Note to po
 tential opponents: Register as an opponent to the presentation of your cho
 ice by sending an email to the examiner for that presentation (firstname.l
 astname@cs.lth.se). Do not forget to specify the presentation you register
  for! Note that the number of opponents may be limited (often to two)\, so
  you might be forced to choose another presentation if you register too la
 te. Registrations are individual\, just as the oppositions are! More instr
 uctions for opponents are found here on the LTH thesis project page.10:15-
 11:00 in E:2116 N.B. No more opponents for this presentationPresenter: Ein
 ar BratthallTitle: Optimization techniques for inter-process communication
  over shared memory on different CPU architecturesExaminer: Flavius Gruian
 Supervisor: Jonas Skeppstedt (LTH)This thesis investigates optimization te
 chniques for shared-memory single-producer multiple-reader(SPMR) FIFO queu
 es through a progressive series of 16 implementations\, each isolating a s
 pecific design choice. The implementations range from a mutex-based baseli
 ne to mutex-free atomic designs employing per-consumer indices with strict
  cache-line isolation and batched publication. All designs are benchmarked
  on two production server platforms\, Intel Xeon Gold 6444Y (x86 64) and N
 VIDIA Grace (ARM64)\, with 2 to 12 readers\, measuring throughput\, tail l
 atency\, batch sensitivity\, and behavior under dynamic workloads.The resu
 lts demonstrate that mutex-free atomic synchronization yields 4–30× thr
 oughput improve- ment over mutex-based designs depending on implementation
  and reader count\, and that within this design space\, cache-line isolati
 on and the choice of reader-acknowledgment instruction dominate per- forma
 nce. The best-performing design\, per-consumer indices with cache-line iso
 lation\, achieves a 50th- percentile (P50) latency of 156 ns on x86. With 
 dual-side batching (reader and writer) at 12 readers on ARM\, the aggregat
 e fan-out throughput reaches 1.9 billion message deliveries per second (15
 5 M per reader). A key finding is that implementation rankings reverse bet
 ween architectures: designs using locked read-modify-write atomics achieve
  only 30–50% of their x86 throughput on ARM\, while plain-store-based de
 signs achieve up to 3.1× higher throughput on ARM than on x86 at 12 reade
 rs. These rank reversals demonstrate that cross-architecture evaluation is
  essential for portable high-performance queue design.&nbsp\;\n\nMer infor
 mation om händelsen: https://www.cs.lth.se/evenemang/cs-msc-thesis-presen
 tation-5-june-2026
DTSTART;TZID=GMT:20260605T081500
DTEND;TZID=GMT:20260605T090000
LOCATION:E:2116
END:VEVENT
END:VCALENDAR
