Architecture Support for Improving Bulk Memory Copying and Initialization Performance

Xiaowei Jiang, Yan Solihin, Li Zhao and Ravishankar Iyer

Bulk (large-region) memory copying and initialization is one of themost ubiquitous operations performed in current computer systemsby both user applications and Operating Systems. While many currentsystems rely on a loop of loads and stores, there are proposalsto introduce a single instruction to perform large-region memorycopying. While such an instruction can improve performance dueto generating fewer TLB and cache accesses, and requiring fewerpipeline resources, in this paper we show that the key to significantlyimproving the performance of such instructions is removingpipeline and cache bottlenecks of the code that follows the instructions.We show that the bottlenecks arise due to (1) the pipelineclogged by the copying instruction, (2) lengthened critical path dueto dependent instructions stalling while waiting for the copying tocomplete, and (3) the inability to specify (separately) the cacheabilityof the source and destination regions.We propose FastBCI, an architecture support that achieves thegranularity efficiency of a bulk copying/initialization instruction,but without its pipeline and cache bottlenecks. When applied toOS kernel buffer management, we show that on average FastBCIachieves anywhere between 23% to 32% speedup ratios, which isroughly 3×–4× of an alternative scheme, and 1.5×–2× of a highlyoptimistic DMA engine with zero setup and interrupt overheads.

Back to Program