The dataset I used for these tests is ERS006494. It contains 186073978 Illumina DNA sequences. The length of these sequences is 75 nucleotides.
I used 8 nodes, each with 2 Intel Xeon X5560 processors (8 cores per node) and 24 GiB of memory. Storage was served by a Lustre file system. Each job therefore had 64 ranks. The version of GCC was 4.7.2. The version of Open-MPI was 1.6.3. The version of Ray was 606be2a7a710a226. The version of Ray Platform was d78e7ec5037c9c9e8a0. The Host Communication Adapter was Mellanox Technologies MT26428.
The command stript was used to remove useless information from Ray executables. The complete command for link-time optimization with GCC is available here. The template to run the jobs is available here.
Table 1: Comparison of running times with different compilation options.
-Wall -std=c++98 -O3 -march=native
|7 hours, 14 minutes, 33 seconds|
-Wall -std=c++98 -Os -march=native -flto -fwhole-program
|10 hours, 39 minutes, 26 seconds|
-Wall -std=c++98 -O3 -march=native -flto -fwhole-program
|7 hours, 8 minutes, 36 seconds|
There is no difference with LTO when running Ray on Infiniband apparently.