Linux环境下O3优化与O0的差异性
概述:在实现RISCV_ISA的P拓展指令描述仿真时,遇到了奇怪的问题,在Linux环境下O3与O0编译器优化的程序结果不同,经过调查发现,问题出在short类型指针的相关优化问题,至于这算Bug还是过度优化不好定义,但作为编译器, 无论做什么优化,至少应该保证得到的结果是正确的,下面开始描述问题。
Linux测试环境:
$ uname -a
Linux oberon 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ c++ --version
c++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
测试用例:
#include <iostream>
int main() {
uint64_t rd = 0;
uint64_t rs1 = 0x0101010101;
uint64_t rs2 = 0x1010101010;
int16_t *rs1_p = (int16_t *)&rs1;
int16_t *rs2_p = (int16_t *)&rs2;
int16_t *rd_p = (int16_t *)&rd;
for (uint32_t i = 0; i < 4; ++i) {
*(rd_p + i) = *(rs1_p + i) + *(rs2_p + i);
printf("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n");
printf("---------------- rd_p= 0x%x ------------\n", *(rd_p + i));
printf("---------------- rd = 0x%lx -------------\n", rd);
printf("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n");
}
return 0;
}
稍稍解释一下上面代码,因为P拓展是SIMD指令,功能是将uint64_t 类型的两个操作数,分别拆成uint8_t uint16_t 或 uint32_t 的几个数字来进行计算,在这里选择使用一个for循环处理逻辑相同的计算部分。
需要承认这种写法是有点怪异,其实是可以使用x86的SSE指令集的builtin函数实现,但可能有点麻烦,所以暂时就用For循环吧。 这也就造成了,这里需要使用指针获取int16_t 大小的内存地址之后用来加法运算。
补充,至少在我看开C++允许这么干,如果不行欢迎纠正。
O3&O0的编译测试:
Linux 直接开搞,无非编译O3与O0两份ELF。
$ c++ -O3 test3.cpp -o test_O3
$ c++ -O0 test3.cpp -o test_O0
结果:
$ ./test_O0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x1111 ------------
---------------- rd = 0x1111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x1111 ------------
---------------- rd = 0x11111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x0 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ ./test_O3
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x3e0a ------------
---------------- rd = 0x0 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x5db6 ------------
---------------- rd = 0x0 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x5dcf ------------
---------------- rd = 0x0 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0xffffd128 ------------
---------------- rd = 0x0 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
分析:
可以很明显的看到,O3编译的结果与O0不同,原因猜测是因为,O3优化减少中间变量,尽量使用寄存器中保存的值,而不是内存中的值。
short类型的指针操作会被编译器优化掉,那么其他类型的是什么结果???将int16_t 换成int8_t ,具体代码就不贴了,无非就是指针类型改变一下。结果如下:
$ ./test_O3
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x11 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x1111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x11111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x0 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x0 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x0 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
$ ./test_O0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x11 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x1111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x11111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x0 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x0 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x0 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
解决方案之volatile:
此时,O3与O0的结果已经一致,没有出现过渡优化问题,由于不是专业人士,无法评判是编译器的Bug,还是由于自身专业水平有限,不会添加编译参数造成这种现象。
那么,出问题是要解决呀,毕竟,毕竟,它和我都不想跑!!!
叮,C++神奇的关键字volatile。详细的可以看相关介绍。
简单的说,当他修饰某个对象时,就是告诉编译器,不要尝试对它进行奇怪的优化了,因为它是易变的,每次用它的时候都去内存(这里所说的内存,包括内存与缓存,由于缓存的一致性可以保障内存与缓存一致,所以不需要真的去内存取,除非Cache Miss)里面取吧。
修改后的代码:
#include <iostream>
int main() {
volatile uint64_t rd = 0;
volatile uint64_t rs1 = 0x0101010101;
volatile uint64_t rs2 = 0x1010101010;
int16_t *rs1_p = (int16_t *)&rs1;
int16_t *rs2_p = (int16_t *)&rs2;
int16_t *rd_p = (int16_t *)&rd;
for (uint32_t i = 0; i < 4; ++i) {
*(rd_p + i) = *(rs1_p + i) + *(rs2_p + i);
printf("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n");
printf("---------------- rd_p= 0x%x ------------\n", *(rd_p + i));
printf("---------------- rd = 0x%lx -------------\n", rd);
printf("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n");
}
return 0;
}
$ ./test_O3
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x1111 ------------
---------------- rd = 0x1111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x1111 ------------
---------------- rd = 0x11111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x11 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
---------------- rd_p= 0x0 ------------
---------------- rd = 0x1111111111 -------------
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++