Zhang Boyang
|
195d6fc1ad
|
Fix build tags for gccgo (#163)
|
2021-03-18 13:39:19 +01:00 |
Klaus Post
|
0e9e10435f
|
avx2: Add 64 bytes per loop processing (#128)
* avx2: Add 64 bytes per loop processing
Not super clean benchmark run, but `BenchmarkGalois` is consistently faster.
```
benchmark old ns/op new ns/op delta
BenchmarkGalois128K-32 2551 2261 -11.37%
BenchmarkGalois1M-32 22492 21107 -6.16%
BenchmarkGaloisXor128K-32 2972 2808 -5.52%
BenchmarkGaloisXor1M-32 25181 23951 -4.88%
BenchmarkEncode10x2x10000-32 5081 4722 -7.07%
BenchmarkEncode100x20x10000-32 383800 346655 -9.68%
BenchmarkEncode17x3x1M-32 264806 263191 -0.61%
BenchmarkEncode10x4x16M-32 8337857 8376910 +0.47%
BenchmarkEncode5x2x1M-32 77119 73598 -4.57%
BenchmarkEncode10x2x1M-32 108424 102423 -5.53%
BenchmarkEncode10x4x1M-32 194427 184301 -5.21%
BenchmarkEncode50x20x1M-32 3870301 3747639 -3.17%
BenchmarkEncode17x3x16M-32 10617586 10602449 -0.14%
BenchmarkEncode_8x4x8M-32 3227254 3229451 +0.07%
BenchmarkEncode_12x4x12M-32 6841898 6847261 +0.08%
BenchmarkEncode_16x4x16M-32 11153469 11048738 -0.94%
BenchmarkEncode_16x4x32M-32 21947506 21826647 -0.55%
BenchmarkEncode_16x4x64M-32 43163608 42971338 -0.45%
BenchmarkEncode_8x5x8M-32 3856675 3780730 -1.97%
BenchmarkEncode_8x6x8M-32 4322023 4437109 +2.66%
BenchmarkEncode_8x7x8M-32 5011434 4959623 -1.03%
BenchmarkEncode_8x9x8M-32 6243694 6098824 -2.32%
BenchmarkEncode_8x10x8M-32 6724456 6657099 -1.00%
BenchmarkEncode_8x11x8M-32 7207693 7340332 +1.84%
BenchmarkEncode_8x8x05M-32 176877 172183 -2.65%
BenchmarkEncode_8x8x1M-32 309716 301743 -2.57%
BenchmarkEncode_8x8x8M-32 5498952 5489078 -0.18%
BenchmarkEncode_8x8x32M-32 22630195 22557074 -0.32%
BenchmarkEncode_24x8x24M-32 28488886 28220702 -0.94%
BenchmarkEncode_24x8x48M-32 56124735 54862495 -2.25%
BenchmarkVerify10x2x10000-32 9874 9356 -5.25%
BenchmarkVerify50x5x50000-32 175610 159735 -9.04%
BenchmarkVerify10x2x1M-32 331276 311726 -5.90%
BenchmarkVerify5x2x1M-32 265466 248075 -6.55%
BenchmarkVerify10x4x1M-32 701627 606420 -13.57%
BenchmarkVerify50x20x1M-32 4338171 4245635 -2.13%
BenchmarkVerify10x4x16M-32 12312830 11932698 -3.09%
BenchmarkReconstruct10x2x10000-32 1594 1504 -5.65%
BenchmarkReconstruct50x5x50000-32 95101 79558 -16.34%
BenchmarkReconstruct10x2x1M-32 38479 37225 -3.26%
BenchmarkReconstruct5x2x1M-32 30968 30013 -3.08%
BenchmarkReconstruct10x4x1M-32 81630 75350 -7.69%
BenchmarkReconstruct50x20x1M-32 1136952 1040156 -8.51%
BenchmarkReconstruct10x4x16M-32 685408 656484 -4.22%
BenchmarkReconstructData10x2x10000-32 1609 1486 -7.64%
BenchmarkReconstructData50x5x50000-32 87090 71512 -17.89%
BenchmarkReconstructData10x2x1M-32 31497 30347 -3.65%
BenchmarkReconstructData5x2x1M-32 23379 22611 -3.28%
BenchmarkReconstructData10x4x1M-32 63853 61035 -4.41%
BenchmarkReconstructData50x20x1M-32 1048807 966201 -7.88%
BenchmarkReconstructData10x4x16M-32 866658 892252 +2.95%
BenchmarkReconstructP10x2x10000-32 544 540 -0.74%
BenchmarkReconstructP10x5x20000-32 1242 1206 -2.90%
BenchmarkSplit10x4x160M-32 2735508 2743214 +0.28%
BenchmarkSplit5x2x5M-32 276232 288523 +4.45%
BenchmarkSplit10x2x1M-32 44389 45517 +2.54%
BenchmarkSplit10x4x10M-32 477282 460888 -3.43%
BenchmarkSplit50x20x50M-32 1608821 1602105 -0.42%
BenchmarkSplit17x3x272M-32 2035932 2034705 -0.06%
BenchmarkParallel_8x8x05M-32 346733 351837 +1.47%
BenchmarkParallel_20x10x05M-32 577127 586232 +1.58%
BenchmarkParallel_8x8x1M-32 722453 729294 +0.95%
BenchmarkParallel_8x8x8M-32 5717650 5817130 +1.74%
BenchmarkParallel_8x8x32M-32 22914260 24132696 +5.32%
BenchmarkStreamEncode10x2x10000-32 6703131 7141021 +6.53%
BenchmarkStreamEncode100x20x10000-32 38175873 39767386 +4.17%
BenchmarkStreamEncode17x3x1M-32 8920549 9218973 +3.35%
BenchmarkStreamEncode10x4x16M-32 21841702 21784898 -0.26%
BenchmarkStreamEncode5x2x1M-32 4088001 3247404 -20.56%
BenchmarkStreamEncode10x2x1M-32 5860652 5932381 +1.22%
BenchmarkStreamEncode10x4x1M-32 7555172 7589960 +0.46%
BenchmarkStreamEncode50x20x1M-32 30006814 30250054 +0.81%
BenchmarkStreamEncode17x3x16M-32 32757489 32818254 +0.19%
BenchmarkStreamVerify10x2x10000-32 6714996 6831093 +1.73%
BenchmarkStreamVerify50x5x50000-32 18525904 18761767 +1.27%
BenchmarkStreamVerify10x2x1M-32 5232278 5444148 +4.05%
BenchmarkStreamVerify5x2x1M-32 3673843 3755283 +2.22%
BenchmarkStreamVerify10x4x1M-32 7184419 7185293 +0.01%
BenchmarkStreamVerify50x20x1M-32 28441187 28574766 +0.47%
BenchmarkStreamVerify10x4x16M-32 8538440 8668614 +1.52%
benchmark old MB/s new MB/s speedup
BenchmarkGalois128K-32 51374.59 57976.36 1.13x
BenchmarkGalois1M-32 46620.03 49679.10 1.07x
BenchmarkGaloisXor128K-32 44106.22 46671.56 1.06x
BenchmarkGaloisXor1M-32 41641.82 43779.89 1.05x
BenchmarkEncode10x2x10000-32 19682.61 21176.81 1.08x
BenchmarkEncode100x20x10000-32 2605.52 2884.71 1.11x
BenchmarkEncode17x3x1M-32 67316.54 67729.50 1.01x
BenchmarkEncode10x4x16M-32 20121.74 20027.93 1.00x
BenchmarkEncode5x2x1M-32 67984.17 71236.47 1.05x
BenchmarkEncode10x2x1M-32 96710.29 102377.00 1.06x
BenchmarkEncode10x4x1M-32 53931.74 56894.82 1.05x
BenchmarkEncode50x20x1M-32 13546.44 13989.82 1.03x
BenchmarkEncode17x3x16M-32 26862.29 26900.64 1.00x
BenchmarkEncode_8x4x8M-32 20794.42 20780.27 1.00x
BenchmarkEncode_12x4x12M-32 22069.16 22051.88 1.00x
BenchmarkEncode_16x4x16M-32 24067.44 24295.58 1.01x
BenchmarkEncode_16x4x32M-32 24461.59 24597.04 1.01x
BenchmarkEncode_16x4x64M-32 24876.09 24987.40 1.00x
BenchmarkEncode_8x5x8M-32 17400.71 17750.24 1.02x
BenchmarkEncode_8x6x8M-32 15527.19 15124.46 0.97x
BenchmarkEncode_8x7x8M-32 13391.15 13531.04 1.01x
BenchmarkEncode_8x9x8M-32 10748.26 11003.58 1.02x
BenchmarkEncode_8x10x8M-32 9979.82 10080.80 1.01x
BenchmarkEncode_8x11x8M-32 9310.73 9142.48 0.98x
BenchmarkEncode_8x8x05M-32 23713.12 24359.50 1.03x
BenchmarkEncode_8x8x1M-32 27084.87 27800.50 1.03x
BenchmarkEncode_8x8x8M-32 12203.94 12225.89 1.00x
BenchmarkEncode_8x8x32M-32 11861.83 11900.28 1.00x
BenchmarkEncode_24x8x24M-32 21200.54 21402.01 1.01x
BenchmarkEncode_24x8x48M-32 21522.77 22017.95 1.02x
BenchmarkVerify10x2x10000-32 10127.24 10688.01 1.06x
BenchmarkVerify50x5x50000-32 28472.25 31301.75 1.10x
BenchmarkVerify10x2x1M-32 31652.63 33637.74 1.06x
BenchmarkVerify5x2x1M-32 19749.74 21134.27 1.07x
BenchmarkVerify10x4x1M-32 14944.92 17291.25 1.16x
BenchmarkVerify50x20x1M-32 12085.46 12348.87 1.02x
BenchmarkVerify10x4x16M-32 13625.80 14059.87 1.03x
BenchmarkReconstruct10x2x10000-32 62723.68 66470.81 1.06x
BenchmarkReconstruct50x5x50000-32 52575.87 62847.32 1.20x
BenchmarkReconstruct10x2x1M-32 272507.04 281685.84 1.03x
BenchmarkReconstruct5x2x1M-32 169299.03 174685.39 1.03x
BenchmarkReconstruct10x4x1M-32 128455.17 139161.42 1.08x
BenchmarkReconstruct50x20x1M-32 46113.48 50404.73 1.09x
BenchmarkReconstruct10x4x16M-32 244777.11 255561.72 1.04x
BenchmarkReconstructData10x2x10000-32 62160.46 67305.98 1.08x
BenchmarkReconstructData50x5x50000-32 57411.81 69917.97 1.22x
BenchmarkReconstructData10x2x1M-32 332909.82 345526.29 1.04x
BenchmarkReconstructData5x2x1M-32 224254.60 231868.74 1.03x
BenchmarkReconstructData10x4x1M-32 164216.61 171799.68 1.05x
BenchmarkReconstructData50x20x1M-32 49988.98 54262.82 1.09x
BenchmarkReconstructData10x4x16M-32 193585.15 188032.29 0.97x
BenchmarkReconstructP10x2x10000-32 183806.57 185284.57 1.01x
BenchmarkReconstructP10x5x20000-32 160985.46 165852.51 1.03x
BenchmarkParallel_8x8x05M-32 12096.63 11921.17 0.99x
BenchmarkParallel_20x10x05M-32 18168.91 17886.72 0.98x
BenchmarkParallel_8x8x1M-32 11611.28 11502.36 0.99x
BenchmarkParallel_8x8x8M-32 11737.14 11536.42 0.98x
BenchmarkParallel_8x8x32M-32 11714.78 11123.31 0.95x
BenchmarkStreamEncode10x2x10000-32 14.92 14.00 0.94x
BenchmarkStreamEncode100x20x10000-32 26.19 25.15 0.96x
BenchmarkStreamEncode17x3x1M-32 1998.28 1933.60 0.97x
BenchmarkStreamEncode10x4x16M-32 7681.28 7701.31 1.00x
BenchmarkStreamEncode5x2x1M-32 1282.50 1614.48 1.26x
BenchmarkStreamEncode10x2x1M-32 1789.18 1767.55 0.99x
BenchmarkStreamEncode10x4x1M-32 1387.89 1381.53 1.00x
BenchmarkStreamEncode50x20x1M-32 1747.23 1733.18 0.99x
BenchmarkStreamEncode17x3x16M-32 8706.79 8690.67 1.00x
BenchmarkStreamVerify10x2x10000-32 14.89 14.64 0.98x
BenchmarkStreamVerify50x5x50000-32 269.89 266.50 0.99x
BenchmarkStreamVerify10x2x1M-32 2004.05 1926.06 0.96x
BenchmarkStreamVerify5x2x1M-32 1427.08 1396.13 0.98x
BenchmarkStreamVerify10x4x1M-32 1459.51 1459.34 1.00x
BenchmarkStreamVerify50x20x1M-32 1843.41 1834.79 1.00x
BenchmarkStreamVerify10x4x16M-32 19649.04 19353.98 0.98x
```
|
2020-05-05 16:36:01 +02:00 |
Klaus Post
|
454fd91890
|
Maintenance updates. (#86)
* Add gcc go build tags.
* Update Travis.
* Fix typo
|
2018-11-12 13:25:55 +01:00 |
Klaus Post
|
f5e73dcfe2
|
Split blocks into size divisible by 16
Older systems (typically without AVX2) are more sensitive to misaligned load+stores.
Add parameter to automatically set the number of goroutines.
name old time/op new time/op delta
Encode10x2x10000-8 18.4µs ± 1% 16.1µs ± 1% -12.43% (p=0.000 n=9+9)
Encode100x20x10000-8 692µs ± 1% 608µs ± 1% -12.10% (p=0.000 n=10+10)
Encode17x3x1M-8 1.78ms ± 5% 1.49ms ± 1% -16.63% (p=0.000 n=10+10)
Encode10x4x16M-8 21.5ms ± 5% 19.6ms ± 4% -8.74% (p=0.000 n=10+9)
Encode5x2x1M-8 343µs ± 2% 267µs ± 2% -22.22% (p=0.000 n=9+10)
Encode10x2x1M-8 858µs ± 5% 701µs ± 5% -18.34% (p=0.000 n=10+10)
Encode10x4x1M-8 1.34ms ± 1% 1.16ms ± 1% -13.19% (p=0.000 n=9+9)
Encode50x20x1M-8 30.3ms ± 4% 25.0ms ± 2% -17.51% (p=0.000 n=10+8)
Encode17x3x16M-8 26.9ms ± 1% 24.5ms ± 4% -9.13% (p=0.000 n=8+10)
name old speed new speed delta
Encode10x2x10000-8 5.45GB/s ± 1% 6.22GB/s ± 1% +14.20% (p=0.000 n=9+9)
Encode100x20x10000-8 1.44GB/s ± 1% 1.64GB/s ± 1% +13.77% (p=0.000 n=10+10)
Encode17x3x1M-8 10.0GB/s ± 5% 12.0GB/s ± 1% +19.88% (p=0.000 n=10+10)
Encode10x4x16M-8 7.81GB/s ± 5% 8.56GB/s ± 5% +9.58% (p=0.000 n=10+9)
Encode5x2x1M-8 15.3GB/s ± 2% 19.6GB/s ± 2% +28.57% (p=0.000 n=9+10)
Encode10x2x1M-8 12.2GB/s ± 5% 15.0GB/s ± 5% +22.45% (p=0.000 n=10+10)
Encode10x4x1M-8 7.84GB/s ± 1% 9.03GB/s ± 1% +15.19% (p=0.000 n=9+9)
Encode50x20x1M-8 1.73GB/s ± 4% 2.09GB/s ± 4% +20.59% (p=0.000 n=10+9)
Encode17x3x16M-8 10.6GB/s ± 1% 11.7GB/s ± 4% +10.12% (p=0.000 n=8+10)
|
2017-11-18 22:00:55 +01:00 |
Frank Wessels
|
3610933d2f
|
Use AVX2 SIMD assembly instructions in favor of BYTE sequences. (#73)
* Use AVX2 SIMD assembly instructions in favor of BYTE sequences.
|
2017-11-18 16:17:10 +01:00 |
Klaus Post
|
985e396eec
|
Asmfmt.
|
2017-08-26 11:51:49 +02:00 |
chenzhongtao
|
d78bf472d8
|
add Update parity function (#60)
Add Update parity function
|
2017-08-20 11:42:39 +02:00 |
Frank
|
467733eb9c
|
Add generated byte assembler using asm2plan9s
Add recompilable assembler using asm2plan9s
|
2016-07-06 21:06:00 +02:00 |
frankw
|
d4000061f2
|
Removed unnecessary JMP instruction
|
2016-07-06 09:39:02 +02:00 |
klauspost
|
efb98c83c7
|
Update asmfmt.
|
2016-01-11 14:44:44 +01:00 |
klauspost
|
a3ee8967cb
|
asmfmt assembler.
|
2015-12-14 14:57:49 +01:00 |
klauspost
|
627f48f59e
|
Add AVX2 assembler functions.
Benchmarks on a VM (therefore a bit more noisy)
benchmark old ns/op new ns/op delta
BenchmarkEncode10x2x10000-8 58372 47421 -18.76%
BenchmarkEncode100x20x10000-8 2635444 1550511 -41.17%
BenchmarkEncode17x3x1M-8 3885495 2231034 -42.58%
BenchmarkEncode10x4x16M-8 24180221 21467661 -11.22%
BenchmarkEncode5x2x1M-8 2395287 2261452 -5.59%
BenchmarkEncode10x2x1M-8 2571278 2566560 -0.18%
BenchmarkEncode10x4x1M-8 3396774 3431916 +1.03%
BenchmarkEncode50x20x1M-8 27004601 20325731 -24.73%
BenchmarkEncode17x3x16M-8 29671393 23668596 -20.23%
BenchmarkVerify10x2x10000-8 109730 101519 -7.48%
BenchmarkVerify50x5x50000-8 3904166 3101568 -20.56%
BenchmarkVerify10x2x1M-8 4398490 4721719 +7.35%
BenchmarkVerify5x2x1M-8 3174574 3296440 +3.84%
BenchmarkVerify10x4x1M-8 5247394 5346667 +1.89%
BenchmarkVerify50x20x1M-8 35742777 26154681 -26.83%
BenchmarkVerify10x4x16M-8 52873512 54931253 +3.89%
benchmark old MB/s new MB/s speedup
BenchmarkEncode10x2x10000-8 1713.14 2108.73 1.23x
BenchmarkEncode100x20x10000-8 379.44 644.95 1.70x
BenchmarkEncode17x3x1M-8 4587.78 7989.92 1.74x
BenchmarkEncode10x4x16M-8 6938.40 7815.11 1.13x
BenchmarkEncode5x2x1M-8 2188.83 2318.37 1.06x
BenchmarkEncode10x2x1M-8 4078.03 4085.53 1.00x
BenchmarkEncode10x4x1M-8 3086.98 3055.37 0.99x
BenchmarkEncode50x20x1M-8 1941.48 2579.43 1.33x
BenchmarkEncode17x3x16M-8 9612.38 12050.26 1.25x
BenchmarkVerify10x2x10000-8 911.32 985.03 1.08x
BenchmarkVerify50x5x50000-8 1280.68 1612.09 1.26x
BenchmarkVerify10x2x1M-8 2383.94 2220.75 0.93x
BenchmarkVerify5x2x1M-8 1651.52 1590.47 0.96x
BenchmarkVerify10x4x1M-8 1998.28 1961.18 0.98x
BenchmarkVerify50x20x1M-8 1466.84 2004.57 1.37x
BenchmarkVerify10x4x16M-8 3173.09 3054.22 0.96x
|
2015-12-14 14:12:09 +01:00 |
klauspost
|
dc9cd67c8c
|
PSHUFB is S(upplemental)-SSE3, not plain SSE3.
|
2015-06-24 16:57:38 +02:00 |
Klaus Post
|
f1c2cf4160
|
Don't use assembler on app engine.
|
2015-06-21 22:54:13 +02:00 |
Klaus Post
|
1388bd44c4
|
Remove comma. Apparently that is a problem on Go tip.
|
2015-06-21 21:27:32 +02:00 |
Klaus Post
|
5aa37c3492
|
Add AMD64 SSE3 Galois multiplication. Approximately 5-10x faster.
BenchmarkEncode10x2x10000 333.31 5827.17 17.48x
BenchmarkEncode10x2x10000-2 431.20 2802.53 6.50x
BenchmarkEncode10x2x10000-4 553.98 2432.95 4.39x
BenchmarkEncode10x2x10000-8 585.79 3469.61 5.92x
BenchmarkEncode100x20x10000 32.59 583.40 17.90x
BenchmarkEncode100x20x10000-2 59.52 726.70 12.21x
BenchmarkEncode100x20x10000-4 108.04 1363.25 12.62x
BenchmarkEncode100x20x10000-8 113.76 1274.62 11.20x
BenchmarkEncode17x3x1M 215.28 3141.85 14.59x
BenchmarkEncode17x3x1M-2 398.76 3650.12 9.15x
BenchmarkEncode17x3x1M-4 655.32 6071.11 9.26x
BenchmarkEncode17x3x1M-8 832.16 6616.47 7.95x
BenchmarkEncode10x4x16M 154.48 1357.30 8.79x
BenchmarkEncode10x4x16M-2 295.62 2377.92 8.04x
BenchmarkEncode10x4x16M-4 529.89 3519.49 6.64x
BenchmarkEncode10x4x16M-8 632.11 4521.90 7.15x
BenchmarkEncode5x2x1M 327.87 4879.09 14.88x
BenchmarkEncode5x2x1M-2 576.11 2599.20 4.51x
BenchmarkEncode5x2x1M-4 1043.65 3559.12 3.41x
BenchmarkEncode5x2x1M-8 1227.77 4255.34 3.47x
BenchmarkEncode10x2x1M 321.24 4574.68 14.24x
BenchmarkEncode10x2x1M-2 587.73 3100.28 5.28x
BenchmarkEncode10x2x1M-4 1101.96 4770.32 4.33x
BenchmarkEncode10x2x1M-8 1217.08 5812.17 4.78x
BenchmarkEncode10x4x1M 155.34 2037.27 13.11x
BenchmarkEncode10x4x1M-2 298.38 2470.97 8.28x
BenchmarkEncode10x4x1M-4 548.67 3603.15 6.57x
BenchmarkEncode10x4x1M-8 625.23 4827.42 7.72x
BenchmarkEncode50x20x1M 31.37 347.65 11.08x
BenchmarkEncode50x20x1M-2 59.81 713.28 11.93x
BenchmarkEncode50x20x1M-4 105.34 1175.47 11.16x
BenchmarkEncode50x20x1M-8 123.84 1491.91 12.05x
BenchmarkEncode17x3x16M 209.55 1861.59 8.88x
BenchmarkEncode17x3x16M-2 394.19 3331.73 8.45x
BenchmarkEncode17x3x16M-4 643.30 4942.74 7.68x
BenchmarkEncode17x3x16M-8 839.64 6213.43 7.40x
|
2015-06-21 21:23:22 +02:00 |