Commit Graph

16 Commits (jerasure-matrix)

Author SHA1 Message Date
Zhang Boyang 195d6fc1ad
Fix build tags for gccgo (#163) 2021-03-18 13:39:19 +01:00
Klaus Post 0e9e10435f
avx2: Add 64 bytes per loop processing (#128)
* avx2: Add 64 bytes per loop processing

Not super clean benchmark run, but `BenchmarkGalois` is consistently faster.

```
benchmark                                 old ns/op     new ns/op     delta
BenchmarkGalois128K-32                    2551          2261          -11.37%
BenchmarkGalois1M-32                      22492         21107         -6.16%
BenchmarkGaloisXor128K-32                 2972          2808          -5.52%
BenchmarkGaloisXor1M-32                   25181         23951         -4.88%
BenchmarkEncode10x2x10000-32              5081          4722          -7.07%
BenchmarkEncode100x20x10000-32            383800        346655        -9.68%
BenchmarkEncode17x3x1M-32                 264806        263191        -0.61%
BenchmarkEncode10x4x16M-32                8337857       8376910       +0.47%
BenchmarkEncode5x2x1M-32                  77119         73598         -4.57%
BenchmarkEncode10x2x1M-32                 108424        102423        -5.53%
BenchmarkEncode10x4x1M-32                 194427        184301        -5.21%
BenchmarkEncode50x20x1M-32                3870301       3747639       -3.17%
BenchmarkEncode17x3x16M-32                10617586      10602449      -0.14%
BenchmarkEncode_8x4x8M-32                 3227254       3229451       +0.07%
BenchmarkEncode_12x4x12M-32               6841898       6847261       +0.08%
BenchmarkEncode_16x4x16M-32               11153469      11048738      -0.94%
BenchmarkEncode_16x4x32M-32               21947506      21826647      -0.55%
BenchmarkEncode_16x4x64M-32               43163608      42971338      -0.45%
BenchmarkEncode_8x5x8M-32                 3856675       3780730       -1.97%
BenchmarkEncode_8x6x8M-32                 4322023       4437109       +2.66%
BenchmarkEncode_8x7x8M-32                 5011434       4959623       -1.03%
BenchmarkEncode_8x9x8M-32                 6243694       6098824       -2.32%
BenchmarkEncode_8x10x8M-32                6724456       6657099       -1.00%
BenchmarkEncode_8x11x8M-32                7207693       7340332       +1.84%
BenchmarkEncode_8x8x05M-32                176877        172183        -2.65%
BenchmarkEncode_8x8x1M-32                 309716        301743        -2.57%
BenchmarkEncode_8x8x8M-32                 5498952       5489078       -0.18%
BenchmarkEncode_8x8x32M-32                22630195      22557074      -0.32%
BenchmarkEncode_24x8x24M-32               28488886      28220702      -0.94%
BenchmarkEncode_24x8x48M-32               56124735      54862495      -2.25%
BenchmarkVerify10x2x10000-32              9874          9356          -5.25%
BenchmarkVerify50x5x50000-32              175610        159735        -9.04%
BenchmarkVerify10x2x1M-32                 331276        311726        -5.90%
BenchmarkVerify5x2x1M-32                  265466        248075        -6.55%
BenchmarkVerify10x4x1M-32                 701627        606420        -13.57%
BenchmarkVerify50x20x1M-32                4338171       4245635       -2.13%
BenchmarkVerify10x4x16M-32                12312830      11932698      -3.09%
BenchmarkReconstruct10x2x10000-32         1594          1504          -5.65%
BenchmarkReconstruct50x5x50000-32         95101         79558         -16.34%
BenchmarkReconstruct10x2x1M-32            38479         37225         -3.26%
BenchmarkReconstruct5x2x1M-32             30968         30013         -3.08%
BenchmarkReconstruct10x4x1M-32            81630         75350         -7.69%
BenchmarkReconstruct50x20x1M-32           1136952       1040156       -8.51%
BenchmarkReconstruct10x4x16M-32           685408        656484        -4.22%
BenchmarkReconstructData10x2x10000-32     1609          1486          -7.64%
BenchmarkReconstructData50x5x50000-32     87090         71512         -17.89%
BenchmarkReconstructData10x2x1M-32        31497         30347         -3.65%
BenchmarkReconstructData5x2x1M-32         23379         22611         -3.28%
BenchmarkReconstructData10x4x1M-32        63853         61035         -4.41%
BenchmarkReconstructData50x20x1M-32       1048807       966201        -7.88%
BenchmarkReconstructData10x4x16M-32       866658        892252        +2.95%
BenchmarkReconstructP10x2x10000-32        544           540           -0.74%
BenchmarkReconstructP10x5x20000-32        1242          1206          -2.90%
BenchmarkSplit10x4x160M-32                2735508       2743214       +0.28%
BenchmarkSplit5x2x5M-32                   276232        288523        +4.45%
BenchmarkSplit10x2x1M-32                  44389         45517         +2.54%
BenchmarkSplit10x4x10M-32                 477282        460888        -3.43%
BenchmarkSplit50x20x50M-32                1608821       1602105       -0.42%
BenchmarkSplit17x3x272M-32                2035932       2034705       -0.06%
BenchmarkParallel_8x8x05M-32              346733        351837        +1.47%
BenchmarkParallel_20x10x05M-32            577127        586232        +1.58%
BenchmarkParallel_8x8x1M-32               722453        729294        +0.95%
BenchmarkParallel_8x8x8M-32               5717650       5817130       +1.74%
BenchmarkParallel_8x8x32M-32              22914260      24132696      +5.32%
BenchmarkStreamEncode10x2x10000-32        6703131       7141021       +6.53%
BenchmarkStreamEncode100x20x10000-32      38175873      39767386      +4.17%
BenchmarkStreamEncode17x3x1M-32           8920549       9218973       +3.35%
BenchmarkStreamEncode10x4x16M-32          21841702      21784898      -0.26%
BenchmarkStreamEncode5x2x1M-32            4088001       3247404       -20.56%
BenchmarkStreamEncode10x2x1M-32           5860652       5932381       +1.22%
BenchmarkStreamEncode10x4x1M-32           7555172       7589960       +0.46%
BenchmarkStreamEncode50x20x1M-32          30006814      30250054      +0.81%
BenchmarkStreamEncode17x3x16M-32          32757489      32818254      +0.19%
BenchmarkStreamVerify10x2x10000-32        6714996       6831093       +1.73%
BenchmarkStreamVerify50x5x50000-32        18525904      18761767      +1.27%
BenchmarkStreamVerify10x2x1M-32           5232278       5444148       +4.05%
BenchmarkStreamVerify5x2x1M-32            3673843       3755283       +2.22%
BenchmarkStreamVerify10x4x1M-32           7184419       7185293       +0.01%
BenchmarkStreamVerify50x20x1M-32          28441187      28574766      +0.47%
BenchmarkStreamVerify10x4x16M-32          8538440       8668614       +1.52%

benchmark                                 old MB/s      new MB/s      speedup
BenchmarkGalois128K-32                    51374.59      57976.36      1.13x
BenchmarkGalois1M-32                      46620.03      49679.10      1.07x
BenchmarkGaloisXor128K-32                 44106.22      46671.56      1.06x
BenchmarkGaloisXor1M-32                   41641.82      43779.89      1.05x
BenchmarkEncode10x2x10000-32              19682.61      21176.81      1.08x
BenchmarkEncode100x20x10000-32            2605.52       2884.71       1.11x
BenchmarkEncode17x3x1M-32                 67316.54      67729.50      1.01x
BenchmarkEncode10x4x16M-32                20121.74      20027.93      1.00x
BenchmarkEncode5x2x1M-32                  67984.17      71236.47      1.05x
BenchmarkEncode10x2x1M-32                 96710.29      102377.00     1.06x
BenchmarkEncode10x4x1M-32                 53931.74      56894.82      1.05x
BenchmarkEncode50x20x1M-32                13546.44      13989.82      1.03x
BenchmarkEncode17x3x16M-32                26862.29      26900.64      1.00x
BenchmarkEncode_8x4x8M-32                 20794.42      20780.27      1.00x
BenchmarkEncode_12x4x12M-32               22069.16      22051.88      1.00x
BenchmarkEncode_16x4x16M-32               24067.44      24295.58      1.01x
BenchmarkEncode_16x4x32M-32               24461.59      24597.04      1.01x
BenchmarkEncode_16x4x64M-32               24876.09      24987.40      1.00x
BenchmarkEncode_8x5x8M-32                 17400.71      17750.24      1.02x
BenchmarkEncode_8x6x8M-32                 15527.19      15124.46      0.97x
BenchmarkEncode_8x7x8M-32                 13391.15      13531.04      1.01x
BenchmarkEncode_8x9x8M-32                 10748.26      11003.58      1.02x
BenchmarkEncode_8x10x8M-32                9979.82       10080.80      1.01x
BenchmarkEncode_8x11x8M-32                9310.73       9142.48       0.98x
BenchmarkEncode_8x8x05M-32                23713.12      24359.50      1.03x
BenchmarkEncode_8x8x1M-32                 27084.87      27800.50      1.03x
BenchmarkEncode_8x8x8M-32                 12203.94      12225.89      1.00x
BenchmarkEncode_8x8x32M-32                11861.83      11900.28      1.00x
BenchmarkEncode_24x8x24M-32               21200.54      21402.01      1.01x
BenchmarkEncode_24x8x48M-32               21522.77      22017.95      1.02x
BenchmarkVerify10x2x10000-32              10127.24      10688.01      1.06x
BenchmarkVerify50x5x50000-32              28472.25      31301.75      1.10x
BenchmarkVerify10x2x1M-32                 31652.63      33637.74      1.06x
BenchmarkVerify5x2x1M-32                  19749.74      21134.27      1.07x
BenchmarkVerify10x4x1M-32                 14944.92      17291.25      1.16x
BenchmarkVerify50x20x1M-32                12085.46      12348.87      1.02x
BenchmarkVerify10x4x16M-32                13625.80      14059.87      1.03x
BenchmarkReconstruct10x2x10000-32         62723.68      66470.81      1.06x
BenchmarkReconstruct50x5x50000-32         52575.87      62847.32      1.20x
BenchmarkReconstruct10x2x1M-32            272507.04     281685.84     1.03x
BenchmarkReconstruct5x2x1M-32             169299.03     174685.39     1.03x
BenchmarkReconstruct10x4x1M-32            128455.17     139161.42     1.08x
BenchmarkReconstruct50x20x1M-32           46113.48      50404.73      1.09x
BenchmarkReconstruct10x4x16M-32           244777.11     255561.72     1.04x
BenchmarkReconstructData10x2x10000-32     62160.46      67305.98      1.08x
BenchmarkReconstructData50x5x50000-32     57411.81      69917.97      1.22x
BenchmarkReconstructData10x2x1M-32        332909.82     345526.29     1.04x
BenchmarkReconstructData5x2x1M-32         224254.60     231868.74     1.03x
BenchmarkReconstructData10x4x1M-32        164216.61     171799.68     1.05x
BenchmarkReconstructData50x20x1M-32       49988.98      54262.82      1.09x
BenchmarkReconstructData10x4x16M-32       193585.15     188032.29     0.97x
BenchmarkReconstructP10x2x10000-32        183806.57     185284.57     1.01x
BenchmarkReconstructP10x5x20000-32        160985.46     165852.51     1.03x
BenchmarkParallel_8x8x05M-32              12096.63      11921.17      0.99x
BenchmarkParallel_20x10x05M-32            18168.91      17886.72      0.98x
BenchmarkParallel_8x8x1M-32               11611.28      11502.36      0.99x
BenchmarkParallel_8x8x8M-32               11737.14      11536.42      0.98x
BenchmarkParallel_8x8x32M-32              11714.78      11123.31      0.95x
BenchmarkStreamEncode10x2x10000-32        14.92         14.00         0.94x
BenchmarkStreamEncode100x20x10000-32      26.19         25.15         0.96x
BenchmarkStreamEncode17x3x1M-32           1998.28       1933.60       0.97x
BenchmarkStreamEncode10x4x16M-32          7681.28       7701.31       1.00x
BenchmarkStreamEncode5x2x1M-32            1282.50       1614.48       1.26x
BenchmarkStreamEncode10x2x1M-32           1789.18       1767.55       0.99x
BenchmarkStreamEncode10x4x1M-32           1387.89       1381.53       1.00x
BenchmarkStreamEncode50x20x1M-32          1747.23       1733.18       0.99x
BenchmarkStreamEncode17x3x16M-32          8706.79       8690.67       1.00x
BenchmarkStreamVerify10x2x10000-32        14.89         14.64         0.98x
BenchmarkStreamVerify50x5x50000-32        269.89        266.50        0.99x
BenchmarkStreamVerify10x2x1M-32           2004.05       1926.06       0.96x
BenchmarkStreamVerify5x2x1M-32            1427.08       1396.13       0.98x
BenchmarkStreamVerify10x4x1M-32           1459.51       1459.34       1.00x
BenchmarkStreamVerify50x20x1M-32          1843.41       1834.79       1.00x
BenchmarkStreamVerify10x4x16M-32          19649.04      19353.98      0.98x
```
2020-05-05 16:36:01 +02:00
Klaus Post 454fd91890
Maintenance updates. (#86)
* Add gcc go build tags.
* Update Travis.
* Fix typo
2018-11-12 13:25:55 +01:00
Klaus Post f5e73dcfe2 Split blocks into size divisible by 16
Older systems (typically without AVX2) are more sensitive to misaligned load+stores.

Add parameter to automatically set the number of goroutines.

name                  old time/op    new time/op    delta
Encode10x2x10000-8      18.4µs ± 1%    16.1µs ± 1%  -12.43%    (p=0.000 n=9+9)
Encode100x20x10000-8     692µs ± 1%     608µs ± 1%  -12.10%  (p=0.000 n=10+10)
Encode17x3x1M-8         1.78ms ± 5%    1.49ms ± 1%  -16.63%  (p=0.000 n=10+10)
Encode10x4x16M-8        21.5ms ± 5%    19.6ms ± 4%   -8.74%   (p=0.000 n=10+9)
Encode5x2x1M-8           343µs ± 2%     267µs ± 2%  -22.22%   (p=0.000 n=9+10)
Encode10x2x1M-8          858µs ± 5%     701µs ± 5%  -18.34%  (p=0.000 n=10+10)
Encode10x4x1M-8         1.34ms ± 1%    1.16ms ± 1%  -13.19%    (p=0.000 n=9+9)
Encode50x20x1M-8        30.3ms ± 4%    25.0ms ± 2%  -17.51%   (p=0.000 n=10+8)
Encode17x3x16M-8        26.9ms ± 1%    24.5ms ± 4%   -9.13%   (p=0.000 n=8+10)

name                  old speed      new speed      delta
Encode10x2x10000-8    5.45GB/s ± 1%  6.22GB/s ± 1%  +14.20%    (p=0.000 n=9+9)
Encode100x20x10000-8  1.44GB/s ± 1%  1.64GB/s ± 1%  +13.77%  (p=0.000 n=10+10)
Encode17x3x1M-8       10.0GB/s ± 5%  12.0GB/s ± 1%  +19.88%  (p=0.000 n=10+10)
Encode10x4x16M-8      7.81GB/s ± 5%  8.56GB/s ± 5%   +9.58%   (p=0.000 n=10+9)
Encode5x2x1M-8        15.3GB/s ± 2%  19.6GB/s ± 2%  +28.57%   (p=0.000 n=9+10)
Encode10x2x1M-8       12.2GB/s ± 5%  15.0GB/s ± 5%  +22.45%  (p=0.000 n=10+10)
Encode10x4x1M-8       7.84GB/s ± 1%  9.03GB/s ± 1%  +15.19%    (p=0.000 n=9+9)
Encode50x20x1M-8      1.73GB/s ± 4%  2.09GB/s ± 4%  +20.59%   (p=0.000 n=10+9)
Encode17x3x16M-8      10.6GB/s ± 1%  11.7GB/s ± 4%  +10.12%   (p=0.000 n=8+10)
2017-11-18 22:00:55 +01:00
Frank Wessels 3610933d2f Use AVX2 SIMD assembly instructions in favor of BYTE sequences. (#73)
* Use AVX2 SIMD assembly instructions in favor of BYTE sequences.
2017-11-18 16:17:10 +01:00
Klaus Post 985e396eec Asmfmt. 2017-08-26 11:51:49 +02:00
chenzhongtao d78bf472d8 add Update parity function (#60)
Add Update parity function
2017-08-20 11:42:39 +02:00
Frank 467733eb9c Add generated byte assembler using asm2plan9s
Add recompilable assembler using asm2plan9s
2016-07-06 21:06:00 +02:00
frankw d4000061f2 Removed unnecessary JMP instruction 2016-07-06 09:39:02 +02:00
klauspost efb98c83c7 Update asmfmt. 2016-01-11 14:44:44 +01:00
klauspost a3ee8967cb asmfmt assembler. 2015-12-14 14:57:49 +01:00
klauspost 627f48f59e Add AVX2 assembler functions.
Benchmarks on a VM (therefore a bit more noisy)

benchmark                             old ns/op     new ns/op     delta
BenchmarkEncode10x2x10000-8           58372         47421         -18.76%
BenchmarkEncode100x20x10000-8         2635444       1550511       -41.17%
BenchmarkEncode17x3x1M-8              3885495       2231034       -42.58%
BenchmarkEncode10x4x16M-8             24180221      21467661      -11.22%
BenchmarkEncode5x2x1M-8               2395287       2261452       -5.59%
BenchmarkEncode10x2x1M-8              2571278       2566560       -0.18%
BenchmarkEncode10x4x1M-8              3396774       3431916       +1.03%
BenchmarkEncode50x20x1M-8             27004601      20325731      -24.73%
BenchmarkEncode17x3x16M-8             29671393      23668596      -20.23%
BenchmarkVerify10x2x10000-8           109730        101519        -7.48%
BenchmarkVerify50x5x50000-8           3904166       3101568       -20.56%
BenchmarkVerify10x2x1M-8              4398490       4721719       +7.35%
BenchmarkVerify5x2x1M-8               3174574       3296440       +3.84%
BenchmarkVerify10x4x1M-8              5247394       5346667       +1.89%
BenchmarkVerify50x20x1M-8             35742777      26154681      -26.83%
BenchmarkVerify10x4x16M-8             52873512      54931253      +3.89%

benchmark                             old MB/s     new MB/s     speedup
BenchmarkEncode10x2x10000-8           1713.14      2108.73      1.23x
BenchmarkEncode100x20x10000-8         379.44       644.95       1.70x
BenchmarkEncode17x3x1M-8              4587.78      7989.92      1.74x
BenchmarkEncode10x4x16M-8             6938.40      7815.11      1.13x
BenchmarkEncode5x2x1M-8               2188.83      2318.37      1.06x
BenchmarkEncode10x2x1M-8              4078.03      4085.53      1.00x
BenchmarkEncode10x4x1M-8              3086.98      3055.37      0.99x
BenchmarkEncode50x20x1M-8             1941.48      2579.43      1.33x
BenchmarkEncode17x3x16M-8             9612.38      12050.26     1.25x
BenchmarkVerify10x2x10000-8           911.32       985.03       1.08x
BenchmarkVerify50x5x50000-8           1280.68      1612.09      1.26x
BenchmarkVerify10x2x1M-8              2383.94      2220.75      0.93x
BenchmarkVerify5x2x1M-8               1651.52      1590.47      0.96x
BenchmarkVerify10x4x1M-8              1998.28      1961.18      0.98x
BenchmarkVerify50x20x1M-8             1466.84      2004.57      1.37x
BenchmarkVerify10x4x16M-8             3173.09      3054.22      0.96x
2015-12-14 14:12:09 +01:00
klauspost dc9cd67c8c PSHUFB is S(upplemental)-SSE3, not plain SSE3. 2015-06-24 16:57:38 +02:00
Klaus Post f1c2cf4160 Don't use assembler on app engine. 2015-06-21 22:54:13 +02:00
Klaus Post 1388bd44c4 Remove comma. Apparently that is a problem on Go tip. 2015-06-21 21:27:32 +02:00
Klaus Post 5aa37c3492 Add AMD64 SSE3 Galois multiplication. Approximately 5-10x faster.
BenchmarkEncode10x2x10000         333.31       5827.17      17.48x
BenchmarkEncode10x2x10000-2       431.20       2802.53      6.50x
BenchmarkEncode10x2x10000-4       553.98       2432.95      4.39x
BenchmarkEncode10x2x10000-8       585.79       3469.61      5.92x
BenchmarkEncode100x20x10000       32.59        583.40       17.90x
BenchmarkEncode100x20x10000-2     59.52        726.70       12.21x
BenchmarkEncode100x20x10000-4     108.04       1363.25      12.62x
BenchmarkEncode100x20x10000-8     113.76       1274.62      11.20x
BenchmarkEncode17x3x1M            215.28       3141.85      14.59x
BenchmarkEncode17x3x1M-2          398.76       3650.12      9.15x
BenchmarkEncode17x3x1M-4          655.32       6071.11      9.26x
BenchmarkEncode17x3x1M-8          832.16       6616.47      7.95x
BenchmarkEncode10x4x16M           154.48       1357.30      8.79x
BenchmarkEncode10x4x16M-2         295.62       2377.92      8.04x
BenchmarkEncode10x4x16M-4         529.89       3519.49      6.64x
BenchmarkEncode10x4x16M-8         632.11       4521.90      7.15x
BenchmarkEncode5x2x1M             327.87       4879.09      14.88x
BenchmarkEncode5x2x1M-2           576.11       2599.20      4.51x
BenchmarkEncode5x2x1M-4           1043.65      3559.12      3.41x
BenchmarkEncode5x2x1M-8           1227.77      4255.34      3.47x
BenchmarkEncode10x2x1M            321.24       4574.68      14.24x
BenchmarkEncode10x2x1M-2          587.73       3100.28      5.28x
BenchmarkEncode10x2x1M-4          1101.96      4770.32      4.33x
BenchmarkEncode10x2x1M-8          1217.08      5812.17      4.78x
BenchmarkEncode10x4x1M            155.34       2037.27      13.11x
BenchmarkEncode10x4x1M-2          298.38       2470.97      8.28x
BenchmarkEncode10x4x1M-4          548.67       3603.15      6.57x
BenchmarkEncode10x4x1M-8          625.23       4827.42      7.72x
BenchmarkEncode50x20x1M           31.37        347.65       11.08x
BenchmarkEncode50x20x1M-2         59.81        713.28       11.93x
BenchmarkEncode50x20x1M-4         105.34       1175.47      11.16x
BenchmarkEncode50x20x1M-8         123.84       1491.91      12.05x
BenchmarkEncode17x3x16M           209.55       1861.59      8.88x
BenchmarkEncode17x3x16M-2         394.19       3331.73      8.45x
BenchmarkEncode17x3x16M-4         643.30       4942.74      7.68x
BenchmarkEncode17x3x16M-8         839.64       6213.43      7.40x
2015-06-21 21:23:22 +02:00