Commit Graph

225 Commits (c81ca04b165501b59798c998c8c1d271ada86df8)

Author SHA1 Message Date
Michael Cook c81ca04b16
Sanity check error on SwapRows (#156) 2020-12-17 09:38:25 +01:00
Klaus Post 60143f4a15
Update README.md 2020-12-09 22:56:37 +01:00
Klaus Post 519603f6e1
Update packages (#154)
* Update packages

Update cpuid and clean up generated.
2020-12-09 22:56:01 +01:00
Klaus Post 653e76aa26
Faster AVX2 encoding (#153)
* Remove 50% of bounds checks when copying.
* Use RIP only addressing, free one register.

```
benchmark                                 old MB/s      new MB/s      speedup
BenchmarkGalois128K-32                    57663.49      58005.87      1.01x
BenchmarkGalois1M-32                      49479.31      49848.29      1.01x
BenchmarkGaloisXor128K-32                 46310.69      46501.88      1.00x
BenchmarkGaloisXor1M-32                   43804.86      43984.39      1.00x
BenchmarkEncode10x2x10000-32              25926.93      27457.75      1.06x
BenchmarkEncode100x20x10000-32            2635.82       2818.95       1.07x
BenchmarkEncode17x3x1M-32                 63215.11      61576.76      0.97x
BenchmarkEncode10x4x16M-32                19551.54      19505.07      1.00x
BenchmarkEncode5x2x1M-32                  79612.06      81985.14      1.03x
BenchmarkEncode10x2x1M-32                 121478.29     127739.41     1.05x
BenchmarkEncode10x4x1M-32                 70757.61      74423.67      1.05x
BenchmarkEncode50x20x1M-32                19811.96      20103.32      1.01x
BenchmarkEncode17x3x16M-32                27202.10      27825.34      1.02x
BenchmarkEncode_8x4x8M-32                 19029.04      19701.31      1.04x
BenchmarkEncode_12x4x12M-32               22449.87      22480.51      1.00x
BenchmarkEncode_16x4x16M-32               24536.74      24672.24      1.01x
BenchmarkEncode_16x4x32M-32               24381.34      24981.99      1.02x
BenchmarkEncode_16x4x64M-32               24717.69      25086.94      1.01x
BenchmarkEncode_8x5x8M-32                 16763.51      17154.04      1.02x
BenchmarkEncode_8x6x8M-32                 15067.22      15205.87      1.01x
BenchmarkEncode_8x7x8M-32                 13156.38      13589.40      1.03x
BenchmarkEncode_8x9x8M-32                 11363.74      11523.70      1.01x
BenchmarkEncode_8x10x8M-32                10359.37      10474.91      1.01x
BenchmarkEncode_8x11x8M-32                9627.07       9463.24       0.98x
BenchmarkEncode_8x8x05M-32                30104.80      32634.89      1.08x
BenchmarkEncode_8x8x1M-32                 36497.28      36425.88      1.00x
BenchmarkEncode_8x8x8M-32                 12186.19      11602.41      0.95x
BenchmarkEncode_8x8x32M-32                11670.72      11413.71      0.98x
BenchmarkEncode_24x8x24M-32               21709.83      21652.50      1.00x
BenchmarkEncode_24x8x48M-32               22494.40      22280.59      0.99x
BenchmarkVerify10x2x10000-32              10567.56      10483.91      0.99x
BenchmarkVerify50x5x50000-32              28102.84      27923.63      0.99x
BenchmarkVerify10x2x1M-32                 30298.33      30106.18      0.99x
BenchmarkVerify5x2x1M-32                  16115.91      15847.03      0.98x
BenchmarkVerify10x4x1M-32                 15382.13      14852.68      0.97x
BenchmarkVerify50x20x1M-32                8476.02       8466.24       1.00x
BenchmarkVerify10x4x16M-32                15101.03      15434.71      1.02x
BenchmarkReconstruct10x2x10000-32         26228.18      26960.19      1.03x
BenchmarkReconstruct50x5x50000-32         31091.42      30975.82      1.00x
BenchmarkReconstruct10x2x1M-32            58548.87      60281.92      1.03x
BenchmarkReconstruct5x2x1M-32             39499.23      41791.80      1.06x
BenchmarkReconstruct10x4x1M-32            41448.60      43053.15      1.04x
BenchmarkReconstruct50x20x1M-32           17185.99      17354.67      1.01x
BenchmarkReconstruct10x4x16M-32           18798.60      18847.43      1.00x
BenchmarkReconstructData10x2x10000-32     27208.48      27538.38      1.01x
BenchmarkReconstructData50x5x50000-32     32135.65      32078.91      1.00x
BenchmarkReconstructData10x2x1M-32        63180.19      67332.17      1.07x
BenchmarkReconstructData5x2x1M-32         47532.85      49932.17      1.05x
BenchmarkReconstructData10x4x1M-32        50059.14      52323.15      1.05x
BenchmarkReconstructData50x20x1M-32       26679.75      26714.11      1.00x
BenchmarkReconstructData10x4x16M-32       24854.99      24527.23      0.99x
BenchmarkReconstructP10x2x10000-32        115089.87     113229.75     0.98x
BenchmarkReconstructP10x5x20000-32        129838.75     132871.10     1.02x
BenchmarkParallel_8x8x64K-32              69951.43      69980.44      1.00x
BenchmarkParallel_8x8x05M-32              11752.94      11724.35      1.00x
BenchmarkParallel_20x10x05M-32            18553.93      18613.33      1.00x
BenchmarkParallel_8x8x1M-32               11639.19      11746.86      1.01x
BenchmarkParallel_8x8x8M-32               11799.36      11685.63      0.99x
BenchmarkParallel_8x8x32M-32              11510.94      11791.72      1.02x
BenchmarkParallel_8x3x1M-32               20268.92      20678.21      1.02x
BenchmarkParallel_8x4x1M-32               17616.05      17856.17      1.01x
BenchmarkParallel_8x5x1M-32               15590.87      15872.42      1.02x
BenchmarkStreamEncode10x2x10000-32        14917.08      15408.39      1.03x
BenchmarkStreamEncode100x20x10000-32      2014.81       2077.31       1.03x
BenchmarkStreamEncode17x3x1M-32           11839.37      12434.80      1.05x
BenchmarkStreamEncode10x4x16M-32          9151.14       9206.98       1.01x
BenchmarkStreamEncode5x2x1M-32            13598.55      13663.56      1.00x
BenchmarkStreamEncode10x2x1M-32           13192.91      13453.41      1.02x
BenchmarkStreamEncode10x4x1M-32           12109.90      12050.68      1.00x
BenchmarkStreamEncode50x20x1M-32          8640.73       8370.10       0.97x
BenchmarkStreamEncode17x3x16M-32          10473.17      10527.04      1.01x
BenchmarkStreamVerify10x2x10000-32        7032.23       7128.82       1.01x
BenchmarkStreamVerify50x5x50000-32        13023.46      13109.31      1.01x
BenchmarkStreamVerify10x2x1M-32           11941.63      11949.91      1.00x
BenchmarkStreamVerify5x2x1M-32            8029.93       8263.39       1.03x
BenchmarkStreamVerify10x4x1M-32           8137.82       8271.11       1.02x
BenchmarkStreamVerify50x20x1M-32          7378.87       7708.81       1.04x
BenchmarkStreamVerify10x4x16M-32          8973.18       8955.29       1.00x
```
2020-11-10 14:39:23 +01:00
Klaus Post 04d4482b55
Test gzip on 390x (#150)
* Upgrade CI to Go 1.15
* Test unzip
2020-09-01 13:14:21 +02:00
Klaus Post 11742a626c
Upgrade CI to Go 1.15 (#151)
Removes Go 1.12
2020-09-01 13:13:31 +02:00
Klaus Post 7daa20bf74
Generate AVX2 code (#141)
Replaces AVX2 up to 10x8 configurations with specific generated functions.

If code size is a concern `-tags=nogen` can be used.

Biggest speedup when not memory constrained.
```
benchmark                                old MB/s      new MB/s      speedup
BenchmarkEncode_8x5x8M                   5895.75       9648.18       1.64x
BenchmarkEncode_8x5x8M-4                 16773.41      17220.67      1.03x
BenchmarkEncode_8x5x8M-16                18263.12      17176.28      0.94x
BenchmarkEncode_8x6x8M                   5075.89       8548.39       1.68x
BenchmarkEncode_8x6x8M-4                 14559.83      15370.95      1.06x
BenchmarkEncode_8x6x8M-16                16183.37      15291.98      0.94x
BenchmarkEncode_8x7x8M                   4481.18       7015.60       1.57x
BenchmarkEncode_8x7x8M-4                 12835.35      13695.90      1.07x
BenchmarkEncode_8x7x8M-16                14246.94      13737.36      0.96x 
BenchmarkEncode_8x8x05M                  5569.95       7947.70       1.43x
BenchmarkEncode_8x8x05M-4                17334.91      25271.37      1.46x
BenchmarkEncode_8x8x05M-16               29349.42      35043.36      1.19x
BenchmarkEncode_8x8x1M                   4830.58       7891.32       1.63x
BenchmarkEncode_8x8x1M-4                 17531.36      27371.42      1.56x
BenchmarkEncode_8x8x1M-16                29593.98      39241.09      1.33x
BenchmarkEncode_8x8x8M                   3953.66       6584.26       1.67x
BenchmarkEncode_8x8x8M-4                 11527.34      12331.23      1.07x
BenchmarkEncode_8x8x8M-16                12718.89      12173.08      0.96x
BenchmarkEncode_8x8x32M                  3927.51       6195.91       1.58x
BenchmarkEncode_8x8x32M-4                11490.85      11424.39      0.99x
BenchmarkEncode_8x8x32M-16               12506.09      11888.55      0.95x

benchmark                          old MB/s     new MB/s     speedup
BenchmarkParallel_8x8x64K          5490.24      6959.57      1.27x
BenchmarkParallel_8x8x64K-4        21078.94     29557.51     1.40x
BenchmarkParallel_8x8x64K-16       57508.45     73672.54     1.28x
BenchmarkParallel_8x8x1M           4755.49      7667.84      1.61x
BenchmarkParallel_8x8x1M-4         11818.66     12013.49     1.02x
BenchmarkParallel_8x8x1M-16        12923.12     12109.42     0.94x
BenchmarkParallel_8x8x8M           3973.94      6525.85      1.64x
BenchmarkParallel_8x8x8M-4         11725.68     11312.46     0.96x
BenchmarkParallel_8x8x8M-16        12608.20     11484.98     0.91x
BenchmarkParallel_8x3x1M           14139.71     17993.04     1.27x
BenchmarkParallel_8x3x1M-4         21805.97     23053.92     1.06x
BenchmarkParallel_8x3x1M-16        24673.05     23596.71     0.96x
BenchmarkParallel_8x4x1M           10617.88     14474.54     1.36x
BenchmarkParallel_8x4x1M-4         18635.82     18965.65     1.02x
BenchmarkParallel_8x4x1M-16        21518.12     20171.47     0.94x
BenchmarkParallel_8x5x1M           8669.88      11833.96     1.36x
BenchmarkParallel_8x5x1M-4         16321.00     17500.30     1.07x
BenchmarkParallel_8x5x1M-16        17267.16     17191.04     1.00x
```
2020-05-20 12:48:34 +02:00
Frank Wessels 01b307ec91
Minor refactor for arm NEON version using macros (#147)
* Minior refactor for arm NEON version using macros
2020-05-19 15:03:47 +02:00
Frank Wessels 6fbce20c81
Updated performance numbers for Graviton2 on ARM (#146) 2020-05-15 11:09:11 +02:00
Klaus Post e8fdfd6630
Update readme and re-allow s390x failure. 2020-05-14 14:29:53 +02:00
Klaus Post f338110979
Make sure assembler is formatted (#145)
* Make sure assembler is formatted
2020-05-14 12:04:55 +02:00
Frank Wessels 27f8a7b6bf
Small optimization to parallal82 for AVX512 by reducing the number of VSHUFI64X2 instructions in the core loop (#143) 2020-05-14 10:19:23 +02:00
Frank Wessels 2475ea7519
Use proper NEON assembly instructions for ARM (#144)
* Use proper NEON assembly instructions for ARM

* Updated performance numbers for ARM
2020-05-14 10:18:32 +02:00
Klaus Post c83b7b4a38
Allow s390x failures (#142)
The s390x seems quite unstable, so we allow failures on it.
2020-05-13 17:04:16 +02:00
Klaus Post cf8495259a
Add pure XOR for 1 parity (#138)
WithFastOneParityMatrix will switch the matrix to a simple xor if there is only one parity shard.
The PAR1 matrix already has this property so it has little effect there.
2020-05-13 11:10:58 +02:00
Frank Wessels d6d9fba4f9
Take vshufi64x2 out of main loop and initialize upfront (for parallel 81 only) (#139) 2020-05-13 10:59:26 +02:00
Frank Wessels d5afb5f48e
Faster arm64 implementation that does not use PMULL instruction (#140)
* Faster arm64 implementation that does not use PMULL instruction
* Add NEON version for sliceXor
2020-05-13 10:24:22 +02:00
Klaus Post 2df03bd4d1
Ci test more archs (#135)
* ci: test more architectures
2020-05-09 10:35:17 +02:00
Frank Wessels 2f8e50e65c
Better test coverage for AVX512 (parallel version) (#134) 2020-05-07 09:28:23 +02:00
Klaus Post 696c4018f8
bench: Fix reconstruct benchmarks (#133)
Always corrupt at least one shard and don't shuffle shards.
2020-05-06 15:42:49 +02:00
Klaus Post 151d8c7a05
Tweak concurrency (#132) 2020-05-06 15:42:30 +02:00
Klaus Post 96dc2a5aa4
Update README 2020-05-06 13:47:25 +02:00
Klaus Post 3067f8aed5
asmfmt 2020-05-06 12:36:43 +02:00
Frank Wessels 1b9e129671
Avx512 parallel81 (#131)
* AVX512 routine for 8x1 parallel processing (WIP)

* Testing and integration of Parallel81 assembly routine
2020-05-06 12:32:31 +02:00
Klaus Post cb7a0b5aef
Do fast by one multiplication (#130)
When multiplying by one we can use faster math.
2020-05-06 11:14:25 +02:00
Klaus Post 0e9e10435f
avx2: Add 64 bytes per loop processing (#128)
* avx2: Add 64 bytes per loop processing

Not super clean benchmark run, but `BenchmarkGalois` is consistently faster.

```
benchmark                                 old ns/op     new ns/op     delta
BenchmarkGalois128K-32                    2551          2261          -11.37%
BenchmarkGalois1M-32                      22492         21107         -6.16%
BenchmarkGaloisXor128K-32                 2972          2808          -5.52%
BenchmarkGaloisXor1M-32                   25181         23951         -4.88%
BenchmarkEncode10x2x10000-32              5081          4722          -7.07%
BenchmarkEncode100x20x10000-32            383800        346655        -9.68%
BenchmarkEncode17x3x1M-32                 264806        263191        -0.61%
BenchmarkEncode10x4x16M-32                8337857       8376910       +0.47%
BenchmarkEncode5x2x1M-32                  77119         73598         -4.57%
BenchmarkEncode10x2x1M-32                 108424        102423        -5.53%
BenchmarkEncode10x4x1M-32                 194427        184301        -5.21%
BenchmarkEncode50x20x1M-32                3870301       3747639       -3.17%
BenchmarkEncode17x3x16M-32                10617586      10602449      -0.14%
BenchmarkEncode_8x4x8M-32                 3227254       3229451       +0.07%
BenchmarkEncode_12x4x12M-32               6841898       6847261       +0.08%
BenchmarkEncode_16x4x16M-32               11153469      11048738      -0.94%
BenchmarkEncode_16x4x32M-32               21947506      21826647      -0.55%
BenchmarkEncode_16x4x64M-32               43163608      42971338      -0.45%
BenchmarkEncode_8x5x8M-32                 3856675       3780730       -1.97%
BenchmarkEncode_8x6x8M-32                 4322023       4437109       +2.66%
BenchmarkEncode_8x7x8M-32                 5011434       4959623       -1.03%
BenchmarkEncode_8x9x8M-32                 6243694       6098824       -2.32%
BenchmarkEncode_8x10x8M-32                6724456       6657099       -1.00%
BenchmarkEncode_8x11x8M-32                7207693       7340332       +1.84%
BenchmarkEncode_8x8x05M-32                176877        172183        -2.65%
BenchmarkEncode_8x8x1M-32                 309716        301743        -2.57%
BenchmarkEncode_8x8x8M-32                 5498952       5489078       -0.18%
BenchmarkEncode_8x8x32M-32                22630195      22557074      -0.32%
BenchmarkEncode_24x8x24M-32               28488886      28220702      -0.94%
BenchmarkEncode_24x8x48M-32               56124735      54862495      -2.25%
BenchmarkVerify10x2x10000-32              9874          9356          -5.25%
BenchmarkVerify50x5x50000-32              175610        159735        -9.04%
BenchmarkVerify10x2x1M-32                 331276        311726        -5.90%
BenchmarkVerify5x2x1M-32                  265466        248075        -6.55%
BenchmarkVerify10x4x1M-32                 701627        606420        -13.57%
BenchmarkVerify50x20x1M-32                4338171       4245635       -2.13%
BenchmarkVerify10x4x16M-32                12312830      11932698      -3.09%
BenchmarkReconstruct10x2x10000-32         1594          1504          -5.65%
BenchmarkReconstruct50x5x50000-32         95101         79558         -16.34%
BenchmarkReconstruct10x2x1M-32            38479         37225         -3.26%
BenchmarkReconstruct5x2x1M-32             30968         30013         -3.08%
BenchmarkReconstruct10x4x1M-32            81630         75350         -7.69%
BenchmarkReconstruct50x20x1M-32           1136952       1040156       -8.51%
BenchmarkReconstruct10x4x16M-32           685408        656484        -4.22%
BenchmarkReconstructData10x2x10000-32     1609          1486          -7.64%
BenchmarkReconstructData50x5x50000-32     87090         71512         -17.89%
BenchmarkReconstructData10x2x1M-32        31497         30347         -3.65%
BenchmarkReconstructData5x2x1M-32         23379         22611         -3.28%
BenchmarkReconstructData10x4x1M-32        63853         61035         -4.41%
BenchmarkReconstructData50x20x1M-32       1048807       966201        -7.88%
BenchmarkReconstructData10x4x16M-32       866658        892252        +2.95%
BenchmarkReconstructP10x2x10000-32        544           540           -0.74%
BenchmarkReconstructP10x5x20000-32        1242          1206          -2.90%
BenchmarkSplit10x4x160M-32                2735508       2743214       +0.28%
BenchmarkSplit5x2x5M-32                   276232        288523        +4.45%
BenchmarkSplit10x2x1M-32                  44389         45517         +2.54%
BenchmarkSplit10x4x10M-32                 477282        460888        -3.43%
BenchmarkSplit50x20x50M-32                1608821       1602105       -0.42%
BenchmarkSplit17x3x272M-32                2035932       2034705       -0.06%
BenchmarkParallel_8x8x05M-32              346733        351837        +1.47%
BenchmarkParallel_20x10x05M-32            577127        586232        +1.58%
BenchmarkParallel_8x8x1M-32               722453        729294        +0.95%
BenchmarkParallel_8x8x8M-32               5717650       5817130       +1.74%
BenchmarkParallel_8x8x32M-32              22914260      24132696      +5.32%
BenchmarkStreamEncode10x2x10000-32        6703131       7141021       +6.53%
BenchmarkStreamEncode100x20x10000-32      38175873      39767386      +4.17%
BenchmarkStreamEncode17x3x1M-32           8920549       9218973       +3.35%
BenchmarkStreamEncode10x4x16M-32          21841702      21784898      -0.26%
BenchmarkStreamEncode5x2x1M-32            4088001       3247404       -20.56%
BenchmarkStreamEncode10x2x1M-32           5860652       5932381       +1.22%
BenchmarkStreamEncode10x4x1M-32           7555172       7589960       +0.46%
BenchmarkStreamEncode50x20x1M-32          30006814      30250054      +0.81%
BenchmarkStreamEncode17x3x16M-32          32757489      32818254      +0.19%
BenchmarkStreamVerify10x2x10000-32        6714996       6831093       +1.73%
BenchmarkStreamVerify50x5x50000-32        18525904      18761767      +1.27%
BenchmarkStreamVerify10x2x1M-32           5232278       5444148       +4.05%
BenchmarkStreamVerify5x2x1M-32            3673843       3755283       +2.22%
BenchmarkStreamVerify10x4x1M-32           7184419       7185293       +0.01%
BenchmarkStreamVerify50x20x1M-32          28441187      28574766      +0.47%
BenchmarkStreamVerify10x4x16M-32          8538440       8668614       +1.52%

benchmark                                 old MB/s      new MB/s      speedup
BenchmarkGalois128K-32                    51374.59      57976.36      1.13x
BenchmarkGalois1M-32                      46620.03      49679.10      1.07x
BenchmarkGaloisXor128K-32                 44106.22      46671.56      1.06x
BenchmarkGaloisXor1M-32                   41641.82      43779.89      1.05x
BenchmarkEncode10x2x10000-32              19682.61      21176.81      1.08x
BenchmarkEncode100x20x10000-32            2605.52       2884.71       1.11x
BenchmarkEncode17x3x1M-32                 67316.54      67729.50      1.01x
BenchmarkEncode10x4x16M-32                20121.74      20027.93      1.00x
BenchmarkEncode5x2x1M-32                  67984.17      71236.47      1.05x
BenchmarkEncode10x2x1M-32                 96710.29      102377.00     1.06x
BenchmarkEncode10x4x1M-32                 53931.74      56894.82      1.05x
BenchmarkEncode50x20x1M-32                13546.44      13989.82      1.03x
BenchmarkEncode17x3x16M-32                26862.29      26900.64      1.00x
BenchmarkEncode_8x4x8M-32                 20794.42      20780.27      1.00x
BenchmarkEncode_12x4x12M-32               22069.16      22051.88      1.00x
BenchmarkEncode_16x4x16M-32               24067.44      24295.58      1.01x
BenchmarkEncode_16x4x32M-32               24461.59      24597.04      1.01x
BenchmarkEncode_16x4x64M-32               24876.09      24987.40      1.00x
BenchmarkEncode_8x5x8M-32                 17400.71      17750.24      1.02x
BenchmarkEncode_8x6x8M-32                 15527.19      15124.46      0.97x
BenchmarkEncode_8x7x8M-32                 13391.15      13531.04      1.01x
BenchmarkEncode_8x9x8M-32                 10748.26      11003.58      1.02x
BenchmarkEncode_8x10x8M-32                9979.82       10080.80      1.01x
BenchmarkEncode_8x11x8M-32                9310.73       9142.48       0.98x
BenchmarkEncode_8x8x05M-32                23713.12      24359.50      1.03x
BenchmarkEncode_8x8x1M-32                 27084.87      27800.50      1.03x
BenchmarkEncode_8x8x8M-32                 12203.94      12225.89      1.00x
BenchmarkEncode_8x8x32M-32                11861.83      11900.28      1.00x
BenchmarkEncode_24x8x24M-32               21200.54      21402.01      1.01x
BenchmarkEncode_24x8x48M-32               21522.77      22017.95      1.02x
BenchmarkVerify10x2x10000-32              10127.24      10688.01      1.06x
BenchmarkVerify50x5x50000-32              28472.25      31301.75      1.10x
BenchmarkVerify10x2x1M-32                 31652.63      33637.74      1.06x
BenchmarkVerify5x2x1M-32                  19749.74      21134.27      1.07x
BenchmarkVerify10x4x1M-32                 14944.92      17291.25      1.16x
BenchmarkVerify50x20x1M-32                12085.46      12348.87      1.02x
BenchmarkVerify10x4x16M-32                13625.80      14059.87      1.03x
BenchmarkReconstruct10x2x10000-32         62723.68      66470.81      1.06x
BenchmarkReconstruct50x5x50000-32         52575.87      62847.32      1.20x
BenchmarkReconstruct10x2x1M-32            272507.04     281685.84     1.03x
BenchmarkReconstruct5x2x1M-32             169299.03     174685.39     1.03x
BenchmarkReconstruct10x4x1M-32            128455.17     139161.42     1.08x
BenchmarkReconstruct50x20x1M-32           46113.48      50404.73      1.09x
BenchmarkReconstruct10x4x16M-32           244777.11     255561.72     1.04x
BenchmarkReconstructData10x2x10000-32     62160.46      67305.98      1.08x
BenchmarkReconstructData50x5x50000-32     57411.81      69917.97      1.22x
BenchmarkReconstructData10x2x1M-32        332909.82     345526.29     1.04x
BenchmarkReconstructData5x2x1M-32         224254.60     231868.74     1.03x
BenchmarkReconstructData10x4x1M-32        164216.61     171799.68     1.05x
BenchmarkReconstructData50x20x1M-32       49988.98      54262.82      1.09x
BenchmarkReconstructData10x4x16M-32       193585.15     188032.29     0.97x
BenchmarkReconstructP10x2x10000-32        183806.57     185284.57     1.01x
BenchmarkReconstructP10x5x20000-32        160985.46     165852.51     1.03x
BenchmarkParallel_8x8x05M-32              12096.63      11921.17      0.99x
BenchmarkParallel_20x10x05M-32            18168.91      17886.72      0.98x
BenchmarkParallel_8x8x1M-32               11611.28      11502.36      0.99x
BenchmarkParallel_8x8x8M-32               11737.14      11536.42      0.98x
BenchmarkParallel_8x8x32M-32              11714.78      11123.31      0.95x
BenchmarkStreamEncode10x2x10000-32        14.92         14.00         0.94x
BenchmarkStreamEncode100x20x10000-32      26.19         25.15         0.96x
BenchmarkStreamEncode17x3x1M-32           1998.28       1933.60       0.97x
BenchmarkStreamEncode10x4x16M-32          7681.28       7701.31       1.00x
BenchmarkStreamEncode5x2x1M-32            1282.50       1614.48       1.26x
BenchmarkStreamEncode10x2x1M-32           1789.18       1767.55       0.99x
BenchmarkStreamEncode10x4x1M-32           1387.89       1381.53       1.00x
BenchmarkStreamEncode50x20x1M-32          1747.23       1733.18       0.99x
BenchmarkStreamEncode17x3x16M-32          8706.79       8690.67       1.00x
BenchmarkStreamVerify10x2x10000-32        14.89         14.64         0.98x
BenchmarkStreamVerify50x5x50000-32        269.89        266.50        0.99x
BenchmarkStreamVerify10x2x1M-32           2004.05       1926.06       0.96x
BenchmarkStreamVerify5x2x1M-32            1427.08       1396.13       0.98x
BenchmarkStreamVerify10x4x1M-32           1459.51       1459.34       1.00x
BenchmarkStreamVerify50x20x1M-32          1843.41       1834.79       1.00x
BenchmarkStreamVerify10x4x16M-32          19649.04      19353.98      0.98x
```
2020-05-05 16:36:01 +02:00
Klaus Post abb309aca7
Fix stream allocations (#129)
Numbers speak for themselves:

```
benchmark                                old ns/op     new ns/op     delta
BenchmarkStreamEncode10x2x10000-32       4792420       7937          -99.83%
BenchmarkStreamEncode100x20x10000-32     38424066      473285        -98.77%
BenchmarkStreamEncode17x3x1M-32          8195036       1482191       -81.91%
BenchmarkStreamEncode10x4x16M-32         21356715      18051773      -15.47%
BenchmarkStreamEncode5x2x1M-32           3295827       412301        -87.49%
BenchmarkStreamEncode10x2x1M-32          5249011       798828        -84.78%
BenchmarkStreamEncode10x4x1M-32          6392974       904818        -85.85%
BenchmarkStreamEncode50x20x1M-32         29083474      7199282       -75.25%
BenchmarkStreamEncode17x3x16M-32         32451850      28036421      -13.61%
BenchmarkStreamVerify10x2x10000-32       4858416       12988         -99.73%
BenchmarkStreamVerify50x5x50000-32       17047361      377003        -97.79%
BenchmarkStreamVerify10x2x1M-32          4869964       887214        -81.78%
BenchmarkStreamVerify5x2x1M-32           3282999       591669        -81.98%
BenchmarkStreamVerify10x4x1M-32          5824392       1230888       -78.87%
BenchmarkStreamVerify50x20x1M-32         27301648      6204613       -77.27%
BenchmarkStreamVerify10x4x16M-32         8508963       18845695      +121.48%

benchmark                                old MB/s     new MB/s     speedup
BenchmarkStreamEncode10x2x10000-32       20.87        12599.82     603.73x
BenchmarkStreamEncode100x20x10000-32     26.03        2112.89      81.17x
BenchmarkStreamEncode17x3x1M-32          2175.19      12026.65     5.53x
BenchmarkStreamEncode10x4x16M-32         7855.71      9293.94      1.18x
BenchmarkStreamEncode5x2x1M-32           1590.76      12716.14     7.99x
BenchmarkStreamEncode10x2x1M-32          1997.66      13126.43     6.57x
BenchmarkStreamEncode10x4x1M-32          1640.20      11588.81     7.07x
BenchmarkStreamEncode50x20x1M-32         1802.70      7282.50      4.04x
BenchmarkStreamEncode17x3x16M-32         8788.80      10172.93     1.16x
BenchmarkStreamVerify10x2x10000-32       20.58        7699.20      374.11x
BenchmarkStreamVerify50x5x50000-32       293.30       13262.49     45.22x
BenchmarkStreamVerify10x2x1M-32          2153.15      11818.75     5.49x
BenchmarkStreamVerify5x2x1M-32           1596.98      8861.17      5.55x
BenchmarkStreamVerify10x4x1M-32          1800.32      8518.86      4.73x
BenchmarkStreamVerify50x20x1M-32         1920.35      8449.97      4.40x
BenchmarkStreamVerify10x4x16M-32         19717.11     8902.41      0.45x
```
2020-05-05 16:35:35 +02:00
Klaus Post dccac354fe
Add cross compilation (#127)
* Add cross compilation

Add 386 as 32 bit test, arm64 and ppc64le since they have assembly.
2020-05-04 21:19:49 +02:00
Klaus Post f525ef0450
Clean up build tags (#126)
Move non-amd64 code to a separate file and remove references in other files.

Fixes #125
2020-05-04 20:06:47 +02:00
Klaus Post a0556fddfa
Add go.sum as well. 2020-05-04 10:19:03 +02:00
Klaus Post de70cc155f
AVX512 parallel processing (#120)
Do concurrent processing in AVX512 mode and split jobs by cache size.
2020-05-04 09:17:40 +02:00
Klaus Post e920b5fec3
Add direct modules support (#124)
* Add direct modules support
* Add tests with various assembly disabled.
* Add Go 1.14 - remove 1.11
2020-05-03 21:53:25 +02:00
Klaus Post d069fb1019
Remove a bounds check in pure Go (#123)
40% faster on the pure operation.

```
benchmark                                old ns/op     new ns/op     delta
BenchmarkParallel_8x8x05M-8              2990849       2763554       -7.60%
BenchmarkParallel_8x8x1M-8               4941575       5061619       +2.43%
BenchmarkParallel_8x8x8M-8               34257722      33192541      -3.11%
BenchmarkParallel_8x8x32M-8              143157262     131654688     -8.03%
BenchmarkGalois128K-8                    64201         38374         -40.23%
BenchmarkGalois1M-8                      507053        307236        -39.41%
BenchmarkGaloisXor128K-8                 63815         63157         -1.03%
BenchmarkGaloisXor1M-8                   506369        505641        -0.14%
BenchmarkEncode10x2x10000-8              96414         92781         -3.77%
BenchmarkEncode100x20x10000-8            3188549       3238299       +1.56%
BenchmarkEncode17x3x1M-8                 3741349       3633535       -2.88%
BenchmarkEncode10x4x16M-8                41628596      40306100      -3.18%
BenchmarkEncode5x2x1M-8                  724162        699137        -3.46%
BenchmarkEncode10x2x1M-8                 1451401       1423224       -1.94%
BenchmarkEncode10x4x1M-8                 2839382       2740249       -3.49%
BenchmarkEncode50x20x1M-8                68415407      67015156      -2.05%
BenchmarkEncode17x3x16M-8                53734221      51784418      -3.63%
BenchmarkEncode_8x4x8M-8                 16826004      16013691      -4.83%
BenchmarkEncode_12x4x12M-8               37544203      36392439      -3.07%
BenchmarkEncode_16x4x16M-8               66070450      69062838      +4.53%
BenchmarkEncode_16x4x32M-8               133905200     130529500     -2.52%
BenchmarkEncode_16x4x64M-8               281313400     265809900     -5.51%
BenchmarkEncode_8x5x8M-8                 20789000      19866553      -4.44%
BenchmarkEncode_8x6x8M-8                 25027385      25087290      +0.24%
BenchmarkEncode_8x7x8M-8                 29156578      28231372      -3.17%
BenchmarkEncode_8x9x8M-8                 37286413      37383431      +0.26%
BenchmarkEncode_8x10x8M-8                41722722      39786752      -4.64%
BenchmarkEncode_8x11x8M-8                45692118      43409812      -4.99%
BenchmarkEncode_8x8x05M-8                2358946       2298631       -2.56%
BenchmarkEncode_8x8x1M-8                 4551026       4357599       -4.25%
BenchmarkEncode_8x8x8M-8                 33596074      31951653      -4.89%
BenchmarkEncode_8x8x32M-8                135030488     127382850     -5.66%
BenchmarkEncode_24x8x24M-8               297317050     301777575     +1.50%
BenchmarkEncode_24x8x48M-8               611638100     596134400     -2.53%
BenchmarkVerify10x2x10000-8              103723        103523        -0.19%
BenchmarkVerify50x5x50000-8              2170780       2148170       -1.04%
BenchmarkVerify10x2x1M-8                 1693351       1676973       -0.97%
BenchmarkVerify5x2x1M-8                  997721        995888        -0.18%
BenchmarkVerify10x4x1M-8                 3354687       3296939       -1.72%
BenchmarkVerify50x20x1M-8                67491300      66890056      -0.89%
BenchmarkVerify10x4x16M-8                44195152      44356146      +0.36%
BenchmarkReconstruct10x2x10000-8         24720         23373         -5.45%
BenchmarkReconstruct50x5x50000-8         880988        858684        -2.53%
BenchmarkReconstruct10x2x1M-8            387655        368900        -4.84%
BenchmarkReconstruct5x2x1M-8             191067        175841        -7.97%
BenchmarkReconstruct10x4x1M-8            1040639       1004731       -3.45%
BenchmarkReconstruct50x20x1M-8           28507103      28467956      -0.14%
BenchmarkReconstruct10x4x16M-8           15829872      15225654      -3.82%
BenchmarkReconstructData10x2x10000-8     24369         23374         -4.08%
BenchmarkReconstructData50x5x50000-8     865039        852456        -1.45%
BenchmarkReconstructData10x2x1M-8        383240        366751        -4.30%
BenchmarkReconstructData5x2x1M-8         183644        170444        -7.19%
BenchmarkReconstructData10x4x1M-8        1010537       969151        -4.10%
BenchmarkReconstructData50x20x1M-8       28288428      28051051      -0.84%
BenchmarkReconstructData10x4x16M-8       15048840      14443250      -4.02%
BenchmarkReconstructP10x2x10000-8        3219          3122          -3.01%
BenchmarkReconstructP10x5x20000-8        23574         22704         -3.69%
BenchmarkSplit10x4x160M-8                2822150       2735071       -3.09%
BenchmarkSplit5x2x5M-8                   409699        311346        -24.01%
BenchmarkSplit10x2x1M-8                  43767         40247         -8.04%
BenchmarkSplit10x4x10M-8                 741097        566888        -23.51%
BenchmarkSplit50x20x50M-8                1913475       1682060       -12.09%
BenchmarkSplit17x3x272M-8                2059505       2095628       +1.75%
BenchmarkStreamEncode10x2x10000-8        8517255       5226284       -38.64%
BenchmarkStreamEncode100x20x10000-8      41903836      40969212      -2.23%
BenchmarkStreamEncode17x3x1M-8           12038007      14129765      +17.38%
BenchmarkStreamEncode10x4x16M-8          56512840      54821895      -2.99%
BenchmarkStreamEncode5x2x1M-8            5326508       3966411       -25.53%
BenchmarkStreamEncode10x2x1M-8           6924358       6589396       -4.84%
BenchmarkStreamEncode10x4x1M-8           9016080       8459049       -6.18%
BenchmarkStreamEncode50x20x1M-8          93583042      94021200      +0.47%
BenchmarkStreamEncode17x3x16M-8          76643714      74750193      -2.47%
BenchmarkStreamVerify10x2x10000-8        8311646       5162179       -37.89%
BenchmarkStreamVerify50x5x50000-8        19015944      18352626      -3.49%
BenchmarkStreamVerify10x2x1M-8           5738380       5441592       -5.17%
BenchmarkStreamVerify5x2x1M-8            3462751       3328057       -3.89%
BenchmarkStreamVerify10x4x1M-8           6735717       6381116       -5.26%
BenchmarkStreamVerify50x20x1M-8          29844543      29416921      -1.43%
BenchmarkStreamVerify10x4x16M-8          8512699       8375778       -1.61%

benchmark                                old MB/s     new MB/s     speedup
BenchmarkParallel_8x8x05M-8              1402.38      1517.72      1.08x
BenchmarkParallel_8x8x1M-8               1697.56      1657.30      0.98x
BenchmarkParallel_8x8x8M-8               1958.94      2021.81      1.03x
BenchmarkParallel_8x8x32M-8              1875.11      2038.94      1.09x
BenchmarkGalois128K-8                    2041.59      3415.64      1.67x
BenchmarkGalois1M-8                      2067.98      3412.93      1.65x
BenchmarkGaloisXor128K-8                 2053.92      2075.33      1.01x
BenchmarkGaloisXor1M-8                   2070.77      2073.76      1.00x
BenchmarkEncode10x2x10000-8              1037.19      1077.81      1.04x
BenchmarkEncode100x20x10000-8            313.62       308.80       0.98x
BenchmarkEncode17x3x1M-8                 4764.54      4905.91      1.03x
BenchmarkEncode10x4x16M-8                4030.21      4162.45      1.03x
BenchmarkEncode5x2x1M-8                  7239.93      7499.07      1.04x
BenchmarkEncode10x2x1M-8                 7224.58      7367.61      1.02x
BenchmarkEncode10x4x1M-8                 3692.97      3826.57      1.04x
BenchmarkEncode50x20x1M-8                766.33       782.34       1.02x
BenchmarkEncode17x3x16M-8                5307.84      5507.69      1.04x
BenchmarkEncode_8x4x8M-8                 3988.40      4190.72      1.05x
BenchmarkEncode_12x4x12M-8               4021.79      4149.07      1.03x
BenchmarkEncode_16x4x16M-8               4062.87      3886.83      0.96x
BenchmarkEncode_16x4x32M-8               4009.34      4113.02      1.03x
BenchmarkEncode_16x4x64M-8               3816.89      4039.51      1.06x
BenchmarkEncode_8x5x8M-8                 3228.09      3377.98      1.05x
BenchmarkEncode_8x6x8M-8                 2681.42      2675.01      1.00x
BenchmarkEncode_8x7x8M-8                 2301.67      2377.10      1.03x
BenchmarkEncode_8x9x8M-8                 1799.82      1795.15      1.00x
BenchmarkEncode_8x10x8M-8                1608.45      1686.71      1.05x
BenchmarkEncode_8x11x8M-8                1468.72      1545.94      1.05x
BenchmarkEncode_8x8x05M-8                1778.04      1824.70      1.03x
BenchmarkEncode_8x8x1M-8                 1843.23      1925.05      1.04x
BenchmarkEncode_8x8x8M-8                 1997.52      2100.33      1.05x
BenchmarkEncode_8x8x32M-8                1987.96      2107.31      1.06x
BenchmarkEncode_24x8x24M-8               2031.43      2001.41      0.99x
BenchmarkEncode_24x8x48M-8               1974.96      2026.32      1.03x
BenchmarkVerify10x2x10000-8              964.10       965.97       1.00x
BenchmarkVerify50x5x50000-8              2303.32      2327.56      1.01x
BenchmarkVerify10x2x1M-8                 6192.31      6252.79      1.01x
BenchmarkVerify5x2x1M-8                  5254.86      5264.53      1.00x
BenchmarkVerify10x4x1M-8                 3125.70      3180.45      1.02x
BenchmarkVerify50x20x1M-8                776.82       783.81       1.01x
BenchmarkVerify10x4x16M-8                3796.17      3782.39      1.00x
BenchmarkReconstruct10x2x10000-8         4045.30      4278.40      1.06x
BenchmarkReconstruct50x5x50000-8         5675.45      5822.87      1.03x
BenchmarkReconstruct10x2x1M-8            27049.21     28424.40     1.05x
BenchmarkReconstruct5x2x1M-8             27440.02     29815.96     1.09x
BenchmarkReconstruct10x4x1M-8            10076.27     10436.39     1.04x
BenchmarkReconstruct50x20x1M-8           1839.15      1841.68      1.00x
BenchmarkReconstruct10x4x16M-8           10598.45     11019.04     1.04x
BenchmarkReconstructData10x2x10000-8     4103.60      4278.25      1.04x
BenchmarkReconstructData50x5x50000-8     5780.09      5865.40      1.01x
BenchmarkReconstructData10x2x1M-8        27360.79     28590.95     1.04x
BenchmarkReconstructData5x2x1M-8         28549.19     30760.16     1.08x
BenchmarkReconstructData10x4x1M-8        10376.42     10819.53     1.04x
BenchmarkReconstructData50x20x1M-8       1853.37      1869.05      1.01x
BenchmarkReconstructData10x4x16M-8       11148.51     11615.96     1.04x
BenchmarkReconstructP10x2x10000-8        31068.70     32026.22     1.03x
BenchmarkReconstructP10x5x20000-8        8484.08      8808.93      1.04x
BenchmarkStreamEncode10x2x10000-8        11.74        19.13        1.63x
BenchmarkStreamEncode100x20x10000-8      23.86        24.41        1.02x
BenchmarkStreamEncode17x3x1M-8           1480.79      1261.58      0.85x
BenchmarkStreamEncode10x4x16M-8          2968.74      3060.31      1.03x
BenchmarkStreamEncode5x2x1M-8            984.30       1321.82      1.34x
BenchmarkStreamEncode10x2x1M-8           1514.33      1591.31      1.05x
BenchmarkStreamEncode10x4x1M-8           1163.01      1239.59      1.07x
BenchmarkStreamEncode50x20x1M-8          560.24       557.63       1.00x
BenchmarkStreamEncode17x3x16M-8          3721.28      3815.54      1.03x
BenchmarkStreamVerify10x2x10000-8        12.03        19.37        1.61x
BenchmarkStreamVerify50x5x50000-8        262.94       272.44       1.04x
BenchmarkStreamVerify10x2x1M-8           1827.30      1926.97      1.05x
BenchmarkStreamVerify5x2x1M-8            1514.08      1575.36      1.04x
BenchmarkStreamVerify10x4x1M-8           1556.74      1643.25      1.06x
BenchmarkStreamVerify50x20x1M-8          1756.73      1782.27      1.01x
BenchmarkStreamVerify10x4x16M-8          19708.46     20030.64     1.02x
```
2020-05-03 19:38:55 +02:00
Klaus Post 65df535980
Make single goroutine encodes more efficient (#122)
Calculate the optimal per round size to keep data in cache when not using WithAutoGoroutines.

```
λ benchcmp before.txt after.txt
benchmark                          old ns/op     new ns/op     delta
BenchmarkParallel_8x8x05M-16       675225        321053        -52.45%
BenchmarkParallel_20x10x05M-16     3471988       600740        -82.70%
BenchmarkParallel_8x8x1M-16        3948606       728093        -81.56%
BenchmarkParallel_8x8x8M-16        47361588      5976467       -87.38%
BenchmarkParallel_8x8x32M-16       195044200     24365474      -87.51%

benchmark                          old MB/s     new MB/s     speedup
BenchmarkParallel_8x8x05M-16       6211.71      13064.22     2.10x
BenchmarkParallel_20x10x05M-16     3020.10      17454.73     5.78x
BenchmarkParallel_8x8x1M-16        2124.45      11521.34     5.42x
BenchmarkParallel_8x8x8M-16        1416.95      11228.85     7.92x
BenchmarkParallel_8x8x32M-16       1376.28      11017.04     8.00x

```
2020-05-03 19:37:22 +02:00
Frank Wessels 0b98f5350a
Refactor AVX512 code to use Go assembly instructions. (#121)
Additionally there is a small performance improvement using VPTERNLOGD (instead of two VPXORD instructions).
2020-05-03 13:43:52 +02:00
Klaus Post 17098a4f19
Use stream test options (#118) 2020-04-22 17:22:16 +02:00
Klaus Post c3634dce94
Use CPU cache to set minSplitSize (#117)
Use L1 cache size to set default split size.
2020-04-22 16:12:18 +02:00
Klaus Post d2cfcb8065
Add commandline arg to disable asm for tests. (#116)
* Add commandline test args
2020-04-22 15:38:21 +02:00
Klaus Post 0abe9de20c
Update tests (#115)
Don't create new slices.
2020-02-21 11:30:44 -08:00
Klaus Post 101092fa3b
Make AVX512 short tests (#114)
Tests are timing out. Use shorter tests for -short.
2020-01-18 14:50:31 +01:00
Klaus Post 70d6279761 Update travis script 2019-09-27 16:33:57 -07:00
Christian Muehlhaeuser 4681100338 Removed unused struct members (#106)
creads & cwrites both seem to be unused.
2019-09-27 16:31:11 -07:00
Christian Muehlhaeuser 993c27a5ba Avoid unnecessary conversion (#107)
No need to convert to byte here.
2019-09-27 16:30:54 -07:00
Andreas Auernhammer 1f1369aa84 limit capacity of shards to shard size (#109)
This commit limits the capacity (additionally
to the length) of each shard to the shard size.

Before this change the following code behaves in
an unexpected way:
```
shards := encoder.Split(buffer)
// ...
shards[0] = shards[0][:cap(shards[0])
```

Instead of restoring the length of `shards[0]` to
the shard size, it assigns the entire memory of `buffer`
to `shards[0]`.
2019-09-27 16:30:26 -07:00
dssysolyatin 7890684129 Improve quick check for case when dataOnly is true (#105) 2019-06-25 16:30:44 +02:00
dssysolyatin ec2eb9fb8c Split: Reduce memory allocation (#103)
* [Split] Reduce memory allocation in Split function
2019-06-25 16:28:24 +02:00
Klaus Post 0883d2f011 Only enable AVX512 on AMD64
Fixes #102
2019-05-26 12:12:55 +02:00
Lennart Oldenburg a373324398 Fixed upper bound check for data shard cli argument in example encoders and file permission issue. (#98) 2019-04-07 17:36:31 +02:00
Klaus Post a9588190c0
Optimize pure Go version. (#96)
* Optimize pure Go version.
* Update docs. Add Go 1.12 CI

* Avoid dst bounds check when using noasm ~ 40-50% faster.
* Convert multiply table to a slice whenever used.
* Split on 32 byte boundaries instead of 16 byte.
2019-03-08 10:49:27 +01:00
Klaus Post 09979cdf93 Start documentation with method name.
Replaces #92
2019-02-15 15:31:43 +01:00