Commit Graph

9 Commits (jerasure-matrix)

Author SHA1 Message Date
Klaus Post 3a82d28edb
Add GF16 AVX2, AVX512 and SSSE3 (#193)
* Add GF16 AVX2
* Add SSSE3 fallback.
* Fix reconstruction was skipped if first shard was empty.
* Combine lookups in pure Go
* Faster xor on pure Go.
* Add 4way butterfly AVX2.
* Add fftDIT4 avx2. Add avx512 version. Add noescape.
* Remove +build space. Do size varied 800x200 bench.
* Use VPTERNLOGD for avx512.
* Remove refMulAdd inner loop bounds checks. ~10-20% faster
2022-07-26 12:37:28 +02:00
Klaus Post 2f19c81be4
Reduce generated code (#185)
* Reduce generated code

Use a define (with hacks)
2022-03-24 13:25:40 +01:00
Klaus Post daf81ef0bd
Use VPTERNLOGD on GOAMD64=v4 (#182)
* Use VPTERNLOGD on GOAMD64=v4
* Bump to Go 1.18
2022-03-16 11:10:29 +01:00
Klaus Post 1bb4d699e1
avx2: Improve speed when > 10 input or output shards. (#174)
Speeds are including a limiting the number of goroutines with all AVX2 paths,

Before/after
```
benchmark                                 old ns/op     new ns/op     delta
BenchmarkGalois128K-32                    2240          2240          +0.00%
BenchmarkGalois1M-32                      19578         18891         -3.51%
BenchmarkGaloisXor128K-32                 2798          2852          +1.93%
BenchmarkGaloisXor1M-32                   23334         23345         +0.05%
BenchmarkEncode2x1x1M-32                  34357         34370         +0.04%
BenchmarkEncode10x2x10000-32              3210          3093          -3.64%
BenchmarkEncode100x20x10000-32            362925        148214        -59.16%
BenchmarkEncode17x3x1M-32                 323767        224157        -30.77%
BenchmarkEncode10x4x16M-32                8376895       8376737       -0.00%
BenchmarkEncode5x2x1M-32                  68365         66861         -2.20%
BenchmarkEncode10x2x1M-32                 101407        93023         -8.27%
BenchmarkEncode10x4x1M-32                 171880        155477        -9.54%
BenchmarkEncode50x20x1M-32                3704691       3015047       -18.62%
BenchmarkEncode17x3x16M-32                10279233      10106658      -1.68%
BenchmarkEncode_8x4x8M-32                 3438245       3326479       -3.25%
BenchmarkEncode_12x4x12M-32               6632257       6581637       -0.76%
BenchmarkEncode_16x4x16M-32               10815755      10788377      -0.25%
BenchmarkEncode_16x4x32M-32               21029061      21507995      +2.28%
BenchmarkEncode_16x4x64M-32               42145450      43876850      +4.11%
BenchmarkEncode_8x5x8M-32                 4543208       3846378       -15.34%
BenchmarkEncode_8x6x8M-32                 5065494       4397218       -13.19%
BenchmarkEncode_8x7x8M-32                 5818995       4962884       -14.71%
BenchmarkEncode_8x9x8M-32                 6215449       6114898       -1.62%
BenchmarkEncode_8x10x8M-32                6923415       6610501       -4.52%
BenchmarkEncode_8x11x8M-32                7365988       7010473       -4.83%
BenchmarkEncode_8x8x05M-32                150857        136820        -9.30%
BenchmarkEncode_8x8x1M-32                 256722        254854        -0.73%
BenchmarkEncode_8x8x8M-32                 5547790       5422048       -2.27%
BenchmarkEncode_8x8x32M-32                23038643      22705859      -1.44%
BenchmarkEncode_24x8x24M-32               27729259      30332216      +9.39%
BenchmarkEncode_24x8x48M-32               53865705      61187658      +13.59%
BenchmarkVerify10x2x10000-32              8769          8154          -7.01%
BenchmarkVerify10x2x1M-32                 516149        476180        -7.74%
BenchmarkVerify5x2x1M-32                  443888        419541        -5.48%
BenchmarkVerify10x4x1M-32                 1030299       948021        -7.99%
BenchmarkVerify50x20x1M-32                7209689       6186891       -14.19%
BenchmarkVerify10x4x16M-32                17774456      17681879      -0.52%
BenchmarkReconstruct10x2x10000-32         3352          3256          -2.86%
BenchmarkReconstruct50x5x50000-32         166417        140900        -15.33%
BenchmarkReconstruct10x2x1M-32            189711        174615        -7.96%
BenchmarkReconstruct5x2x1M-32             128080        126520        -1.22%
BenchmarkReconstruct10x4x1M-32            273312        254017        -7.06%
BenchmarkReconstruct50x20x1M-32           3628812       3192474       -12.02%
BenchmarkReconstruct10x4x16M-32           8562186       8781479       +2.56%
BenchmarkReconstructData10x2x10000-32     3241          3116          -3.86%
BenchmarkReconstructData50x5x50000-32     162520        134794        -17.06%
BenchmarkReconstructData10x2x1M-32        171253        161955        -5.43%
BenchmarkReconstructData5x2x1M-32         102215        106942        +4.62%
BenchmarkReconstructData10x4x1M-32        225593        219969        -2.49%
BenchmarkReconstructData50x20x1M-32       2515311       2129721       -15.33%
BenchmarkReconstructData10x4x16M-32       6980308       6698111       -4.04%
BenchmarkReconstructP10x2x10000-32        924           937           +1.35%
BenchmarkReconstructP10x5x20000-32        1639          1703          +3.90%
BenchmarkSplit10x4x160M-32                4984993       4898045       -1.74%
BenchmarkSplit5x2x5M-32                   380415        221446        -41.79%
BenchmarkSplit10x2x1M-32                  58761         53335         -9.23%
BenchmarkSplit10x4x10M-32                 643188        410959        -36.11%
BenchmarkSplit50x20x50M-32                1843879       1647205       -10.67%
BenchmarkSplit17x3x272M-32                3684920       3613951       -1.93%
BenchmarkParallel_8x8x64K-32              7022          6630          -5.58%
BenchmarkParallel_8x8x05M-32              348308        348369        +0.02%
BenchmarkParallel_20x10x05M-32            575672        581028        +0.93%
BenchmarkParallel_8x8x1M-32               716033        697167        -2.63%
BenchmarkParallel_8x8x8M-32               5716048       5616437       -1.74%
BenchmarkParallel_8x8x32M-32              22650878      22098667      -2.44%
BenchmarkParallel_8x3x1M-32               406839        399125        -1.90%
BenchmarkParallel_8x4x1M-32               459107        463890        +1.04%
BenchmarkParallel_8x5x1M-32               527488        520334        -1.36%
BenchmarkStreamEncode10x2x10000-32        6013          5878          -2.25%
BenchmarkStreamEncode100x20x10000-32      503124        267894        -46.75%
BenchmarkStreamEncode17x3x1M-32           1561838       1376618       -11.86%
BenchmarkStreamEncode10x4x16M-32          19124427      17762582      -7.12%
BenchmarkStreamEncode5x2x1M-32            429701        384666        -10.48%
BenchmarkStreamEncode10x2x1M-32           801257        763637        -4.70%
BenchmarkStreamEncode10x4x1M-32           876065        820744        -6.31%
BenchmarkStreamEncode50x20x1M-32          7205112       6081398       -15.60%
BenchmarkStreamEncode17x3x16M-32          27182786      26117143      -3.92%
BenchmarkStreamVerify10x2x10000-32        13767         14026         +1.88%
BenchmarkStreamVerify50x5x50000-32        826983        690453        -16.51%
BenchmarkStreamVerify10x2x1M-32           1238566       1182591       -4.52%
BenchmarkStreamVerify5x2x1M-32            892661        806301        -9.67%
BenchmarkStreamVerify10x4x1M-32           1676394       1631495       -2.68%
BenchmarkStreamVerify50x20x1M-32          10877875      10037678      -7.72%
BenchmarkStreamVerify10x4x16M-32          27599576      30435400      +10.27%

benchmark                                 old MB/s      new MB/s      speedup
BenchmarkGalois128K-32                    58518.53      58510.17      1.00x
BenchmarkGalois1M-32                      53558.10      55507.44      1.04x
BenchmarkGaloisXor128K-32                 46839.74      45961.09      0.98x
BenchmarkGaloisXor1M-32                   44936.98      44917.46      1.00x
BenchmarkEncode2x1x1M-32                  91561.27      91524.11      1.00x
BenchmarkEncode10x2x10000-32              37385.54      38792.54      1.04x
BenchmarkEncode100x20x10000-32            3306.47       8096.40       2.45x
BenchmarkEncode17x3x1M-32                 64773.49      93557.14      1.44x
BenchmarkEncode10x4x16M-32                28039.15      28039.68      1.00x
BenchmarkEncode5x2x1M-32                  107365.88     109781.16     1.02x
BenchmarkEncode10x2x1M-32                 124083.62     135266.27     1.09x
BenchmarkEncode10x4x1M-32                 85408.99      94419.71      1.11x
BenchmarkEncode50x20x1M-32                19812.81      24344.67      1.23x
BenchmarkEncode17x3x16M-32                32642.93      33200.32      1.02x
BenchmarkEncode_8x4x8M-32                 29277.52      30261.21      1.03x
BenchmarkEncode_12x4x12M-32               30355.67      30589.14      1.01x
BenchmarkEncode_16x4x16M-32               31023.66      31102.39      1.00x
BenchmarkEncode_16x4x32M-32               31912.44      31201.82      0.98x
BenchmarkEncode_16x4x64M-32               31846.32      30589.65      0.96x
BenchmarkEncode_8x5x8M-32                 24003.28      28351.84      1.18x
BenchmarkEncode_8x6x8M-32                 23184.41      26707.91      1.15x
BenchmarkEncode_8x7x8M-32                 21623.86      25354.03      1.17x
BenchmarkEncode_8x9x8M-32                 22943.85      23321.13      1.02x
BenchmarkEncode_8x10x8M-32                21809.31      22841.68      1.05x
BenchmarkEncode_8x11x8M-32                21637.77      22735.06      1.05x
BenchmarkEncode_8x8x05M-32                55606.22      61311.47      1.10x
BenchmarkEncode_8x8x1M-32                 65351.80      65830.73      1.01x
BenchmarkEncode_8x8x8M-32                 24193.01      24754.07      1.02x
BenchmarkEncode_8x8x32M-32                23303.06      23644.60      1.01x
BenchmarkEncode_24x8x24M-32               29041.76      26549.54      0.91x
BenchmarkEncode_24x8x48M-32               29900.52      26322.51      0.88x
BenchmarkVerify10x2x10000-32              13685.12      14717.10      1.08x
BenchmarkVerify10x2x1M-32                 24378.43      26424.72      1.08x
BenchmarkVerify5x2x1M-32                  16535.79      17495.41      1.06x
BenchmarkVerify10x4x1M-32                 14248.35      15484.96      1.09x
BenchmarkVerify50x20x1M-32                10180.79      11863.85      1.17x
BenchmarkVerify10x4x16M-32                13214.53      13283.71      1.01x
BenchmarkReconstruct10x2x10000-32         35799.16      36854.89      1.03x
BenchmarkReconstruct50x5x50000-32         33049.47      39034.89      1.18x
BenchmarkReconstruct10x2x1M-32            66326.88      72061.06      1.09x
BenchmarkReconstruct5x2x1M-32             57308.21      58014.92      1.01x
BenchmarkReconstruct10x4x1M-32            53711.74      57791.66      1.08x
BenchmarkReconstruct50x20x1M-32           20227.09      22991.67      1.14x
BenchmarkReconstruct10x4x16M-32           27432.37      26747.32      0.98x
BenchmarkReconstructData10x2x10000-32     37030.86      38511.87      1.04x
BenchmarkReconstructData50x5x50000-32     33842.07      40802.85      1.21x
BenchmarkReconstructData10x2x1M-32        73475.57      77693.87      1.06x
BenchmarkReconstructData5x2x1M-32         71809.58      68635.57      0.96x
BenchmarkReconstructData10x4x1M-32        65073.27      66736.88      1.03x
BenchmarkReconstructData50x20x1M-32       29181.41      34464.76      1.18x
BenchmarkReconstructData10x4x16M-32       33649.09      35066.75      1.04x
BenchmarkReconstructP10x2x10000-32        129819.98     128086.76     0.99x
BenchmarkReconstructP10x5x20000-32        183073.89     176202.21     0.96x
BenchmarkParallel_8x8x64K-32              149327.33     158153.67     1.06x
BenchmarkParallel_8x8x05M-32              24083.89      24079.69      1.00x
BenchmarkParallel_20x10x05M-32            27322.20      27070.35      0.99x
BenchmarkParallel_8x8x1M-32               23430.78      24064.83      1.03x
BenchmarkParallel_8x8x8M-32               23480.86      23897.31      1.02x
BenchmarkParallel_8x8x32M-32              23701.99      24294.27      1.02x
BenchmarkParallel_8x3x1M-32               28351.11      28899.03      1.02x
BenchmarkParallel_8x4x1M-32               27407.34      27124.76      0.99x
BenchmarkParallel_8x5x1M-32               25842.27      26197.58      1.01x
BenchmarkStreamEncode10x2x10000-32        16629.76      17012.26      1.02x
BenchmarkStreamEncode100x20x10000-32      1987.58       3732.83       1.88x
BenchmarkStreamEncode17x3x1M-32           11413.34      12948.97      1.13x
BenchmarkStreamEncode10x4x16M-32          8772.66       9445.26       1.08x
BenchmarkStreamEncode5x2x1M-32            12201.21      13629.70      1.12x
BenchmarkStreamEncode10x2x1M-32           13086.64      13731.34      1.05x
BenchmarkStreamEncode10x4x1M-32           11969.16      12775.92      1.07x
BenchmarkStreamEncode50x20x1M-32          7276.61       8621.18       1.18x
BenchmarkStreamEncode17x3x16M-32          10492.40      10920.52      1.04x
BenchmarkStreamVerify10x2x10000-32        7264.00       7129.49       0.98x
BenchmarkStreamVerify50x5x50000-32        6046.07       7241.62       1.20x
BenchmarkStreamVerify10x2x1M-32           8466.05       8866.77       1.05x
BenchmarkStreamVerify5x2x1M-32            5873.31       6502.39       1.11x
BenchmarkStreamVerify10x4x1M-32           6254.95       6427.09       1.03x
BenchmarkStreamVerify50x20x1M-32          4819.76       5223.20       1.08x
BenchmarkStreamVerify10x4x16M-32          6078.79       5512.40       0.91x 
```
2021-12-09 12:28:44 +01:00
Klaus Post 7761c8f7cd
Use Workflows (#169)
* Use Workflows
* Go 1.17 build tags
* Do races separately.
2021-09-01 18:55:02 +02:00
Klaus Post 7bd22796ec
Wider AVX2 loops and less usage. (#162)
* Experiment with 64 bytes/loop AVX2

* Only reduce when doing 64.

* Use no more than 8 goroutines for avx2 codegen.

```
name                         old speed      new speed      delta
Encode10x2x10000-32          33.3GB/s ± 0%  37.5GB/s ± 1%  +12.49%   (p=0.000 n=9+10)
Encode100x20x10000-32        3.79GB/s ± 5%  3.77GB/s ± 5%     ~     (p=0.853 n=10+10)
Encode17x3x1M-32             78.2GB/s ± 1%  76.0GB/s ± 6%     ~     (p=0.123 n=10+10)
Encode10x4x16M-32            28.3GB/s ± 0%  27.7GB/s ± 2%   -2.32%   (p=0.000 n=8+10)
Encode5x2x1M-32               112GB/s ± 1%   113GB/s ± 1%     ~     (p=0.796 n=10+10)
Encode10x2x1M-32              149GB/s ± 1%   129GB/s ± 3%  -13.24%   (p=0.000 n=9+10)
Encode10x4x1M-32             99.1GB/s ± 1%  91.5GB/s ± 3%   -7.74%  (p=0.000 n=10+10)
Encode50x20x1M-32            19.7GB/s ± 1%  19.8GB/s ± 1%     ~      (p=0.447 n=9+10)
Encode17x3x16M-32            33.4GB/s ± 0%  33.3GB/s ± 1%   -0.46%   (p=0.043 n=10+9)
Encode_8x4x8M-32             30.1GB/s ± 1%  29.4GB/s ± 3%   -2.31%  (p=0.000 n=10+10)
Encode_12x4x12M-32           30.6GB/s ± 0%  30.5GB/s ± 0%     ~      (p=0.720 n=10+9)
Encode_16x4x16M-32           31.5GB/s ± 0%  31.5GB/s ± 0%     ~      (p=0.497 n=10+9)
Encode_16x4x32M-32           31.9GB/s ± 0%  31.5GB/s ± 4%     ~     (p=0.165 n=10+10)
Encode_16x4x64M-32           32.4GB/s ± 0%  32.3GB/s ± 0%     ~       (p=0.321 n=9+8)
Encode_8x5x8M-32             28.4GB/s ± 0%  28.4GB/s ± 1%     ~      (p=0.237 n=10+8)
Encode_8x6x8M-32             27.0GB/s ± 0%  27.2GB/s ± 2%     ~     (p=0.075 n=10+10)
Encode_8x7x8M-32             26.0GB/s ± 1%  25.8GB/s ± 1%   -0.53%   (p=0.003 n=9+10)
Encode_8x9x8M-32             24.6GB/s ± 1%  24.4GB/s ± 1%   -0.63%  (p=0.000 n=10+10)
Encode_8x10x8M-32            23.7GB/s ± 1%  23.7GB/s ± 0%   +0.32%   (p=0.035 n=10+9)
Encode_8x11x8M-32            23.0GB/s ± 1%  22.8GB/s ± 0%   -0.59%    (p=0.000 n=9+8)
Encode_8x8x05M-32            66.4GB/s ± 1%  64.2GB/s ± 1%   -3.32%  (p=0.000 n=10+10)
Encode_8x8x1M-32             56.7GB/s ± 0%  75.7GB/s ± 2%  +33.55%    (p=0.000 n=9+9)
Encode_8x8x8M-32             24.9GB/s ± 0%  24.9GB/s ± 1%     ~      (p=0.146 n=8+10)
Encode_8x8x32M-32            23.8GB/s ± 0%  23.4GB/s ± 0%   -1.42%   (p=0.000 n=9+10)
Encode_24x8x24M-32           29.9GB/s ± 0%  29.9GB/s ± 0%     ~      (p=0.278 n=10+9)
Encode_24x8x48M-32           30.7GB/s ± 1%  30.7GB/s ± 0%     ~       (p=0.351 n=9+7)
StreamEncode10x2x10000-32    15.5GB/s ± 1%  16.5GB/s ± 0%   +6.53%   (p=0.000 n=10+9)
StreamEncode100x20x10000-32  2.09GB/s ± 1%  2.06GB/s ± 2%   -1.78%  (p=0.000 n=10+10)
StreamEncode17x3x1M-32       12.2GB/s ± 2%  12.3GB/s ± 1%   +1.19%   (p=0.008 n=10+9)
StreamEncode10x4x16M-32      8.68GB/s ± 0%  9.47GB/s ± 1%   +9.05%   (p=0.000 n=8+10)
StreamEncode5x2x1M-32        12.3GB/s ± 1%  13.2GB/s ± 1%   +7.61%  (p=0.000 n=10+10)
StreamEncode10x2x1M-32       11.5GB/s ± 4%  13.3GB/s ± 2%  +15.15%   (p=0.000 n=10+7)
```
2021-06-21 15:15:23 +02:00
Klaus Post 46e0559fe3
Upgrade avo to avoid bp (#166)
Skip spilling BP unless needed.
2021-04-26 11:40:31 +02:00
Klaus Post 9227782845
Avoid clobbering BP (#165)
* Avoid clobbering BP
* Spill to XMM register when available.
2021-04-06 17:47:52 +02:00
Klaus Post 519603f6e1
Update packages (#154)
* Update packages

Update cpuid and clean up generated.
2020-12-09 22:56:01 +01:00