reedsolomon-go/galois_gen_amd64.go

657 lines
24 KiB
Go
Raw Normal View History

// Code generated by command: go run gen.go -out ../galois_gen_amd64.s -stubs ../galois_gen_amd64.go -pkg=reedsolomon. DO NOT EDIT.
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
//go:build !appengine && !noasm && !nogen && gc
// +build !appengine,!noasm,!nogen,gc
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
package reedsolomon
// mulAvxTwo_1x1 takes 1 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_1x1_64 takes 1 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_1x2 takes 1 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_1x2_64 takes 1 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_1x3 takes 1 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_1x3_64 takes 1 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_1x4 takes 1 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_1x5 takes 1 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_1x6 takes 1 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_1x7 takes 1 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_1x8 takes 1 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_1x9 takes 1 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_1x10 takes 1 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_1x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_2x1 takes 2 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_2x1_64 takes 2 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_2x2 takes 2 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_2x2_64 takes 2 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_2x3 takes 2 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_2x3_64 takes 2 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_2x4 takes 2 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_2x5 takes 2 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_2x6 takes 2 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_2x7 takes 2 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_2x8 takes 2 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_2x9 takes 2 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_2x10 takes 2 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_2x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_3x1 takes 3 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_3x1_64 takes 3 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_3x2 takes 3 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_3x2_64 takes 3 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_3x3 takes 3 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_3x3_64 takes 3 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_3x4 takes 3 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_3x5 takes 3 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_3x6 takes 3 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_3x7 takes 3 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_3x8 takes 3 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_3x9 takes 3 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_3x10 takes 3 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_3x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_4x1 takes 4 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_4x1_64 takes 4 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_4x2 takes 4 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_4x2_64 takes 4 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_4x3 takes 4 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_4x3_64 takes 4 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_4x4 takes 4 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_4x5 takes 4 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_4x6 takes 4 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_4x7 takes 4 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_4x8 takes 4 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_4x9 takes 4 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_4x10 takes 4 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_4x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_5x1 takes 5 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_5x1_64 takes 5 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_5x2 takes 5 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_5x2_64 takes 5 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_5x3 takes 5 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_5x3_64 takes 5 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_5x4 takes 5 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_5x5 takes 5 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_5x6 takes 5 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_5x7 takes 5 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_5x8 takes 5 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_5x9 takes 5 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_5x10 takes 5 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_5x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_6x1 takes 6 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_6x1_64 takes 6 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_6x2 takes 6 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_6x2_64 takes 6 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_6x3 takes 6 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_6x3_64 takes 6 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_6x4 takes 6 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_6x5 takes 6 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_6x6 takes 6 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_6x7 takes 6 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_6x8 takes 6 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_6x9 takes 6 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_6x10 takes 6 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_6x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_7x1 takes 7 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_7x1_64 takes 7 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_7x2 takes 7 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_7x2_64 takes 7 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_7x3 takes 7 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_7x3_64 takes 7 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_7x4 takes 7 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_7x5 takes 7 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_7x6 takes 7 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_7x7 takes 7 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_7x8 takes 7 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_7x9 takes 7 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_7x10 takes 7 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_7x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_8x1 takes 8 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_8x1_64 takes 8 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_8x2 takes 8 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_8x2_64 takes 8 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_8x3 takes 8 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_8x3_64 takes 8 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_8x4 takes 8 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_8x5 takes 8 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_8x6 takes 8 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_8x7 takes 8 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_8x8 takes 8 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_8x9 takes 8 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_8x10 takes 8 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_8x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_9x1 takes 9 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_9x1_64 takes 9 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_9x2 takes 9 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_9x2_64 takes 9 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_9x3 takes 9 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_9x3_64 takes 9 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_9x4 takes 9 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_9x5 takes 9 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_9x6 takes 9 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_9x7 takes 9 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_9x8 takes 9 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_9x9 takes 9 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_9x10 takes 9 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_9x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_10x1 takes 10 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x1(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_10x1_64 takes 10 inputs and produces 1 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x1_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_10x2 takes 10 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x2(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_10x2_64 takes 10 inputs and produces 2 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x2_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_10x3 takes 10 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x3(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_10x3_64 takes 10 inputs and produces 3 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x3_64(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Generate AVX2 code (#141) Replaces AVX2 up to 10x8 configurations with specific generated functions. If code size is a concern `-tags=nogen` can be used. Biggest speedup when not memory constrained. ``` benchmark old MB/s new MB/s speedup BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x benchmark old MB/s new MB/s speedup BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x ```
2020-05-20 13:48:34 +03:00
// mulAvxTwo_10x4 takes 10 inputs and produces 4 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x4(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_10x5 takes 10 inputs and produces 5 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x5(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_10x6 takes 10 inputs and produces 6 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x6(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_10x7 takes 10 inputs and produces 7 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x7(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_10x8 takes 10 inputs and produces 8 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x8(matrix []byte, in [][]byte, out [][]byte, start int, n int)
Wider AVX2 loops and less usage. (#162) * Experiment with 64 bytes/loop AVX2 * Only reduce when doing 64. * Use no more than 8 goroutines for avx2 codegen. ``` name old speed new speed delta Encode10x2x10000-32 33.3GB/s ± 0% 37.5GB/s ± 1% +12.49% (p=0.000 n=9+10) Encode100x20x10000-32 3.79GB/s ± 5% 3.77GB/s ± 5% ~ (p=0.853 n=10+10) Encode17x3x1M-32 78.2GB/s ± 1% 76.0GB/s ± 6% ~ (p=0.123 n=10+10) Encode10x4x16M-32 28.3GB/s ± 0% 27.7GB/s ± 2% -2.32% (p=0.000 n=8+10) Encode5x2x1M-32 112GB/s ± 1% 113GB/s ± 1% ~ (p=0.796 n=10+10) Encode10x2x1M-32 149GB/s ± 1% 129GB/s ± 3% -13.24% (p=0.000 n=9+10) Encode10x4x1M-32 99.1GB/s ± 1% 91.5GB/s ± 3% -7.74% (p=0.000 n=10+10) Encode50x20x1M-32 19.7GB/s ± 1% 19.8GB/s ± 1% ~ (p=0.447 n=9+10) Encode17x3x16M-32 33.4GB/s ± 0% 33.3GB/s ± 1% -0.46% (p=0.043 n=10+9) Encode_8x4x8M-32 30.1GB/s ± 1% 29.4GB/s ± 3% -2.31% (p=0.000 n=10+10) Encode_12x4x12M-32 30.6GB/s ± 0% 30.5GB/s ± 0% ~ (p=0.720 n=10+9) Encode_16x4x16M-32 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.497 n=10+9) Encode_16x4x32M-32 31.9GB/s ± 0% 31.5GB/s ± 4% ~ (p=0.165 n=10+10) Encode_16x4x64M-32 32.4GB/s ± 0% 32.3GB/s ± 0% ~ (p=0.321 n=9+8) Encode_8x5x8M-32 28.4GB/s ± 0% 28.4GB/s ± 1% ~ (p=0.237 n=10+8) Encode_8x6x8M-32 27.0GB/s ± 0% 27.2GB/s ± 2% ~ (p=0.075 n=10+10) Encode_8x7x8M-32 26.0GB/s ± 1% 25.8GB/s ± 1% -0.53% (p=0.003 n=9+10) Encode_8x9x8M-32 24.6GB/s ± 1% 24.4GB/s ± 1% -0.63% (p=0.000 n=10+10) Encode_8x10x8M-32 23.7GB/s ± 1% 23.7GB/s ± 0% +0.32% (p=0.035 n=10+9) Encode_8x11x8M-32 23.0GB/s ± 1% 22.8GB/s ± 0% -0.59% (p=0.000 n=9+8) Encode_8x8x05M-32 66.4GB/s ± 1% 64.2GB/s ± 1% -3.32% (p=0.000 n=10+10) Encode_8x8x1M-32 56.7GB/s ± 0% 75.7GB/s ± 2% +33.55% (p=0.000 n=9+9) Encode_8x8x8M-32 24.9GB/s ± 0% 24.9GB/s ± 1% ~ (p=0.146 n=8+10) Encode_8x8x32M-32 23.8GB/s ± 0% 23.4GB/s ± 0% -1.42% (p=0.000 n=9+10) Encode_24x8x24M-32 29.9GB/s ± 0% 29.9GB/s ± 0% ~ (p=0.278 n=10+9) Encode_24x8x48M-32 30.7GB/s ± 1% 30.7GB/s ± 0% ~ (p=0.351 n=9+7) StreamEncode10x2x10000-32 15.5GB/s ± 1% 16.5GB/s ± 0% +6.53% (p=0.000 n=10+9) StreamEncode100x20x10000-32 2.09GB/s ± 1% 2.06GB/s ± 2% -1.78% (p=0.000 n=10+10) StreamEncode17x3x1M-32 12.2GB/s ± 2% 12.3GB/s ± 1% +1.19% (p=0.008 n=10+9) StreamEncode10x4x16M-32 8.68GB/s ± 0% 9.47GB/s ± 1% +9.05% (p=0.000 n=8+10) StreamEncode5x2x1M-32 12.3GB/s ± 1% 13.2GB/s ± 1% +7.61% (p=0.000 n=10+10) StreamEncode10x2x1M-32 11.5GB/s ± 4% 13.3GB/s ± 2% +15.15% (p=0.000 n=10+7) ```
2021-06-21 16:15:23 +03:00
// mulAvxTwo_10x9 takes 10 inputs and produces 9 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x9(matrix []byte, in [][]byte, out [][]byte, start int, n int)
// mulAvxTwo_10x10 takes 10 inputs and produces 10 outputs.
// The output is initialized to 0.
//go:noescape
func mulAvxTwo_10x10(matrix []byte, in [][]byte, out [][]byte, start int, n int)