Ok -- once again, I have messed with the structure. My goal is flexible and efficient. It's similar to the stuff before, but better because it makes things like Euclid's method much cleaner.
I think we're ready to hack.
typedef struct gf { gf_func_a_b multiply; gf_func_a_b divide; gf_func_a inverse; gf_region multiply_region; void *scratch; } gf_t; |
We can beef it up later with buf-buf or buf-acc. The problem is that the paper is already bloated, so right now, I want to keep it lean.
The types of the procedures are big unions, so that they work with the following types of arguments:
typedef uint8_t gf_val_4_t; typedef uint8_t gf_val_8_t; typedef uint16_t gf_val_16_t; typedef uint32_t gf_val_32_t; typedef uint64_t gf_val_64_t; typedef uint64_t *gf_val_128_t; typedef uint32_t gf_val_gen_t; /* The intent here is for general values <= 32 */ |
To use one of these, you need to create one with gf_init_easy() or gf_init_hard(). Let's concentrate on the former:
extern int gf_init_easy(gf_t *gf, int w, int mult_type); |
You pass it memory for a gf_t, a value of w and a variable that says how to do multiplication. The valid values of mult_type are enumerated in gf.h:
typedef enum {GF_MULT_DEFAULT, GF_MULT_SHIFT, GF_MULT_GROUP, GF_MULT_BYTWO_p, GF_MULT_BYTWO_b, GF_MULT_TABLE, GF_MULT_LOG_TABLE, GF_MULT_SPLIT_TABLE, GF_MULT_COMPOSITE } gf_mult_type_t; |
After creating the gf_t, you use its multiply method to multiply, using the union's fields to work with the various types. It looks easier than my explanation. For example, suppose you wanted to multiply 5 and 4 in GF(24). You can do it as in gf_54.c
#include "gf.h" main() { gf_t gf; gf_init_easy(&gf, 4, GF_MULT_DEFAULT); printf("%d\n", gf.multiply.w4(&gf, 5, 4)); exit(0); } |
If you wanted to multiply in GF(28), then you'd have to use 8 as a parameter to gf_init_easy, and call the multiplier as gf.mult.w8().
When you're done with your gf_t, you should call gf_free() on it so that it can free memory that it has allocated. We'll talk more about memory later, but if you create your gf_t with gf_init_easy, then it calls malloc(), and if you care about freeing memory, you'll have to call gf_free().
If you don't want to have the initialization call allocate memory, you can use gf_init_hard():
extern int gf_init_hard(gf_t *gf, int w, int mult_type, int region_type, int divide_type, uint64_t prim_poly, int arg1, int arg2, gf_t *base_gf, void *scratch_memory); |
The first three parameters are the same as gf_init_easy(). You can add additional arguments for performing multiply_region, and for performing division in the region_type and divide_type arguments. Their values are also defined in gf.h. You can mix the region_type values (e.g. "DOUBLE" and "SSE"):
#define GF_REGION_DEFAULT (0x0) #define GF_REGION_SINGLE_TABLE (0x1) #define GF_REGION_DOUBLE_TABLE (0x2) #define GF_REGION_QUAD_TABLE (0x4) #define GF_REGION_LAZY (0x8) #define GF_REGION_SSE (0x10) #define GF_REGION_NOSSE (0x20) #define GF_REGION_STDMAP (0x40) #define GF_REGION_ALTMAP (0x80) #define GF_REGION_CAUCHY (0x100) typedef uint32_t gf_region_type_t; typedef enum { GF_DIVIDE_DEFAULT, GF_DIVIDE_MATRIX, GF_DIVIDE_EUCLID } gf_division_type_t; |
You can change the primitive polynomial with prim_poly, give additional arguments with arg1 and arg2 and give a base Galois Field for composite fields. Finally, you can pass it a pointer to memory in scratch_memory. That way, you can avoid having gf_init_hard() call malloc().
There is a procedure called gf_scratch_size() that lets you know the minimum size for scratch_memory, depending on w, the multiplication type and the arguments:
extern int gf_scratch_size(int w, int mult_type, int region_type, int divide_type, int arg1, int arg2); |
You can specify default arguments in gf_init_hard():
gf_free() frees memory that was allocated with gf_init_easy() or gf_init_hard(). The recursive parameter is in case you use composite fields, and want to recursively free the base fields. If you pass scratch_memory to gf_init_hard(), then you typically don't need to call gf_free(). It won't hurt to call it, though.
UNIX> gf_mult 7 11 4 - Default 4 UNIX> gf_mult 7 11 4 SHIFT - - - Use shift 4 UNIX> gf_mult 7 11 4 LOG - - - Use logs 4 UNIX> gf_div 4 7 4 - Default 11 UNIX> gf_div 4 7 4 LOG - - - Use logs 11 UNIX> gf_div 4 7 4 LOG - EUCLID - Use Euclid instead of logs 11 UNIX> gf_div 4 7 4 LOG - MATRIX - Use Matrix inversion instead of logs 11 UNIX> gf_div 4 7 4 SHIFT - - - Default 11 UNIX> gf_div 4 7 4 SHIFT - EUCLID - Use Euclid (which is the default) 11 UNIX> gf_div 4 7 4 SHIFT - MATRIX - Use Matrix inversion instead of logs 11 UNIX> gf_mult 200 211 8 - The remainder are shift/Euclid 201 UNIX> gf_div 201 211 8 200 UNIX> gf_mult 60000 65111 16 63515 UNIX> gf_div 63515 65111 16 60000 UNIX> gf_mult abcd0001 9afbf788 32h b0359681 UNIX> gf_div b0359681 9afbf788 32h abcd0001 UNIX> gf_mult abcd00018c8b8c8a 9afbf7887f6d8e5b 64h 3a7def35185bd571 UNIX> gf_mult abcd00018c8b8c8a 9afbf7887f6d8e5b 64h 3a7def35185bd571 UNIX> gf_div 3a7def35185bd571 9afbf7887f6d8e5b 64h abcd00018c8b8c8a UNIX>You can see all the methods with gf_methods. We have a lot of implementing to do:
UNIX> gf_methods To specify the methods, do one of the following: - leave empty to use defaults - use a single dash to use defaults - specify MULTIPLY REGION DIVIDE Legal values of MULTIPLY: SHIFT: shift GROUP g_mult g_reduce: the Group technique - see the paper BYTWO_p: BYTWO doubling the product. BYTWO_b: BYTWO doubling b (more efficient thatn BYTWO_p) TABLE: Full multiplication table LOG: Discrete logs LOG_ZERO: Discrete logs with a large table for zeros SPLIT g_a g_b: Split tables defined by g_a and g_b COMPOSITE k l [METHOD]: Composite field, recursively specify the method of the base field in GF(2^l) Legal values of REGION: Specify multiples with commas e.g. 'DOUBLE,LAZY' -: Use defaults SINGLE/DOUBLE/QUAD: Expand tables LAZY: Lazily create table (only applies to TABLE and SPLIT) SSE/NOSSE: Use 128-bit SSE instructions if you can CAUCHY/ALTMAP/STDMAP: Use different memory mappings Legal values of DIVIDE: -: Use defaults MATRIX: Use matrix inversion EUCLID: Use the extended Euclidian algorithm. See the user's manual for more information. There are many restrictions, so it is better to simply use defaults in most cases. UNIX>
UNIX> gf_unit w tests seed [METHOD] UNIX> gf_time w tests seed size(bytes) iterations [METHOD] |
The tests parameter is one or more of the following characters:
For example, testing the defaults with w=4:
UNIX> gf_unit 4 AV 1 LOG - - Seed: 1 Testing single multiplications/divisions. Testing Inversions. Testing buffer-constant, src != dest, xor = 0 Testing buffer-constant, src != dest, xor = 1 Testing buffer-constant, src == dest, xor = 0 Testing buffer-constant, src == dest, xor = 1 UNIX> gf_unit 4 AV 1 SHIFT - - Seed: 1 Testing single multiplications/divisions. Testing Inversions. No multiply_region. UNIX>There is no multiply_region() method defined for SHIFT. Thus, the procedures are NULL and the unit tester ignores them.
At the moment, I only have the unit tester working for w=4.
gf_time takes the size of an array (in bytes) and a number of iterations, and tests the speed of both single and region operations. The tests are:
UNIX> gf_time 4 A 1 102400 1024 LOG - - Seed: 1 Multiply: 0.538126 s 185.830 Mega-ops/s Divide: 0.520825 s 192.003 Mega-ops/s Inverse: 0.631198 s 158.429 Mega-ops/s Buffer-Const,s!=d,xor=0: 0.478395 s 209.032 MB/s Buffer-Const,s!=d,xor=1: 0.524245 s 190.751 MB/s Buffer-Const,s==d,xor=0: 0.471851 s 211.931 MB/s Buffer-Const,s==d,xor=1: 0.528275 s 189.295 MB/s UNIX> gf_time 4 A 1 102400 1024 LOG - EUCLID Seed: 1 Multiply: 0.555512 s 180.014 Mega-ops/s Divide: 5.359434 s 18.659 Mega-ops/s Inverse: 4.911719 s 20.359 Mega-ops/s Buffer-Const,s!=d,xor=0: 0.496097 s 201.573 MB/s Buffer-Const,s!=d,xor=1: 0.538536 s 185.689 MB/s Buffer-Const,s==d,xor=0: 0.485564 s 205.946 MB/s Buffer-Const,s==d,xor=1: 0.540227 s 185.107 MB/s UNIX> gf_time 4 A 1 102400 1024 LOG - MATRIX Seed: 1 Multiply: 0.544005 s 183.822 Mega-ops/s Divide: 7.602822 s 13.153 Mega-ops/s Inverse: 7.000564 s 14.285 Mega-ops/s Buffer-Const,s!=d,xor=0: 0.474868 s 210.585 MB/s Buffer-Const,s!=d,xor=1: 0.527588 s 189.542 MB/s Buffer-Const,s==d,xor=0: 0.473130 s 211.358 MB/s Buffer-Const,s==d,xor=1: 0.529877 s 188.723 MB/s UNIX> gf_time 4 A 1 102400 1024 SHIFT - - Seed: 1 Multiply: 2.708842 s 36.916 Mega-ops/s Divide: 8.756882 s 11.420 Mega-ops/s Inverse: 5.695511 s 17.558 Mega-ops/s UNIX>At the moment, I only have the timer working for w=4.
gf_init_hard()' first job is to set up the scratch. The scratch's type is gf_internal_t, defined in gf_int.h:
typedef struct { int mult_type; int region_type; int divide_type; int w; uint64_t prim_poly; int free_me; int arg1; int arg2; gf_t *base_gf; void *private; } gf_internal_t; |
All the fields are straightfoward, with the exception of private. That is a (void *) which points to the implementation's private data.
Here's the code for gf_init_hard():
int gf_init_hard(gf_t *gf, int w, int mult_type, int region_type, int divide_type, uint64_t prim_poly, int arg1, int arg2, gf_t *base_gf, void *scratch_memory) { int sz; gf_internal_t *h; if (scratch_memory == NULL) { sz = gf_scratch_size(w, mult_type, region_type, divide_type, arg1, arg2); if (sz <= 0) return 0; h = (gf_internal_t *) malloc(sz); h->free_me = 1; } else { h = scratch_memory; h->free_me = 0; } gf->scratch = (void *) h; h->mult_type = mult_type; h->region_type = region_type; h->divide_type = divide_type; h->w = w; h->prim_poly = prim_poly; h->arg1 = arg1; h->arg2 = arg2; h->base_gf = base_gf; h->private = (void *) gf->scratch; h->private += (sizeof(gf_internal_t)); switch(w) { case 4: return gf_w4_init(gf); case 8: return gf_w8_init(gf); case 16: return gf_w16_init(gf); case 32: return gf_w32_init(gf); case 64: return gf_w64_init(gf); case 128: return gf_dummy_init(gf); default: return 0; } } |
The first thing it does is determine if it has to allocate space for scratch. If it must, it uses gf_scratch_size() to figure out how big the space must be. It then sets gf->scratch to this space, and sets all of the fields of the scratch to the arguments in gf_init_hard(). The private pointer is set to be the space just after the pointer gf->private. Again, it is up to gf_scratch_size() to make sure there is enough space for the scratch, and for all of the private data needed by the implementation.
Once the scratch is set up, gf_init_hard() calls gf_w4_init(). This is in gf_w4.c, and it is a simple dispatcher to the various initialization routines, plus it sets EUCLID and MATRIX if need be:
int gf_w4_init(gf_t *gf) { gf_internal_t *h; h = (gf_internal_t *) gf->scratch; if (h->prim_poly == 0) h->prim_poly = 0x13; gf->multiply.w4 = NULL; gf->divide.w4 = NULL; gf->inverse.w4 = NULL; gf->multiply_region.w4 = NULL; switch(h->mult_type) { case GF_MULT_SHIFT: if (gf_w4_shift_init(gf) == 0) return 0; break; case GF_MULT_LOG_TABLE: if (gf_w4_log_init(gf) == 0) return 0; break; case GF_MULT_DEFAULT: if (gf_w4_log_init(gf) == 0) return 0; break; default: return 0; } if (h->divide_type == GF_DIVIDE_EUCLID) { gf->divide.w4 = gf_w4_divide_from_inverse; gf->inverse.w4 = gf_w4_euclid; } else if (h->divide_type == GF_DIVIDE_MATRIX) { gf->divide.w4 = gf_w4_divide_from_inverse; gf->inverse.w4 = gf_w4_matrix; } if (gf->inverse.w4 != NULL && gf->divide.w4 == NULL) { gf->divide.w4 = gf_w4_divide_from_inverse; } if (gf->inverse.w4 == NULL && gf->divide.w4 != NULL) { gf->inverse.w4 = gf_w4_inverse_from_divide; } return 1; } |
The code in gf_w4_log_init() sets up the log and antilog tables, and sets the multiply.w4, divide.w4 etc routines to be the ones for logs. The tables are put into gf->scratch->private, which is typecast to a struct gf_logtable_data *:
struct gf_logtable_data { gf_val_4_t log_tbl[GF_FIELD_SIZE]; gf_val_4_t antilog_tbl[GF_FIELD_SIZE * 2]; gf_val_4_t *antilog_tbl_div; }; ....... static int gf_w4_log_init(gf_t *gf) { gf_internal_t *h; struct gf_logtable_data *ltd; int i, b; h = (gf_internal_t *) gf->scratch; ltd = h->private; ltd->log_tbl[0] = 0; ltd->antilog_tbl_div = ltd->antilog_tbl + (GF_FIELD_SIZE-1); b = 1; for (i = 0; i < GF_FIELD_SIZE-1; i++) { ltd->log_tbl[b] = (gf_val_8_t)i; ltd->antilog_tbl[i] = (gf_val_8_t)b; ltd->antilog_tbl[i+GF_FIELD_SIZE-1] = (gf_val_8_t)b; b <<= 1; if (b & GF_FIELD_SIZE) { b = b ^ h->prim_poly; } } gf->inverse.w4 = gf_w4_inverse_from_divide; gf->divide.w4 = gf_w4_log_divide; gf->multiply.w4 = gf_w4_log_multiply; gf->multiply_region.w4 = gf_w4_log_multiply_region; return 1; } |
And of course the individual routines use h->private to access the tables:
static inline gf_val_8_t gf_w4_log_multiply (gf_t *gf, gf_val_8_t a, gf_val_8_t b) { struct gf_logtable_data *ltd; ltd = (struct gf_logtable_data *) ((gf_internal_t *) (gf->scratch))->private; return (a == 0 || b == 0) ? 0 : ltd->antilog_tbl[(unsigned)(ltd->log_tbl[a] + ltd->log_tbl[b])]; } |
Finally, it's important that the proper sizes are put into gf_w4_scratch_size() for each implementation:
int gf_w4_scratch_size(int mult_type, int region_type, int divide_type, int arg1, int arg2) { int region_tbl_size; switch(mult_type) { case GF_MULT_DEFAULT: case GF_MULT_LOG_TABLE: return sizeof(gf_internal_t) + sizeof(struct gf_logtable_data) + 64; break; case GF_MULT_SHIFT: return sizeof(gf_internal_t); break; default: return -1; } } |
I hope that's enough explanation for y'all to start implementing. Let me know if you have problems -- thanks -- Jim
For example, the log techniques for w=4 are:
gf_w4_log_multiply() gf_w4_log_divide() gf_w4_log_multiply_region() gf_w4_log_init()
SHIFT | Done - Jim |
BYTWO_p | Done - Jim |
BYTWO_b | Done - Jim |
BYTWO_p, SSE | Done - Jim |
BYTWO_b, SSE | Done - Jim |
Single TABLE | Done - Jim |
Double TABLE | Done - Jim |
Double TABLE, SSE | Done - Jim |
Quad TABLE | Done - Jim |
Lazy Quad TABLE | Done - Jim |
LOG | Done - Jim |
SHIFT | Done - Jim |
BYTWO_p | Done - Jim |
BYTWO_b | Done - Jim |
BYTWO_p, SSE | Done - Jim |
BYTWO_b, SSE | Done - Jim |
Single TABLE | Done - Kevin |
Double TABLE | Done - Jim |
Lazy Double TABLE | Done - Jim |
Split 2 1 (Half) SSE | Done - Jim |
Composite, k=2 | Done - Kevin (alt mapping not passing unit test) |
LOG | Done - Kevin |
LOG ZERO | Done - Jim |
SHIFT | Done - Jim |
BYTWO_p | Done - Jim |
BYTWO_b | Done - Jim |
BYTWO_p, SSE | Done - Jim |
BYTWO_b, SSE | Done - Jim |
Lazy TABLE | Done - Jim |
Split 4 16 No-SSE, lazy | Done - Jim |
Split 4 16 SSE, lazy | Done - Jim |
Split 4 16 SSE, lazy, alternate mapping | Done - Jim |
Split 8 16, lazy | Done - Jim |
Composite, k=2, stdmap recursive | Done - Kevin |
Composite, k=2, altmap recursive | Done - Kevin |
Composite, k=2, stdmap inline | Done - Kevin |
LOG | Done - Kevin |
LOG ZERO | Done - Kevin |
Group 4 4 | Done - Jim: I don't see a reason to implement others, although 4-8 will be faster, and 8 8 will have faster region ops. They'll never beat SPLIT. |
SHIFT | Done - Jim |
BYTWO_p | Done - Jim |
BYTWO_b | Done - Jim |
BYTWO_p, SSE | Done - Jim |
BYTWO_b, SSE | Done - Jim |
Split 2 32,lazy | Done - Jim |
Split 2 32, SSE, lazy | Done - Jim |
Split 4 32, lazy | Done - Jim |
Split 4 32, SSE,ALTMAP lazy | Done - Jim |
Split 4 32, SSE, lazy | Done - Jim |
Split 8 8 | Done - Jim |
Group, g_s == g_r | Done - Jim |
Group, any g_s and g_r | Done - Jim |
Composite, k=2, stdmap recursive | Done - Kevin |
Composite, k=2, altmap recursive | Done - Kevin |
Composite, k=2, stdmap inline | Done - Kevin |
SHIFT | Done - Jim |
BYTWO_p | - |
BYTWO_b | - |
BYTWO_p, SSE | - |
BYTWO_b, SSE | - |
Split 16 1 SSE, maybe lazy | - |
Split 8 1 lazy | - |
Split 8 8 | - |
Split 8 8 lazy | - |
Group | - |
Composite, k=2, alternate mapping | - |
SHIFT | Done - Will |
BYTWO_p | - |
BYTWO_b | - |
BYTWO_p, SSE | - |
BYTWO_b, SSE | - |
Split 32 1 SSE, maybe lazy | - |
Split 16 1 lazy | - |
Split 16 16 - Maybe that's insanity | - |
Split 16 16 lazy | - |
Group (SSE) | - |
Composite, k=?, alternate mapping | - |
CAUCHY Region (SSE XOR) | Done - Jim |
SHIFT | Done - Jim |
TABLE | Done - Jim |
LOG | Done - Jim |
BYTWO_p | Done - Jim |
BYTWO_b | Done - Jim |
Group, g_s == g_r | Done - Jim |
Group, any g_s and g_r | Done - Jim |
Split - do we need it? | Done - Jim |
Composite - do we need it? | - |
Split - do we need it? | - |
Logzero? | - |