Code structure as of 7/20/2012

written by Jim.

Ok -- once again, I have messed with the structure. My goal is flexible and efficient. It's similar to the stuff before, but better because it makes things like Euclid's method much cleaner.

I think we're ready to hack.


Files


Prototypes and typedefs in gf.h

The main structure that users will see is in gf.h, and it is of type gf_t:

typedef struct gf {
  gf_func_a_b    multiply;
  gf_func_a_b    divide;
  gf_func_a      inverse;
  gf_region      multiply_region;
  void           *scratch;
} gf_t;

We can beef it up later with buf-buf or buf-acc. The problem is that the paper is already bloated, so right now, I want to keep it lean.

The types of the procedures are big unions, so that they work with the following types of arguments:

typedef uint8_t     gf_val_4_t;
typedef uint8_t     gf_val_8_t;
typedef uint16_t    gf_val_16_t;
typedef uint32_t    gf_val_32_t;
typedef uint64_t    gf_val_64_t;
typedef uint64_t    *gf_val_128_t;
typedef uint32_t    gf_val_gen_t;   /* The intent here is for general values <= 32 */

To use one of these, you need to create one with gf_init_easy() or gf_init_hard(). Let's concentrate on the former:

extern int gf_init_easy(gf_t *gf, int w, int mult_type);

You pass it memory for a gf_t, a value of w and a variable that says how to do multiplication. The valid values of mult_type are enumerated in gf.h:

typedef enum {GF_MULT_DEFAULT,
              GF_MULT_SHIFT,
              GF_MULT_GROUP,
              GF_MULT_BYTWO_p,
              GF_MULT_BYTWO_b,
              GF_MULT_TABLE,
              GF_MULT_LOG_TABLE,
              GF_MULT_SPLIT_TABLE,
              GF_MULT_COMPOSITE } gf_mult_type_t;

After creating the gf_t, you use its multiply method to multiply, using the union's fields to work with the various types. It looks easier than my explanation. For example, suppose you wanted to multiply 5 and 4 in GF(24). You can do it as in gf_54.c

#include "gf.h"

main()
{
  gf_t gf;

  gf_init_easy(&gf, 4, GF_MULT_DEFAULT);
  printf("%d\n", gf.multiply.w4(&gf, 5, 4));
  exit(0);
}

If you wanted to multiply in GF(28), then you'd have to use 8 as a parameter to gf_init_easy, and call the multiplier as gf.mult.w8().

When you're done with your gf_t, you should call gf_free() on it so that it can free memory that it has allocated. We'll talk more about memory later, but if you create your gf_t with gf_init_easy, then it calls malloc(), and if you care about freeing memory, you'll have to call gf_free().


Memory allocation

Each implementation of a multiplication technique keeps around its own data. For example, GF_MULT_TABLE keeps around multiplication and division tables, and GF_MULT_LOG maintains log and antilog tables. This data is stored in the pointer scratch. My intent is that the memory that is there is all that's required. In other words, the multiply(), divide(), inverse() and multiply_region() calls don't do any memory allocation. Moreover, gf_init_easy() only allocates one chunk of memory -- the one in scratch.

If you don't want to have the initialization call allocate memory, you can use gf_init_hard():

extern int gf_init_hard(gf_t *gf,
                        int w,
                        int mult_type,
                        int region_type,
                        int divide_type,
                        uint64_t prim_poly,
                        int arg1,
                        int arg2,
                        gf_t *base_gf,
                        void *scratch_memory);

The first three parameters are the same as gf_init_easy(). You can add additional arguments for performing multiply_region, and for performing division in the region_type and divide_type arguments. Their values are also defined in gf.h. You can mix the region_type values (e.g. "DOUBLE" and "SSE"):

#define GF_REGION_DEFAULT      (0x0)
#define GF_REGION_SINGLE_TABLE (0x1)
#define GF_REGION_DOUBLE_TABLE (0x2)
#define GF_REGION_QUAD_TABLE   (0x4)
#define GF_REGION_LAZY         (0x8)
#define GF_REGION_SSE          (0x10)
#define GF_REGION_NOSSE        (0x20)
#define GF_REGION_STDMAP       (0x40)
#define GF_REGION_ALTMAP       (0x80)
#define GF_REGION_CAUCHY       (0x100)

typedef uint32_t gf_region_type_t;

typedef enum { GF_DIVIDE_DEFAULT,
               GF_DIVIDE_MATRIX,
               GF_DIVIDE_EUCLID } gf_division_type_t;

You can change the primitive polynomial with prim_poly, give additional arguments with arg1 and arg2 and give a base Galois Field for composite fields. Finally, you can pass it a pointer to memory in scratch_memory. That way, you can avoid having gf_init_hard() call malloc().

There is a procedure called gf_scratch_size() that lets you know the minimum size for scratch_memory, depending on w, the multiplication type and the arguments:

extern int gf_scratch_size(int w,
                           int mult_type,
                           int region_type,
                           int divide_type,
                           int arg1,
                           int arg2);

You can specify default arguments in gf_init_hard():

If any argument is equal to its default, then default actions are taken (e.g. a standard primitive polynomial is used, or memory is allocated for scratch_memory). In fact, gf_init_easy() simply calls gf_init_hard() with the default parameters.

gf_free() frees memory that was allocated with gf_init_easy() or gf_init_hard(). The recursive parameter is in case you use composite fields, and want to recursively free the base fields. If you pass scratch_memory to gf_init_hard(), then you typically don't need to call gf_free(). It won't hurt to call it, though.


gf_mult and gf_div

For the moment, I have few things completely implemented, but that's because I want to be able to explain the structure, and how to specify methods. In particular, for w=4, I have implemented SHIFT and LOG. For w=8, 16, 32, 64 I have implemented SHIFT. For all w ≤ 32, I have implemented both Euclid's algorithm for inversion, and the matrix method for inversion. For w=64, it's just Euclid. You can test these all with gf_mult and gf_div. Here are a few calls:
UNIX> gf_mult 7 11 4                - Default
4
UNIX> gf_mult 7 11 4 SHIFT - -      - Use shift
4
UNIX> gf_mult 7 11 4 LOG - -        - Use logs
4
UNIX> gf_div 4 7 4                  - Default
11
UNIX> gf_div 4 7 4 LOG - -          - Use logs
11
UNIX> gf_div 4 7 4 LOG - EUCLID     - Use Euclid instead of logs
11
UNIX> gf_div 4 7 4 LOG - MATRIX     - Use Matrix inversion instead of logs
11
UNIX> gf_div 4 7 4 SHIFT - -        - Default
11
UNIX> gf_div 4 7 4 SHIFT - EUCLID   - Use Euclid (which is the default)
11
UNIX> gf_div 4 7 4 SHIFT - MATRIX   - Use Matrix inversion instead of logs
11
UNIX> gf_mult 200 211 8        - The remainder are shift/Euclid
201
UNIX> gf_div 201 211 8
200
UNIX> gf_mult 60000 65111 16
63515
UNIX> gf_div 63515 65111 16
60000
UNIX> gf_mult abcd0001 9afbf788 32h
b0359681
UNIX> gf_div b0359681 9afbf788 32h
abcd0001
UNIX> gf_mult abcd00018c8b8c8a 9afbf7887f6d8e5b 64h
3a7def35185bd571
UNIX> gf_mult abcd00018c8b8c8a 9afbf7887f6d8e5b 64h
3a7def35185bd571
UNIX> gf_div 3a7def35185bd571 9afbf7887f6d8e5b 64h
abcd00018c8b8c8a
UNIX> 
You can see all the methods with gf_methods. We have a lot of implementing to do:
UNIX> gf_methods
To specify the methods, do one of the following: 
       - leave empty to use defaults
       - use a single dash to use defaults
       - specify MULTIPLY REGION DIVIDE

Legal values of MULTIPLY:
       SHIFT: shift
       GROUP g_mult g_reduce: the Group technique - see the paper
       BYTWO_p: BYTWO doubling the product.
       BYTWO_b: BYTWO doubling b (more efficient thatn BYTWO_p)
       TABLE: Full multiplication table
       LOG:   Discrete logs
       LOG_ZERO: Discrete logs with a large table for zeros
       SPLIT g_a g_b: Split tables defined by g_a and g_b
       COMPOSITE k l [METHOD]: Composite field, recursively specify the
                               method of the base field in GF(2^l)

Legal values of REGION: Specify multiples with commas e.g. 'DOUBLE,LAZY'
       -: Use defaults
       SINGLE/DOUBLE/QUAD: Expand tables
       LAZY: Lazily create table (only applies to TABLE and SPLIT)
       SSE/NOSSE: Use 128-bit SSE instructions if you can
       CAUCHY/ALTMAP/STDMAP: Use different memory mappings

Legal values of DIVIDE:
       -: Use defaults
       MATRIX: Use matrix inversion
       EUCLID: Use the extended Euclidian algorithm.

See the user's manual for more information.
There are many restrictions, so it is better to simply use defaults in most cases.
UNIX> 

gf_unit and gf_time

gf_unit.c is a unit tester, and gf_time.c is a time tester. They are called as follows:

UNIX> gf_unit w tests seed [METHOD] 
UNIX> gf_time w tests seed size(bytes) iterations [METHOD] 

The tests parameter is one or more of the following characters:

seed is a seed for srand48() -- using -1 defaults to the current time.

For example, testing the defaults with w=4:

UNIX> gf_unit 4 AV 1 LOG - -
Seed: 1
Testing single multiplications/divisions.
Testing Inversions.
Testing buffer-constant, src != dest, xor = 0
Testing buffer-constant, src != dest, xor = 1
Testing buffer-constant, src == dest, xor = 0
Testing buffer-constant, src == dest, xor = 1
UNIX> gf_unit 4 AV 1 SHIFT - -
Seed: 1
Testing single multiplications/divisions.
Testing Inversions.
No multiply_region.
UNIX> 
There is no multiply_region() method defined for SHIFT. Thus, the procedures are NULL and the unit tester ignores them.

At the moment, I only have the unit tester working for w=4.

gf_time takes the size of an array (in bytes) and a number of iterations, and tests the speed of both single and region operations. The tests are:

Here are some examples with SHIFT and LOG on my mac.
UNIX> gf_time 4 A 1 102400 1024 LOG - -
Seed: 1
Multiply:   0.538126 s      185.830 Mega-ops/s
Divide:     0.520825 s      192.003 Mega-ops/s
Inverse:    0.631198 s      158.429 Mega-ops/s
Buffer-Const,s!=d,xor=0:    0.478395 s      209.032 MB/s
Buffer-Const,s!=d,xor=1:    0.524245 s      190.751 MB/s
Buffer-Const,s==d,xor=0:    0.471851 s      211.931 MB/s
Buffer-Const,s==d,xor=1:    0.528275 s      189.295 MB/s
UNIX> gf_time 4 A 1 102400 1024 LOG - EUCLID
Seed: 1
Multiply:   0.555512 s      180.014 Mega-ops/s
Divide:     5.359434 s       18.659 Mega-ops/s
Inverse:    4.911719 s       20.359 Mega-ops/s
Buffer-Const,s!=d,xor=0:    0.496097 s      201.573 MB/s
Buffer-Const,s!=d,xor=1:    0.538536 s      185.689 MB/s
Buffer-Const,s==d,xor=0:    0.485564 s      205.946 MB/s
Buffer-Const,s==d,xor=1:    0.540227 s      185.107 MB/s
UNIX> gf_time 4 A 1 102400 1024 LOG - MATRIX
Seed: 1
Multiply:   0.544005 s      183.822 Mega-ops/s
Divide:     7.602822 s       13.153 Mega-ops/s
Inverse:    7.000564 s       14.285 Mega-ops/s
Buffer-Const,s!=d,xor=0:    0.474868 s      210.585 MB/s
Buffer-Const,s!=d,xor=1:    0.527588 s      189.542 MB/s
Buffer-Const,s==d,xor=0:    0.473130 s      211.358 MB/s
Buffer-Const,s==d,xor=1:    0.529877 s      188.723 MB/s
UNIX> gf_time 4 A 1 102400 1024 SHIFT - -
Seed: 1
Multiply:   2.708842 s       36.916 Mega-ops/s
Divide:     8.756882 s       11.420 Mega-ops/s
Inverse:    5.695511 s       17.558 Mega-ops/s
UNIX> 
At the moment, I only have the timer working for w=4.

Walking you through LOG

To see how scratch is used to store data, let's look at what happens when you call gf_init_easy(&gf, 4, GF_MULT_LOG); First, gf_init_easy() calls gf_init_hard() with default parameters. This is in gf.c.

gf_init_hard()' first job is to set up the scratch. The scratch's type is gf_internal_t, defined in gf_int.h:

typedef struct {
  int mult_type;
  int region_type;
  int divide_type;
  int w;
  uint64_t prim_poly;
  int free_me;
  int arg1;
  int arg2;
  gf_t *base_gf;
  void *private;
} gf_internal_t;

All the fields are straightfoward, with the exception of private. That is a (void *) which points to the implementation's private data.

Here's the code for gf_init_hard():

int gf_init_hard(gf_t *gf, int w, int mult_type, 
                        int region_type,
                        int divide_type,
                        uint64_t prim_poly,
                        int arg1, int arg2,
                        gf_t *base_gf,
                        void *scratch_memory) 
{
  int sz;
  gf_internal_t *h;


  if (scratch_memory == NULL) {
    sz = gf_scratch_size(w, mult_type, region_type, divide_type, arg1, arg2);
    if (sz <= 0) return 0;
    h = (gf_internal_t *) malloc(sz);
    h->free_me = 1;
  } else {
    h = scratch_memory;
    h->free_me = 0;
  }
  gf->scratch = (void *) h;
  h->mult_type = mult_type;
  h->region_type = region_type;
  h->divide_type = divide_type;
  h->w = w;
  h->prim_poly = prim_poly;
  h->arg1 = arg1;
  h->arg2 = arg2;
  h->base_gf = base_gf;
  h->private = (void *) gf->scratch;
  h->private += (sizeof(gf_internal_t));

  switch(w) {
    case 4: return gf_w4_init(gf);
    case 8: return gf_w8_init(gf);
    case 16: return gf_w16_init(gf);
    case 32: return gf_w32_init(gf);
    case 64: return gf_w64_init(gf);
    case 128: return gf_dummy_init(gf);
    default: return 0;
  }
}

The first thing it does is determine if it has to allocate space for scratch. If it must, it uses gf_scratch_size() to figure out how big the space must be. It then sets gf->scratch to this space, and sets all of the fields of the scratch to the arguments in gf_init_hard(). The private pointer is set to be the space just after the pointer gf->private. Again, it is up to gf_scratch_size() to make sure there is enough space for the scratch, and for all of the private data needed by the implementation.

Once the scratch is set up, gf_init_hard() calls gf_w4_init(). This is in gf_w4.c, and it is a simple dispatcher to the various initialization routines, plus it sets EUCLID and MATRIX if need be:

int gf_w4_init(gf_t *gf)
{
  gf_internal_t *h;

  h = (gf_internal_t *) gf->scratch;
  if (h->prim_poly == 0) h->prim_poly = 0x13;

  gf->multiply.w4 = NULL;
  gf->divide.w4 = NULL;
  gf->inverse.w4 = NULL;
  gf->multiply_region.w4 = NULL;

  switch(h->mult_type) {
    case GF_MULT_SHIFT:     if (gf_w4_shift_init(gf) == 0) return 0; break;
    case GF_MULT_LOG_TABLE: if (gf_w4_log_init(gf) == 0) return 0; break;
    case GF_MULT_DEFAULT:   if (gf_w4_log_init(gf) == 0) return 0; break;
    default: return 0;
  }
  if (h->divide_type == GF_DIVIDE_EUCLID) {
    gf->divide.w4 = gf_w4_divide_from_inverse;
    gf->inverse.w4 = gf_w4_euclid;
  } else if (h->divide_type == GF_DIVIDE_MATRIX) {
    gf->divide.w4 = gf_w4_divide_from_inverse;
    gf->inverse.w4 = gf_w4_matrix;
  }

  if (gf->inverse.w4 != NULL && gf->divide.w4 == NULL) {
    gf->divide.w4 = gf_w4_divide_from_inverse;
  }
  if (gf->inverse.w4 == NULL && gf->divide.w4 != NULL) {
    gf->inverse.w4 = gf_w4_inverse_from_divide;
  }
  return 1;
}

The code in gf_w4_log_init() sets up the log and antilog tables, and sets the multiply.w4, divide.w4 etc routines to be the ones for logs. The tables are put into gf->scratch->private, which is typecast to a struct gf_logtable_data *:

struct gf_logtable_data {
    gf_val_4_t      log_tbl[GF_FIELD_SIZE];
    gf_val_4_t      antilog_tbl[GF_FIELD_SIZE * 2];
    gf_val_4_t      *antilog_tbl_div;
};
.......

static 
int gf_w4_log_init(gf_t *gf)
{
  gf_internal_t *h;
  struct gf_logtable_data *ltd;
  int i, b;

  h = (gf_internal_t *) gf->scratch;
  ltd = h->private;

  ltd->log_tbl[0] = 0;

  ltd->antilog_tbl_div = ltd->antilog_tbl + (GF_FIELD_SIZE-1);
  b = 1;
  for (i = 0; i < GF_FIELD_SIZE-1; i++) {
      ltd->log_tbl[b] = (gf_val_8_t)i;
      ltd->antilog_tbl[i] = (gf_val_8_t)b;
      ltd->antilog_tbl[i+GF_FIELD_SIZE-1] = (gf_val_8_t)b;
      b <<= 1;
      if (b & GF_FIELD_SIZE) {
          b = b ^ h->prim_poly;
      }
  }
    
  gf->inverse.w4 = gf_w4_inverse_from_divide;
  gf->divide.w4 = gf_w4_log_divide;
  gf->multiply.w4 = gf_w4_log_multiply;
  gf->multiply_region.w4 = gf_w4_log_multiply_region;
  return 1;
}

And of course the individual routines use h->private to access the tables:

static
inline
gf_val_8_t gf_w4_log_multiply (gf_t *gf, gf_val_8_t a, gf_val_8_t b)
{
  struct gf_logtable_data *ltd;
    
  ltd = (struct gf_logtable_data *) ((gf_internal_t *) (gf->scratch))->private;
  return (a == 0 || b == 0) ? 0 : ltd->antilog_tbl[(unsigned)(ltd->log_tbl[a] + ltd->log_tbl[b])];
}

Finally, it's important that the proper sizes are put into gf_w4_scratch_size() for each implementation:

int gf_w4_scratch_size(int mult_type, int region_type, int divide_type, int arg1, int arg2)
{
  int region_tbl_size;
  switch(mult_type)
  {
    case GF_MULT_DEFAULT:
    case GF_MULT_LOG_TABLE:
      return sizeof(gf_internal_t) + sizeof(struct gf_logtable_data) + 64;
      break;
    case GF_MULT_SHIFT:
      return sizeof(gf_internal_t);
      break;
    default:
      return -1;
   }
}

I hope that's enough explanation for y'all to start implementing. Let me know if you have problems -- thanks -- Jim


The initial structure has been set for w=4, 8, 16, 32 and 64, with implementations of SHIFT and EUCLID, and for w <= 32, MATRIX. There are some weird caveats:

Things we need to Implement: w=4

SHIFT Done - Jim
BYTWO_p Done - Jim
BYTWO_b Done - Jim
BYTWO_p, SSE Done - Jim
BYTWO_b, SSE Done - Jim
Single TABLE Done - Jim
Double TABLE Done - Jim
Double TABLE, SSE Done - Jim
Quad TABLE Done - Jim
Lazy Quad TABLE Done - Jim
LOG Done - Jim


Things we need to Implement: w=8

SHIFT Done - Jim
BYTWO_p Done - Jim
BYTWO_b Done - Jim
BYTWO_p, SSE Done - Jim
BYTWO_b, SSE Done - Jim
Single TABLE Done - Kevin
Double TABLE Done - Jim
Lazy Double TABLE Done - Jim
Split 2 1 (Half) SSE Done - Jim
Composite, k=2 Done - Kevin (alt mapping not passing unit test)
LOG Done - Kevin
LOG ZERO Done - Jim


Things we need to Implement: w=16

SHIFT Done - Jim
BYTWO_p Done - Jim
BYTWO_b Done - Jim
BYTWO_p, SSE Done - Jim
BYTWO_b, SSE Done - Jim
Lazy TABLE Done - Jim
Split 4 16 No-SSE, lazy Done - Jim
Split 4 16 SSE, lazy Done - Jim
Split 4 16 SSE, lazy, alternate mapping Done - Jim
Split 8 16, lazy Done - Jim
Composite, k=2, stdmap recursive Done - Kevin
Composite, k=2, altmap recursive Done - Kevin
Composite, k=2, stdmap inline Done - Kevin
LOG Done - Kevin
LOG ZERO Done - Kevin
Group 4 4 Done - Jim: I don't see a reason to implement others, although 4-8 will be faster, and 8 8 will have faster region ops. They'll never beat SPLIT.


Things we need to Implement: w=32

SHIFT Done - Jim
BYTWO_p Done - Jim
BYTWO_b Done - Jim
BYTWO_p, SSE Done - Jim
BYTWO_b, SSE Done - Jim
Split 2 32,lazy Done - Jim
Split 2 32, SSE, lazy Done - Jim
Split 4 32, lazy Done - Jim
Split 4 32, SSE,ALTMAP lazy Done - Jim
Split 4 32, SSE, lazy Done - Jim
Split 8 8 Done - Jim
Group, g_s == g_r Done - Jim
Group, any g_s and g_r Done - Jim
Composite, k=2, stdmap recursive Done - Kevin
Composite, k=2, altmap recursive Done - Kevin
Composite, k=2, stdmap inline Done - Kevin


Things we need to Implement: w=64

SHIFT Done - Jim
BYTWO_p -
BYTWO_b -
BYTWO_p, SSE -
BYTWO_b, SSE -
Split 16 1 SSE, maybe lazy -
Split 8 1 lazy -
Split 8 8 -
Split 8 8 lazy -
Group -
Composite, k=2, alternate mapping -


Things we need to Implement: w=128

SHIFT Done - Will
BYTWO_p -
BYTWO_b -
BYTWO_p, SSE -
BYTWO_b, SSE -
Split 32 1 SSE, maybe lazy -
Split 16 1 lazy -
Split 16 16 - Maybe that's insanity -
Split 16 16 lazy -
Group (SSE) -
Composite, k=?, alternate mapping -


Things we need to Implement: w=general between 1 & 32

CAUCHY Region (SSE XOR) Done - Jim
SHIFT Done - Jim
TABLE Done - Jim
LOG Done - Jim
BYTWO_p Done - Jim
BYTWO_b Done - Jim
Group, g_s == g_r Done - Jim
Group, any g_s and g_r Done - Jim
Split - do we need it? Done - Jim
Composite - do we need it? -
Split - do we need it? -
Logzero? -