Tech things by Robus

Tech things by Robus

Typesafe, Generic & Extremely Fast Dictionary Implementation in C

Check out the project here. If you want to understand the motivation behind writing one, you are welcome :)

Well, I had few design goals in my mind while implementing Dictionary in C.

Safety, safety, safety

Trust me, it smells when you see void* as value type in your API signature. There are many languages that rely on such property where your API accepts the entire Cosmos. For example, you see something like this in Golang.


func Add(buffer interface{}) {}

Or, in C

void Add(void* buffer) {}

You never know, as a programmer, the type of argument it expects. Is it an integer? is it character buffer? Is it float?

Similarly, implementation of these API is usually hairy as these API mostly relies on runtime reflection of some sort to operate on values by "trying" to figure out the type on the value being operated on. It has massive runtime overhead associated with it. Avoiding it completely is the biggest rationale behind writing Dictionary lib.

Also, let's not forget the pleasure of catching bugs that could have been avoided during compilation which creeps up during runtime while using such dynamic APIs.

Generic

Although, C does not have support for Generic Programming compared to something like C++/Rust. It does have #define macros which we can use to emulate generics. Although the implementation is quirky, I think APIs are simple to reason about.

For instance, to define a Generic type that holds Dictionary of int to char mapping.


#include "cdict.h"

CDict(int, char) cdict_int_char_t;

Similarly, to define Generic type that holds Dictionary of Integer to Node mapping, where Node is user-defined struct.

#include "cdict.h"

typedef struct {
  int x;
  int y;
} Node;

// Define CDict of Integer -> Node
CDict(int, Node) cdict_int_node_t;

Overall, I am a big fan of Generics. Remember, It does not always a good idea to litter code with Generics. Being careful is the key here.

Stupidly Simple

This is one of the hardest things to get right. The simplicity came at the cost of not so simple implementation detail. But I think that is a tradeoff well worth it. I won't bother much with my explanation here and let code do the speaking here.

#include "cdict.h"

// Define generic type that holds `Dictionary of Integer to String`
typedef char* string;
CDict(int, string) cdict_int_string_t;

int main() {
  // Create instance of cdict_int_string_t
  cdict_int_string_t cdict;

  // Initialize
  cdict__init(&cdict);

  // Add key -> value pair
  cset__add(&cset_int, 1);
  cdict__add(&cdict, 1, "one");
  cdict__add(&cdict, 2, "two");  

  // Remove element from Dictionary
  cdict__remove(&cdict, 1);

  // Size
  size_t size = cdict__size(&cdict);

  // Clear
  cdict__clear(&cdict);

  // Free up the heap
  cdict__free(&cdict);
}

Monomorphism

This is the compiler's way of generating code to mitigate the runtime cost that comes with abstraction. It, in simpler terms, creates copies for each Generic type that you define in your program. By making copies, it avoids dynamic dispatch and allows simple function calls with concrete types as if there were no Generic in the first place. For example, if I use a CDict(int, string) and a CDict(char, string), then the generated binary will have two copies of the generated code for CDict: one for CDict(int, string) and another for CDict(char, string). Let's look at the pseudocode.

Given CDict(int, string)

CDict(int, string) cdict_int_string_t;

Compiler generates

typedef struct cdict_int_string_t {
  // Key of type `int`
  int* key;
  // Value of string type
 string* value;
  .....
}

Given CDict of char -> string mapping

CDict(char, string) cdict_char_string_t;

Compiler generates

typedef struct cdict_char_string_t {
  // Key of type `char` 
  char* key;
  string* value;
   .......
}

I hope this helps. By doing so, we don't pay any price to our abstractions. On the downside, our binary size increases, but I think the tradeoff between performance and size is well worth it.

Others

  • I wanted to avoid the problems of Primary and Secondary clustering that comes with Linear and Quadratic probing during collision resolution. Thus, Cset uses Double Hashing. There is an even better version of it called "Enhanced Double Hashing". I might later implement it. Maybe.

  • Cset also allows you to use Custom Comparator and Custom Hash function by overriding the default implementation which allows for greater control over your use-case. It is super important to implement both Comparator and Hasher for the correctness of your program. Read this StackOverflow post for more details on why is that the case. You can find the example in the README on how you can implement one.

  • Oh yes, it should be extremely fast. It uses XXHash for hashing keys that run at the RAM speed limits. Check out the benchmark that compares XXH3 with Murmur3, City64, etc.

  • Last but not least, for the fun of it. :)

Also, make sure to check out my other projects.

Follow me here

Reach out to me at

Thanks a ton!!

 
Share this