Tech things by Robus

Tech things by Robus

Typesafe & Fast as F**k Set in C

Typesafe & Fast as F**k Set in C

Why even bother writing one?

Check out the project here. If you want to understand motivation behind writing one, you are welcome :)

Well, I had few design goals in my mind while implementing HashSet in C.

Safety, safety, safety

Trust me, it smells when you see void* as value type in your API signature. There are many languages that rely on such property where your API accepts the entire Cosmos. For example, you see something like this in Golang.


func Add(buffer interface{}) {}

Or, in C

function Add(void* buffer) {}

You never know, as a programmer, the type of argument it expects. Is it an integer? is it character buffer? Is it float?

Similarly, implementation of these API is usually hairy as these API mostly relies on runtime reflection of some sort to operate on values by "trying" to figure out the type on the value being operated on. It has massive runtime overhead associated with it. Avoiding it completely is the biggest rationale behind writing Cset lib.

Also, let's not forget the pleasure of catching bugs that could have been avoided during compilation which creeps up during runtime while using such dynamic APIs.

Generic

Although, C does not have support for Generic Programming compared to something like C++/Rust. It does have #define macros which we can use to emulate generics. Although the implementation is quirky, I think APIs are simple to reason about.

For instance, to define a Generic type that holds Set of int.


#include "cset.h"

Cset(int) cset_int_t;

Similarly, to define Generic type that holds Set of Node, where Node is user-defined struct.

#include "cset.h"

typedef struct {
  int x;
  int y;
} Node;

// Define Cset of Node
Cset(Node) cset_node_t;

Overall, I am a big fan of Generics. Remember, It does not always a good idea to litter code with Generics. Being careful is the key here.

Stupidly Simple

This is one of the hardest things to get right. The simplicity came at the cost of not so simple implementation detail. But I think that is a tradeoff well worth it. I won't bother much with my explanation here and let code do the speaking here.

#include "cset.h"

// Define generic type that holds `Set of int`
Cset(int) cset_int_t;

int main() {
  // Create instance of cset_int_t
  cset_int_t cset_int;

  // Initialize
  cset__init(&cset_int);

  // Add elements to set
  cset__add(&cset_int, 1);  
  cset__add(&cset_int, 2);

  // Remove element from set
  cset__remove(&cset_int, 1)

  // Size
  size_t size = cset__size(&cset_int);

  // Clear the set
  cset__clear(&cset_int);

  // Free up the heap
  cset__free(&cset_int);
}

Monomorphism

This is the compiler's way of generating code to mitigate the runtime cost that comes with abstraction. It, in simpler terms, creates copies for each Generic type that you define in your program. By making copies, it avoids dynamic dispatch and allows simple function calls with concrete types as if there were no Generic in the first place. For example, if I use a Cset(int) and a Cset(char) in my code, then the generated binary will have two copies of the generated code for Cset: one for Cset(int) and another for Cset(char). Let's look at the pseudocode.

Given Cset of int

Cset(int) cset_int_t;

Compiler generates

typedef struct Cset_int {
  // Buffer of type `int`
  int* buffer;
  .....
}

Given Cset of char

Cset(char) cset_char_t;

Compiler generates

typedef struct Cset_char {
  // Buffer of type `char` 
  char* buffer;
   .......
}

I hope this helps. By doing so, we don't pay any price to our abstractions. On the downside, our binary size increases, but I think the tradeoff between performance and size is well worth it.

Others

  • I wanted to avoid the problems of Primary and Secondary clustering that comes with Linear and Quadratic probing during collision resolution. Thus, Cset uses Double Hashing. There is an even better version of it called "Enhanced Double Hashing". I might later implement it. Maybe.

  • Cset also allows you to use Custom Comparator and Custom Hash function by overriding the default implementation which allows for greater control over your use-case. It is super important to implement both Comparator and Hasher for the correctness of your program. Read this StackOverflow post for more details on why is that the case. You can find the example in the README on how you can implement one.

  • Oh yes, it should be extremely fast. It uses XXHash for hashing keys that run at the RAM speed limits. Check out the benchmark that compares XXH3 with Murmur3, City64, etc.

  • Last but not least, for the fun of it. :)

Also, make sure to check out my other projects.

Follow me here

Reach out to me at

Thanks a ton!!

 
Share this