Collections


The making of cHashTable and cDictionary

So, I wanted to create an associative array for Visual Dataflex. These are the things I wanted my array to handle:

  • String keys of variable length.
  • Variant values (I wanna be able to store an array or a struct in my collection).
  • Insert and find operation times must be faster than O(n) on average.
  • Sorting the collection must be faster than O(n) on average.

After some wiki-time I found out that a hash table is good for all above besides the sorting. To get the sorting capabilities I had to look into implementing some kind of search tree and that is when I stumbled upon the trie/digital tree. So I decided to create one hash table class and one trie/digital tree class. The first faster but without sorting and the latter slower but with sorting as a byproduct of the implementation.

Besides the performance capabilities I also wanted a simple and intuitive interface for working with the collections. Just simple Set Value of/Get Value of methods without any ByRef parameters.

Implementing the hash table

A good hash table is dependent on its hashing method and its collision resolution. For the hashing I tried out several different ones before settling for the djb2 algorithm. More complex ones surfaced VDF's calculation speed issues and simpler ones just did not get enough spread, which lead to unbalance in the hash table. For the collision resolution I went with the closed addressing technique, which means that the lookup speed also will depend on the time it takes to search through the buckets for the current hash index.

Illustration of the closed addressing technique

The bucket search is an average O(n) operation and to let hash tables grow large bucketwise is not a good idea. To keep the buckets small I had to implement a resizing/rehashing method that occurs when the hash table is filled to a certain rate. Highly influenced by Java's hash table I added three properties to my hash table class: MaxBucketSize = 64, MaxBucketCount = 128 and LoadFactor = 0.75. When the item count divided by the MaxBucketSize*MaxBucketCount is > than 0.75 a resize/rehash will occur. During the resize/rehash the MaxBucketCount will be set to MaxBucketCount*2 and the MaxBucketSize will be left untouched. By doing this and having a good hash distribution method we never have to O(n) search through than more than 64 entries no matter how big the hash table gets.

Yes, resizing/rehashing is not super fast... But feel free to modify the parameters to make the rehash occur less often. I found 64/128/0.75 to be a good all-round setting.

cHashTable usage guide

Implementing the trie/digital tree

My digital tree is just a collection of node elements that can carry a value and 256 child nodes. Each child represents an ascii code. Look at the image below and you'll get the point.

Paths are found by recursively stepping down the tree. A 4 character key will be found on the 4th level, a 5 character key will be found on the 5th level a.s.o. This is very effective when you keep the key lengths short. For longer keys the paths get more complex and you might experience performance loss and some memory inefficiency.

The really cool thing about storing the keys like this is that you automatically can retrieve your keys in ascii-code order, by scanning the tree from top-bottom/left-right.

I decided to name the trie/digital tree class cDictionary.

cDictionary usage guide

Performance tests

  • All tests are done on an Intel Core i5 3320M CPU @ 2,6 GHz computer with 16GB RAM, running VDF 17.0 on Windows 7 64 bit.
  • The tests are done by putting unique keys into the arrays.
  • The resulting data is produced by calculating an average out of 5 different runs for each setting.

The reference class

In the tests below you'll see that I keep mentioning a reference-class. This is an implementation of an associative array often seen in VDF applications. The keys and values are stored in two property arrays. This associative array design has two major flaws: First off it requires an O(n) key existence check every time a new key is added or a value is fetched and secondly it doesn't take the copy-on-write principle in consideration. Both flaws are pointed out here:

Write test - 10,000 unique 8 character keys

The following graph shows how long it takes (ms) to write 100 entries as the array keeps on growing.

write

Write test - 10,000 unique 128 character keys

The following graph shows how long it takes (ms) to write 100 entries as the array keeps on growing.

write

Read test - 10,000 unique 8 character keys

The following graph shows how long it takes (ms) to read 100 entries as the array keeps on growing.

read

Read test - 10,000 unique 128 character keys

The following graph shows how long it takes (ms) to read 100 entries as the array keeps on growing.

read128

Write test - 100,000 unique 8 character keys

The following graph shows how long it takes (ms) to write 1000 entries as the array keeps on growing.
read128

Read test - 100,000 unique 8 character keys

The following graph shows how long it takes (ms) to read 1000 entries as the array keeps on growing.
read128

Speed averages

Average time to read/write 1 entry, based on the 100,000 entries tests above.

Class Write speed (ms) Read speed (ms)
cDictionary 0.09 0.06
cHashTable 0.31 0.09
reference-class 11.26 2.83

Memory usage

How much memory is allocated when storing 100,000 entries with key lengths 4-12 and a string values with 16 characters?

Key length cHashTable (Mb) cDictionary (Mb)
4 21.49 267.32
6 21.19 562.59
8 23.99 857.79
10 22.26 1152.91
12 22.52 1,448.09

Conclusions

Ok, let's look at the results:

You can clearly see the cHashTable rehashing and the speed gain after the rehash on all graphs. But the overall winner for the speed tests is the cDictionary...

...But look at the memory usage! Yep, cDictionary is totally out of control there. But when you think about the design, it's not that strange. I'm gonna try to modify the cDictionary class to use dynamically sized child node arrays instead. At the moment a handle[255] is allocated for each new node. The cHashTable is containing itself much better, which is relieving.

Anyhow...

In the 100,000 entries tests you can really see the O(1) like behavior of both cHashTable and cDictionary. They both won by a landslide over the reference-class.

Bottom line...

Use the cHashTable class and wait until I've corrected the cDictionary's memory issues. With the cHashTable class you get a super fast associative array that can store any type of data(!). I'm very excited about all the cool implementations you could use it for. I'm thinking JSON parser, dynamic object properties, making a subclass with persistent storage, you name it.

Download

cHashTable.pkg
cDictionary.pkg (memory leak-ish version)
Share

2 thoughts on Collections

  1. Pingback: Collection classes for Visual Dataflex

  2. Pingback: Collection classes for Visual Dataflex | { eriksven.com }

Leave a Reply

Your email address will not be published. Required fields are marked *