Comparison Of Data Hashing Algorithms
ABSTRACT
I became interested in this topic when I learned that one needs to be very specific when writing key words to search for information on the Internet. So, I pursued to find out how information (data) is stored in computers. I began to focus on learning about methods for categorizing data for ease of search. The computer term for categorizing is called hashing. Three hashing methods are commonly used: chaining hash, additive rehash, and multiplicative rehash. The efficiency of a hashing method is measured by the number of collisions, found in grouping the data, and the number of comparisons, found in searching through the data. For this project I used three programs written in the C programming language and a data set of web addresses for teachers, a certain service, and students. My testing shows that the chaining hash method resulted in the fewest number of collisions and comparisons. More testing is needed with different types and larger data sets to arrive at more definitive conclusions.
Which of the three data hashing algorithms (chaining hash, additive rehash 3, and multiplicative rehash 3) results in the fewest number of collisions in grouping the data and the fewest number of comparisons in searching the data set?
Hashing is a computer science term that means categorizing. It is used to group data by their common characteristics. Each datum is assigned to a certain cell in an array. The number of cells in an array should be fairly close to the data size. The target cell is determined by a hashing function. Collisions can be used to test for the efficiency of hashing algorithms, and occur when the target cell is occupied. They are measured in a whole database. There are two main hashing methods, each of which responds to collisions in a different way. In the chaining hash method, if the target cell is occupied, a new cell is created for each of the subsequent datum. In the rehash methods, instead of building a new cell, the subsequent datum is re-assigned to some other cell in the array. The rehash number determines the cell numbers to which subsequent data are re-assigned. For example, in rehash 3, the rehash number is 3. Rehash can be done in two different ways: additive rehash and multiplicative rehash. In additive rehash, when a collision occurs, the cell for each of the subsequent data is determined by the following formula: C = A + R where C = the cell number for subsequent datum, A = the target cell, and R = the rehash number. In the multiplicative rehash method, when a collision occurs, the cell for each subsequent datum is determined by the following formula: C = A´R + 1 where C = the cell number for subsequent data, A = the target cell, and R = the rehash number. In rehash, all calculations are done in modulo arithmetic. The modulo-number is the array size.
An object in designing data storage systems is to make them efficient for searching. Comparisons occur when trying to find certain pieces of data by searching the array. The number of comparisons is the number of cells looked in to find the particular datum. In my project, I tested a database of four sets of web addresses. Set 1 was made up of teacher related addresses. Set 2 was made up of web addresses for a certain service. Set 3 was made up of student related web addresses. Set 4 was the union of the first three data sets. I got all of my data from a high school in Montgomery County.
The following examples illustrate the two hashing methods: chaining hash and additive rehash 3. The set {3z3, 5x3, 7x2, 7y, 9z, 10, 5} will be hashed. The size of array and modulo number is 7. In this case, the hashing function is that the data will go into the cell number that corresponds to the highest power of the variable in the term. For example, cell 3 will be the target for 3z3 and 5x3, and cell 0 will be the target for 10, and 5. The multiplicative rehash method is not illustrated since it is so similar to the additive rehash method.
First, 3z3 is placed in cell 3. The next datum, 5x3, is placed in a new cell below cell 3. Datum 7x2 goes into cell 2. 7y goes into cell 1. Then, 9z is placed in a new cell below cell 1. Datum 10 goes into cell 0 and then 5 goes into new cell below 0. The number of collisions is three.

The target cell for 3z3 is cell 3. Then
3y3 is placed in the cell A + R = 3 + 3 = 6. Datum 7x2 goes
into cell 2 and 7y goes into cell 1. Then 9z goes into cell 1 + 3 = 4. Datum
10 goes into cell 0. Then 5 goes into 0 + 3 = 3. This cell is full so
the datum moves into 3 + 3 = 6. It is also occupied. The datum 5 moves
into 6 + 3 = 9 2. The datum again needs to move into 2 + 3 = 5. There
are six collisions.
I hypothesize that if I were to test all three hashing methods mentioned, then I would find the chaining hash method to result in the fewest number of collisions and comparisons. My reasoning is that the chaining hash method seems to be much more orderly and organized. It should be easier to find each group of data.
MATERIALS
One computer
One C-program for the chaining hash method*
One C-program for the additive rehash method*
One C-program for the multiplicative rehash method*
One C compiler
One C editor
Four sets of web addresses, the fourth set being union of first three.* The first data set was web addresses for teachers, the second data set was web addresses for certain services, the third data set was web addresses for students, and the fourth data set was the collection (union) of the first three data sets.
· The programs and the data sets were provided by a high school student.
PROCEDURE
1. Choose a reasonable function to classify web addresses in programs. *
2. Gather web addresses (I used a collection of web addresses grouped into three sets).
3. Based on the size of data, decide on the size (number of cells) of the array.
4. Select rehash number that is relatively prime to the size of array.
5. Create programs in the “C” programming language for chaining hash, additive rehash, and multiplicative rehash algorithms.
6. Input set of web addresses.
7. Run program for chaining hash algorithm and first set of web addresses, and record number of collisions and comparisons.
8. Repeat step 7 for the second, third, and fourth (all combined) sets of web addresses.
9. Run program for additive rehash algorithm for all four data sets.
10. Run program for multiplicative rehash algorithm for all four data sets.
* All of the web addresses are in the form ____.mbhs.edu. My function is y´(x-3) where x is the number of letters in the blank and y is the assigned ASCII of the last letter.
RESULTS
Table 1: The number of comparisons and collisions for the three hashing methods applied to the four data sets.
|
Multiplicative Rehash |
Additive Rehash |
Chaining Hash |
|
|
Comparisons data set 1 |
77 |
79 |
18 |
|
Comparisons data set 2 |
45 |
48 |
31 |
|
Comparisons data set 3 |
98 |
95 |
55 |
|
Comparisons data set 4 |
4272 |
3677 |
3274 |
|
Collisions data set 4 |
4002 |
3407 |
3274 |
The collisions were only measured for the data set 4, which was a union of the first three. The number of collisions for the chaining hash method was 133 less than for the additive rehash 3 method, and 728 less than for the multiplicative rehash 3 method.
CONCLUSION
For all four data sets, the chaining hash method resulted in the fewest number of comparisons. For the combined data set 4 the chaining hash method resulted in the fewest number of collisions as well. My hypothesis turned out to be correct. However, I tested very specific sets of web addresses. It may not be possible to generalize the conclusions to other types and larger data sets. Despite that limitation, I have discussed the criteria for efficiency of data hashing methods and demonstrated how different hashing methods may be compared.
REFERENCES
Aitkin, Peter, and Jones, Bradley. Bestseller’s Edition: Teach Yourself C In 21 Days. Indiana: Sams Publishing, 1994.
Brain, Marshall. How Stuff Works. How C Programming Works. 9 December 1999.
Dale, Nell, and Lilly, Susan C. Pascal Plus: Data Structures. Massachusetts: D.C. Health and Company, 1995.
Kacker, Ravi. Personal Interview. 5 December 1999.
Schild, Herbert. The Complete Reference: Borland C++. California: Osborne McGraw-Hill, 1997.