Spring ENC 1102: Single Hash Makes Hashing Even Faster!

Hashing may sound unfamiliar to most people, but it is actually a very commonly used technique in the field of computer science and is closely related to our daily life. For example, when you turn in a paper for your English class, it is very likely that your instructor uses Turnitin to check the similarities between your article and the existing articles in the database. It is very inefficient for Turnitin to search through all the articles in the entire database to see if there is any match for each of the word or sentence pattern in your article. Therefore, Turnitin uses a hash-based database that uses several hash functions to calculate the possible storage locations of each of the word or sentence pattern in your article and just go to those locations to see if the match exists. Theoretically, this method is already millions of times faster than searching in a non-hash-based database like an array-based database, but a technique called Single Hash will accelerate the process by an additional 80%.

Figure 1. Array-based vs. Hash-based Turnitin

The speed improvement brought by the Single Hash technique is important for us. First, this improve comes from simply designing something in a different way. This is essentially giving us “free power” because we don’t need to buy any new hardware. Secondly, the development of the semiconductor industry, which is what most of today’s electronics based on, is getting slower and slower in recent years. If we want to keep increasing the computational power, software design solutions like Single Hash is one of the best options for now.

Before going into how the magic of the Single Hash technique happens, let’s get some background information about the hash functions.

The primary purpose of hashing is to map a larger set of data into a smaller set of data for faster access and more efficient memory usage. As a result, hashing uses a one-directional function called a hash function, that creates a many-to-one relationship between the inputs, or the keys, and the outputs, or the hash values. Given a key, the hash function can help us find the hash value, but we cannot determine the key if we only know the hash value because many keys can possibly be mapped to the same hash value. When more than one keys map to a single hash value, we call it a collision.

Figure 2. Key, Hash Value, and Collision Diagram

Collisions slow down the execution of a hash function because we will always need to spend some time to somehow resolve the collisions. So, an ideal hash function should spread out the hash values as evenly as possible in a given range to reduce the number of collisions. The traditional approach solves this problem by feeding the key into many different hash functions to produce a set of hash values and using the set of hash values to determine the final hash value of the key.

Figure 3. Traditional Approach with Multiple Hash Functions

In practice, this traditional approach does a good job to spread out the hash value evenly. However, the approach also has several disadvantages. First, the smaller internal hash functions used in this approach are usually very complex in order to increase the randomness of the produced hash values. Therefore, it requires lots of the computational power from the CPU to produce the hash values and is very time-consuming. Furthermore, when trying to fit the output inside a given range, the final step of each hash function is often a modulo operation that trims off the higher digits. For example, when the output is 1739 and the range is 100, the final step would be 1739 modulo 100 = 39. In this final step, the information carried by and the time spent to compute the higher digits 1 and 7 are wasted. To address this performance limitation and the waste of data and computational power, Guo et al. (2018) introduced a new hash function called Single Hash that made full use of the information produced by a single hash function to generate a set of hash values. These hash values could then be combined in the same way as in the traditional approach and fit in the existing design of the data structures.

Figure 4. Single Hash Technique with Only One Hash Function

The core of the Single Hash algorithm is this formula:

Figure 5. Single Hash Formula with Annotations

Figure 6. Bit-Shifting Diagram and Exclusive-OR Truth Table

As shown in the picture above, in the formula, h is the output of the smaller internal hash function, H is the output hash value of the entire Single Hash process, w means the expected range of the result, “>> n” means shift each bit n positions toward the right, and “<<n” means shift each bit n positions towards the left. The first part of the formula gets the higher half of h, and the second part gets the ith to the highest bits of h. These two parts then go through an exclusive or process, and the final result is trimmed to fit in the given range.

The key improvement of the Single Hash technique is replacing the complex and time-consuming hash functions with bit-shifting and exclusive or, which are very fast to execute because they are the basic common operations of most of today’s processors. These basic operations serve as several mini hash functions. The inputs to these mini hash functions are the lower bits and the higher bits of h. The outputs of these mini hash functions form a set of hash values just like in the traditional approach so that we can directly use the outputs in our existing hash-based implementations. The mathematical analysis in Guo et al.’s research shows that the Single Hash process also maintains the independence among the produced hash values by including all the higher bits that would have been discarded due to the modulo operation in the traditional approach.

The experiment in Guo et al.’s research applies the traditional approach and the Single Hash technique to three common hash-based data structures to see the difference between both the number of collisions and the speed performance. The result of the experiment supports the mathematical analysis. In addition, the outcome also shows that the speed of the Single Hash technique is between 10% and 400%, you are right, 400%, faster than the traditional approach depending on the applied data structure and the number of hash values generated. In general, the more hash values we need, the more improvement we will get from the Single Hash technique.

Now you should be pretty familiar with hashing in computer science and the Single Hash technique. Hopefully, big companies like Turnitin will adapt to this new technique soon to make the similarity checking even faster. Just like many other internal optimizations in the computer science field, the improvement brought by the Single Hash technique won’t be known by many people. However, we would never be able to use the computational power as efficient as we are today without these unseen optimizations that lay underneath your favorite website, app, and games.

By Shida Yang - Department of Electrical and Computer Engineering, University of Florida

Reference:

Guo, X., Zhao, C., Yang, T., Zou, L., Zhou, Y., Yan, Y., Li, X., & Cui, B. (2018, 15-17 January). Single Hash: Use one hash function to build faster hash based data structures. Paper presented at 2018 IEEE International Conference on Big Data and Smart Computing (BigComp). Retrieved April 10, 2019, from IEEE Xplore Digital Library. doi: 10.1109/BigComp.2018.00048

Spring ENC 1102

Saturday, April 13, 2019

Single Hash Makes Hashing Even Faster!

No comments:

Post a Comment

Report Abuse