### A string hash problem

Earlier this week I was looking through various software developer jobs and I came across this one by Fog Creek. They've posted the following interesting problem, which is the subject of this post.

There are 16 major groups (surprise), and, looking closely, one notices that the left 8 groups of bins are a mirror image of those on the right. In other words, this hash function has an overwhelmingly biased image.

Thinking this could be related to polynomials, I started reading about string hash functions. It turns out we may write polynomials in a recursive manner, so

\begin{align}\sum_{i=0}^{n}a_{i}x^{i}&=a_{0}+a_{1}x+\cdots+a_{n}x^{n}\\&=a_{0}+x\left(a_{1}+x\left(a_{2}+\cdots+x\left(a_{n-1}+a_{n}x\right)\cdots\right)\right).\end{align}This is called Horner's method. Therefore, our hash algorithm above in expanded form is little more than the computation of a polynomial

\begin{align}h\left(x\right)&=a_{0}+a_{1}x+\cdots+a_{8}x^{8}+7\cdot x^{9},\end{align}evaluated at \(x=37\). Stated in this manner, it's at once easy to see that we're tasked with finding coefficients \(a_{0},\cdots,a_{8}\in\left[0,15\right]\subset\mathbb{Z}\) such that \(h\left(37\right)=945924806726376\). Thus, yes, there's \(16^{9}\) possible coefficient arrangements, but we can use what we know about polynomials to narrow down that search space.

First we remove the constant (I've called it a bias) \(7\cdot{37}^9\) from the resulting hash, which is also the smallest value of the hash function for a 9 character string consisting of

\begin{align}36192628160837&=\sum_{i=0}^{8}a_{i}37^{i}.\end{align}After a little more reading about the polynomial vector space, I noticed that we basically have a polynomial defined over a finite field \(\mathbb{Z}/16\mathbb{Z}\). Moreover, there's a

My solution, then, is to iterate from the largest to smallest degree polynomial basis and determine the largest coefficient a basis term (evaluated at \(x={37}\)) can have before, for instance \(37^i>36192628160837\), and subtract it. The code looks as follows.

Find a 9 letter string of characters that contains only letters from acdegilmnoprstuw such that the `hash(the_string)` is 945924806726376 if `hash` is defined by the following pseudo-code: Int64 hash (String s) { Int64 h = 7 String letters = "acdegilmnoprstuw" for (Int32 i = 0; i < s.length; i++) { h = (h * 37 + letters.indexOf(s[i])) } return h }I had no clue what to do when I first read this. In fact, my first attempt was to brute force my way through \({16}^{9}\sim 68\cdot {10}^9\) possibilities. Obviously we need a better method than that.

from typing import Optional from itertools import product alphabet = 'acdegilmnoprstuw' def hash(string: str) -> int: """ Quick implementation of the hash function. Parameters ---------- string : str String to be hashed. Returns ------- h : int Hash of the input string. """ h = 7 for character in string: h = h * 37 + alphabet.rfind(character) return h def bruteFindString(hashToFind: int) -> Optional[str]: """ Find the string by brute force (-- a really bad idea). Parameters ---------- hashToFind : int Hash of some string we'd like to find. Returns ------- string : Optional[str] String whose hash equals hashToFind or None if no string was found. """ for charSubset in product(alphabet, repeat=9): string = ''.join(charSubset) if hash(string) == hashToFind: return stringSomehow we need to find a bias in the hash algorithm and exploit it. The first clue came when I plotted a histogram of the range of this function for random n-character strings from

`alphabet`

.There are 16 major groups (surprise), and, looking closely, one notices that the left 8 groups of bins are a mirror image of those on the right. In other words, this hash function has an overwhelmingly biased image.

Thinking this could be related to polynomials, I started reading about string hash functions. It turns out we may write polynomials in a recursive manner, so

\begin{align}\sum_{i=0}^{n}a_{i}x^{i}&=a_{0}+a_{1}x+\cdots+a_{n}x^{n}\\&=a_{0}+x\left(a_{1}+x\left(a_{2}+\cdots+x\left(a_{n-1}+a_{n}x\right)\cdots\right)\right).\end{align}This is called Horner's method. Therefore, our hash algorithm above in expanded form is little more than the computation of a polynomial

\begin{align}h\left(x\right)&=a_{0}+a_{1}x+\cdots+a_{8}x^{8}+7\cdot x^{9},\end{align}evaluated at \(x=37\). Stated in this manner, it's at once easy to see that we're tasked with finding coefficients \(a_{0},\cdots,a_{8}\in\left[0,15\right]\subset\mathbb{Z}\) such that \(h\left(37\right)=945924806726376\). Thus, yes, there's \(16^{9}\) possible coefficient arrangements, but we can use what we know about polynomials to narrow down that search space.

First we remove the constant (I've called it a bias) \(7\cdot{37}^9\) from the resulting hash, which is also the smallest value of the hash function for a 9 character string consisting of

`aaaaaaaa`

, i.e. when \(\forall a_{i}=0\). Now, we need to find \(a_{i}\) in the equation\begin{align}36192628160837&=\sum_{i=0}^{8}a_{i}37^{i}.\end{align}After a little more reading about the polynomial vector space, I noticed that we basically have a polynomial defined over a finite field \(\mathbb{Z}/16\mathbb{Z}\). Moreover, there's a

*lot*of material online about finite (Galois) fields and polynomials. While writing this post, I found this set of slides to be really helpful and informative.My solution, then, is to iterate from the largest to smallest degree polynomial basis and determine the largest coefficient a basis term (evaluated at \(x={37}\)) can have before, for instance \(37^i>36192628160837\), and subtract it. The code looks as follows.

from operator import itemgetter def findString(hashToFind: int, letterSubsetLength: int) -> str: """ Find the string whose hash matches hashToFind. Parameters ---------- hashToFind : int Hash of some string we'd like to find. letterSubsetLength : int Number of letters used, subset of letters. Returns ------- hashedString : str String whose hash equals the hashToFind. Raises ------ ValueError letterSubsetLength is less than 1. """ if letterSubsetLength < 1: raise ValueError('Expected positive subset length') unBiasedHash = hashToFind - 7 * pow(37, letterSubsetLength) coefficients = [] for i in range(letterSubsetLength - 1, -1, -1): # Determine the largest value a polynomial coefficient # can be before the basis is larger than the hash power = pow(37, i) coeff = unBiasedHash // power coefficients.append(coeff) unBiasedHash -= coeff * power hashedString = ''.join(itemgetter(*coefficients)(alphabet)) return hashedStringOf course, this implementation assumes you'll give it the hash value of a string that was hashed with

`hash`

above, and you can insert all these checks to ensure you don't get an `IndexError`

or some other exception, but it solves this basic problem.$ ./hashProblem.py I think the answer is 'promenade'