## Tuesday, January 5, 2010

### Possible Collision Issue with Product.hashCode

First Off I am no expert at hashing algo's & perhaps this is some sort of Birthday Problem so I am not sure if what I am pointing out is actually a big issue. It would be interesting to see what would happen on a project that uses large hash maps with Products in it if the distribution at the bottom was improved.

Quick Summary:
Right now it seems like the hashCode for Products of Arity N is the same regardless of the actual class implementing it & that it doesn't distribute very well:

`case class Point(x : Int, y : Int)case class Point2(x : Int, y : Int)(1,2).hashCode == Point(1,2).hashCode == Point2(1,2).hashCode == 3405`

`Welcome to Scala version 2.8.0.r20331-b20100101020206 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_15).List((1,2), (2,1), Point(1,2), Point(2,1), Point2(1,2), Point2(2,1)).map(_.hashCode)res0: List[Int] = List(3405, 3445, 3405, 3445, 3405, 3445)`

In a little more depth: I think this behavior is much better than in 2.7.5 where the ordering of the product elements did not affect the hashCode. (Point(x,y) would clash with Point(y,x))

in 2.7.5:
`res0: List[Int] = List(1838716253, 1838716293, -1163646902, -1163646862, -270917823, -270917783)`

Digging a little further on what this means for a distribution =>

Keep in mind that a big problem is that java's Integer.hashCode == the number being hashed, which is going to cause problems.

Using this code to get rudimentary collision distribution:

`val tot = 500val ideal = tot * tot * 1.0val hashes = (for(x <- -(tot/2) until (tot/2); y <- -(tot/2) until (tot/2)) yield (x,y).hashCode ).toListval hset = Set() ++ hashesval hsSize = hset.size val hlSize = hashes.removeDuplicates.sizeprintln((ideal, hsSize, hsSize/ideal, hlSize, hlSize/ideal))`

Here is the 2.7.5 result (takes a few minutes on a Core i7 920, removeDuplicates is terribly slow):
`(250000.0, 20959,0.083836,20959,0.083836)`

and Here is the 2.8 nightly result (surprisingly the same distribution, that takes only a few seconds):

`(250000.0, 20959,0.083836,20959,0.083836)`

Both cases are going to collide 90+% of the time (notice that changing tot to 100 reduces the collision % quite significantly so I suspect there is a quadratic relationship at play here)

Running with a tot =100 gives:
`(10000.0,4159,0.4159,4159,0.4159)`
Still a 58% collide rate, but a lot better than the larger case.

Finally for the 3 Tuple cases with tot = 100
`(1000000.0,170578,0.170578,170578,0.170578)`

Perhaps I should look into a better case than just integers to see what the collision rates actually are like.

Ben Jackman said...

Here is the post I added to the Scala debate mailing list

Ben Jackman said...

Here are some hashes of strings:
scala> "a".hashCode
res20: Int = 97

scala> "b".hashCode
res21: Int = 98

scala> "\00".hashCode
res22: Int = 0

scala> "\01".hashCode
res23: Int = 1

scala> "\00\00".hashCode
res24: Int = 0

scala> "\00\01".hashCode
res25: Int = 1

scala> "\00b".hashCode
res26: Int = 98

Ben Jackman said...

I wonder if it would be possible to add a collisionRate method to hash tables to see how many of their keys are colliding.

Ben Jackman said...