Tuesday, January 5, 2010

Possible Collision Issue with Product.hashCode

First Off I am no expert at hashing algo's & perhaps this is some sort of Birthday Problem so I am not sure if what I am pointing out is actually a big issue. It would be interesting to see what would happen on a project that uses large hash maps with Products in it if the distribution at the bottom was improved.

Quick Summary:
Right now it seems like the hashCode for Products of Arity N is the same regardless of the actual class implementing it & that it doesn't distribute very well:


case class Point(x : Int, y : Int)
case class Point2(x : Int, y : Int)
(1,2).hashCode == Point(1,2).hashCode == Point2(1,2).hashCode == 3405



Welcome to Scala version 2.8.0.r20331-b20100101020206 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_15).
List((1,2), (2,1), Point(1,2), Point(2,1), Point2(1,2), Point2(2,1)).map(_.hashCode)
res0: List[Int] = List(3405, 3445, 3405, 3445, 3405, 3445)


In a little more depth: I think this behavior is much better than in 2.7.5 where the ordering of the product elements did not affect the hashCode. (Point(x,y) would clash with Point(y,x))

in 2.7.5:

res0: List[Int] = List(1838716253, 1838716293, -1163646902, -1163646862, -270917823, -270917783)



Digging a little further on what this means for a distribution =>

Keep in mind that a big problem is that java's Integer.hashCode == the number being hashed, which is going to cause problems.

Using this code to get rudimentary collision distribution:


val tot = 500
val ideal = tot * tot * 1.0
val hashes = (for(x <- -(tot/2) until (tot/2); y <- -(tot/2) until (tot/2)) yield (x,y).hashCode ).toList
val hset = Set() ++ hashes
val hsSize = hset.size
val hlSize = hashes.removeDuplicates.size
println((ideal, hsSize, hsSize/ideal, hlSize, hlSize/ideal))


Here is the 2.7.5 result (takes a few minutes on a Core i7 920, removeDuplicates is terribly slow):

(250000.0, 20959,0.083836,20959,0.083836)


and Here is the 2.8 nightly result (surprisingly the same distribution, that takes only a few seconds):

(250000.0, 20959,0.083836,20959,0.083836)

Both cases are going to collide 90+% of the time (notice that changing tot to 100 reduces the collision % quite significantly so I suspect there is a quadratic relationship at play here)

Running with a tot =100 gives:
(10000.0,4159,0.4159,4159,0.4159)
Still a 58% collide rate, but a lot better than the larger case.

Finally for the 3 Tuple cases with tot = 100
(1000000.0,170578,0.170578,170578,0.170578)


Perhaps I should look into a better case than just integers to see what the collision rates actually are like.

4 comments:

Ben Jackman said...

Here is the post I added to the Scala debate mailing list

Ben Jackman said...

Here are some hashes of strings:
scala> "a".hashCode
res20: Int = 97

scala> "b".hashCode
res21: Int = 98

scala> "\00".hashCode
res22: Int = 0

scala> "\01".hashCode
res23: Int = 1

scala> "\00\00".hashCode
res24: Int = 0

scala> "\00\01".hashCode
res25: Int = 1

scala> "\00b".hashCode
res26: Int = 98

Ben Jackman said...

I wonder if it would be possible to add a collisionRate method to hash tables to see how many of their keys are colliding.

Ben Jackman said...

follow up: http://scalide.blogspot.com/2010/01/follow-up-to-producthashcode-collisions.html