Quick Summary:
Right now it seems like the hashCode for Products of Arity N is the same regardless of the actual class implementing it & that it doesn't distribute very well:
case class Point(x : Int, y : Int)
case class Point2(x : Int, y : Int)
(1,2).hashCode == Point(1,2).hashCode == Point2(1,2).hashCode == 3405
Welcome to Scala version 2.8.0.r20331-b20100101020206 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_15).
List((1,2), (2,1), Point(1,2), Point(2,1), Point2(1,2), Point2(2,1)).map(_.hashCode)
res0: List[Int] = List(3405, 3445, 3405, 3445, 3405, 3445)
In a little more depth: I think this behavior is much better than in 2.7.5 where the ordering of the product elements did not affect the hashCode. (Point(x,y) would clash with Point(y,x))
in 2.7.5:
res0: List[Int] = List(1838716253, 1838716293, -1163646902, -1163646862, -270917823, -270917783)
Digging a little further on what this means for a distribution =>
Keep in mind that a big problem is that java's Integer.hashCode == the number being hashed, which is going to cause problems.
Using this code to get rudimentary collision distribution:
val tot = 500
val ideal = tot * tot * 1.0
val hashes = (for(x <- -(tot/2) until (tot/2); y <- -(tot/2) until (tot/2)) yield (x,y).hashCode ).toList
val hset = Set() ++ hashes
val hsSize = hset.size
val hlSize = hashes.removeDuplicates.size
println((ideal, hsSize, hsSize/ideal, hlSize, hlSize/ideal))
Here is the 2.7.5 result (takes a few minutes on a Core i7 920, removeDuplicates is terribly slow):
(250000.0, 20959,0.083836,20959,0.083836)
and Here is the 2.8 nightly result (surprisingly the same distribution, that takes only a few seconds):
(250000.0, 20959,0.083836,20959,0.083836)
Both cases are going to collide 90+% of the time (notice that changing tot to 100 reduces the collision % quite significantly so I suspect there is a quadratic relationship at play here)
Running with a tot =100 gives:
(10000.0,4159,0.4159,4159,0.4159)
Still a 58% collide rate, but a lot better than the larger case.
Finally for the 3 Tuple cases with tot = 100
(1000000.0,170578,0.170578,170578,0.170578)
Perhaps I should look into a better case than just integers to see what the collision rates actually are like.
4 comments:
Here is the post I added to the Scala debate mailing list
Here are some hashes of strings:
scala> "a".hashCode
res20: Int = 97
scala> "b".hashCode
res21: Int = 98
scala> "\00".hashCode
res22: Int = 0
scala> "\01".hashCode
res23: Int = 1
scala> "\00\00".hashCode
res24: Int = 0
scala> "\00\01".hashCode
res25: Int = 1
scala> "\00b".hashCode
res26: Int = 98
I wonder if it would be possible to add a collisionRate method to hash tables to see how many of their keys are colliding.
follow up: http://scalide.blogspot.com/2010/01/follow-up-to-producthashcode-collisions.html
Post a Comment