I am going to post a link to the Enum type that I wrote due to the limitations of the builtin Scala Version. I don't use the built in enumeration type anymore as I find it very limiting.
(source & test / example):
http://gist.github.com/282446
Originally I wrote this as answer to this question on StackOverflow, however felt that it would be useful here too.
Wednesday, January 20, 2010
Tuesday, January 5, 2010
Follow up to Product.hashCode collisions
After some experimentation here is the result that I came up. Note that you might need to increase the Heap Space to get it to run in the interpreter because of sort / removeDuplicates.
I posted it as a gist [http://gist.github.com/270117]
You should be able to just copy and paste it into a scala 2.8 interpreter
This yields a result of:
(ideal = 1000000.0, actual = 999880, uniqueRatio = 0.99988)
meaning a collision rate < .1%
I posted it as a gist [http://gist.github.com/270117]
You should be able to just copy and paste it into a scala 2.8 interpreter
This yields a result of:
(ideal = 1000000.0, actual = 999880, uniqueRatio = 0.99988)
meaning a collision rate < .1%
Possible Collision Issue with Product.hashCode
First Off I am no expert at hashing algo's & perhaps this is some sort of Birthday Problem so I am not sure if what I am pointing out is actually a big issue. It would be interesting to see what would happen on a project that uses large hash maps with Products in it if the distribution at the bottom was improved.
Quick Summary:
Right now it seems like the hashCode for Products of Arity N is the same regardless of the actual class implementing it & that it doesn't distribute very well:
In a little more depth: I think this behavior is much better than in 2.7.5 where the ordering of the product elements did not affect the hashCode. (Point(x,y) would clash with Point(y,x))
in 2.7.5:
Digging a little further on what this means for a distribution =>
Keep in mind that a big problem is that java's Integer.hashCode == the number being hashed, which is going to cause problems.
Using this code to get rudimentary collision distribution:
Here is the 2.7.5 result (takes a few minutes on a Core i7 920, removeDuplicates is terribly slow):
and Here is the 2.8 nightly result (surprisingly the same distribution, that takes only a few seconds):
Both cases are going to collide 90+% of the time (notice that changing tot to 100 reduces the collision % quite significantly so I suspect there is a quadratic relationship at play here)
Running with a tot =100 gives:
Still a 58% collide rate, but a lot better than the larger case.
Finally for the 3 Tuple cases with tot = 100
Perhaps I should look into a better case than just integers to see what the collision rates actually are like.
Quick Summary:
Right now it seems like the hashCode for Products of Arity N is the same regardless of the actual class implementing it & that it doesn't distribute very well:
case class Point(x : Int, y : Int)
case class Point2(x : Int, y : Int)
(1,2).hashCode == Point(1,2).hashCode == Point2(1,2).hashCode == 3405
Welcome to Scala version 2.8.0.r20331-b20100101020206 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_15).
List((1,2), (2,1), Point(1,2), Point(2,1), Point2(1,2), Point2(2,1)).map(_.hashCode)
res0: List[Int] = List(3405, 3445, 3405, 3445, 3405, 3445)
In a little more depth: I think this behavior is much better than in 2.7.5 where the ordering of the product elements did not affect the hashCode. (Point(x,y) would clash with Point(y,x))
in 2.7.5:
res0: List[Int] = List(1838716253, 1838716293, -1163646902, -1163646862, -270917823, -270917783)
Digging a little further on what this means for a distribution =>
Keep in mind that a big problem is that java's Integer.hashCode == the number being hashed, which is going to cause problems.
Using this code to get rudimentary collision distribution:
val tot = 500
val ideal = tot * tot * 1.0
val hashes = (for(x <- -(tot/2) until (tot/2); y <- -(tot/2) until (tot/2)) yield (x,y).hashCode ).toList
val hset = Set() ++ hashes
val hsSize = hset.size
val hlSize = hashes.removeDuplicates.size
println((ideal, hsSize, hsSize/ideal, hlSize, hlSize/ideal))
Here is the 2.7.5 result (takes a few minutes on a Core i7 920, removeDuplicates is terribly slow):
(250000.0, 20959,0.083836,20959,0.083836)
and Here is the 2.8 nightly result (surprisingly the same distribution, that takes only a few seconds):
(250000.0, 20959,0.083836,20959,0.083836)
Both cases are going to collide 90+% of the time (notice that changing tot to 100 reduces the collision % quite significantly so I suspect there is a quadratic relationship at play here)
Running with a tot =100 gives:
(10000.0,4159,0.4159,4159,0.4159)
Still a 58% collide rate, but a lot better than the larger case.
Finally for the 3 Tuple cases with tot = 100
(1000000.0,170578,0.170578,170578,0.170578)
Perhaps I should look into a better case than just integers to see what the collision rates actually are like.
Subscribe to:
Posts (Atom)