Wednesday, January 20, 2010

A Better Scala Enumeration Type

I am going to post a link to the Enum type that I wrote due to the limitations of the builtin Scala Version. I don't use the built in enumeration type anymore as I find it very limiting.

(source & test / example):

Originally I wrote this as answer to this question on StackOverflow, however felt that it would be useful here too.

Tuesday, January 5, 2010

Follow up to Product.hashCode collisions

After some experimentation here is the result that I came up. Note that you might need to increase the Heap Space to get it to run in the interpreter because of sort / removeDuplicates.

I posted it as a gist []

You should be able to just copy and paste it into a scala 2.8 interpreter
This yields a result of:

(ideal = 1000000.0, actual = 999880, uniqueRatio = 0.99988)
meaning a collision rate < .1%

Possible Collision Issue with Product.hashCode

First Off I am no expert at hashing algo's & perhaps this is some sort of Birthday Problem so I am not sure if what I am pointing out is actually a big issue. It would be interesting to see what would happen on a project that uses large hash maps with Products in it if the distribution at the bottom was improved.

Quick Summary:
Right now it seems like the hashCode for Products of Arity N is the same regardless of the actual class implementing it & that it doesn't distribute very well:

case class Point(x : Int, y : Int)
case class Point2(x : Int, y : Int)
(1,2).hashCode == Point(1,2).hashCode == Point2(1,2).hashCode == 3405

Welcome to Scala version 2.8.0.r20331-b20100101020206 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_15).
List((1,2), (2,1), Point(1,2), Point(2,1), Point2(1,2), Point2(2,1)).map(_.hashCode)
res0: List[Int] = List(3405, 3445, 3405, 3445, 3405, 3445)

In a little more depth: I think this behavior is much better than in 2.7.5 where the ordering of the product elements did not affect the hashCode. (Point(x,y) would clash with Point(y,x))

in 2.7.5:

res0: List[Int] = List(1838716253, 1838716293, -1163646902, -1163646862, -270917823, -270917783)

Digging a little further on what this means for a distribution =>

Keep in mind that a big problem is that java's Integer.hashCode == the number being hashed, which is going to cause problems.

Using this code to get rudimentary collision distribution:

val tot = 500
val ideal = tot * tot * 1.0
val hashes = (for(x <- -(tot/2) until (tot/2); y <- -(tot/2) until (tot/2)) yield (x,y).hashCode ).toList
val hset = Set() ++ hashes
val hsSize = hset.size
val hlSize = hashes.removeDuplicates.size
println((ideal, hsSize, hsSize/ideal, hlSize, hlSize/ideal))

Here is the 2.7.5 result (takes a few minutes on a Core i7 920, removeDuplicates is terribly slow):

(250000.0, 20959,0.083836,20959,0.083836)

and Here is the 2.8 nightly result (surprisingly the same distribution, that takes only a few seconds):

(250000.0, 20959,0.083836,20959,0.083836)

Both cases are going to collide 90+% of the time (notice that changing tot to 100 reduces the collision % quite significantly so I suspect there is a quadratic relationship at play here)

Running with a tot =100 gives:
Still a 58% collide rate, but a lot better than the larger case.

Finally for the 3 Tuple cases with tot = 100

Perhaps I should look into a better case than just integers to see what the collision rates actually are like.