I am going to post a link to the Enum type that I wrote due to the limitations of the builtin Scala Version. I don't use the built in enumeration type anymore as I find it very limiting.

(source & test / example):

http://gist.github.com/282446

Originally I wrote this as answer to this question on StackOverflow, however felt that it would be useful here too.

## Wednesday, January 20, 2010

## Tuesday, January 5, 2010

### Follow up to Product.hashCode collisions

After some experimentation here is the result that I came up. Note that you might need to increase the Heap Space to get it to run in the interpreter because of sort / removeDuplicates.

I posted it as a gist [http://gist.github.com/270117]

You should be able to just copy and paste it into a scala 2.8 interpreter

This yields a result of:

(ideal = 1000000.0, actual = 999880, uniqueRatio = 0.99988)

meaning a collision rate < .1%

I posted it as a gist [http://gist.github.com/270117]

You should be able to just copy and paste it into a scala 2.8 interpreter

This yields a result of:

(ideal = 1000000.0, actual = 999880, uniqueRatio = 0.99988)

meaning a collision rate < .1%

### Possible Collision Issue with Product.hashCode

First Off I am no expert at hashing algo's & perhaps this is some sort of Birthday Problem so I am not sure if what I am pointing out is actually a big issue. It would be interesting to see what would happen on a project that uses large hash maps with Products in it if the distribution at the bottom was improved.

Quick Summary:

Right now it seems like the hashCode for Products of Arity N is the same regardless of the actual class implementing it & that it doesn't distribute very well:

In a little more depth: I think this behavior is much better than in 2.7.5 where the ordering of the product elements did not affect the hashCode. (Point(x,y) would clash with Point(y,x))

in 2.7.5:

Digging a little further on what this means for a distribution =>

Keep in mind that a big problem is that java's Integer.hashCode == the number being hashed, which is going to cause problems.

Using this code to get rudimentary collision distribution:

Here is the 2.7.5 result (takes a few minutes on a Core i7 920, removeDuplicates is terribly slow):

and Here is the 2.8 nightly result (surprisingly the same distribution, that takes only a few seconds):

Both cases are going to collide 90+% of the time (notice that changing tot to 100 reduces the collision % quite significantly so I suspect there is a quadratic relationship at play here)

Running with a tot =100 gives:

Still a 58% collide rate, but a lot better than the larger case.

Finally for the 3 Tuple cases with tot = 100

Perhaps I should look into a better case than just integers to see what the collision rates actually are like.

Quick Summary:

Right now it seems like the hashCode for Products of Arity N is the same regardless of the actual class implementing it & that it doesn't distribute very well:

case class Point(x : Int, y : Int)

case class Point2(x : Int, y : Int)

(1,2).hashCode == Point(1,2).hashCode == Point2(1,2).hashCode == 3405

Welcome to Scala version 2.8.0.r20331-b20100101020206 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_15).

List((1,2), (2,1), Point(1,2), Point(2,1), Point2(1,2), Point2(2,1)).map(_.hashCode)

res0: List[Int] = List(3405, 3445, 3405, 3445, 3405, 3445)

In a little more depth: I think this behavior is much better than in 2.7.5 where the ordering of the product elements did not affect the hashCode. (Point(x,y) would clash with Point(y,x))

in 2.7.5:

res0: List[Int] = List(1838716253, 1838716293, -1163646902, -1163646862, -270917823, -270917783)

Digging a little further on what this means for a distribution =>

Keep in mind that a big problem is that java's Integer.hashCode == the number being hashed, which is going to cause problems.

Using this code to get rudimentary collision distribution:

val tot = 500

val ideal = tot * tot * 1.0

val hashes = (for(x <- -(tot/2) until (tot/2); y <- -(tot/2) until (tot/2)) yield (x,y).hashCode ).toList

val hset = Set() ++ hashes

val hsSize = hset.size

val hlSize = hashes.removeDuplicates.size

println((ideal, hsSize, hsSize/ideal, hlSize, hlSize/ideal))

Here is the 2.7.5 result (takes a few minutes on a Core i7 920, removeDuplicates is terribly slow):

(250000.0, 20959,0.083836,20959,0.083836)

and Here is the 2.8 nightly result (surprisingly the same distribution, that takes only a few seconds):

`(250000.0, 20959,0.083836,20959,0.083836)`

Both cases are going to collide 90+% of the time (notice that changing tot to 100 reduces the collision % quite significantly so I suspect there is a quadratic relationship at play here)

Running with a tot =100 gives:

`(10000.0,4159,0.4159,4159,0.4159)`

Still a 58% collide rate, but a lot better than the larger case.

Finally for the 3 Tuple cases with tot = 100

`(1000000.0,170578,0.170578,170578,0.170578)`

Perhaps I should look into a better case than just integers to see what the collision rates actually are like.

Subscribe to:
Posts (Atom)