Tuesday, January 13, 2009

The use of Java Sets

The Set interface requires that the underlying implementation can tell whether two objects added to it are the same, so that the second can be rejected (a Set is defined as only containing one copy of any given object).  Ensuring adherence to this requirement involves both the Set implementation and the mechanism by which the objects can be distinguished from one-another.  The two basic implementations that i use in my code, TreeSet and HashSet, handle this in different ways:

TreeSets are ordered b-trees (or something very similar) of objects.  Because they’re ordered, the objects they contain must implement the Comparable interface, or alternatively the TreeSet itself may have a Comparator member; in either case, a compareTo() method is called to compare objects to one another.  compareTo() returns -1, 0 or 1, depending on whether the first argument is less than, equal to or greater than the second; this is how the nodes in the tree are ordered, and how the TreeSet code can tell when two objects are equal.  You wouldn’t have noticed this if you’ve only been using TreeSets to store Strings, Integers, etc. because those classes all implement compareTo(), but if you create your own class and try to add instances to a TreeSet without having defined compareTo(), you’ll get an exception saying that your instance can’t be cast to Comparable.

HashSets are typically unordered, and thus don’t need the objects they contain to be Comparable.  What they do need is for those objects to implement hashCode(), which returns an int value that the HashSet maps to one of its buckets, and equals(), which it uses to compare objects in the same bucket.  Here too, the basic classes like String and Integer have appropriate definitions for these methods.  Unlike compareTo(), though, hashCode() and equals() are defined by Object and thus inherited by any class that doesn’t overwrite them.  Which brings us – finally – to my reason for writing this:

Object.hashCode(), at least in the current implementations of the JVM (1.6), returns a value based on an object’s address, rather than anything having to do with the object’s contents.  That means that two objects that are equal according the their class semantics will probably be placed in different buckets and probably not even compared to one-another.  Even if they end up in the same bucket and are compared, the Object.equals() method probably won’t be able to figure out that they’re really equivalent.  So, not only might you end up with multiple copies of the same data, you might not be able to find whether a given object is contained in the HashSet, since the lookup algorithm will have the same problems as insertion.
To get around these potential problems (and they’re only potential – they may never happen in real life) one should add hashCode() and equals() overrides for those data-model classes that get put into HashSets.  

The real pain in this issue is that you won't be notified if you didn't override those methods, cause they are inherited (non efficiently) from Object.

No comments: