Java Rumblings: collection-view

Showing posts with label collection-view. Show all posts

Friday, 27 May 2011

Collection views for Maps for iteration

The Collection-view methods allow a Map to be viewed as a Collection in three ways:

keySet: the Set of keys contained in the Map.
values: The Collection of values contained in the Map. This Collection is not a Set, as multiple keys can map to the same value.
entrySet: The Set of key-value pairs contained in the Map. The Map interface provides a small nested interface called Map.Entry that is the type of the elements in this Set.

The Collection-views provide the only means to iterate over a Map.

Iterating over the keys in a Map

Here's an example illustrating the standard idiom for iterating over the keys in a Map:

for (Iterator i=m.keySet().iterator(); i.hasNext(); )
    System.out.println(i.next());

Iterating over the values in Map

The idiom for iterating over values is analogous.

for (Iterator i=m.values().iterator(); i.hasNext(); )
    System.out.println(i.next());

Iterating over the key value pairs

Here's the idiom for iterating over key-value pairs:

for (Iterator i=m.entrySet().iterator(); i.hasNext(); ) {
    Map.Entry e = (Map.Entry) i.next();
    System.out.println(e.getKey() + 
                          ": " + e.getValue());
}

When first presented with these idioms, many people worry that they may be slow because the Map has to create a new Collection object each time a Collection-view operation is called. Rest easy: This is not the case. There's no reason that a Map can't always return the same object each time it is asked for a given Collection-view. This is precisely what all of the JDK's Map implementations do.
With all three Collection-views, calling an Iterator's remove operation removes the associated entry from the backing Map (assuming the Map supports element removal to begin with). With the entrySet view, it is also possible to change the value associated with a key, by calling a Map.Entry's setValue method during iteration (again, assuming the Map supports value modification to begin with). Note that these are the only safe ways to modify a Map during iteration; the behavior is unspecified if the underlying Map is modified in any other way while the iteration is in progress.
The Collection-views support element removal in all its many forms: the remove, removeAll, retainAll, and clear operations, as well as the Iterator.remove operation. (Yet again, this assumes that the backing Map supports element removal.)
The Collection-views do not support element addition under any circumstances. It would make no sense for the keySet and values views, and it's unnecessary for the entrySet view, as the backing Map's put and putAll provide the same functionality.

Sunday, 13 March 2011

General Purpose Implementations

The general-purpose implementations are summarized in the table below. The table highlights their regular naming pattern: names are all of the form <Implementation> <Interface>, where <Interface> is the core collection interface implemented by the class, and <Implementation> signifies the data structure underlying the implementation.

		Hash Table	Resizable Array	Balanced Tree	Linked List
		Implementations
Interfaces	Set	HashSet		TreeSet
	List		ArrayList		LinkedList
	Map	HashMap		TreeMap

JDK 1.2 provides two implementations of each interface (with the exception of Collection

, which has no direct implementations, but serves as a least common denominator for the other collection interfaces). In each case, one implementation is clearly the primary implementation: the one to use, all other things being equal. The primary implementations are HashSet, ArrayList and HashMap. Note that the SortedSet

and SortedMap

interfaces do not have rows in the table above. Each of these interfaces has one implementation and these implementations (TreeSet and TreeMap) are listed in the Set and Map rows. Not only do the implementations have consistent names, but they have consistent behavior as well. All of them implement all the optional operations contained in their interfaces. All permit null elements, keys and values. Each one is unsynchronized. All have fail-fast iterators, which detect illegal concurrent modification during iteration and fail quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future. All are Serializable, and all support a public clone method.
The fact that the new implementations are unsynchronized represents a break with the past: Vector and Hashtable were synchronized in versions of the JDK prior to 1.2. The new approach was taken because it was recognized that collections are frequently used in a manner where the synchronization is of no benefit. Such uses include single-threaded use, read-only use, and use as part of a larger data object that does its own synchronization. In general, it is good API design practice not to make users pay for a feature that they generally don't use. Further, unnecessary synchronization can result in deadlock under certain circumstances.
If you need a synchronized collection, the synchronization wrappers, described in the next section, allow any collection to be transformed into a synchronized collection. Thus, synchronization is optional for the new collection implementations where it was mandatory for the old.
As a rule of thumb, you should be thinking about the interfaces rather than the implementations. That is why there are no programming examples in this lesson. For the most part, the choice of implementation affects only performance. The preferred style, as was mentioned in the interfaces lesson, is to choose an implementation when a collection is created, and immediately assign the new collection to a variable of the corresponding interface type (or pass the collection to a method expecting an argument of the interface type). In this way, the program does not become dependent on any added methods in a given implementation, leaving the programmer or maintainer with the freedom to change implementations at the drop of a hat, if performance concerns dictate that it is the right thing to do.
The general purposes implementations are briefly discussed below. The performance of the implementations is described using words like constant, log, linear, n log(n) and quadratic. These words refer to the asymptotic upper bound on the time complexity of performing the operation. All of this is quite a mouthful, and it doesn't matter much if you don't know what it means. If you're interested, any good algorithms textbook explains this stuff. One thing to keep in mind is that this sort of performance metric has its limitations. Sometimes the nominally slower implementation may be faster for the collection size that you're actually using. When in doubt, measure the performance.

Set

The two general purpose Setimplementations are HashSetand TreeSet. It's very straightforward to decide which of these two to use. HashSet is much faster (constant time vs. log time for most operations), but offers no ordering guarantees. If you need to use the operations in the SortedSet, or in-order iteration is important to you, use TreeSet. Otherwise, use HashSet. It's a fair bet that you'll end up using HashSet most of the time. One thing worth keeping in mind about HashSet is that iteration is linear in the sum of the number of entries and the number of buckets (the capacity). Thus, it's important to choose an appropriate initial capacity if iteration performance is important. Choosing a capacity that's too high can waste space as well as time. The default initial capacity is 101, and that's often more that you need. The initial capacity may be specified using the int constructor. To allocate a HashSet whose initial capacity is 17:
Set s= new HashSet(17);
HashSets have one other "tuning parameter" called the load factor. If you care deeply about the space consumption of your HashSet, read the HashSet documentation for more information. Otherwise just live with the default. If you accept the default load factor but you do want to specify an initial capacity, pick a number that's about twice the size that you expect the Set to grow to. If your guess is way off, it may have to grow or you may waste a bit of space, but either way it's no big problem. If you know a prime number of about the right size, use it. If not, use an odd number. Or use an even number. It doesn't really matter much; these things might make the HashSet perform a wee bit better, but nothing to write home about.
TreeSet has no tuning parameters. With the exception of clone, neither HashSet nor TreeSet have any operations other than those required by their respective interfaces (Set and TreeSet).

List

The two general purpose Listimplementations are ArrayListand LinkedList. Most of the time, you'll probably use ArrayList. It offers constant time positional access, and it's just plain fast, because it does not have to allocate a node object for each element in the List, and it can take advantage of the native method System.arraycopy when it has to move multiple elements at once. Think of ArrayList as Vector without the synchronization overhead. If you frequently add elements to the beginning of the List, or iterate over the List deleting elements from its interior, you might want to consider LinkedList. These operations are constant time in a LinkedList but linear time in an ArrayList. But you pay a big price! Positional access is linear time in a LinkedList and constant time in an ArrayList. Furthermore, the constant factor for LinkedList is much worse. If you think that you want to use a LinkedList, measure the performance with both LinkedList and ArrayList. You may be surprised.
ArrayList has one tuning parameter, the initial capacity. It refers to the number of elements the ArrayList can hold before it has to grow. There's not much to say about it. The only ArrayList operations that are not required by List are ensureCapacity and trimToSize (which alter the excess capacity), and clone.
LinkedList has no tuning parameters, and seven optional operations, one of which is clone. The other six are addFirst, getFirst, removeFirst, addLast, getLast, and removeLast; I have very mixed feelings about them. They make it a bit more convenient to use a LinkedList as a queue or a double-ended queue (dequeue), but they prevent you from easily switching representations when you discover that ArrayList is faster.
If you need synchronization, a Vectorwill be slightly faster than an ArrayList synchronized with Collections.synchronizedList, but Vector has loads of legacy operations, so be extra careful to always manipulate the Vector with the List interface, or you'll be stuck with it for life.
If your List is fixed in size (that is, you'll never use remove, add or any of the bulk operations other than containsAll) you have a third option that's definitely worth considering. See Arrays.asList in the convenience implementations section.

Map

The two general purpose Mapimplementations are HashMapand TreeMap. The situation for Map is exactly analogous to Set. If you need SortedMap operations or in-order Collection-view iteration, go for TreeMap; otherwise, go for HashMap. Everything else in the Set section (above) also applies to Map so just re-read it. Completeness requires that we mention Hashtable. As with Vector and ArrayList, if you need synchronization, a Hashtable will be slightly faster than a HashMap synchronized with Collections.synchronizedMap. Again, Hashtable has loads of legacy operations, so be extra careful always to manipulate it with the Map interface, or you'll be stuck with it for life.

Saturday, 12 March 2011

Fancy Uses of Collection-Views: Map Algebra

When applied to the Collection-views, the bulk operations (containsAll, removeAll and retainAll) are a surprisingly potent tool. For starters, suppose you want to know whether one Map is a submap of another, that is, whether the first Map contains all of the key-value mappings in the second. The following idiom does the trick:

if (m1.entrySet().containsAll(m2.entrySet())) {
    ...
}

Along similar lines, suppose that you want to know if two Map objects contain mappings for all the same keys:

if (m1.keySet().equals(m2.keySet())) {
    ...
}

Suppose you have a map that represents a collection of attribute-value pairs, and two sets, representing required attributes and permissible attributes. (The permissible attributes include the required attributes.) The following snippet determines whether the attribute map conforms to these constraints, and prints a detailed error message if it doesn't:

boolean valid = true;
Set attributes = attributeMap.keySet();
if(!attributes.containsAll(requiredAttributes)) {
    Set missing = new HashSet(requiredAttributes);
    missing.removeAll(attributes);
    System.out.println("Missing required attributes: "+missing);
    valid = false;
}

if (!permissibleAttributes.containsAll(attributes)) {
    Set illegal = new HashSet(attributes);
    illegal.removeAll(permissibleAttributes);
    System.out.println("Contains illegal attributes: "+illegal);
    valid = false;
}

if (valid)
    System.out.println("OK");

Suppose you want to know all the keys common to two Map objects:

Set commonKeys = new HashSet(a.keySet());
commonKeys.retainAll(b.keySet());

A similar idiom gets you the common values, and the common key-value pairs. Extra care is needed if you get the common key-value pairs, as the elements of the resulting Set, which are Map.Entry objects, may become invalid if the Map is modified. All the idioms presented thus far have been "non-destructive": They don't modify the backing Map. Here are a few that do. Suppose you want to remove all the key-value pairs that one Map has in common with another:

m1.entrySet().removeAll(m2.entrySet());

Suppose you want to remove from one Map all the keys that have mappings in another:

m1.keySet().removeAll(m2.keySet());

And what happens when you start mixing keys and values in the same bulk operation? Suppose you have a Map called managers that maps each employee in a company to the employee's manager. We'll be deliberately vague about the types of the key and value objects. It doesn't matter, so long as they're the same. Now suppose you want to know who all the "individual contributors" are. (This is corporate-speak for employees who are not managers.) The following one-liner tells you exactly what you want to know:

Set individualContributors = new HashSet(managers.keySet());
individualContributors.removeAll(managers.values());

Suppose you want to fire all of the employees who report directly to some manager (we'll call him herbert):

Employee herbert = ... ;
managers.values().removeAll(Collections.singleton(herbert));

Note that this idiom makes use of Collections.singleton, a static factory method that returns an immutable Set with the single, specified element. Once you've done this, you may have a bunch of employees whose managers no longer work for the company (if any of herbert's direct-reports were themselves managers). The following code tells you all of the employees whose manager no longer works for the company:

Map m = new HashMap(managers);
m.values().removeAll(managers.keySet());
Set slackers = m.keySet();

This example is a bit tricky. First it makes a temporary copy of the Map. Then it removes from the temporary copy all entries whose (manager) value is a key in the original Map. Recall that the original Map contains an entry for every employee. Thus, the remaining entries in the temporary Map comprise all the entries from the original Map whose (manager) values are no longer employees. The keys in the temporary copy, then, represent precisely the employees that we're looking for. If you stare at the example for a little while, this should all become clear. If it doesn't, now would be a good time to get a steaming hot mug of freshly brewed Java. There are many, many more idioms like the ones contained in this section but it would be impractical and tedious to list them all. Once you get the hang of it, it's not that hard to come up with the right one when you need it.

Pages