10 Collection Types the Pitfall Ridden List Operation

10 Collection Types The Pitfall-Ridden List Operation #

Today, I’m going to talk about the pitfalls in List operations.

Niklaus Wirth, the father of Pascal, once proposed a famous formula “program = data structure + algorithm.” This shows the importance of data structures. Common data structures include List, Set, Map, Queue, Tree, Graph, and Stack, among which List, Set, Map, and Queue can be broadly categorized as collection data structures.

Modern programming languages generally provide implementations of various data structures for us to use out of the box. Java is no exception, for example, it provides various implementations of collection classes. Java’s collection classes include Map and Collection. Collection includes three subclasses: List, Set, and Queue, among which List is the most important and is used in all business code. Therefore, today I will focus on List and will not concentrate on the pitfalls of Map and other subclasses in Collection.

Today, we will discuss several aspects such as converting an array to a List collection, slicing operations on List, and performance issues in List searches, to talk about some of the most likely pitfalls you may encounter.

Three Pitfalls of Using Arrays.asList to Convert Data to List #

The various features of Stream in Java 8 greatly reduce the amount of code for various operations on collections (projection, filtering, conversion). Therefore, in business development, we often convert original arrays into List data structures to continue various Stream operations.

You may also think that using the Arrays.asList method can convert an array into a List with just one line of code, but it’s not that simple. Next, let’s take a look at the reasons behind it, as well as several pitfalls of using Arrays.asList to convert arrays into List.

In the following code, we initialize an int[] array with three numbers, and then use Arrays.asList to convert the array into a List:

int[] arr = {1, 2, 3};
List list = Arrays.asList(arr);
log.info("list:{} size:{} class:{}", list, list.size(), list.get(0).getClass());

However, the List initialized in this way is not the List we expected that contains three numbers. By looking at the log, we can see that the List actually contains an int array, and the entire List has only one element, which is an array of integers.

12:50:39.445 [main] INFO org.geekbang.time.commonmistakes.collection.aslist.AsListApplication - list:[[I@1c53fd30] size:1 class:class [I

The reason is that only int can be boxed to Integer, but it is impossible to box an int array into an Integer array. We know that the Arrays.asList method takes a generic parameter T variable arguments. In the end, the int array is treated as an object and becomes the generic type T:

public static <T> List<T> asList(T... a) {
    return new ArrayList<>(a);
}

Directly traversing such a List will inevitably result in bugs. There are two ways to fix this: if you are using Java 8 or above, you can use the Arrays.stream method to convert it. Otherwise, you can declare the int array as a wrapper type Integer array:

int[] arr1 = {1, 2, 3};
List list1 = Arrays.stream(arr1).boxed().collect(Collectors.toList());
log.info("list:{} size:{} class:{}", list1, list1.size(), list1.get(0).getClass());

Integer[] arr2 = {1, 2, 3};
List list2 = Arrays.asList(arr2);
log.info("list:{} size:{} class:{}", list2, list2.size(), list2.get(0).getClass());

After fixing the code, the log shows that the List has three elements, and the element type is Integer:

13:10:57.373 [main] INFO org.geekbang.time.commonmistakes.collection.aslist.AsListApplication - list:[1, 2, 3] size:3 class:class java.lang.Integer

As we can see, the first pitfall is that we cannot directly use Arrays.asList to convert primitive type arrays. So now that we have the correct List, can we use it like a normal List? Let’s continue reading.

After using Arrays.asList to convert a string array composed of three strings “1”, “2”, and “3” into a List, we modify the second character of the original string array to “4”, then add a string “5” to the List. What will the array and the List be?

String[] arr = {"1", "2", "3"};
List list = Arrays.asList(arr);
arr[1] = "4";
try {
    list.add("5");
} catch (Exception ex) {
    ex.printStackTrace();
}

log.info("arr:{} list:{}", Arrays.toString(arr), list);

We can see that there is an UnsupportedOperationException in the log. The operation of adding the string “5” to the List failed, and after the second element of the original array was changed from “2” to “4”, the second element of the List obtained by asList was also modified to “4”:

java.lang.UnsupportedOperationException
  at java.util.AbstractList.add(AbstractList.java:148)
  at java.util.AbstractList.add(AbstractList.java:108)
  at org.geekbang.time.commonmistakes.collection.aslist.AsListApplication.wrong2(AsListApplication.java:41)
  at org.geekbang.time.commonmistakes.collection.aslist.AsListApplication.main(AsListApplication.java:15)
13:15:34.699 [main] INFO org.geekbang.time.commonmistakes.collection.aslist.AsListApplication - arr:[1, 4, 3] list:[1, 4, 3]

Here, two more pitfalls are brought up.

The second pitfall is that the List returned by Arrays.asList does not support add and remove operations. The List returned by Arrays.asList is not the java.util.ArrayList we expected, but an inner class ArrayList of Arrays. The ArrayList inner class inherits from the AbstractList class and does not override the add method of the parent class. The implementation of the add method in the parent class is to throw an UnsupportedOperationException. The relevant source code is shown below:

public static <T> List<T> asList(T... a) {
    return new ArrayList<>(a);
}

private static class ArrayList<E> extends AbstractList<E> implements RandomAccess, java.io.Serializable {

    private final E[] a;

    ArrayList(E[] array) {
        a = Objects.requireNonNull(array);
    }
...
    @Override
    public E set(int index, E element) {

        E oldValue = a[index];
        a[index] = element;
        return oldValue;
    }
    ...
}

public abstract class AbstractList<E> extends AbstractCollection<E> implements List<E> {

...

public void add(int index, E element) {
        throw new UnsupportedOperationException();
    }
}

The third pitfall is that modifying the original array will affect the List we obtained. Looking at the implementation of ArrayList, we can see that it directly uses the original array. Therefore, we need to be particularly careful when passing the List obtained through Arrays.asList to other methods, as bugs can easily occur due to sharing the array and modifying it.

The fix is quite simple, just create a new ArrayList to initialize the List returned by Arrays.asList:

String[] arr = {"1", "2", "3"};
List list = new ArrayList(Arrays.asList(arr));
arr[1] = "4";

try {
    list.add("5");
} catch (Exception ex) {
    ex.printStackTrace();
}

log.info("arr:{} list:{}", Arrays.toString(arr), list);

After modifying the code to “decouple” the original array and the List, they will no longer affect each other. Also, because we are operating on the actual ArrayList, add will no longer throw an error:

13:34:50.829 [main] INFO org.geekbang.time.commonmistakes.collection.aslist.AsListApplication - arr:[1, 4, 3] list:[1, 2, 3, 5]

## Using List.subList for slicing can lead to OOM?

In business development, it is often necessary to slice a List, that is, to extract a portion of the elements to form a new List. We usually think of using the List.subList method. However, similar to the issue with Arrays.asList, the sub List returned by List.subList is not a normal ArrayList. This sub List can be considered as a view of the original List and will affect the original List. If not careful, this can lead to OOM issues. Let's analyze the pitfalls together.

As shown in the code below, a static List named "data" is defined to store a List of Integers, which means the members of "data" themselves contain multiple numbers in a List. In a loop of 1000 iterations, each time a sub List containing only one number is obtained from a List of 100,000 Integers using the subList method, and this sub List is added to the "data" variable:

private static List<List> data = new ArrayList<>();

private static void oom() { for (int i = 0; i < 1000; i++) { List rawList = IntStream.rangeClosed(1, 100000).boxed().collect(Collectors.toList()); data.add(rawList.subList(0, 1)); } }


You may think that the "data" variable ultimately only saves 1000 Lists with one element, so it shouldn't occupy much space, but an OOM occurs shortly after the program starts running:

Exception in thread “main” java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3181) at java.util.ArrayList.grow(ArrayList.java:265)


The reason for the OOM is that the 1000 Lists with 100,000 elements in the loop are never garbage collected because they are always strongly referenced by the List returned by the subList method. So why does the returned sub List strongly reference the original List, and what is their relationship? Let's continue experimenting and observing the characteristics of this sub List.

First, initialize an ArrayList containing numbers 1 to 10, then use the subList method to retrieve 2, 3, 4; then remove the number 3 from this sub List and print the original ArrayList; finally, add the number 0 to the original ArrayList and iterate over the sub List to output all elements:

List list = IntStream.rangeClosed(1, 10).boxed().collect(Collectors.toList()); List subList = list.subList(1, 4); System.out.println(subList); subList.remove(1); System.out.println(list); list.add(0); try { subList.forEach(System.out::println); } catch (Exception ex) { ex.printStackTrace(); }


After running the code, the following output is obtained:

[2, 3, 4] [1, 2, 4, 5, 6, 7, 8, 9, 10] java.util.ConcurrentModificationException at java.util.ArrayList$SubList.checkForComodification(ArrayList.java:1239) at java.util.ArrayList$SubList.listIterator(ArrayList.java:1099) at java.util.AbstractList.listIterator(AbstractList.java:299) at java.util.ArrayList$SubList.iterator(ArrayList.java:1095) at java.lang.Iterable.forEach(Iterable.java:74)


We can see two phenomena:

The number 3 is deleted from the original List, indicating that deleting an element from the sub List affects the original List;

Trying to iterate over the sub List after adding the number 0 to the original List results in a ConcurrentModificationException.

Let's analyze the source code of ArrayList to understand why this is happening:

public class ArrayList extends AbstractList implements List, RandomAccess, Cloneable, java.io.Serializable {

protected transient int modCount = 0;

private void ensureExplicitCapacity(int minCapacity) { modCount++; // overflow-conscious code if (minCapacity - elementData.length > 0) grow(minCapacity); }

public void add(int index, E element) { rangeCheckForAdd(index); ensureCapacityInternal(size + 1); // Increments modCount!!

        ```
        System.arraycopy(elementData, index, elementData, index + 1,
                         size - index);
        elementData[index] = element;
        size++;
      }
    
      public List<E> subList(int fromIndex, int toIndex) {
        subListRangeCheck(fromIndex, toIndex, size);
        return new SubList(this, offset, fromIndex, toIndex);
      }
    
      private class SubList extends AbstractList<E> implements RandomAccess {
    
        private final AbstractList<E> parent;
        private final int parentOffset;
        private final int offset;
        int size;
    
        SubList(AbstractList<E> parent,
              int offset, int fromIndex, int toIndex) {
            this.parent = parent;
            this.parentOffset = fromIndex;
            this.offset = offset + fromIndex;
            this.size = toIndex - fromIndex;
            this.modCount = ArrayList.this.modCount;
        }
    
            public E set(int index, E element) {
                rangeCheck(index);
                checkForComodification();
                return l.set(index+offset, element);
            }
    
        public ListIterator<E> listIterator(final int index) {
                    checkForComodification();
                    ...
        }
    
        private void checkForComodification() {
            if (ArrayList.this.modCount != this.modCount)
                throw new ConcurrentModificationException();
        }
        ...
      }
    }
    

First, ArrayList maintains a field called modCount, which represents the number of structural modifications to the collection. A structural modification refers to a modification that changes the size of the List. Therefore, an add operation will definitely change the value of modCount.

Second, by analyzing the subList method from lines 21 to 24, we can see that the obtained List is actually an inner class called SubList, rather than a normal ArrayList. During initialization, it is passed in the "this" reference.

Third, by analyzing the code from lines 26 to 39, we can see that the parent field in the SubList class is the original List. When SubList is initialized, the elements of the original List are not copied and saved in a separate variable. We can think of SubList as a view of the original List, not an independent List. Modifications to either side will affect the other, and since SubList holds a strong reference to the original List, having a large number of such SubLists can lead to OutOfMemoryError.

Fourth, by analyzing the code from lines 47 to 55, we can see that when iterating over the SubList, an iterator is acquired and the modCount of the original ArrayList is compared to the current modCount of the SubList. After obtaining the SubList, we add an element to the original List and modify its modCount, so the equality check fails and a ConcurrentModificationException is thrown.

To avoid mutual interference between the SubList and the original List, there are two ways to fix it:

One is to not directly use the SubList returned by the subList method, but to create a new ArrayList and pass the SubList to its constructor to create an independent ArrayList.

The other is to use the skip and limit APIs of the Stream in Java 8 to skip elements in the stream and limit the number of elements, which can also achieve the slicing of a SubList.

    //Method 1:
    List<Integer> subList = new ArrayList<>(list.subList(1, 4));
    
    //Method 2:
    List<Integer> subList = list.stream().skip(1).limit(3).collect(Collectors.toList());

After making these fixes, the output will be as follows:

    [2, 3, 4]
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    2
    4

As you can see, deleting elements from the SubList no longer affects the original List, and modifications to the original List no longer result in List iteration exceptions.
## Make Suitable Data Structures Do Suitable Things

When introducing concurrent tools, I mentioned the importance of selecting the appropriate concurrent tools or containers based on the business scenario. When using the List collection class, there are also two common pitfalls if the usage scenario is not taken into account.

The first pitfall is not considering the trade-off between time and space when using data structures.

Firstly, let's define an Order class with only one field, orderId, of type int:

```java
@Data
@NoArgsConstructor
@AllArgsConstructor
static class Order {
    private int orderId;
}

Then, let’s define a listSearch method that takes two parameters, elementCount and loopCount, which initializes an ArrayList with elementCount Order objects and searches this ArrayList in a loop for loopCount times, randomly selecting an order number each time:

private static Object listSearch(int elementCount, int loopCount) {

    List<Order> list = IntStream.rangeClosed(1, elementCount).mapToObj(i -> new Order(i)).collect(Collectors.toList());

    IntStream.rangeClosed(1, loopCount).forEach(i -> {
        int search = ThreadLocalRandom.current().nextInt(elementCount);
        Order result = list.stream().filter(order -> order.getOrderId() == search).findFirst().orElse(null);
        Assert.assertTrue(result != null && result.getOrderId() == search);
    });
    return list;
}

Next, let’s define another method, mapSearch, which searches a random order number from a Map with elementCount elements for loopCount times. In the Map, the Key is the order number and the Value is the Order object:

private static Object mapSearch(int elementCount, int loopCount) {

    Map<Integer, Order> map = IntStream.rangeClosed(1, elementCount).boxed().collect(Collectors.toMap(Function.identity(), i -> new Order(i)));

    IntStream.rangeClosed(1, loopCount).forEach(i -> {
        int search = ThreadLocalRandom.current().nextInt(elementCount);
        Order result = map.get(search);
        Assert.assertTrue(result != null && result.getOrderId() == search);
    });
    return map;
}

We know that the time complexity of searching in an ArrayList is O(n), while the time complexity of the get operation in a HashMap is O(1). Therefore, if we need to search for a single value in a large List, using a HashMap, where the Key is the value to search and the Value is the original object, will have a significant performance advantage over using an ArrayList.

As shown in the following code, we perform 1000 searches on an ArrayList and a HashMap, each containing 1 million elements:

int elementCount = 1000000;

int loopCount = 1000;

StopWatch stopWatch = new StopWatch();
stopWatch.start("listSearch");

Object list = listSearch(elementCount, loopCount);
System.out.println(ObjectSizeCalculator.getObjectSize(list));
stopWatch.stop();
stopWatch.start("mapSearch");

Object map = mapSearch(elementCount, loopCount);
stopWatch.stop();
System.out.println(ObjectSizeCalculator.getObjectSize(map));
System.out.println(stopWatch.prettyPrint());

We can see that with just 1000 searches, the listSearch method takes 3.3 seconds, while the mapSearch method takes only 108 milliseconds.

“shell 20861992 72388672

StopWatch: running time = 3506699764 ns #

ns % 任务名称 #

3398413176 097% 列表搜索 108286588 003% 映射搜索

即使我们要搜索的不是单值而是条件区间,也可以尝试使用 HashMap 来进行“搜索性能优化”。如果你的条件区间是固定的话,可以提前把 HashMap 按照条件区间进行分组,Key 就是不同的区间。

的确,如果业务代码中有频繁的大 ArrayList 搜索,使用 HashMap 性能会好很多。类似地,如果要对大 ArrayList 进行去重操作,也不建议使用 contains 方法,而是可以考虑使用 HashSet 进行去重。说到这里,还有一个问题,使用 HashMap 是否会牺牲空间呢?

为此,我们使用 ObjectSizeCalculator 工具打印 ArrayList 和 HashMap 的内存占用,可以看到 ArrayList 占用内存 21M,而 HashMap 占用的内存达到了 72M,是 List 的三倍多。进一步使用 MAT 工具分析堆可以再次证明,ArrayList 在内存占用上性价比很高,77% 是实际的数据(如第 1 个图所示,16000000/20861992),而 HashMap 的“含金量”只有 22%(如第 2 个图所示,16000000/72386640)。

img

img

所以,在应用内存吃紧的情况下,我们需要考虑是否值得使用更多的内存消耗来换取更高的性能。这里我们看到的是平衡的艺术,空间换时间,还是时间换空间,只考虑任何一个方面都是不对的。

第二个误区是,过于迷信教科书的大 O 时间复杂度。

数据结构中要实现一个列表,有基于连续存储的数组和基于指针串联的链表两种方式。在 Java 中,有代表性的实现是 ArrayList 和 LinkedList,前者背后的数据结构是数组,后者则是(双向)链表。

在选择数据结构的时候,我们通常会考虑每种数据结构不同操作的时间复杂度,以及使用场景两个因素。查看这里,你可以看到数组和链表大 O 时间复杂度的显著差异:

对于数组,随机元素访问的时间复杂度是 O(1),元素插入操作是 O(n);

对于链表,随机元素访问的时间复杂度是 O(n),元素插入操作是 O(1)。

那么,在大量的元素插入、很少的随机访问的业务场景下,是不是就应该使用 LinkedList 呢?接下来,我们写一段代码测试下两者随机访问和插入的性能吧。

定义四个参数一致的方法,分别对元素个数为 elementCount 的 LinkedList 和 ArrayList,循环 loopCount 次,进行随机访问和增加元素到随机位置的操作:

//LinkedList访问
private static void linkedListGet(int elementCount, int loopCount) {
    List<Integer> list = IntStream.rangeClosed(1, elementCount).boxed().collect(Collectors.toCollection(LinkedList::new));
    IntStream.rangeClosed(1, loopCount).forEach(i -> list.get(ThreadLocalRandom.current().nextInt(elementCount)));
}

//ArrayList访问
private static void arrayListGet(int elementCount, int loopCount) {
    List<Integer> list = IntStream.rangeClosed(1, elementCount).boxed().collect(Collectors.toCollection(ArrayList::new));
    IntStream.rangeClosed(1, loopCount).forEach(i -> list.get(ThreadLocalRandom.current().nextInt(elementCount)));
}

//LinkedList插入
private static void linkedListAdd(int elementCount, int loopCount) {
    List<Integer> list = IntStream.rangeClosed(1, elementCount).boxed().collect(Collectors.toCollection(LinkedList::new));
    IntStream.rangeClosed(1, loopCount).forEach(i -> list.add(ThreadLocalRandom.current().nextInt(elementCount),1));
}

//ArrayList插入
private static void arrayListAdd(int elementCount, int loopCount) {
    List<Integer> list = IntStream.rangeClosed(1, elementCount).boxed().collect(Collectors.toCollection(ArrayList::new));
    IntStream.rangeClosed(1, loopCount).forEach(i -> list.add(ThreadLocalRandom.current().nextInt(elementCount),1));
}

测试代码如下,10万个元素,循环10万次:

int elementCount = 100000;

int loopCount = 100000;

StopWatch stopWatch = new StopWatch();
stopWatch.start("linkedListGet");
linkedListGet(elementCount, loopCount);
stopWatch.stop();
stopWatch.start("arrayListGet");
arrayListGet(elementCount, loopCount);
stopWatch.stop();
System.out.println(stopWatch.prettyPrint());

StopWatch stopWatch2 = new StopWatch();
stopWatch2.start("linkedListAdd");
linkedListAdd(elementCount, loopCount);
stopWatch2.stop();

stopWatch2.start("arrayListAdd");
arrayListAdd(elementCount, loopCount);
stopWatch2.stop();

System.out.println(stopWatch2.prettyPrint());

The running results may surprise you. In terms of random access, we can see the absolute advantage of ArrayList, which only takes 11 milliseconds, while LinkedList takes 6.6 seconds. This aligns with the time complexity we mentioned above. However, in terms of random insertion, LinkedList is unexpectedly defeated, taking 9.3 seconds, while ArrayList only takes 1.5 seconds:

---------------------------------------------
ns         %     Task name
---------------------------------------------
6604199591  100%  linkedListGet
011494583  000%  arrayListGet

StopWatch '': running time = 10729378832 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
9253355484  086%  linkedListAdd
1476023348  014%  arrayListAdd

After examining the source code of LinkedList, I found that the time complexity of the insertion operation is O(1) under the premise that you already have a pointer to the node to be inserted. However, in the implementation, we need to first traverse the linked list to locate the node, and then perform the insertion operation. The former also incurs overhead, and it is not possible to consider only the cost of the insertion operation itself:

public void add(int index, E element) {

    checkPositionIndex(index);
    if (index == size)
        linkLast(element);
    else
        linkBefore(element, node(index));
}

Node<E> node(int index) {
    // assert isElementIndex(index);
    if (index < (size >> 1)) {
        Node<E> x = first;
        for (int i = 0; i < index; i++)
            x = x.next;
        return x;
    } else {
        Node<E> x = last;
        for (int i = size - 1; i > index; i--)
            x = x.prev;
        return x;
    }
}

Therefore, for insertion operations, the time complexity of LinkedList is actually O(n). If you continue to conduct more experiments, you will find that LinkedList almost never outperforms ArrayList in various common scenarios.

Ironically, Joshua Bloch, the author of LinkedList, replied to someone on his Twitter saying, “Although I wrote LinkedList, I don’t use it, who would really use it?”

img

This tells us that there is often a gap between theory and practice. Please do not blindly believe in the theory from textbooks. It is best to test it in practice before drawing conclusions. Regardless of the algorithmic aspect, due to issues such as CPU cache and memory contiguity, the implementation of a linked list data structure is not performance-friendly, and it may not even be powerful in its best scenarios.

Key Takeaways #

Today, I shared several error cases related to List lists, all of which were caused by “assuming” without sufficient examination.

First, assuming that Arrays.asList() and List.subList() return ordinary, independent ArrayLists can lead to various strange problems when used.

What Arrays.asList() returns is an inner class ArrayList of Arrays, while what List.subList() returns is an inner class SubList of ArrayList. These two inner classes cannot be converted to ArrayList for use.

Arrays.asList() directly uses the original array, so it can be considered as sharing “storage”, and it does not support adding or removing elements. List.subList() directly references the original List, so it can also be considered as sharing “storage”, and performing structural modifications directly on the original List will cause exceptions in the SubList.

What is easy to overlook about Arrays.asList() and List.subList() is that the new List holds a reference to the original data, which may cause the original data to be unable to be garbage collected, eventually leading to an OOM.

Second, assuming that Arrays.asList() can always correctly convert all arrays into Lists. When passing in an array of primitive types, the elements of the List will be the array itself, not the elements in the array.

Third, assuming that searching any collection in memory is always fast, we encountered performance issues when searching large ArrayLists. We considered using the O(1) time complexity of HashMap for random lookups to optimize performance, but we also need to consider the cost of storage space in HashMap and balance time and space.

Fourth, assuming that linked lists are suitable for scenarios where elements need to be added or removed, and selecting LinkedList as the data structure. In real-world scenarios, read and write operations are generally balanced, and it is not possible to only perform operations on the head or tail objects, so there may not be a performance gain in 90% of the cases. It is recommended to perform performance testing before using to evaluate its performance.

The code used today is uploaded to GitHub, and you can click on this link to view it.

Reflection and Discussion #

Finally, I’ll leave you with two questions related to the pitfalls of removing elements with ArrayList.

If you call the remove method of an ArrayList with a type of Integer, does it produce the same result whether you pass in an Integer wrapper class number or an int primitive type number?

When iterating over a List and calling the remove method to delete elements, you often encounter the ConcurrentModificationException exception. What is the reason for this, and how can it be fixed?

Have you encountered any other pitfalls related to collection classes? I’m Zhu Ye, and I welcome you to leave your thoughts in the comments section. You are also welcome to share this article with your friends or colleagues and engage in discussion together.