19 Deep Understanding of Iterators and Generators

19 Deep Understanding of Iterators and Generators #

Hello, I’m Jingxiao.

When you first encountered Python, you might have written statements like for i in [2, 3, 5, 7, 11, 13]: print(i). The for loop in Python is easy to understand and much more concise and clear compared to statements like for (int i = 0; i < n; i ++) printf("%d\n", a[i]) in early versions of C++ and Java.

However, have you ever wondered what exactly happens when Python processes the for loop? What kind of objects can be enumerated using for loops?

In this lesson, we will delve into the implementation details of container types in Python and learn about something called iterators and generators.

Containers, Iterables, and Iterators You’ve Definitely Used #

The concept of containers is very easy to understand. As we’ve mentioned before, in Python, everything is an object, and the abstraction of an object is a class, while a collection of objects is a container.

Lists (list: [0, 1, 2]), tuples (tuple: (0, 1, 2)), dictionaries (dict: {0:0, 1:1, 2:2}), and sets (set: set([0, 1, 2])) are all containers. With containers, you can imagine them as units that hold multiple elements together. The difference between different containers lies in their internal data structure implementation. Then, you can choose different containers with different time and space complexities based on different scenarios.

All containers are iterable. Here, iteration is slightly different from enumeration. You can imagine iteration as going to buy apples, where the seller does not tell you how many apples they have in stock. So, every time you need to tell the seller that you want one apple, and then the seller takes action: either give you one apple or tell you that the apples are sold out. You don’t need to know how the seller arranges the apples in the warehouse, just like with a list where you don’t have to specify the index of an element. For containers such as dictionaries and sets, there is no concept of index. For example, if a dictionary is implemented using a hash table, all you need to know is that the next() function can retrieve all elements one by one without repetition or omission.

Strictly speaking, an iterator provides a next() method. After calling this method, you either get the next object in the container or receive a StopIteration error (indicating that the apples are sold out). You don’t need to specify the index of elements like in a list because containers like dictionaries and sets do not have indices. Instead, you only need to know that the next() function can retrieve all elements one by one.

An iterable object, returned by the iter() function, is used with the next() function to achieve iteration. The for in statement implicitly handles this process, so you only need to know what it roughly does.

Let’s take a look at the following code, which mainly shows you how to determine if an object is iterable. Of course, there is another approach: using isinstance(obj, Iterable).

def is_iterable(param):
    try:
        iter(param)
        return True
    except TypeError:
        return False

params = [
    1234,
    '1234',
    [1, 2, 3, 4],
    set([1, 2, 3, 4]),
    {1:1, 2:2, 3:3, 4:4},
    (1, 2, 3, 4)
]

for param in params:
    print('{} is iterable? {}'.format(param, is_iterable(param)))

The output is as follows:

1234 is iterable? False
'1234' is iterable? True
[1, 2, 3, 4] is iterable? True
{1, 2, 3, 4} is iterable? True
{1: 1, 2: 2, 3: 3, 4: 4} is iterable? True
(1, 2, 3, 4) is iterable? True

Based on this code, you can see that among the given types, all types except the number 1234 are iterable.

Generators, what are they? #

From what I know, many people may be unfamiliar with the concept of generators because in many commonly used languages there isn’t a corresponding model for generators.

Here, you only need to remember one thing: generators are lazy versions of iterators.

We know that in iterators, if we want to enumerate its elements, these elements need to be generated in advance. Let’s first look at the following simple example:

import os
import psutil

# Display the memory size occupied by the current Python program
def show_memory_info(hint):
    pid = os.getpid()
    p = psutil.Process(pid)

    info = p.memory_full_info()
    memory = info.uss / 1024. / 1024
    print('{} memory used: {} MB'.format(hint, memory))


def test_iterator():
    show_memory_info('initiating iterator')
    list_1 = [i for i in range(100000000)]
    show_memory_info('after iterator initiated')
    print(sum(list_1))
    show_memory_info('after sum called')

def test_generator():
    show_memory_info('initiating generator')
    list_2 = (i for i in range(100000000))
    show_memory_info('after generator initiated')
    print(sum(list_2))
    show_memory_info('after sum called')

%time test_iterator()
%time test_generator()

The above code demonstrates that when you declare an iterator, it is simple to generate a list containing one hundred million elements using [i for i in range(100000000)]. Each element is saved in memory after being generated. As you can see from the code, they occupy a huge amount of memory. If the memory is not enough, an OOM (Out of Memory) error will occur.

However, we do not need to simultaneously save so many things in memory. For example, when summing the elements, we only need to know what each element is at the moment of addition, and they can be discarded afterwards.

Therefore, the concept of generators emerges. The next variable is only generated when you call the next() function. The syntax for generators in Python is using parentheses, (i for i in range(100000000)), which initializes a generator.

In this way, you can clearly see that generators do not occupy a large amount of memory like iterators. They are only called when they are being used. Moreover, generators do not need to run the generation operation once during initialization. Compared to the test_iterator() function, the test_generator() function saves the process of generating one billion elements once, so it takes significantly less time.

At this point, you might say, “Generators are just like that. I have plenty of money, so I can just occupy more memory and computing resources. I’ll just spend more.”

Even if you are wealthy, please sit down and have a cup of tea before continuing to listen to me. This time, let’s implement a custom generator.

Generators, what else can they do? #

In mathematics, there is an identity: (1 + 2 + 3 + ... + n)^2 = 1^3 + 2^3 + 3^3 + ... + n^3. I’m sure you learned it in high school. Now, let’s verify the correctness of this formula. First, take a look at the code below. You can read it first, even if you don’t understand it, I will explain it in detail later.

def generator(k):
    i = 1
    while True:
        yield i ** k
        i += 1

gen_1 = generator(1)
gen_3 = generator(3)
print(gen_1)
print(gen_3)

def get_sum(n):
    sum_1, sum_3 = 0, 0
    for i in range(n):
        next_1 = next(gen_1)
        next_3 = next(gen_3)
        print('next_1 = {}, next_3 = {}'.format(next_1, next_3))
        sum_1 += next_1
        sum_3 += next_3
    print(sum_1 * sum_1, sum_3)

get_sum(8)

In this code, pay attention to the generator() function, which returns a generator.

The yield statement is the key. For beginners, you can think of it as follows: when the function reaches this line, the program pauses here and jumps out, but where does it jump to? The answer is the next() function. And what does i ** k do? It actually becomes the return value of next().

This way, every time the next(gen) function is called, the paused program is revived and continues to execute from the yield statement. Note that the local variable i is not cleared, but continues to accumulate. We can see that next_1 changes from 1 to 8, and next_3 changes from 1 to 512.

Intelligent people like you should have noticed that this generator can keep running indefinitely! Yes, in fact, an iterator is a finite set, while a generator can be an infinite set. I only need to call next(), and the generator will generate new elements based on the computation, and then return them to you, very convenient.

At this point, you must be eager to see more powerful examples, right?

Don’t worry, let’s look at another problem: given a list and a specified number, find the positions of the number in the list.

The following code should look familiar to you. It’s the conventional approach, iterating over each element and its index, checking and adding to the result list, and finally returning the result.

def index_normal(L, target):
    result = []
    for i, num in enumerate(L):
        if num == target:
            result.append(i)
    return result

print(index_normal([1, 6, 2, 4, 5, 2, 8, 6, 3, 2], 2))

So how can we do it using iterators? Without further ado, let’s look at the code.

def index_generator(L, target):
    for i, num in enumerate(L):
        if num == target:
            yield i

print(list(index_generator([1, 6, 2, 4, 5, 2, 8, 6, 3, 2], 2)))

Smart people like you should see the obvious difference, so I won’t explain it further. The only thing worth mentioning is that index_generator returns a Generator object, and we need to convert it to a list using list() before using print() to output it.

Let me say a few more words here. In the Python language specification, it has always been encouraged to achieve the same functionality with fewer and clearer code, because this can greatly improve the readability of code, reduce the chance of errors, and make it easier for others to quickly and accurately understand your intentions. However, it’s worth noting that the premise of “less code” here is clarity, not using more magic operations. Although it reduces the amount of code, it actually increases the difficulty of reading.

Now let’s get back to the point. Next, let’s look at another problem: given two sequences, determine if the first sequence is a subsequence of the second sequence. (LeetCode link: https://leetcode.com/problems/is-subsequence/)

Let’s first understand the problem itself. A sequence refers to a list, and a subsequence of a list means that the elements of one list appear in the second list in order, but they don’t have to be contiguous. For example, [1, 3, 5] is a subsequence of [1, 2, 3, 4, 5], but [1, 4, 3] is not.

To solve this problem, the conventional algorithm is the greedy algorithm. We maintain two pointers pointing to the beginning of the two lists, then we scan the second sequence, and if a number is the same as the one pointed to by the first pointer, we move the first pointer forward. If the first pointer moves past the last element of the first sequence, we return True, otherwise, we return False.

However, if we write this algorithm normally, it would probably take about ten lines or so. So what if we use iterators and generators?

def is_subsequence(a, b):
    b = iter(b)
    return all(i in b for i in a)

print(is_subsequence([1, 3, 5], [1, 2, 3, 4, 5]))
print(is_subsequence([1, 4, 3], [1, 2, 3, 4, 5]))

def is_subsequence(a, b):
    b = iter(b)
    print(b)

    gen = (i for i in a)
    print(gen)

    for i in gen:
        print(i)

    gen = ((i in b) for i in a)
    print(gen)

    for i in gen:
        print(i)

    return all(((i in b) for i in a))

print(is_subsequence([1, 3, 5], [1, 2, 3, 4, 5]))
print(is_subsequence([1, 4, 3], [1, 2, 3, 4, 5]))

First of all, the line b = iter(b) converts the list b into an iterator. I won’t explain why we do this for now.

The statement gen = (i for i in a) is easy to understand. It creates a generator that iterates over object a, so it can output 1, 3, 5. And (i in b) should be examined closely. Can you think of the for in statement?

That’s right, (i in b) is roughly equivalent to the following code:

while True:
    val = next(b)
    if val == i:
        yield True

This cleverly utilizes the characteristics of generators. When the next() function is called, it saves the current pointer. For example, consider the following example:

b = (i for i in range(5))

print(2 in b)
print(4 in b)
print(3 in b)

Output:

True
True
False

As for the all() function at the end, it’s quite simple. It is used to determine whether all elements of an iterator are True. If so, it returns True; otherwise, it returns False.

So we have elegantly solved this interview question. However, please be aware that it is best not to use this technique in interviews because your interviewer may not be familiar with the usage of generators and they may not have read my Geek Time column. However, you are now more proficient in this technical knowledge and its practical application in your work. Keep up the good work!

Summary #

In summary, today we discussed four different objects: containers, iterable objects, iterators, and generators.

Containers are iterable objects. By calling the iter() function on iterable objects, we can obtain an iterator. Iterators can use the next() function to retrieve the next element, thus supporting iteration.
Generators are a special type of iterator (note that the reverse is not true). Using generators allows you to write clearer code. By using generators effectively, you can reduce memory usage, optimize program structure, and improve program speed.
In Python 2, generators were an important implementation method for coroutines. However, with the introduction of the async/await syntax in Python 3.5, the way generators implement coroutines has become outdated. We will continue to delve into Python coroutines in the next lesson.

Thought-provoking question #

Finally, here’s a thought-provoking question for you. What happens if you continue calling next() on a generator after it has finished iterating? Can a generator be iterated multiple times?

Feel free to leave a comment and discuss with me. You are also welcome to share this article with your colleagues and friends, so that we can progress together through communication.