Generator and yield are used frequently in Python. In this article, let’s discuss some basics of
generator, the benefit for generator, and how we use
yield to create a generator.
At the same time, we study two concepts in computer science: lazy evaluation and stream.
First, we need to know
iterables before understanding
generator is also an iterator in essential.
In Python, iterable is an object which can be iterated over, such as in a for-loop.
Most collection data structures are iterables. Such as list, tuple, set. For example, we create a list and iterate it one by one:
We can also iterate the characters in a string:
The limitation of iterable
The limitation of iterable is that we need to store all the values in memory before we begin to iterate over them. This will cost too much memory in some scenarios. A typical scenario is reading lines from a file:
Think about what will happen if we read a large file, such as a file with a size of 6 GB?
We need to save all the lines in memory when loading content from the file.
Actually, in most cases, we only want to iterate line by line to finish some data processing tasks. We maybe break out the loop in advance, loading all lines into memory does not necessary.
Could we just have a strategy of reading data by need? Python introduced
generator to solve this problem.
A generator is also iterator, but its key feature is lazy evaluation. Lazy evaluation is a classic concept in computer science and adopted by many programming languages such as Haskell. The core idea of lazy evaluation is call-by-need. Lazy evaluation can lead to a reduction in memory footprint.
A generator is an iterator in the style of iterating by need. We will not calculate and store the values at once, but generate them on the fly when we are iterating.
We have two ways to create a
generator, generator expression and generator function.
generator expression is similar with list comprehension, except we use
(). Since generator is a iterator, we use
next function to get the next item:
The difference here is we don’t compute all the values when creating the generator.
x*x is calculated when we are iterating the generator.
To understand the difference, let’s run this code snippet:
As we can see from the result, when we create an iterable, it costs about 10 seconds. Because we execute the time.sleep(1) 10 times.
But time.sleep(1) actually does not run when we are creating the generator.
Another way to create
generator is by using generator function, we use keyword
yield to return a generator in a function.
Let’s check out this
fib function which returns a generator with N Fibonacci numbers:
yield to rewrite above file reading program:
In this way, we won’t load all the content into memory, but loading it by reading the lines.
With a generator, we could construct a data structure with infinite items. This kind of sequence of data elements is called Stream in computer science. It’s fancy because we could express the infinite concept in mathematical.
Suppose we want a sequence with all the Fibonacci numbers. How to achieve this?
We only need to remove the count parameter from the above function!
Yes! We get a variable which could stand for all the Fibonacci numbers. Let’s write a generic function to take n items from any stream.
take(all_fib_numbers, 10) will return the first 10 Fibonacci numbers as a result.
Generator in Python is a powerful tool for delaying computation and saving both time and space. The core idea of lazy evaluation is: compute the value until you really need it. This also helps us to express the infinite sequence concepts.