Coder's Cat

Python: Generator, Yield and Stream

2020-03-26

file:img/2020_03_26_python-generator-and-yield.org_20200326_203548.png

Generator and yield are used frequently in Python. In this article, let’s discuss some basics of generator, the benefit for generator, and how we use yield to create a generator.

At the same time, we study two concepts in computer science: lazy evaluation and stream.

Iterables

First, we need to know iterables before understanding generator. Because generator is also an iterator in essential.

In Python, iterable is an object which can be iterated over, such as in a for-loop.

Most collection data structures are iterables. Such as list, tuple, set. For example, we create a list and iterate it one by one:

lst = [1, 2, 3]
for i in lst:
print(i)

# 1
# 2
# 3

lst = [x+x for x in range(3)]
for x in lst:
print(x)
# 0
# 2
# 4

We can also iterate the characters in a string:

string = "cat"
for c in string:
print(c)

# c
# a
# t

The limitation of iterable

The limitation of iterable is that we need to store all the values in memory before we begin to iterate over them. This will cost too much memory in some scenarios. A typical scenario is reading lines from a file:

def file_reader(file_path):
fp = open(file_path)
return fp.read().split("\n")

Think about what will happen if we read a large file, such as a file with a size of 6 GB?

We need to save all the lines in memory when loading content from the file.

Actually, in most cases, we only want to iterate line by line to finish some data processing tasks. We maybe break out the loop in advance, loading all lines into memory does not necessary.

Could we just have a strategy of reading data by need? Python introduced generator to solve this problem.

Generator

A generator is also iterator, but its key feature is lazy evaluation. Lazy evaluation is a classic concept in computer science and adopted by many programming languages such as Haskell. The core idea of lazy evaluation is call-by-need. Lazy evaluation can lead to a reduction in memory footprint.

A generator is an iterator in the style of iterating by need. We will not calculate and store the values at once, but generate them on the fly when we are iterating.

We have two ways to create a generator, generator expression and generator function.

generator expression is similar with list comprehension, except we use (). Since generator is a iterator, we use next function to get the next item:

g1 = (x*x for x in range(10))
print(type(g1))
print(next(g1))
print(next(g1))

# <type 'generator'>
# 0
# 1

The difference here is we don’t compute all the values when creating the generator. x*x is calculated when we are iterating the generator.

To understand the difference, let’s run this code snippet:

>>> import timeit
>>> timeit.timeit('lst = [time.sleep(1) for x in range(5)]', number=2)
10.032547950744629

>>> timeit.timeit('lst = (time.sleep(1) for x in range(5))', number=2)
1.0013580322265625e-05

As we can see from the result, when we create an iterable, it costs about 10 seconds. Because we execute the time.sleep(1) 10 times.

But time.sleep(1) actually does not run when we are creating the generator.

Yield

Another way to create generator is by using generator function, we use keyword yield to return a generator in a function.

Let’s check out this fib function which returns a generator with N Fibonacci numbers:

def fib(cnt):
n, a, b = 0, 0, 1
while n < cnt:
yield a
a, b = b, a + b
n = n + 1

g = fib(10)
for i in range(10):
print g.next(),

# 0 1 1 2 3 5 8 13 21 34

Let’s use yield to rewrite above file reading program:

def file_reader(file_path):
for row in open(file_path, "r"):
yield row

for row in file_reader('./demo.txt'):
print(row),

In this way, we won’t load all the content into memory, but loading it by reading the lines.

Stream

With a generator, we could construct a data structure with infinite items. This kind of sequence of data elements is called Stream in computer science. It’s fancy because we could express the infinite concept in mathematical.

Suppose we want a sequence with all the Fibonacci numbers. How to achieve this?

We only need to remove the count parameter from the above function!

def all_fib():
n, a, b = 0, 0, 1
while True:
yield a
a, b = b, a + b
n = n + 1

all_fib_numbers = all_fib()

Yes! We get a variable which could stand for all the Fibonacci numbers. Let’s write a generic function to take n items from any stream.

def take(seq, n):
result = []
try:
for i in range(n):
result.append(next(seq))
except StopIteration:
pass
return result

So, take(all_fib_numbers, 10) will return the first 10 Fibonacci numbers as a result.

Conclusion

Generator in Python is a powerful tool for delaying computation and saving both time and space. The core idea of lazy evaluation is: compute the value until you really need it. This also helps us to express the infinite sequence concepts.

Join my Email List for more insights, It's Free!😋

Tags: Python