06 Python Black Box Input and Output

06 Python Black Box Input and Output #

Hello, I’m Jingxiao.

There was a popular saying on the forums at the turn of the century: on the Internet, no one knows you’re a dog. When the Internet was just taking off, a network cable connected to your home, and information flew through this high-speed cable to your screen. You responded to your friend’s messages quickly on your keyboard, and the information flew back into the complex virtual world through the network cable, and then into your friend’s home. Abstractly speaking, each computer is a black box, and with input and output, it has the necessary conditions for the operation of a Turing machine.

A Python program is also a black box: it sends data through the input stream, and sends out processed data through the output stream. Behind the Python interpreter, is there a person hiding, or a Schrödinger’s cat? No one cares.

Well, without further ado, today let’s talk about Python’s input and output, starting from the basics.

Input and Output Basics #

The simplest and most direct input comes from keyboard operations, like in the example below.

name = input('your name:')
gender = input('you are a boy?(y/n)')

###### Input ######
your name:Jack
you are a boy?

welcome_str = 'Welcome to the matrix {prefix} {name}.'
welcome_dic = {
    'prefix': 'Mr.' if gender == 'y' else 'Mrs',
    'name': name
}

print('authorizing...')
print(welcome_str.format(**welcome_dic))

########## Output ##########
authorizing...
Welcome to the matrix Mr. Jack.

The input() function pauses the program execution and waits for keyboard input. The parameter of the function is the prompt message, and the input type is always a string (str). Note that beginners often make mistakes here, as I will explain in the example below. The print() function accepts strings, numbers, dictionaries, lists, and even outputs from custom classes.

Now let’s look at the example below.

a = input()
1
b = input()
2

print('a + b = {}'.format(a + b))
########## Output ##############
a + b = 12
print('type of a is {}, type of b is {}'.format(type(a), type(b)))
########## Output ##############
type of a is <class 'str'>, type of b is <class 'str'>
print('a + b = {}'.format(int(a) + int(b)))
########## Output ##############
a + b = 3

Here, note that to forcefully convert a str to an int, use int(), and to convert to a floating-point number, use float(). When using type conversion in production environments, remember to include try-except (i.e., error and exception handling, which will be discussed in later articles).

Python does not have a maximum limit for the int type (while in contrast, in C++, the maximum value for an int is 2147483647, exceeding which will result in overflow), but it still has precision limits for the float type. Apart from being cautious in algorithm competitions, it is also important to constantly be wary of these features in production environments, to avoid bugs and even 0day vulnerabilities resulting from misjudging boundary conditions.

Let’s revisit the cryptocurrency industry. Around 11:30 AM on April 23, 2018, the BEC token contract was hacked. Hackers exploited a vulnerability related to data overflow and attacked the BEC contract, which was a partnership between Meitu and a company called Meiliang. They successfully transferred a large amount of BEC tokens to two addresses, resulting in a massive sell-off of BEC tokens on the market and the value of the cryptocurrency almost plummeting to zero, causing a devastating blow to BEC’s market trading.

This shows that although input and output and type handling seem simple, we must exercise extreme caution. After all, a significant proportion of security vulnerabilities arise from careless I/O processing.

File Input and Output #

Command line input and output are the most basic ways of interacting with Python, suitable for simple program interactions. However, for production-level Python code, most input and output comes from files, networks, and messages from other processes.

Next, let’s analyze in detail how to read and write a text file. Let’s assume we have a text file called “in.txt” with the following contents:

I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

Okay, let’s do a simple NLP (Natural Language Processing) task. If you’re not familiar with NLP, don’t worry, I’ll guide you step by step through this task.

First, let’s understand the basic steps of an NLP task, which are as follows:

Read the file.
Remove all punctuation marks and newline characters, and convert all uppercase letters to lowercase.
Merge identical words, count the frequency of each word, and sort them in descending order of frequency.
Output the result to the file “out.txt” line by line.

You can think about how to solve this problem using Python. Here, I provide you with my code and detailed comments. Let’s take a look together.

import re

# You don't need to pay too much attention to this function
def parse(text):
    # Use regular expressions to remove punctuation marks and newline characters
    text = re.sub(r'[^\w ]', ' ', text)

    # Convert to lowercase
    text = text.lower()
    
    # Generate a list of all words
    word_list = text.split(' ')
    
    # Remove empty words
    word_list = filter(None, word_list)
    
    # Generate a dictionary of words and their frequencies
    word_cnt = {}
    for word in word_list:
        if word not in word_cnt:
            word_cnt[word] = 0
        word_cnt[word] += 1
    
    # Sort by word frequency
    sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True)
    
    return sorted_word_cnt

with open('in.txt', 'r') as fin:
    text = fin.read()

word_and_freq = parse(text)

with open('out.txt', 'w') as fout:
    for word, freq in word_and_freq:
        fout.write('{} {}\n'.format(word, freq))

########## Output (omitting intermediate results) ##########

and 15
be 13
will 11
to 11
the 10
of 10
a 8
we 8
day 6

...

old 1
negro 1
spiritual 1
thank 1
god 1
almighty 1
are 1

You don’t need to pay too much attention to the specific implementation of the parse() function. All you need to know is that it converts the input text string into the sorted word frequency statistics we need. The sorted_word_cnt is a list of tuples.

First, we need to understand some basic knowledge about file access in computers. In fact, the handling of files by the computer kernel is relatively complex and involves a series of concepts such as kernel mode, virtual file system, locks, and pointers. I won’t go into detail on these topics, I’ll just explain some basic but sufficient knowledge.

We need to start by obtaining a pointer to the file using the open() function. The first parameter specifies the file location (relative or absolute), and the second parameter, if it is 'r', indicates reading, if it is 'w', it indicates writing, and of course 'rw' can also be used to indicate both reading and writing. “a” is a less commonly used (but also very useful) parameter that indicates append. When a file opened with append mode requires writing, it will start writing from the end of the original file.

Here I want to mention that code permission management is very important in Facebook’s work. If you only need to read a file, don’t request write permissions. This can, to some extent, reduce the risk that bugs pose to the entire system.

Okay, back to our topic. After obtaining the pointer, we can use the read() function to read the entire contents of the file. The code text = fin.read() means that it reads all the contents of the file into memory and assigns them to the variable text. This approach naturally has pros and cons:

The advantage is convenience. We can easily call the parse function to analyze the text later.
The disadvantage is that if the file is too large, reading it all at once may cause a memory crash.

At this point, we can specify the parameter size for read to indicate the maximum length to read. We can also use the readline() function to read one line at a time. This approach is commonly used in data cleaning in data mining and is very convenient when writing small programs. If there is no association between each line, this approach can also reduce the pressure on memory. The write() function can output the string in the parameter to the file, which is also easy to understand.

Here, I need to briefly mention the with statement (which will be detailed later). The open() function corresponds to the close() function, which means that if you open a file, you should close it immediately after completing the reading task. However, if you use the with statement, you don’t need to explicitly call close(). After the tasks are executed within the with context, the close() function will be automatically called, making the code much cleaner.

Finally, it is important to note that error handling should be performed for all I/O operations. Because I/O operations can have various situations, a robust program should be able to handle various situations that may occur, rather than crashing (except for intentionally designed situations).

JSON Serialization and Practice #

Finally, let me talk about a knowledge point that is closely related to practical applications.

JSON (JavaScript Object Notation) is a lightweight data interchange format. Its design intent is to represent everything using designed strings, which is convenient for transmitting information over the Internet and for human readability (compared to some binary protocols). JSON is widely used in the current Internet and is a skill that every Python programmer should be proficient in.

Imagine a scenario where you want to buy a certain amount of stocks from an exchange. In order to do that, you need to submit a series of parameters such as stock code, direction (buy/sell), order type (market/limit), price (if it’s a limit order), quantity, etc. However, these data contain strings, integers, floating-point numbers, and even boolean variables all mixed together, which is not convenient for the exchange to unpack.

So what should you do?

In fact, JSON can solve this problem. You can simply think of it as two black boxes:

The first one takes these miscellaneous information, such as a Python dictionary, as input and outputs a string.
The second one takes this string as input and outputs a Python dictionary containing the original information.

The specific code is as follows:

import json

params = {
    'symbol': '123456',
    'type': 'limit',
    'price': 123.4,
    'amount': 23
}

params_str = json.dumps(params)

print('after json serialization')
print('type of params_str = {}, params_str = {}'.format(type(params_str), params))

original_params = json.loads(params_str)

print('after json deserialization')
print('type of original_params = {}, original_params = {}'.format(type(original_params), original_params))

########## Output ##########

after json serialization
type of params_str = <class 'str'>, params_str = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}
after json deserialization
type of original_params = <class 'dict'>, original_params = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}

In the code above,

The json.dumps() function accepts basic data types in Python and serializes them into a string.
The json.loads() function accepts a valid string and deserializes it into basic data types in Python.

Isn’t it simple?

But as always, please remember to add error handling. Otherwise, if you send an invalid string to json.loads() without catching it, the program will crash.

At this point, you may wonder about how to output a string to a file or read a JSON string from a file.

Yes, you can still use the open() function and read()/write() methods mentioned earlier to read/write the string to memory first, and then perform JSON encoding/decoding. However, this is a bit cumbersome.

import json

params = {
    'symbol': '123456',
    'type': 'limit',
    'price': 123.4,
    'amount': 23
}

with open('params.json', 'w') as fout:
    json.dump(params, fout)

with open('params.json', 'r') as fin:
    original_params = json.load(fin)

print('after json deserialization')
print('type of original_params = {}, original_params = {}'.format(type(original_params), original_params))

########## Output ##########

after json deserialization
type of original_params = <class 'dict'>, original_params = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}

This way, we have implemented the process of reading and writing JSON strings simply and clearly. When developing a third-party application, you can use JSON to output the user’s personal configuration to a file for easy retrieval when the program starts automatically next time. This is also a mature practice widely used today.

So is JSON the only choice? Obviously not, it is just one of the most convenient choices for lightweight applications. As far as I know, at Google, there is a similar tool called Protocol Buffer. Of course, Google has completely open-sourced this tool and you can learn more about how to use it.

Compared to JSON, the advantage of Protocol Buffer is that it generates optimized binary files, so it has better performance. However, at the same time, the generated binary sequence cannot be directly read. It is widely used in systems like TensorFlow that have strict performance requirements.

Summary #

In this lesson, we mainly learned about Python’s regular I/O and file I/O. We also gained a basic understanding of JSON serialization and further mastered it through specific examples. Here are a few points that need to be emphasized:

Be cautious when performing I/O operations. Make sure to handle errors adequately and code carefully to prevent vulnerabilities.
When coding, estimate the memory usage and disk occupation adequately. This way, it will be easier to find the cause of errors when they occur.
JSON serialization is a convenient tool. Practice it more in real-world scenarios.
Keep the code as concise and clear as possible, even in the early stages of learning. Have the mindset of a marshal.

Thought Questions #

Finally, I have two thought questions for you.

Question 1: Can you implement the word count example in NLP again? However, this time, the in.txt file may be very very large (which means you cannot read it into memory at once), while the output.txt file will not be very large (which means there will be many repeated words).

Hint: You may need to read a certain length of strings each time for processing, and then read the next batch. However, if you simply divide the text by length, you may separate a word. So you need to handle this boundary case carefully.

Question 2: You may have used similar cloud storage services like Baidu Netdisk, Dropbox, etc. However, they may have limited space (e.g., 5GB). If one day you plan to transfer 100GB of data from your home to your company, unfortunately, you didn’t bring a USB flash drive. So you came up with an idea:

Write a server.py at home to write data to Dropbox in chunks of no more than 5GB. Once the data is detected by the computer at the company, it will immediately copy it locally and delete the data in the cloud. After the home computer detects that the current data has been completely transferred to the company computer, it will proceed with the next write until all the data has been transferred.

Based on this idea, you plan to write a server.py at home and a client.py at the company to implement this requirement.

Hint: We assume that each file will not exceed 5GB.

You can synchronize the state by writing a control file (config.json). However, be careful when designing the state, as race conditions may occur.
You can also synchronize the state by directly detecting if the file has been generated or deleted, which is the simplest approach.

Don’t worry about the difficulty. Feel free to write down your thoughts, and I will also prepare the final code for you.

Feel free to write your answers in the comments section, and feel free to share this article with your colleagues and friends to learn together through thinking.