42 Data Types in the Bufio Package Part 1

42 Data Types in the bufio Package - Part 1 #

Today, we will talk about another code package related to I/O operations, bufio. bufio is an abbreviation for “buffered I/O”. As the name suggests, the entities implemented in this code package all have built-in buffering for I/O operations.

The main data types in the bufio package are:

  1. Reader;
  2. Scanner;
  3. Writer and ReadWriter.

Similar to the data types in the io package, values of these types also need to wrap one or more simple I/O interface types when initialized. (Here, simple I/O interface types refer to the simple interfaces in the io package.)

Next, we will discuss the bufio.Reader and bufio.Writer types through a series of questions (with a focus on the former). Today’s question is: What is the role of the buffer in a bufio.Reader value?

A typical answer to this question is as follows.

The buffer in a bufio.Reader value (referred to as a Reader value below) is actually a data storage intermediary that sits between the underlying reader and the read methods and their callers. The underlying reader refers to the io.Reader parameter value passed in when initializing this type of value.

The read methods of the Reader value generally first read data from its own buffer. At the same time, when necessary, they also pre-fetch a portion of data from the underlying reader and temporarily store it in the buffer for future use.

The benefit of having such a buffer is that it can reduce the execution time of the read methods in most cases. Although the read methods sometimes also need to fill the buffer, on the whole, the average execution time of the read methods is generally significantly shortened because of this.

Problem Analysis #

The bufio.Reader type is not ready for immediate use because it contains some fields that need to be explicitly initialized. In order to better understand the internal process of its read methods later on, I will briefly explain these fields here as follows.

  1. buf: A field of type []byte, which represents the buffer as a byte slice. Although it is a slice type, its length is specified during initialization and remains unchanged afterwards.
  2. rd: A field of type io.Reader, which represents the underlying reader. The data in the buffer is copied from here.
  3. r: A field of type int, which represents the starting index for the next read from the buffer. We can call it the “read count”.
  4. w: A field of type int, which represents the starting index for the next write to the buffer. We can call it the “write count”.
  5. err: A field of type error. Its value is used to indicate any errors that occur while getting data from the underlying reader. After it is read or ignored, the value of this field is set to nil.
  6. lastByte: A field of type int, used to record the last byte read from the buffer. This value is used for readback.
  7. lastRuneSize: A field of type int, used to record the number of bytes occupied by the last Unicode character read from the buffer. It is only assigned a meaningful value in the ReadRune method and is set to -1 in other cases.

The bufio package provides two functions for initializing Reader values, namely:

  • NewReader
  • NewReaderSize

Both functions return a value of type *bufio.Reader.

The NewReader function initializes a Reader value with a default buffer size. This default size is 4096 bytes, i.e., 4 KB. The NewReaderSize function allows the buffer size to be determined by the user.

Because the buffer size of the Reader value cannot be changed within its lifetime, there are times when some trade-offs need to be made. The NewReaderSize function provides a way to do this.

In the read methods of the bufio.Reader type, both the Peek method and the ReadSlice method call a package-level private method named fill. The purpose of the fill method is to fill the internal buffer. Let’s focus on it for now.

The fill method first checks the read count of its value. If this count is not greater than 0, there are two possibilities.

One possibility is that all the bytes in the buffer are brand new, meaning they have not been read before. The other possibility is that the buffer has just been compressed.

Compressing the buffer includes two steps. The first step is to copy all the element values (or bytes) in the buffer within the range [read count, write count) to the beginning of the buffer.

For example, copy the byte in the buffer corresponding to the index represented by the read count to the position of index 0 and copy the byte immediately after it to the position of index 1, and so on.

This step does not have any side effects because it is based on two facts.

The first fact is that the bytes before the read count have already been read and will definitely not be read again, so it is safe to overwrite them.

The second fact is that after compressing the buffer, any bytes after the write count can only be bytes that have been read or bytes that have been copied to the beginning of the buffer and have not been read, or they are zeros (0x00) representing unfilled data. Therefore, new bytes can be written to these positions.

In the second step of compressing the buffer, the fill method sets the new write count to the difference between the original write count and the original read count. This difference represents the starting index for the first byte to be written after the compression.

In addition, the method sets the read count to 0. Obviously, after compression, reading bytes must start from the beginning of the buffer.

(Compression of the buffer in bufio.Reader)

In fact, when the fill method finds that the read count of its value is greater than 0, it compresses the buffer once. Afterwards, if there is still writable space in the buffer, the method fills it.

When filling the buffer, the fill method attempts to read enough bytes from the underlying reader and fill the space between the write count index and the end of the buffer as much as possible.

During this process, the method updates the write count promptly to ensure the correctness and order of filling. It also checks for any errors that occur while reading data from the underlying reader. If there is an error, it assigns the error value to the err field of its value and terminates the filling process.

Okay, let’s pause here for now. In this question, I have summarized the basic structure of the bufio.Reader type, as well as some related functions and methods, and explained in detail the fill method of this type.

The fill method is an important component of the read process that we will explain later. At the very least, you should remember what this fill method roughly does.

Knowledge Expansion #

Question 1: When will the buffered data in a bufio.Writer be written to its underlying writer?

Let’s first take a look at the fields of the bufio.Writer type:

  1. err: This is a field of type error. Its value is used to represent any errors that occur while writing data to the underlying writer.
  2. buf: This is a field of type []byte, representing the buffer. Its length remains unchanged after initialization.
  3. n: This is a field of type int, representing the next index to write to in the buffer. We can call it the write count.
  4. wr: This is a field of type io.Writer, representing the underlying writer.

The bufio.Writer type has a method called Flush, which is used to write all the temporary data stored in the corresponding buffer to the underlying writer. Once the data is written to the underlying writer, the Flush method removes them from the buffer.

However, the removal here is sometimes only logical. Whether or not all the temporary data is successfully written, the Flush method handles it properly and ensures that no rewriting or omission occurs. The n field of this type plays an important role here.

All the data writing methods of a bufio.Writer value (referred to as a Writer value below) will call its Flush method when necessary.

For example, the Write method sometimes calls the Flush method after writing data to the buffer in order to make room for subsequent new data. The behavior of the WriteString method is similar.

Similarly, the WriteByte and WriteRune methods call the Flush method when there is not enough writable space in the buffer for new bytes or Unicode characters.

In addition, if the Write method finds that there are too many bytes to be written and the buffer is empty, it will bypass the buffer and write the data directly to the underlying writer.

The ReadFrom method, on the other hand, directly calls the ReadFrom method of the parameter value when it finds that the underlying writer’s type implements the io.ReaderFrom interface and writes the data held by the parameter value into it.

In summary, in normal circumstances, whenever the writable space in the buffer is insufficient to accommodate the new data that needs to be written, the Flush method will be called. Moreover, some methods of the bufio.Writer type sometimes try to take shortcuts and directly connect the data between the supply and demand sides by bypassing the buffer.

After understanding these internal mechanisms, you can write your code more purposefully. However, it is obviously the safest to call the Flush method of the Writer value after writing all your data to it.

Summary #

Today, we started with the question of “What role does the buffer in bufio.Reader play?” and introduced some data types in the bufio package. In the next sharing session, I will continue to expand on this question.

Please leave me a message and let’s discuss your thoughts on today’s content together. Thank you for listening, and see you again next time.

Click here to view the detailed code accompanying the Go language column articles.