14 File Io Implement Effective and Correct File Read Write Is Not Easy

14 File IO Implement Effective and Correct File Read-Write Is Not Easy #

Today, let’s talk about how to implement efficient and correct file operations.

With the maturity and popularity of database systems, the need for direct file IO operations has become less and less. This has led to our lack of familiarity with related APIs. When we encounter requirements like file exporting or third-party file reconciliations, we can only search for random code to complete the requirements. After encountering performance issues or bugs, we don’t know where to start.

In this article, I will start from three common problems: character encoding, buffering, and file handle release. I will share with you how to solve performance issues or bugs related to file operations. If you are not familiar with APIs related to file operations, you can refer to the introduction on the Oracle official website.

Ensuring Consistent Character Encoding for File Read/Write #

There is a project that requires periodically reconciling third-party account files. Originally, this reconciliation process was done on a single machine without any issues. However, in order to improve performance, a dual-node setup was implemented where each node processes a portion of the account data. However, the newly added node always encounters garbled characters when reading Chinese characters from the file.

Although the program code is consistent, why does the old node not have this problem? We know that this is likely due to not paying attention to encoding issues when writing the code. Let’s analyze this problem further.

To simulate this scenario, we write “你好 hi” in GBK encoding to a text file named “hello.txt” and directly read the file content as a byte array, then convert it to a hexadecimal string and output it to the log:

Files.deleteIfExists(Paths.get("hello.txt"));
Files.write(Paths.get("hello.txt"), "你好hi".getBytes(Charset.forName("GBK")));
log.info("bytes:{}", Hex.encodeHexString(Files.readAllBytes(Paths.get("hello.txt"))).toUpperCase());

The output is as follows:

13:06:28.955 [main] INFO org.geekbang.time.commonmistakes.io.demo3.FileBadEncodingIssueApplication - bytes:C4E3BAC36869

Although we see “你好 hi” when opening the text file, the computer actually saves it in binary format according to certain rules. These rules are defined by character sets, which enumerate the mapping between characters and their binary representations. When dealing with file read/write operations at the byte level, there is no need to consider character encoding issues. However, if we need to read/write at the character level, we need to specify the character encoding, which is the character set.

The code that caused the problem when reading the file was as follows:

char[] chars = new char[10];
String content = "";
try (FileReader fileReader = new FileReader("hello.txt")) {
    int count;
    while ((count = fileReader.read(chars)) != -1) {
        content += new String(chars, 0, count);
    }
}
log.info("result:{}", content);

As we can see, FileReader was used to read the file in character format, and the Chinese characters in the log output became garbled:

13:06:28.961 [main] INFO org.geekbang.time.commonmistakes.io.demo3.FileBadEncodingIssueApplication - result:���hi

Clearly, here we did not specify the character set to use when reading the characters from the file. According to the JDK documentation, FileReader reads the file using the default character set of the machine. If we want to specify the character set, we need to use InputStreamReader and FileInputStream directly.

Now that we understand this issue, fixing it is simple. As mentioned in the documentation, we can directly use FileInputStream to obtain the file stream, then use InputStreamReader to read the character stream and specify the character set as GBK:

private static void right1() throws IOException {

    char[] chars = new char[10];
    String content = "";

    try (FileInputStream fileInputStream = new FileInputStream("hello.txt");
        InputStreamReader inputStreamReader = new InputStreamReader(fileInputStream, Charset.forName("GBK"))) {
        int count;
        while ((count = inputStreamReader.read(chars)) != -1) {
            content += new String(chars, 0, count);
        }
    }

    log.info("result: {}", content);
}

From the log, we can see that the fixed code correctly reads “你好 Hi”:

13:06:28.963 [main] INFO org.geekbang.time.commonmistakes.io.demo3.FileBadEncodingIssueApplication - result: 你好hi

If you find this method cumbersome, you can use the readAllLines method of the Files class introduced in JDK 1.7, which allows you to read the file content with just one line of code:

log.info("result: {}", Files.readAllLines(Paths.get("hello.txt"), Charset.forName("GBK")).stream().findFirst().orElse(""));

However, this method will encounter an OutOfMemoryError when reading large files that exceed the memory size. Why is that?

Looking at the source code of the readAllLines method, we can see that it reads the entire contents of the file and stores it in a List. If the memory cannot accommodate this List, an OutOfMemoryError will occur:

public static List<String> readAllLines(Path path, Charset cs) throws IOException {

    try (BufferedReader reader = newBufferedReader(path, cs)) {
        List<String> result = new ArrayList<>();
        for (;;) {
            String line = reader.readLine();
            if (line == null)
                break;
            result.add(line);
        }
        return result;
    }
}

Is there a way to achieve on-demand streaming read, such as reading the file only when consuming a certain line, rather than reading the entire file into memory at once?

Of course there is. The solution is to use the lines method of the File class. Next, I will tell you about some issues to consider when using the lines method.

Using the Files class static methods for file operations and be mindful of releasing file handles #

Unlike the readAllLines method, which returns a List, the lines method returns a Stream. This allows us to read and use the contents of the file as needed, instead of reading all the contents into memory at once, thus avoiding OOM (Out of Memory) errors.

Next, I will test this using a code snippet. We will try to read a file with 100,000,001 lines, occupying more than 4GB of disk space. If we start the JVM with “-Xmx512m -Xms512m” to limit the maximum heap memory to 512MB, it would not be possible to read such a large file at once. However, using the Files.lines method solves this problem.

In the code below, we first output the size of the file, then calculate the time difference between reading 200,000 lines and 2,000,000 lines, and finally read the file line by line to count the total number of lines:

// Output file size
log.info("file size:{}", Files.size(Paths.get("test.txt")));
StopWatch stopWatch = new StopWatch();
stopWatch.start("read 200,000 lines");

// Read 200,000 lines using Files.lines method
log.info("lines {}", Files.lines(Paths.get("test.txt")).limit(200_000).collect(Collectors.toList()).size());
stopWatch.stop();
stopWatch.start("read 2,000,000 lines");

// Read 2,000,000 lines using Files.lines method
log.info("lines {}", Files.lines(Paths.get("test.txt")).limit(2_000_000).collect(Collectors.toList()).size());
stopWatch.stop();
log.info(stopWatch.prettyPrint());
AtomicLong atomicLong = new AtomicLong();

// Count the total number of lines using Files.lines method
Files.lines(Paths.get("test.txt")).forEach(line -> atomicLong.incrementAndGet());
log.info("total lines {}", atomicLong.get());

The output results are as follows:

img

As we can see, the code successfully reads and counts the total number of lines in the file without encountering OOM errors. It takes 760ms to read 2,000,000 lines and only 267ms to read 200,000 lines. This demonstrates that the Files.lines method does not read the entire file at once, but rather reads as needed.

Now, do you see any issues with this code?

The problem is that the file is not closed after reading. We usually assume that static method calls do not involve resource release, as the method call ending naturally implies that the resource usage is complete and the API will release the resources. However, this is not the case for some methods in the Files class that return Streams. This is a serious issue that is often overlooked.

I have encountered a case where the program caused “too many files” errors after running for some time in production. We initially thought that the maximum file handle limit set by the OS was too low, so we asked the operations team to increase the limit. However, even after increasing the limit, the problem persisted. Upon investigation, we found that the problem was caused by the file handles not being released, and the issue was related to the Files.lines method.

Let’s reproduce this problem by randomly writing 10 lines of data to a demo.txt file:

Files.write(Paths.get("demo.txt"),
IntStream.rangeClosed(1, 10).mapToObj(i -> UUID.randomUUID().toString()).collect(Collectors.toList()),
StandardCharsets.UTF_8, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);

Then, let’s use the Files.lines method to read this file 1 million times and increment a counter for each line read:

LongAdder longAdder = new LongAdder();
IntStream.rangeClosed(1, 1_000_000).forEach(i -> {
    try {
        Files.lines(Paths.get("demo.txt")).forEach(line -> longAdder.increment());
    } catch (IOException e) {
        e.printStackTrace();
    }
});

log.info("total: {}", longAdder.longValue());

After running this code, you will immediately see the following error in the logs:

java.nio.file.FileSystemException: demo.txt: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)

By using the lsof command to view the files opened by the process, you can see that over 10,000 instances of the demo.txt file were opened:

lsof -p 63937
...
java    63902 zhuye *238r   REG                1,4      370         12934160647 /Users/zhuye/Documents/common-mistakes/demo.txt
java    63902 zhuye *239r   REG                1,4      370         12934160647 /Users/zhuye/Documents/common-mistakes/demo.txt
...
lsof -p 63937 | grep demo.txt | wc -l
   10,007

In fact, the JDK documentation mentions that we should be careful to use the try-with-resources statement to ensure that the close method of the stream is called to release resources.

This is also easy to understand. With stream processing, if we don’t explicitly tell the program when we’re done with the stream, how would the program know? It can’t make the decision for us as to when to close the file.

The fix is simple, use the try statement to wrap the Stream:

LongAdder longAdder = new LongAdder();
IntStream.rangeClosed(1, 1_000_000).forEach(i -> {
    try (Stream<String> lines = Files.lines(Paths.get("demo.txt"))) {
        lines.forEach(line -> longAdder.increment());
    } catch (IOException e) {
        e.printStackTrace();
    }
});

log.info("total: {}", longAdder.longValue());

After making this modification, there are no more error logs because the file is properly closed. Since we read the file containing 10 lines of data 1 million times, the correct output is 10 million:

14:19:29.410 [main] INFO org.geekbang.time.commonmistakes.io.demo2.FilesStreamOperationNeedCloseApplication - total: 10,000,000

Looking at the source code of the lines method, you can see that the Stream’s close method registers a callback to close the BufferedReader and release resources:

public static Stream<String> lines(Path path, Charset cs) throws IOException {

    BufferedReader br = Files.newBufferedReader(path, cs);
    try {
        return br.lines().onClose(asUncheckedRunnable(br));
    } catch (Error|RuntimeException e) {
        try {
            br.close();
        } catch (IOException ex) {
            try {
                e.addSuppressed(ex);
            } catch (Throwable ignore) {}
        }
        throw e;
    }
}

private static Runnable asUncheckedRunnable(Closeable c) {
    return () -> {
        try {
            c.close();
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }
    };
}

From the naming, it can be inferred that when using BufferedReader for character stream reading, buffering is used. The idea behind buffering is that a memory area is used as an intermediate transfer for direct operations.

For example, in file reading operations, a large chunk of data (e.g. 8KB) is read into the buffer once, and subsequent reads can be returned directly from the buffer, avoiding direct file I/O for each read. The same principle applies for writing. If we perform I/O operations every time we write a few dozen bytes to a file, writing a large file of hundreds of megabytes would require millions of I/O operations, resulting in a significant amount of time.

Next, I will conduct a few experiments to explain the importance of using buffers and compare the performance of different usage methods for file reading and writing. This will help you understand and use buffers correctly.

Pay attention to buffer settings when reading and writing files #

I once encountered a case where a piece of code read a file and then performed a simple operation before writing to another file. The developers used single-byte read and write methods, which resulted in slow execution. It would take hours to finish when the system load increased.

Let’s simulate the implementation. Create a file and randomly write 1 million lines of data, with a file size of around 35MB:

Files.write(Paths.get("src.txt"),
IntStream.rangeClosed(1, 1000000).mapToObj(i -> UUID.randomUUID().toString()).collect(Collectors.toList())
, UTF_8, CREATE, TRUNCATE_EXISTING);

The file processing code written by the developers at that time was roughly like this: use FileInputStream to get a file input stream, then read one byte at a time using its read method, and finally write the processed result to another file using a FileOutputStream.

To simplify the logic and make it easier to understand, we won’t do any data processing here, but directly write the data from the original file to the target file, which is equivalent to file copying:

private static void perByteOperation() throws IOException {

    try (FileInputStream fileInputStream = new FileInputStream("src.txt");
         FileOutputStream fileOutputStream = new FileOutputStream("dest.txt")) {
        int i;
        while ((i = fileInputStream.read()) != -1) {
            fileOutputStream.write(i);
        }
    }
}

With this implementation, it took 190 seconds to copy a 35MB file.

Obviously, performing one I/O operation for each byte read or written is too costly. The solution is to use a buffer as an intermediate, read a certain amount of data from the original file into the buffer at once, and write a certain amount of data from the buffer to the target file at once.

After the improvement, we use a buffer of 100 bytes. We use the byte[] overload of FileInputStream to read a certain number of bytes of data at once, and use the byte[] overload of FileOutputStream to write a certain number of bytes of data from the buffer to the file at once:

private static void bufferOperationWith100Buffer() throws IOException {

    try (FileInputStream fileInputStream = new FileInputStream("src.txt");
         FileOutputStream fileOutputStream = new FileOutputStream("dest.txt")) {
        byte[] buffer = new byte[100];
        int len = 0;
        while ((len = fileInputStream.read(buffer)) != -1) {
            fileOutputStream.write(buffer, 0, len);
        }
    }
}

By using a buffer of only 100 bytes as an intermediate, the copying time for a 35MB file is reduced to 26 seconds, which is 7 times faster than without a buffer. If the buffer size is increased to 1000 bytes, the time can be further shortened to 342 milliseconds. As you can see, using a suitable buffer can significantly improve performance in file I/O processing.

You may say, it’s troublesome to manually create a buffer when implementing file reading and writing. Aren’t there BufferedInputStream and BufferedOutputStream that can handle buffered processing for input and output streams?

Yes, they internally implement a default buffer size of 8KB. However, when using BufferedInputStream and BufferedOutputStream, I still recommend using an additional buffer for reading and writing, rather than performing byte-by-byte operations just because they have internal buffers.

Next, I will write a piece of code to compare the performance of three methods of reading and writing one byte:

  1. Directly use BufferedInputStream and BufferedOutputStream;
  2. Use an additional 8KB buffer with BufferedInputStream and BufferedOutputStream;
  3. Directly use FileInputStream and FileOutputStream, and use an additional 8KB buffer.
// Use BufferedInputStream and BufferedOutputStream
private static void bufferedStreamByteOperation() throws IOException {
   try (BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream("src.txt"));
        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream("dest.txt"))) {
        int i;
        while ((i = bufferedInputStream.read()) != -1) {
            bufferedOutputStream.write(i);
        }
    }
}

}

// Use an additional 8KB buffer with BufferedInputStream and BufferedOutputStream private static void bufferedStreamBufferOperation() throws IOException {

try (BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream("src.txt"));
     BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream("dest.txt"))) {
    byte[] buffer = new byte[8192];
    int len = 0;
    while ((len = bufferedInputStream.read(buffer)) != -1) {
        bufferedOutputStream.write(buffer, 0, len);
    }
}

}

// Directly use FileInputStream and FileOutputStream with an 8KB buffer private static void largerBufferOperation() throws IOException { try (FileInputStream fileInputStream = new FileInputStream(“src.txt”); FileOutputStream fileOutputStream = new FileOutputStream(“dest.txt”)) { byte[] buffer = new byte[8192]; int len = 0; while ((len = fileInputStream.read(buffer)) != -1) { fileOutputStream.write(buffer, 0, len); } } }

The results are as follows:

---------------------------------------------
ns         %     Task name
---------------------------------------------
1424649223  086%  bufferedStreamByteOperation
117807808  007%  bufferedStreamBufferOperation
112153174  007%  largerBufferOperation

As you can see, although the first method uses buffered streams, the byte-by-byte operation is still slow due to the large number of method calls, taking 1.4 seconds. The performance of the second and third methods is similar, taking around 110 milliseconds. Although the third method does not use buffered streams, it uses a buffer of the same size (8KB) as the default buffer size of the buffered streams.

At this point, you may be wondering, what is the significance of using BufferedInputStream and BufferedOutputStream if they are not faster than direct FileInputStream and FileOutputStream with a larger buffer size?

In fact, in this example, I used a fixed-size buffer for demonstration purposes. However, in actual code, the number of bytes to be read each time may vary. Sometimes it may be a few bytes, and sometimes it may be a few hundred bytes. In this case, having a fixed-size larger buffer, i.e., using BufferedInputStream and BufferedOutputStream as a stable secondary buffer, is very meaningful.

Finally, I would like to add that for similar file copying operations, if higher performance is desired, you can use the transferTo method of FileChannel for stream copying. On some operating systems (such as newer versions of Linux and UNIX), it can achieve DMA (direct memory access), which means that data can be sent directly from the disk to the destination file through the bus, without the need for data transfer through memory and CPU:

private static void fileChannelOperation() throws IOException { FileChannel in = FileChannel.open(Paths.get(“src.txt”), StandardOpenOption.READ); FileChannel out = FileChannel.open(Paths.get(“dest.txt”), CREATE, WRITE); in.transferTo(0, in.size(), out); }

You can learn more about the transferTo method in this article.

While testing the performance of FileChannel, I will also run all the implementations in this section to compare the time taken to read and write a 35MB file.

---------------------------------------------
ns         %     Task name
---------------------------------------------
183673362265  098%  perByteOperation
2034504694  001%  bufferOperationWith100Buffer
749967898  000%  bufferedStreamByteOperation
110602155  000%  bufferedStreamBufferOperation
114542834  000%  largerBufferOperation
050068602  000%  fileChannelOperation

As you can see, the slowest method is reading and writing the file stream byte by byte, taking 183 seconds. The fastest method is using FileChannel.transferTo to forward the stream, taking 50 milliseconds. The difference in time taken between the two methods is a factor of 3600!

Key Points Review #

Today, I shared with you several important aspects of file reading and writing operations through three case studies.

First, if you need to read and write character streams, make sure that the character set of the file matches the character set of the character stream, otherwise it may result in garbled characters.

Second, when using some stream processing operations from the Files class, be sure to use try-with-resources to wrap the Stream, ensuring that the underlying file resources can be released and avoiding the problem of “too many open files”.

Third, when performing file byte stream operations, it is generally not necessary to perform byte-by-byte operations. Instead, use a buffer to perform batch reading and writing to reduce the number of IO operations and improve performance. You can consider using BufferedXXXStream for buffered input and output operations, or if you are aiming for ultimate performance, you can consider using FileChannel for stream forwarding.

Finally, I want to emphasize that file operations involve the implementation of the operating system and file system, and the JDK cannot guarantee logical consistency of all IO APIs on all platforms. When migrating code to a new operating system or file system, functional and performance testing should be conducted again.

The code used today is available on GitHub, and you can click on this link to view it.

Thinking and Discussion #

When using the Files.lines method for stream processing, we need to use try-with-resources to release resources. So, when using other methods in the Files class that return Stream wrapping objects for stream processing, such as the newDirectoryStream method returning DirectoryStream, and the list, walk, and find methods returning Stream, do they also have resource release problems?

Are the file copying, renaming, and deleting operations provided by the File class and Files class in Java atomic?

Have you encountered any pitfalls in file operations? I’m Zhu Ye. You are welcome to leave a comment in the comment section to share your thoughts. You are also welcome to share this article with your friends or colleagues to exchange ideas together.