在路上

 找回密码
 立即注册
在路上 站点首页 学习 查看内容

Java高效地读取一个大型文件

2016-12-20 13:13| 发布者: zhangjf| 查看: 422| 评论: 0

摘要: Contents 1. Overview 2. Reading In Memory 3. Streaming Through the File 4. Streaming with Apache Commons IO 5. Conclusion If you're new here, you may want to get my "Spring Development Repo ...

Contents

1. Overview 2. Reading In Memory 3. Streaming Through the File 4. Streaming with Apache Commons IO 5. Conclusion

If you're new here, you may want to get my "Spring Development Report". Thanks for visiting!

I usually post about Java stuff on Google+ - you should follow me there:


1. Overview

This tutorial will show how to read all the lines from a large file in Java in an efficient manner.

This article is part of the “Java – Back to Basic” tutorial here on Baeldung.

2. Reading In Memory

The standard way of reading the lines of the file is in-memory – both Guava and Apache Commons IO provide a quick way to do just that:

  1. Files.readLines(new File(path), Charsets.UTF_8);
复制代码
  1. FileUtils.readLines(new File(path));
复制代码

The problem with this approach is that all the file lines are kept in memory – which will quickly lead to OutOfMemoryError if the File is large enough.

For example – reading a ~1Gb file:

  1. @Test
  2. public void givenUsingGuava_whenIteratingAFile_thenWorks() throws IOException {
  3. String path = ...
  4. Files.readLines(new File(path), Charsets.UTF_8);
  5. }
复制代码

This starts off with a small amount of memory being consumed: (~0 Mb consumed)

  1. [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 128 Mb
  2. [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 116 Mb
复制代码

However, after the full file has been processed, we have at the end: (~2 Gb consumed)

  1. [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 2666 Mb
  2. [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 490 Mb
复制代码

Which means that about 2.1 Gb of memory are consumed by the process – the reason is simple – the lines of the file are all being stored in memory now.

It should be obvious by this point that keeping in-memory the contents of the file will quickly exhaust the available memory – regardless of how much that actually is.

What’s more, we usually don’t need all of the lines in the file in memory at once – instead, we just need to be able to iterate through each one, do some processing and throw it away. So, this is exactly what we’re going to do – iterate through the lines without holding the in memory.

3. Streaming Through the File

Let’s now look at a solution – we’re going to use a java.util.Scanner to run through the contents of the file and retrieve lines serially, one by one:

  1. FileInputStream inputStream = null;
  2. Scanner sc = null;
  3. try {
  4. inputStream = new FileInputStream(path);
  5. sc = new Scanner(inputStream, "UTF-8");
  6. while (sc.hasNextLine()) {
  7. String line = sc.nextLine();
  8. // System.out.println(line);
  9. }
  10. // note that Scanner suppresses exceptions
  11. if (sc.ioException() != null) {
  12. throw sc.ioException();
  13. }
  14. } finally {
  15. if (inputStream != null) {
  16. inputStream.close();
  17. }
  18. if (sc != null) {
  19. sc.close();
  20. }
  21. }
复制代码

This solution will iterate through all the lines in the file – allowing for processing of each line – without keeping references to them – and in conclusion, without keeping them in memory: (~150 Mb consumed)

  1. [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 763 Mb
  2. [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 605 Mb
复制代码
4. Streaming with Apache Commons IO

The same can be achieved using the Commons IO library as well, by using the custom LineIterator provided by the library:

  1. LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
  2. try {
  3. while (it.hasNext()) {
  4. String line = it.nextLine();
  5. // do something with line
  6. }
  7. } finally {
  8. LineIterator.closeQuietly(it);
  9. }
复制代码

Since the entire file is not fully in memory – this will also result in pretty conservative memory consumption numbers: (~150 Mb consumed)

  1. [main] INFO o.b.java.CoreJavaIoIntegrationTest - Total Memory: 752 Mb
  2. [main] INFO o.b.java.CoreJavaIoIntegrationTest - Free Memory: 564 Mb
复制代码
5. Conclusion

This quick article shows how to process lines in a large file without iteratively, without exhausting the available memory – which proves quite useful when working with these large files.

The implementation of all these examples and code snippets can be found in my github project – this is an Eclipse based project, so it should be easy to import and run as it is.

最新评论

小黑屋|在路上 ( 蜀ICP备15035742号-1 

;

GMT+8, 2025-7-8 02:12

Copyright 2015-2025 djqfx

返回顶部