Topstone Software Consulting

Data Lakes and the ORC File Format

Data, data everywhere

Most organizations generate valuable data. If collected and properly analyzed, this data can be a critical business asset.

The term "data lake" has become popular, but the popularity of the term has not resulted in a clear definition of what exactly a data lake is. Often a "data lake" refers to various streams of raw unanalyzed business data. For example, customer purchase transactions, content view information or financial transactions. This raw data is often written to a scalable cloud resource like Amazon Web Services S3. The idea behind a data lake is that once this data is available in a cloud resource it can be analyzed to provide business insight. The data can also serve as input to machine learning algorithms.

Often the data that is stored in the data lake consists of hundreds of thousands or millions of small JSON or CSV files. Processing these files individually can be very slow. The file names must be read and each file opened and processed. This can have a significant performance impact for data analysis systems like AWS Athena, where the entire data set is scanned for every query.

ORC to the rescue

The Optimized Row Columnar (ORC) file format was developed for the Apache Hive data warehouse. Hive, like AWS Athena, allows data, stored in ORC format, to be queried using SQL. ORC is a compressed file format, where it's columnar structure can provide better compression compared to formats like Parquet. The diagram below shows the relative file compression for (some) ORC files.

Source: ORCFile in HDP 2: Better Compression, Better Performance

The ORC file format was designed for the Hive data warehouse application. The highly compressed structure of ORC files means that execution of SQL queries read less data. The columnar structure of ORC files allows queries to be more efficient. A data lake processing stage that reads small JSON or CSV files and converts them into large ORC files, can dramatically improve data queries by systems like Hive, AWS Athena or Snowflake.

javaorc: a Java library for writing and reading ORC files

Leveraging the ORC file format in Java for data lake applications has required mastering ORC's complex vectorized column data structures. Topstone Software developed javaorc library to make writing and reading ORC files simple. The javaorc library is published on GitHub (see https://github.com/IanLKaplan/javaorc) under the Apache 2 open source license. The library is also available via Sonatype's Maven Central (see the GitHub README documentation).

Ian Kaplan, Topstone Software Consulting, June 2021