Apache Iceberg

Apache Iceberg
Original author(s)	Ryan Blue, Daniel Weeks
Initial release	10 August 2017; 8 years ago
Stable release	1.10.0 / 11 September 2025; 7 months ago
Written in	Java, Scala, Python
Operating system	Cross-platform
Type	Data warehouse, Data lake
License	Apache License 2.0
Website	iceberg.apache.org;

Apache Iceberg is a high performance open-source format for large analytic tables. Iceberg enables the use of SQL tables for big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables, at the same time.^[1] Iceberg is released under the Apache License.^[2] Iceberg addresses the performance and usability challenges of Apache Hive tables in large and demanding data lake environments.^[3] Vendors currently supporting Apache Iceberg tables include Buster,^[4] CelerData, Cloudera, Crunchy Data,^[5] Dremio, IBM watsonx.data, IOMETE, Oracle^[6], Snowflake, Starburst, Tabular,^[7] AWS,^[8] ,Google Cloud,^[9] and Databricks^[10].

History

Iceberg was started at Netflix by Ryan Blue and Dan Weeks. Apache Hive was used by many different services and engines in the Netflix infrastructure. Hive was never able to guarantee correctness and did not provide stable atomic transactions.^[3] Many at Netflix avoided using these services and making changes to the data to avert unintended consequences from the Hive format.^[3] Ryan Blue set out to address three issues that faced the Hive table by creating Iceberg:^[3]^[11]

Ensure the correctness of the data and support ACID transactions.
Improve performance by enabling finer-grained operations to be done at the file granularity for optimal writes.
Simplify and abstract general operation and maintenance of tables.

Iceberg development started in 2017.^[12] The project was open-sourced and donated to the Apache Software Foundation in November 2018.^[13] In May 2020, the Iceberg project graduated to become a top-level Apache project.^[13]

Iceberg is used by multiple companies including Airbnb,^[14] Apple,^[3] Expedia,^[15] LinkedIn,^[16] Adobe,^[17] Lyft, and many more.^[18]

Technical details

Apache Iceberg operates by abstracting table metadata from the underlying data storage. It maintains metadata files that track snapshots, schema information, partition layouts, and data file locations, enabling efficient and atomic table operations.^[19]

At a high level, Iceberg organizes table data into snapshots. Each snapshot represents the state of the table at a particular point in time, allowing Iceberg to provide ACID-compliant transactional capabilities, including snapshot isolation, concurrent writes, and rollback functionality. The snapshot metadata is managed as a tree structure of manifest files and metadata files stored within the file system.^[20]

Iceberg uses the Apache Parquet file format for storing actual data due to its efficient columnar storage structure, optimized for analytical queries. Parquet files in Iceberg store table rows in a compressed, column-oriented format, significantly reducing storage costs and improving read performance through techniques such as predicate pushdown and column pruning. Iceberg references Parquet files in manifest files, facilitating quick identification and access to relevant data during query execution.^[21]

Apache Iceberg employs a multi‐level metadata hierarchy for tracking table contents.^[22] At the top, a table metadata file (often metadata.json) stores table-level information—such as the schema, partition specifications, the list of snapshots, and pointers to the current "root" snapshot.^[23] Each snapshot represents a consistent view of the table and is associated with a manifest list (an Avro file) that enumerates all manifest files for that snapshot. A manifest file is an index that lists a set of data files (e.g., Parquet files) along with metadata about each file – including row count, partition values, and column statistics such as minimum and maximum values. These manifests are small metadata files (often in Avro format) that segment the table’s metadata, enabling a distributed design whereby entire manifests can be pruned when querying by partition instead of requiring a single, giant file listing all data files. Moreover, Iceberg’s metadata tree provides an historic record of table changes—retaining old snapshots and manifests (thus enabling time travel) until they expire—and it can quickly plan queries by reading only the relevant manifest files rather than scanning all data files or directories. This approach avoids expensive operations such as directory listing and makes metadata access efficient even for huge tables.

References

↑ "Apache Iceberg". https://iceberg.apache.org/.
↑ "apache/iceberg GitHub License". The Apache Software Foundation. 5 October 2022. https://github.com/apache/iceberg/blob/master/LICENSE.
↑ ^3.0 ^3.1 ^3.2 ^3.3 ^3.4 Woodie, Alex (8 February 2021). "Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?". https://www.datanami.com/2021/02/08/apache-iceberg-the-hub-of-an-emerging-data-service-ecosystem/.
↑ "Buster". https://www.buster.so/.
↑ Woodie, Alex (24 July 2024). "Crunchy Data Goes All-in With Postgres" (in en). https://www.datanami.com/2024/07/24/crunchy-data-goes-all-in-with-postgres/.
↑ Oracle Corporation (2025-09-05). "Query Apache Iceberg Tables". https://docs.oracle.com/en-us/iaas/autonomous-database-serverless/doc/query-external-data-apache-iceberg.html.
↑ "Vendors". https://iceberg.apache.org/vendors/.
↑ "Using Apache Iceberg tables – Amazon Athena". Amazon Web Services, Inc.. https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html.
↑ "Google Cloud BigQuery tables for Apache Iceberg". Google Cloud, Inc.. https://cloud.google.com/bigquery/docs/iceberg-tables.
↑ "What is Apache Iceberg in Databricks? | Databricks on AWS" (in en). 2025-08-20. https://docs.databricks.com/aws/en/iceberg/.
↑ "Iceberg at Netflix and Beyond with Ryan Blue, EPISODE 1654 Transcript". 7 March 2024. https://softwareengineeringdaily.com/wp-content/uploads/2024/02/SED1654-SED1654_Apache_Iceberg.txt.
↑ "Initial public release in apache/iceberg" (in en). https://github.com/apache/iceberg/commit/a5eb3f6ba171ecfc517a4f09ae9654e7d8ae0291.
↑ ^13.0 ^13.1 "Incubation Status Template - Apache Incubator". https://incubator.apache.org/projects/iceberg.html.
↑ Zhu, Ronnie (26 September 2022). "Upgrading Data Warehouse Infrastructure at Airbnb" (in en). https://medium.com/airbnb-engineering/upgrading-data-warehouse-infrastructure-at-airbnb-a4e18f09b6d5.
↑ Mathiesen, Christine (26 January 2021). "A Short Introduction to Apache Iceberg" (in en). https://medium.com/expedia-group-tech/a-short-introduction-to-apache-iceberg-d34f628b6799.
↑ "FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format" (in en). https://engineering.linkedin.com/blog/2021/fastingest-low-latency-gobblin.
↑ Bremner, Jaemi (3 December 2020). "Iceberg at Adobe" (in en). https://blog.developer.adobe.com/iceberg-at-adobe-88cf1950e866.
↑ Council, Data (17 July 2020). "Open Source Highlight: Apache Iceberg" (in en-ie). https://www.datacouncil.ai/blog/apache-iceberg.
↑ "Apache Iceberg Documentation". https://iceberg.apache.org/docs/latest/.
↑ "Apache Iceberg Specification". https://iceberg.apache.org/spec/.
↑ "Apache Iceberg vs Parquet: File vs. Table Formats for Modern Data Lakes". https://www.decube.io/post/what-is-apache-iceberg-versus-parquet.
↑ "Apache Iceberg Specification". https://iceberg.apache.org/spec/.
↑ "A Hands-On Look at the Structure of an Apache Iceberg Table". 24 August 2022. https://www.dremio.com/blog/a-hands-on-look-at-the-structure-of-an-apache-iceberg-table/.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Apache Iceberg. Read more

[1] "Apache Iceberg". https://iceberg.apache.org/.

[2] "apache/iceberg GitHub License". The Apache Software Foundation. 5 October 2022. https://github.com/apache/iceberg/blob/master/LICENSE.

[iceberg-data-hub-article-3] 3.0 ^3.1 ^3.2 ^3.3 ^3.4 Woodie, Alex (8 February 2021). "Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?". https://www.datanami.com/2021/02/08/apache-iceberg-the-hub-of-an-emerging-data-service-ecosystem/.

[4] "Buster". https://www.buster.so/.

[5] Woodie, Alex (24 July 2024). "Crunchy Data Goes All-in With Postgres" (in en). https://www.datanami.com/2024/07/24/crunchy-data-goes-all-in-with-postgres/.

[6] Oracle Corporation (2025-09-05). "Query Apache Iceberg Tables". https://docs.oracle.com/en-us/iaas/autonomous-database-serverless/doc/query-external-data-apache-iceberg.html.

[7] "Vendors". https://iceberg.apache.org/vendors/.

[8] "Using Apache Iceberg tables – Amazon Athena". Amazon Web Services, Inc.. https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html.

[9] "Google Cloud BigQuery tables for Apache Iceberg". Google Cloud, Inc.. https://cloud.google.com/bigquery/docs/iceberg-tables.

[10] "What is Apache Iceberg in Databricks? | Databricks on AWS" (in en). 2025-08-20. https://docs.databricks.com/aws/en/iceberg/.

[11] "Iceberg at Netflix and Beyond with Ryan Blue, EPISODE 1654 Transcript". 7 March 2024. https://softwareengineeringdaily.com/wp-content/uploads/2024/02/SED1654-SED1654_Apache_Iceberg.txt.

[12] "Initial public release in apache/iceberg" (in en). https://github.com/apache/iceberg/commit/a5eb3f6ba171ecfc517a4f09ae9654e7d8ae0291.

[iceberg-incubator-13] 13.0 ^13.1 "Incubation Status Template - Apache Incubator". https://incubator.apache.org/projects/iceberg.html.

[14] Zhu, Ronnie (26 September 2022). "Upgrading Data Warehouse Infrastructure at Airbnb" (in en). https://medium.com/airbnb-engineering/upgrading-data-warehouse-infrastructure-at-airbnb-a4e18f09b6d5.

[15] Mathiesen, Christine (26 January 2021). "A Short Introduction to Apache Iceberg" (in en). https://medium.com/expedia-group-tech/a-short-introduction-to-apache-iceberg-d34f628b6799.

[16] "FastIngest: Low-latency Gobblin with Apache Iceberg and ORC format" (in en). https://engineering.linkedin.com/blog/2021/fastingest-low-latency-gobblin.

[17] Bremner, Jaemi (3 December 2020). "Iceberg at Adobe" (in en). https://blog.developer.adobe.com/iceberg-at-adobe-88cf1950e866.

[18] Council, Data (17 July 2020). "Open Source Highlight: Apache Iceberg" (in en-ie). https://www.datacouncil.ai/blog/apache-iceberg.

[19] "Apache Iceberg Documentation". https://iceberg.apache.org/docs/latest/.

[20] "Apache Iceberg Specification". https://iceberg.apache.org/spec/.

[21] "Apache Iceberg vs Parquet: File vs. Table Formats for Modern Data Lakes". https://www.decube.io/post/what-is-apache-iceberg-versus-parquet.

[22] "Apache Iceberg Specification". https://iceberg.apache.org/spec/.

[23] "A Hands-On Look at the Structure of an Apache Iceberg Table". 24 August 2022. https://www.dremio.com/blog/a-hands-on-look-at-the-structure-of-an-apache-iceberg-table/.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

Apache Iceberg

Topic: Software

Contents

History

Technical details

See also

References