
Using Iceberg Metadata Tables to Determine When Compaction Is Needed
- Free Apache Iceberg Course
- Free Copy of “Apache Iceberg: The Definitive Guide”
- Free Copy of “Apache Polaris: The Definitive Guide”
- 2025 Apache Iceberg Architecture Guide
- Iceberg Lakehouse Engineering Video Playlist
- Ultimate Apache Iceberg Resource Guide
Using Iceberg Metadata Tables to Determine When Compaction Is Needed
Scheduling compaction at fixed intervals is better than not optimizing at all—but it can still lead to unnecessary compute spend or delayed maintenance. A smarter approach is to dynamically trigger compaction based on real-time metadata signals.
Apache Iceberg makes this possible with its powerful system of metadata tables, which expose granular details about files, snapshots, and manifests.
In this post, we’ll explore how to query these tables to:
- Detect small files
- Identify bloated partitions
- Spot manifest inefficiencies
- Automate event-driven compaction workflows
What Are Iceberg Metadata Tables?
Every Iceberg table automatically maintains a set of virtual tables that expose its internals. The most relevant for optimization include:
files
– List of all data files in the table, including size, partition, and metricsmanifests
– List of manifest files and the data files they referencesnapshots
– History of table changes and snapshot metadatahistory
– Timeline of snapshot commits and their lineage
These tables can be queried like any other SQL table, making it easy to introspect your table’s health.
1. Detecting Small Files with the files
Table
To identify partitions suffering from small file syndrome:
SELECT
partition,
COUNT(*) AS file_count,
AVG(file_size_in_bytes) AS avg_size_bytes
FROM my_table.files
GROUP BY partition
HAVING COUNT(*) > 10 AND AVG(file_size_in_bytes) < 134217728; -- 128 MB
You can use this to:
-
Trigger compaction on specific partitions
-
Monitor trends in file size distribution over time
2. Finding Fragmented or Stale Manifests
Bloated metadata can come from too many or inefficient manifest files. Use the manifests table to explore:
SELECT
COUNT(*) AS manifest_count,
AVG(added_data_files_count) AS avg_files_per_manifest
FROM my_table.manifests;
Low averages can indicate fragmented manifests that are good candidates for rewriting.
3. Tracking Snapshot Volume and Velocity
To see if snapshots are accumulating too fast (and increasing metadata overhead):
SELECT
COUNT(*) AS snapshot_count,
MIN(committed_at) AS first_snapshot,
MAX(committed_at) AS latest_snapshot
FROM my_table.snapshots;
You can also inspect how many files each snapshot adds or removes to identify noisy patterns from ingestion jobs.
4. Building a Health Score
By combining file count, file size, manifest count, and snapshot frequency, you can compute a “table health score”:
-- Example: High file count + small average size = poor health
WITH file_stats AS (
SELECT COUNT(*) AS total_files, AVG(file_size_in_bytes) AS avg_file_size
FROM my_table.files
),
manifest_stats AS (
SELECT COUNT(*) AS total_manifests
FROM my_table.manifests
)
SELECT
total_files,
avg_file_size,
total_manifests,
CASE
WHEN avg_file_size < 67108864 AND total_files > 1000 THEN 'Needs compaction'
ELSE 'Healthy'
END AS status
FROM file_stats, manifest_stats;
5. Triggering Compaction Automatically
Once you identify problematic patterns, you can wire up your orchestration layer to act:
-
Use Airflow, Dagster, or dbt Cloud to run SQL-based checks
-
When thresholds are breached, trigger Spark/Flink compaction jobs
-
Track results and update monitoring dashboards
-
This ensures you optimize only when needed, keeping costs and latency low.
Benefits of Metadata-Driven Optimization
-
Precision: Only touch affected partitions
-
Efficiency: Avoid unnecessary compute jobs
-
Responsiveness: React to real-time ingestion patterns
-
Governance: Create audit trails for all compaction decisions
Summary
Apache Iceberg gives you visibility and control over your tables through metadata tables. By tapping into this metadata:
-
You avoid blind scheduling of compaction
-
You build smarter, more efficient optimization workflows
-
You reduce both query latency and operational cost
In the next post, we’ll dive into partition evolution and layout pitfalls, and how to avoid undermining your compaction and clustering strategies when schemas or partitions change.