Overview Merged pull requests across Apache open-source projects: DataFusion, DataFusion-Comet, DataFusion-Ballista, Iceberg, and Ozone. Contributions span new features, code quality improvements, documentation fixes, and refactoring. Apache DataFusion PR Title Type Date #20412 feat: support Spark-compatible json_tuple function feat 2026-02-20 #20076 chore: Add microbenchmark (compared to ExprOrExpr) chore 2026-01-30 #19592 feat: implement Spark size function for arrays and maps feat 2026-01-13 #18443 chore: Format examples in doc strings - spark, sql, sqllogictest, sibstrait docs 2025-11-07 #18353 chore: Format examples in doc strings - functions docs 2025-11-03 #18357 chore: Format examples in doc strings - physical expr, optimizer, and plan docs 2025-11-01 #18335 chore: Format examples in doc strings - catalog listing docs 2025-10-29 #18358 chore: Format examples in doc strings - proto, pruning, and session docs 2025-10-29 #18354 chore: Format examples in doc strings - macros and optmizer docs 2025-10-29 #18338 chore: Format examples in doc strings - datasource crates docs 2025-10-29 #18340 chore: Format examples in doc strings - expr docs 2025-10-29 #18333 chore: Format examples in doc strings - crate datafusion docs 2025-10-29 #18336 chore: Format examples in doc strings - common docs 2025-10-28 #18339 chore: Format examples in doc strings - execution docs 2025-10-28 Apache DataFusion-Comet
PR: apache/datafusion#20412 Background DataFusion-Comet accelerates Spark queries by offloading execution to Apache DataFusion. For this to work, DataFusion needs to support the Spark built-in functions that Comet encounters. json_tuple is one of them — it is commonly used in ETL pipelines to extract fields from JSON columns without defining a full schema. Comet had an open issue requesting this. Without native support, queries using json_tuple would fall back to Spark’s own execution path, defeating the purpose of using Comet.
PR: apache/datafusion-ballista#1337 Background Ballista is a distributed query engine built on DataFusion. It coordinates executors through a scheduler, with all inter-node communication going over gRPC. A previous PR (#115) had introduced gRPC timeout support, but all values were hard-coded. In production environments, different workloads require different timeout behavior — a long-running aggregation needs different settings than a quick metadata fetch. Without configuration options, the only recourse was to modify the source code directly.
PRs: #2028 and #2085 — part of tracking issue #2019 Background DataFusion-Comet translates Spark physical plans into DataFusion execution plans. The core of this translation lives in QueryPlanSerde, a Scala file responsible for serializing Spark expressions into Protocol Buffer messages. Over time, QueryPlanSerde had accumulated serialization logic for every expression type — comparisons, datetime operations, string functions, math — all in a single file. This made navigation difficult, PR reviews cumbersome, and adding new expressions error-prone.
PR: apache/iceberg#13217 Background Apache Iceberg uses ErrorProne, Google’s static analysis tool for Java, to catch bugs at compile time. Five warnings existed across the API, Flink, GCP, and Azure modules, each representing a different category of issue. Fixes 1. UnnecessaryParentheses (ADLSFileIO) Removal of unnecessary parentheses to align with the project’s style conventions. 2. ObjectsHashCodePrimitive (DynamicRecordInternalSerializer) // before: boxes the boolean into an Object Objects.hashCode(booleanValue) // after: operates directly on the primitive Boolean.hashCode(booleanValue) Objects.hashCode() wraps primitives in their boxed type before computing the hash. Boolean.hashCode() avoids this unnecessary allocation.