PR: apache/datafusion#20412

Background

DataFusion-Comet accelerates Spark queries by offloading execution to Apache DataFusion. For this to work, DataFusion needs to support the Spark built-in functions that Comet encounters. json_tuple is one of them — it is commonly used in ETL pipelines to extract fields from JSON columns without defining a full schema.

Comet had an open issue requesting this. Without native support, queries using json_tuple would fall back to Spark’s own execution path, defeating the purpose of using Comet.

Design

DataFusion’s ScalarUDF interface returns exactly one value per row, but json_tuple conceptually produces multiple columns — one per requested key. To work within this constraint, the function was implemented to return a Struct where each field maps to a requested key:

json_tuple('{"a":1, "b":2}', 'a', 'b') -> {c0: 1, c1: 2}

Struct fields follow Spark’s naming convention: c0, c1, c2, etc. Comet then destructures the struct into separate columns on its end. This keeps the UDF interface unchanged while preserving the multi-column semantics.

Implementation

The function is registered as a variadic ScalarUDF requiring at minimum the JSON string and one key. At runtime, it:

  1. Infers the return Struct type from the number of key arguments
  2. Parses each row’s JSON string via serde_json
  3. Looks up each requested key and places the value into the corresponding struct field
  4. Returns NULL for the entire struct if the input is NULL, or NULL for individual fields when a key is absent
SELECT json_tuple('{"f1":"value1","f2":"value2"}', 'f1', 'f2');
-- {c0: value1, c1: value2}

SELECT json_tuple('{"f1":"value1"}', 'f1', 'f2');
-- {c0: value1, c1: NULL}

SELECT json_tuple(NULL, 'f1');
-- NULL

Testing

Unit tests cover return type inference and edge cases such as insufficient arguments. A dedicated json_tuple.slt SQL logic test file was added with cases derived from Spark’s own JsonExpressionsSuite to ensure behavioral parity.