Hi, I'm Raymond

I build data pipelines with Kafka, Spark, and DataFusion
I contribute to Apache open-source projects
I deploy and orchestrate with Docker and Kubernetes

Yu-Chuan Hung

Data Engineer | Apache Open Source Contributor

Data engineer building reliable data pipelines at scale. Experienced with Spark and DataFusion for batch and query processing, Kafka for real-time streaming, and Docker/Kubernetes for orchestration. Actively contributing to the Apache ecosystem — DataFusion, Comet, Ballista, Iceberg, and Ozone.

Skills

Projects

Apache DataFusion

Contributor 2025 - Present

Implemented Spark-compatible functions (json_tuple, size), formatted doc strings across the entire codebase, and added microbenchmarks.

apache datafusion

Apache DataFusion-Comet

Contributor 2025 - Present

Refactored QueryPlanSerde by extracting comparison and datetime expressions into separate modules with reusable traits.

apache datafusion

Apache DataFusion-Ballista

Contributor 2025

Made gRPC timeout configurations user-configurable across the distributed query engine with 9 new config options.

apache datafusion

Apache Iceberg

Contributor 2025

Fixed ErrorProne warnings across multiple modules, enforced test naming conventions, and corrected documentation.

apache

Recent Posts

Open Source Contributions

Overview Merged pull requests across Apache open-source projects: DataFusion, DataFusion-Comet, DataFusion-Ballista, Iceberg, and Ozone. Contributions span new features, code quality improvements, documentation fixes, and refactoring. Apache DataFusion PR Title Type Date #20412 feat: support Spark-compatible json_tuple function feat 2026-02-20 #20076 chore: Add microbenchmark (compared to ExprOrExpr) chore 2026-01-30 #19592 feat: implement Spark size function for arrays and maps feat 2026-01-13 #18443 chore: Format examples in doc strings - spark, sql, sqllogictest, sibstrait docs 2025-11-07 #18353 chore: Format examples in doc strings - functions docs 2025-11-03 #18357 chore: Format examples in doc strings - physical expr, optimizer, and plan docs 2025-11-01 #18335 chore: Format examples in doc strings - catalog listing docs 2025-10-29 #18358 chore: Format examples in doc strings - proto, pruning, and session docs 2025-10-29 #18354 chore: Format examples in doc strings - macros and optmizer docs 2025-10-29 #18338 chore: Format examples in doc strings - datasource crates docs 2025-10-29 #18340 chore: Format examples in doc strings - expr docs 2025-10-29 #18333 chore: Format examples in doc strings - crate datafusion docs 2025-10-29 #18336 chore: Format examples in doc strings - common docs 2025-10-28 #18339 chore: Format examples in doc strings - execution docs 2025-10-28 Apache DataFusion-Comet

Monday, February 23, 2026 | 2 minutes Read

Implementing Spark-Compatible json_tuple in Apache DataFusion

PR: apache/datafusion#20412 Background DataFusion-Comet accelerates Spark queries by offloading execution to Apache DataFusion. For this to work, DataFusion needs to support the Spark built-in functions that Comet encounters. json_tuple is one of them — it is commonly used in ETL pipelines to extract fields from JSON columns without defining a full schema. Comet had an open issue requesting this. Without native support, queries using json_tuple would fall back to Spark’s own execution path, defeating the purpose of using Comet.

Friday, February 20, 2026 | 2 minutes Read

Making gRPC Timeouts Configurable in Apache DataFusion-Ballista

PR: apache/datafusion-ballista#1337 Background Ballista is a distributed query engine built on DataFusion. It coordinates executors through a scheduler, with all inter-node communication going over gRPC. A previous PR (#115) had introduced gRPC timeout support, but all values were hard-coded. In production environments, different workloads require different timeout behavior — a long-running aggregation needs different settings than a quick metadata fetch. Without configuration options, the only recourse was to modify the source code directly.

Saturday, November 1, 2025 | 2 minutes Read