Data engineer building reliable data pipelines at scale. Experienced with Spark and DataFusion for batch and query processing, Kafka for real-time streaming, and Docker/Kubernetes for orchestration. Actively contributing to the Apache ecosystem — DataFusion, Comet, Ballista, Iceberg, and Ozone.
Implemented Spark-compatible functions (json_tuple, size), formatted doc strings across the entire codebase, and added microbenchmarks.
Refactored QueryPlanSerde by extracting comparison and datetime expressions into separate modules with reusable traits.
Made gRPC timeout configurations user-configurable across the distributed query engine with 9 new config options.
Fixed ErrorProne warnings across multiple modules, enforced test naming conventions, and corrected documentation.