Skip to main content

One post tagged with "python"

View All Tags

· 4 min read
Juri Petersen

In the vast landscape of data processing, efficiency and flexibility are important. However, navigating through a multitude of tools and languages often is a major inconvenience. Apache Wayang's upcoming Python API will allow you to seamlessly orchestrate data processing tasks without ever leaving the comfort of Python, irrespective of the underlying framework written in Java.

Expanding Apache Wayang's APIs

Apache Wayang's architecture decouples the process of planning from the resulting execution, allowing users to specify platform agnostic plans through the provided APIs.


wayang stack

Python's popularity and convenience for data processing workloads makes it an obvious candidate for a desired API. Previous APIs, such as the Scala API wayang-api-scala-java benefited from the interoperability of Java and Scala that allows to reuse objects from other languages to provide new interfaces. Accessing JVM objects in Python is possible through several libraries, but in doing so, future APIs in other programming languages would need similar libraries and implementations in order to exist. As a contrast to that, providing an API within Apache Wayang that receives input plans from any source and executes them within allows to create plans and submit them in any programming language. The following figure shows the architecture of pywayang:


pywayang stack

The Python API allows users to specify WayangPlans with UDFs in Python. pywayang then serializes the UDFs and constructs the WayangPlan in JSON format, preparing it to be sent to Apache Wayang's JSON API. When receiving a valid JSON plan, the JSON API uses the optimizer to construct an execution plan. However, since UDFs are defined in Python and thus need to be executed in Python as well, an operators function needs to be wrapped into a WrappedPythonFunction:

val mapOperator = new MapPartitionsOperator[Input, Output](
new MapPartitionsDescriptor[Input, Output](
new WrappedPythonFunction[Input, Output](
ByteString.copyFromUtf8(udf)
),
classOf[Input],
classOf[Output],
)
)

This wrapped functional descriptor allows to handle execution of UDFs in Python through a socket connection with the pywayang worker. Input data is sourced from the platform chosen by the optimizer and Apache Wayang handles routing the output data to the next operator.


A new API in any programming languages would have to specify two things:

  • A way to create plans that conform to a JSON format specified in the Wayang JSON API.
  • A worker that handles encoding and decoding of user defined functions (UDFs), as they need to be executed on iterables in their respective language. After that, the API can be added as a module in Wayang, so that operators will be wrapped and UDFs can be executed in the desired programming language.