Skip to main content

Getting Started

Write your data pipeline once. Run it anywhere.

You write your pipeline against a single API, then decide how it runs. Point it at one engine and it runs there — or hand Wayang's cost-based optimizer the choice and let it pick the best platform for each step across your laptop, Apache Spark, Apache Flink, or a database, even splitting a single job across several. Either way, when your data outgrows one machine you don't rewrite anything — you just make another engine available.

This page gets you from zero to a running cross-platform pipeline in a few minutes.


How it works

Most data tools lock you into one engine. Pick Spark, and your code is Spark code forever. Outgrow it, or need a database in the mix, and you rewrite.

Wayang sits one level up. You write a pipeline against Wayang's API and register the engines you have — then it's your call. Want control? Register one engine and it runs there. Want it handled? Register several and let Wayang's cost-based optimizer pick the best one for each step:

A single pipeline, written once, feeds the Wayang optimizer, which routes each step to the best available engine — Local, Spark, Flink, Postgres, and others.

Same code on your laptop and on a 100-node cluster.


Quickstart

We'll run a word count locally first — no cluster, nothing to install on a server — then make Spark available with a one-line change. The pipeline itself never changes; only the set of engines you register does.

1. Run locally

import org.apache.wayang.core.api.Configuration;
import org.apache.wayang.core.api.WayangContext;
import org.apache.wayang.api.JavaPlanBuilder;
import org.apache.wayang.basic.data.Tuple2;
import org.apache.wayang.java.Java;
import java.util.Arrays;

public class WordCount {
public static void main(String[] args) {
// Register ONLY the local Java engine → runs on your machine, no cluster needed.
WayangContext wayang = new WayangContext(new Configuration())
.withPlugin(Java.basicPlugin());

new JavaPlanBuilder(wayang)
.withJobName("WordCount")
.withUdfJarOf(WordCount.class)
.readTextFile("file:///path/to/input.txt")
.flatMap(line -> Arrays.asList(line.split("\\W+")))
.filter(word -> !word.isEmpty())
.map(word -> new Tuple2<>(word.toLowerCase(), 1))
.reduceByKey(Tuple2::getField0,
(t1, t2) -> new Tuple2<>(t1.getField0(), t1.getField1() + t2.getField1()))
.writeTextFile("file:///path/to/output.txt", t -> t.getField0() + ": " + t.getField1());
}
}

It executes locally. Good for development, tests, and small data.

2. Run it on Spark

Now run the exact same pipeline on Spark instead of locally. You don't touch the pipeline — you change which platform you register: comment out Java and register Spark.

import org.apache.wayang.spark.Spark; // ← swap the import

// Same pipeline as before — only the registered platform changed.
WayangContext wayang = new WayangContext(new Configuration())
// .withPlugin(Java.basicPlugin()) // ← comment out the local engine
.withPlugin(Spark.basicPlugin()); // ← register Spark instead

Run it again. The same pipeline now executes on Spark — you changed where it runs without changing what it does. Switch to Flink or any other supported platform the same way: swap the import and the registered plugin.

3. Register both and let the optimizer choose

This is the point of Wayang. In practice you don't have to pick a platform at all: you register every engine you have and let the optimizer choose the best one for each step — even splitting a single job across engines. The pipeline is still the same; you just stop deciding where it runs.

// Register BOTH platforms — Wayang's optimizer decides which to use per step.
WayangContext wayang = new WayangContext(new Configuration())
.withPlugin(Java.basicPlugin())
.withPlugin(Spark.basicPlugin());

Now Wayang owns the placement decision. For each operator it estimates the cost on every registered platform and picks the cheapest — keeping a small job entirely local, pushing a large one onto Spark, or mixing both within the same job as the data demands. You wrote the pipeline once in step 1; steps 2 and 3 only changed which engines were on the table.

On a tiny input you'll see it keep everything local (that's the optimizer working correctly, not ignoring Spark). The cross-platform decisions show up once the data is big enough for them to pay off. Read How Wayang chooses a platform for what drives those choices.


Install

Replace WAYANG_VERSION below with latest maven release.

Maven:

<dependency>
<groupId>org.apache.wayang</groupId>
<artifactId>wayang-core</artifactId>
<version>WAYANG_VERSION</version>
</dependency>
<dependency>
<groupId>org.apache.wayang</groupId>
<artifactId>wayang-basic</artifactId>
<version>WAYANG_VERSION</version>
</dependency>
<dependency>
<groupId>org.apache.wayang</groupId>
<artifactId>wayang-api-scala-java</artifactId>
<version>WAYANG_VERSION</version>
</dependency>
<!-- add one artifact per engine you want available -->
<dependency>
<groupId>org.apache.wayang</groupId>
<artifactId>wayang-java</artifactId>
<version>WAYANG_VERSION</version>
</dependency>
<dependency>
<groupId>org.apache.wayang</groupId>
<artifactId>wayang-spark</artifactId>
<version>WAYANG_VERSION</version>
</dependency>

Or build from source (always works, no published artifact needed). Just make sure to use the latest snapshot version:

git clone https://github.com/apache/wayang.git
cd wayang
./mvnw clean install -DskipTests

To run on Spark, you'll need Apache Spark 3+ on your PATH.


Stuck on the first run? That's a documentation bug, and we want to know. Ask on the user mailing list.