Spark ML Pipelines
Dan S is a data engineer who is passionate about large-scale distributed
systems and streaming pipelines, and cares about producing clean,
robust, well-tested Scala / Spark code.
In this talk, I will present to you a uniform set of high-level APIs
built on top of Spark DataFrames to help developers create and tune
practical machine learning pipelines.
In machine learning, it is common to run a sequence of algorithms to
process data and learn from it. For example, a simple text document
processing workflow might include several stages:
1) Split each document’s text into words.
2) Convert each document’s words into a numerical feature vector.
3) Build a prediction model using the feature vectors and labels.
Spark ML represents such a workflow as a pipeline, which consists of a
sequence of pipeline stages (transformers and estimators) to be run in
a specific order. We will use this simple workflow in a practical
example in this talk.