Spark ML Pipelines

Spark ML Pipelines

Level: All

Dan S is a data engineer who is passionate about large-scale distributed
systems and streaming pipelines, and cares about producing clean,
robust, well-tested Scala / Spark code.

In this talk, I will present to you a uniform set of high-level APIs
built on top of Spark DataFrames to help developers create and tune
practical machine learning pipelines.

In machine learning, it is common to run a sequence of algorithms to
process data and learn from it. For example, a simple text document
processing workflow might include several stages:

1) Split each document’s text into words.

2) Convert each document’s words into a numerical feature vector.

3) Build a prediction model using the feature vectors and labels.

Spark ML represents such a workflow as a pipeline, which consists of a
sequence of pipeline stages (transformers and estimators) to be run in
a specific order. We will use this simple workflow in a practical
example in this talk.


Leave A Comment

Your email address will not be published. Required fields are marked *

back to top