Process your big data in a blink using Spark


Dan Serban is a data engineer who occasionally teaches advanced functional programming as well as data engineering (using Spark as the big data framework).

This 2-hour, intensely hands-on workshop introduces Apache Spark, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).

The plan is to start with a few publicly available datasets and gradually work our way through them until we harness some useful insights, gaining a deep understanding of Spark’s rich collections API in the process.

Time permitting, we are going to look at a very simple Spark Streaming example (stream of integers / moving average).

During the workshop, participants are encouraged to exchange with one another URLs and snippets of code via the issues section of this GitHub repository ( ).

Leave A Comment

Your email address will not be published. Required fields are marked *

back to top