Dan Serban is a data engineer who occasionally teaches advanced functional programming as well as data engineering (using Spark as the big data framework).
This 2-hour, intensely hands-on workshop introduces Apache Spark, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).
The plan is to start with a few publicly available datasets and gradually work our way through them until we harness some useful insights, gaining a deep understanding of Spark’s rich collections API in the process.
Time permitting, we are going to look at a very simple Spark Streaming example (stream of integers / moving average).
During the workshop, participants are encouraged to exchange with one another URLs and snippets of code via the issues section of this GitHub repository ( https://github.com/dserban/SparkVoxxed ).