Programme overview:
-
Introduction to Spark
-
What is Spark?
-
Spark vs Hadoop
-
Spark with HDFS : quick overview
-
Spark on YARN : quick overview
-
Basic building blocks in Spark
-
Introduction to Resilient distributed datasets
-
Spark shell
-
Overview of RDD operations
-
Key-Value Pair RDDs
-
Aggregating Data with pair RDDs
Hands-on exercises:
-
Word count
-
Writing and deploying Spark applications
-
Spark context
-
Building Spark applications
-
Submitting a Spark application to a cluster
-
Spark Web UI
-
Spark Config: important options
-
Logging, YARN log aggregation
Hands-on exercises:
-
Joining RDDs
-
Spark on a cluster
-
RDD partitions : on HDFS, on local filesystem, after shuffle
-
Data Locality
-
Execution model overview : Stages, Tasks, Executors
-
RDD persistence
-
Fault tolerance
-
Hands-on exercises:
-
Spark-SQL aggregations
-
Spark use cases
-
Data analysis
-
Machine learning
-
Iterative algorithms
Hands-on exercises:
-
Page rank
-
Spark performance tips:
-
Controlling parallelism
-
Dealing with skewed data
-
Broadcast variables
Hands-on exercises:
- Performance tuning challenge!
Prerequisites
- Ability to understand simple programs written in scala or java.
- Familiarity with Linux command line.
Target Audience
Software Developers who have no previous Spark experience
Details
Venue : Agile Actors HQ
Date : Saturday, May 20th
Time : 09:30 – 17:30
€240+ VAT 24%