After more than a year publishing Scalera posts, I think the time for scratching one of the most important tools in Scala ecosystem has arrived. Of course, I’m talking about Spark.
Spark is a framework that allows parallelizing collections and their process. This process might be batch map-reduce tasks over different data sources (HDFS, Cassandra, ElasticSearch, …), streaming process of certain data flows (Kafka, Flume, …); or perform different data sources queries in some unified way with a query language like SQL. But that’s too much rock&roll. We’ll cover today the basic data type that is used in Spark: RDD.
What’s an RDD?
The RDD type (Resilient Distributed Dataset) looks like many other Scala collections. However, it’s important to get to know some of its main features:
- It’s a distributed collection. This means that it’s partitioned among the different Spark nodes (known as workers).
- They’re immutable: when you apply a transformation over an RDD, we’re actually creating a new one.
- It’s lazily evaluated. With RDD’s, we’re just defining the information flow, but it won’t be evaluated at its definition moment, but at the moment when you apply an action over the RDD.
Besides, it’s good to know that you can perform two different type of operations on an RDD: transformations and actions.
Great, but how do I create them?
There are several ways to do so:
- Parallelizing an in-memory collection, like a list of values. For doing so, we’ll use the parallelize method of the SparkContext.
val newRDD: RDD[Int] = sc.parallelize(List(1, 2, 3, 4))
- From some data source using, for example, the textFile function of the SparkContext.
val newRDD: RDD[Int] = sc.textFile("myValues.txt")
- Transforming an RDD by applying a transformation in order to create a new RDD from another one.
val newRDD: RDD[String] = intValues.map(_.toString)
What about that transformations stuff …?
Transformations define how the information flow will change by generating a new RDD. Some of these transformations are:
- map: applies a function for transforming each collection element:
intValues.map(_.toString) // RDD[String]
- filter: select the subset of elements that match certaing boolean expression:
- flatMap: apart from applying the map function, it flattens the returning collection:
textFile.map(_.split(" ")) //RDD[Array[String]] but ... textFile.flatMap(_.split(" ")) //RDD[String]
You spoke about actions, didn’t you?
Actions will allow us to evaluate the RDD and return some result. This way, the whole defined data flow that represents the RDD is evaluated. Any example? Some of them:
- count: it returns the total amount of elements:
sc.parallelize(List(1, 2, 3, 4)).count //4
- collect: it returns the WHOLE collection inside an in-memory array:
sc.parallelize(List(1, 2, 3, 4)).collect // Array(1, 2, 3, 4)
So, beware! If the RDD size doesn’t fit into the driver’s assigned memory, the program will crash.
- saveAsTextFile: it pipes the collection to some text file:
So how can I apply all of these with Spark? We’ll find out the way in a week, with a practical case that will join both Twitter and Spark Streaming functionalities for performing some basic analytics in an easy, simple, g-rated way. So we’ll be able to make profit of it and feel both powerful and geek at the same time.