Spark

For latest updates check out my Apache Spark exploration @ https://github.com/Mageswaran1989/aja/

ApacheSpark (https://spark.apache.org/)

Backbone:

  • RDD
  • DataFrames

Jargons :

Application:- User program built on Spark. Consists of a driver program and executors on the cluster.

Application jar:-  A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.

Driver Program: - The process running the main() function of the application and creating the SparkContext or The program/process responsible for running the Job over the Spark Engine

Cluster manager:-  An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)

Deploy mode:-  Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.

Worker Node:- Any node that can run application code in the cluster

Executor :- The process responsible for executing a task or A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.

Tasks:- Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor(machine).

Job:-  A piece of code which reads some input  from HDFS or local, performs some computation on the data and writes some  output data or A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.

Stages:- Jobs are divided into stages. Stages are classified as a Map or reduce stages(Its easier to understand if you  have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations(operators) cannot be Updated in a single Stage. It happens over many stages.

DAG: - DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.

Master: - The machine on which the Driver program runs

Slave: - The machine on which the Executor program runs

RDDs - a low level API for expressing DAGs that will be executed in parallel by Spark workers
 - Catalyst - an internal library for expressing trees that we use to build relational algebra and expression evaluation.  There's also an optimizer and query planner than turns these into logical concepts into RDD actions.
 - Tungsten - an internal optimized execution engine that can compile catalyst expressions into efficient java bytecode that operates directly on serialized binary data.  It also has nice low level data structures / algorithms like hash tables and sorting that operate directly on serialized data.  These are used by the physical nodes that are produced by the query planner (and run inside of RDD operation on workers).
 - DataFrames - a user facing API that is similar to SQL/LINQ for constructing dataflows that are backed by catalyst logical plans

 - Datasets - a user facing API that is similar to the RDD API for constructing dataflows that are backed by catalyst logical plans 

RDD


Resilient Distributed Dataset

Spark is a distributed computing engine and its main abstraction is a resilient distributed dataset (RDD), which can be viewed as a distributed collection. Basically, RDD's elements are partitioned across the nodes of the cluster, but Spark abstracts this away from the user, letting the user interact with the RDD (collection) as if it were a local one.


RDD is a big collections of data with following properties
  • Immutable
  • Distributed
  • Lazily Evaluated
  • Type Inferred
  • Cacheable

Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others),
your transformation code (closure) is:

1. Serialized on the driver node,
2. Shipped to the appropriate nodes in the cluster,
3. Deserialized, and
4. Finally executed on the nodes

You can of course run this locally, but all those phases (apart from shipping over network) still occur. [This lets you catch any bugs even before deploying to production]


From a high-level system engineering point of view:


1. Spark uses fast RPCs for task dispatching and scheduling.


2. Spark uses a threadpool for execution of tasks rather than a pool of JVM processes.


The above two combined enables Spark to schedule tasks as fast as milliseconds, whereas MR scheduling takes seconds and sometimes minutes in busy clusters.


3. Spark supports both checkpointing-based recovery (the way fault-tolerance is implemented in Hadoop MR) and lineage-based recovery. Unless fault is very common, lineage-based recovery is faster (because replicating state across the network is slow for high-throughput data flow applications).

4. Partially due to its academic root, the Spark community often embrace novel ideas. One example of this is the use of a torrent-like protocol in Spark for doing one-to-all broadcast of data.


Transformation


val textFile = sc.textFile(path)
val strings = textFile.flatMap(line => line.split(“,”))




Compute
  • Compute is the function for evaluation of each partition in RDD
  • Compute is an abstract method of RDD
  • Each sub class of RDD like MappedRDD, FilteredRDD have to override this method


Actions
val textFile = sc.textFile(path)
val strings = textFile.flatMap(line => line.split(“,”))
strings.collect()




runJob API

  • runJob API of RDD is the api to implement actions
  • runJob allows to take each partition and allow you evaluate
  • All spark actions internally use runJob api.


Spark Overview


Spark has a small code base and the system is divided in various layers. Each layer has some responsibilities. The layers are independent of each other.




  1. The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications.
  2. As  you enter your code in spark console(creating RDD's and applying operators), Spark creates a operator graph.
  3. When the user runs an action(like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages.
  4. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.
  5. The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager.( Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies among stages.
  6. The Worker executes the tasks on the Slave. A new JVM is started per JOB. The worker knows only about the code that is passed to it.
Spark caches the data to be processed, allowing it to me 100 times faster than hadoop. Spark uses Akka for Multithreading, managing executor state, scheduling tasks.
It uses Jetty to share files(Jars and other files), Http Broadcast, run Spark Web UI. Spark is highly configurable, and is capable of utilizing the existing components already existing in the Hadoop Eco-System. This has allowed spark to grow exponentially, and in a little time many organisations are already using it in production.

The machine where the Spark application process (the one that creates the SparkContext) is running is the "Driver" node, with process being called the Driver process. There is another machine where the Spark Standalone cluster manager is running, called the "Master" node. Along side, in each of the machines in the cluster, there is a "Worker" process running which reports the available resources in its node to the "Master".

When the Driver process needs resources to run jobs/tasks, it ask the "Master" for some resources. The "Master" allocates the resources and uses the "Workers" running through out the cluster to create "Executors" for the "Driver". Then the Driver can run tasks in those "Executors".



No comments:

Post a Comment