Spark

Spark provides an interface for crunching,processing,mining and generally analysing big data.
Spark is used by python or Scala developers mostly
Spark had different models as follows

Spark Resilient Distributed Dataset(RDD) is the oldest models used

Very powerful and fast

Spark SQL and Dataframes API

We focus on data science

Spark Machine Learning
Spark Streaming

process realtime streaming data

use Dstreams and newer structured streaming libraries.

Spark is a parallel execution framework, which enables us to perform operations on big data sets in a highly parallel fashion.
Spark overcomes the following disadvantages of Hadoop

It uses map reduce framework to crunch big data sets.

It is very rigid model as it must do a map and a reduce process.

For complex requirements, we have to chain together, complex map reduce jobs.
After each map reduce in Hadoop The results have to be rewritten To disks And reloaded into the next map task.

Spark runs programmes up to 100X faster Then Hadoop Map Reduce in memory, or 10X faster on disk.
Spark allows us to build much richer jobs And we can use combination of operations, such as sort, filter, joins and Many other high-level operations.
Spark builds an execution plan. It has a graph of operations. We want to perform. It runs the execution plan only when we want results that are needed.

It can optimise our tasks by running Two independent parts of our process in parallel.
This is done by execution engine at runtime

Spark Program is distributed Across all cores On my development environment.
Running on Cluster

The code is uploaded to the driver node or master Node.
The driver node builds an execution plan.

The execution plan is a type of a graph.
The graph is the way to efficiently divide up the work so that it can be processed in parallel.

We have two or more worker nodes which are used to parallel process tasks in work.
Once the execution plan is complete, we load in big data or any file system such as Hadoop or Amazon S3.
Data is distributed across worker nodes and is also partitioned.

A partition is a block of data. A note will have multiple partitions deployed to it.

Driver will send across the network, any Java functions that needs to be executed against this data.
Functions reach the worker in words, and these are executed against the partitions.

Function is applied to each partition in the data node.

The function executing against a partition is called as a task.
A partition is just a data, but task is a function executing on a code.

RDD-Resilient Distributed Data

RDD is the data that we are working with.

This data is distributed across multiple partitions, which are distributed across multiple nodes.

If any of the node fails Then data present on note can be recovered and recreated.

This is done by distributed aspect of RDD.
If the big set of data is single array, then that is the resilient distributed data set.

We can visualise the execution plan using a Web user interface in spark.
The entire execution plan is executed only once terminal function is called.
The execution plan is also called as DAG-Directed Acyclic Graph

DAG Is a graph where there are connections between nodes on the graph.

Some of the basic operations in spark are

Spark is implemented in Scala, but there is API for python programming and Java programming.
Lambda function is the best way to work in Java Spark.
Scala Tuples

A Tuple is a collection of values
A Tuple can have many items inside it as we like
Tuples have very simple syntax and can be defined via placing comma separated values in round brackets.
The values have to be of same type.

Pairs

Allows us to have a pair of values in RDD.
Allows us to store values as key and value pairs.
Allow us to perform operations such as grouping by key
We can process log files on servers using pairs Of levels of loggings and logs using pairs.
A pair RDD is an RDD with two columns with any Java types.
The two columns are defined by key and value.
Difference between pair RDD and Map

Map can have only one instance of a key.
Pair can have multiple instance of the same key.
It is similar to a multi set also called as bag structure.

Java API does not have a similar data set.
Google guava Also had similar dataset as scala has which is used in Java.
Pair RDD Allows us to have many other methods which can perform operations based on key.

Flat Maps

Map takes a function and applies that function to every map in RDD.

For every value in input RDD there is a value in Output RDD

FlatMap gives multiple outputs for a given value or has a case where no value is returned.

For example if we have an RDD of sentences.
We may need to return words in a string

Flatmaps are used to return 0 or more values for a given singl value.

Returns a flat of values

Transformations

When we transform an RDD from one form to another.
Spark, lazily evaluates the transformation function’s.
Create new data sets based on existing ones.
They build a logical execution plan(DAG)
Spark can optimise the execution plan by combining multiple transformations before Executing them.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

Actions

Trigger the computation and write a result, or write data to the storage.
Actions for the execution of the DAG plan.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

Sorts and Coalesce

Sorts do not work with foreach.
RDD is processed in different partitions so partitions is a chunk of data present in RDD.
Multiple threads process partition in parallel or partitions are distributed on different physical nodes which process them in parallel.
To combine all data into one partition we use coalesce() method.
coalesce() allows us to specify how many partitions we need to define to process data.

If we want do coalesce(1) 1 then only one partition is created.
All data is combined into one partition.

Use of coalesce Is not recommended as we may get out of memory exception if we try to perform further action’s on the data.
Number of partitions Created is based on the size of the input data.
If we are working on a HDFS partition, the size of each partition is equal to the size of HDFS Block, which is 64MB in size.
When Working in spark, we do not need to worry about shuffling the data or even really know about partition for the results of an action to be correct.

We should know about shuffle for performance. This is very important, but it is not needed for the correctness of data.

We should get correct sorting, regardless of how spark has organised the data Into partitions.
“ForEach” after sorting, runs on each partition in parallel.
When we deploy spark Application on real cluster, then each partition is on different node.
On each node “println” Will be executing

We wouldn’t see any output on the driver.

Spark, when deployed on machine, deploys a thread On each core of the machine to process data on partitions.
“ForEach” Execute the lambda on each partition in parallel.
Any action, other than foreach will give us the right answers.

We can use functions like take() etc But we should be mindful of these actions as they may affect the performance.

Coalesce

After performing many transformations on our multibyte, multi partition, RDD, we have reached a point where we only have a small amount of data.
For the remaining transformations there is no point in continuing across 1000 partitions, any shuffles will be pointlessly expensive
Coalesce is a way of reducing the number of partitions, it is never used for correctness.
Shuffling data across partition’s Is a slow and expensive.
The lower, the partitions is the more effective that transformation are.

Collect()

Is generally used when we have finished and we want together a small RDD on the driver node, for example printing.
Only Call, if we are sure, RDD will fit into a single JVM RAM.
If the results are big, we will write to a HDFS file.

Joins

References

Search This Blog

Data Analytics

Spark

Comments

Post a Comment