Skip to main content
Spark
- Spark provides an interface for crunching,processing,mining and generally analysing big data.
- Spark is used by python or Scala developers mostly
- Spark had different models as follows
- Spark is a parallel execution framework, which enables us to perform operations on big data sets in a highly parallel fashion.
- Spark overcomes the following disadvantages of Hadoop
- It uses map reduce framework to crunch big data sets.
- It is very rigid model as it must do a map and a reduce process.
- For complex requirements, we have to chain together, complex map reduce jobs.
- After each map reduce in Hadoop The results have to be rewritten To disks And reloaded into the next map task.
- Spark runs programmes up to 100X faster Then Hadoop Map Reduce in memory, or 10X faster on disk.
- Spark allows us to build much richer jobs And we can use combination of operations, such as sort, filter, joins and Many other high-level operations.
- Spark builds an execution plan. It has a graph of operations. We want to perform. It runs the execution plan only when we want results that are needed.
- It can optimise our tasks by running Two independent parts of our process in parallel.
- This is done by execution engine at runtime
- Spark Program is distributed Across all cores On my development environment.
- Running on Cluster
- The code is uploaded to the driver node or master Node.
- The driver node builds an execution plan.
- The execution plan is a type of a graph.
- The graph is the way to efficiently divide up the work so that it can be processed in parallel.
- We have two or more worker nodes which are used to parallel process tasks in work.
- Once the execution plan is complete, we load in big data or any file system such as Hadoop or Amazon S3.
- Data is distributed across worker nodes and is also partitioned.
- A partition is a block of data. A note will have multiple partitions deployed to it.
- Driver will send across the network, any Java functions that needs to be executed against this data.
- Functions reach the worker in words, and these are executed against the partitions.
- Function is applied to each partition in the data node.
- The function executing against a partition is called as a task.
- A partition is just a data, but task is a function executing on a code.
- RDD-Resilient Distributed Data
- RDD is the data that we are working with.
- This data is distributed across multiple partitions, which are distributed across multiple nodes.
- If any of the node fails Then data present on note can be recovered and recreated.
- This is done by distributed aspect of RDD.
- If the big set of data is single array, then that is the resilient distributed data set.
- We can visualise the execution plan using a Web user interface in spark.
- The entire execution plan is executed only once terminal function is called.
- The execution plan is also called as DAG-Directed Acyclic Graph
- DAG Is a graph where there are connections between nodes on the graph.
- Some of the basic operations in spark are
- Spark is implemented in Scala, but there is API for python programming and Java programming.
- Lambda function is the best way to work in Java Spark.
- Scala Tuples
- A Tuple is a collection of values
- A Tuple can have many items inside it as we like
- Tuples have very simple syntax and can be defined via placing comma separated values in round brackets.
- The values have to be of same type.
- Pairs
- Allows us to have a pair of values in RDD.
- Allows us to store values as key and value pairs.
- Allow us to perform operations such as grouping by key
- We can process log files on servers using pairs Of levels of loggings and logs using pairs.
- A pair RDD is an RDD with two columns with any Java types.
- The two columns are defined by key and value.
- Difference between pair RDD and Map
- Java API does not have a similar data set.
- Google guava Also had similar dataset as scala has which is used in Java.
- Pair RDD Allows us to have many other methods which can perform operations based on key.
- Flat Maps
- Map takes a function and applies that function to every map in RDD.
- For every value in input RDD there is a value in Output RDD
- FlatMap gives multiple outputs for a given value or has a case where no value is returned.
- For example if we have an RDD of sentences.
- We may need to return words in a string
- Flatmaps are used to return 0 or more values for a given singl value.
- Transformations
- Actions
- Sorts and Coalesce
- Sorts do not work with foreach.
- RDD is processed in different partitions so partitions is a chunk of data present in RDD.
- Multiple threads process partition in parallel or partitions are distributed on different physical nodes which process them in parallel.
- To combine all data into one partition we use coalesce() method.
- coalesce() allows us to specify how many partitions we need to define to process data.
- If we want do coalesce(1) 1 then only one partition is created.
- All data is combined into one partition.
- Use of coalesce Is not recommended as we may get out of memory exception if we try to perform further action’s on the data.
- Number of partitions Created is based on the size of the input data.
- If we are working on a HDFS partition, the size of each partition is equal to the size of HDFS Block, which is 64MB in size.
- When Working in spark, we do not need to worry about shuffling the data or even really know about partition for the results of an action to be correct.
- We should know about shuffle for performance. This is very important, but it is not needed for the correctness of data.
- We should get correct sorting, regardless of how spark has organised the data Into partitions.
- “ForEach” after sorting, runs on each partition in parallel.
- When we deploy spark Application on real cluster, then each partition is on different node.
- On each node “println” Will be executing
- We wouldn’t see any output on the driver.
- Spark, when deployed on machine, deploys a thread On each core of the machine to process data on partitions.
- “ForEach” Execute the lambda on each partition in parallel.
- Any action, other than foreach will give us the right answers.
- We can use functions like take() etc But we should be mindful of these actions as they may affect the performance.
- Coalesce
- After performing many transformations on our multibyte, multi partition, RDD, we have reached a point where we only have a small amount of data.
- For the remaining transformations there is no point in continuing across 1000 partitions, any shuffles will be pointlessly expensive
- Coalesce is a way of reducing the number of partitions, it is never used for correctness.
- Shuffling data across partition’s Is a slow and expensive.
- The lower, the partitions is the more effective that transformation are.
- Collect()
- Is generally used when we have finished and we want together a small RDD on the driver node, for example printing.
- Only Call, if we are sure, RDD will fit into a single JVM RAM.
- If the results are big, we will write to a HDFS file.
- Joins
- References
Comments
Post a Comment