Spark

  • Spark provides an interface for crunching,processing,mining and generally analysing big data.
  • Spark is used by python or Scala developers mostly
  • Spark had different models as follows
  • Spark is a parallel execution framework, which enables us to perform operations on big data sets in a highly parallel fashion.
  • Spark overcomes the following disadvantages of Hadoop
    • It uses map reduce framework to crunch big data sets.
      • It is very rigid model as it must do a map and a reduce process.
    • For complex requirements, we have to chain together, complex map reduce jobs.
    • After each map reduce in Hadoop The results have to be rewritten To disks And reloaded into the next map task.
  • Spark runs programmes up to 100X faster Then Hadoop Map Reduce in memory, or 10X faster on disk.
  • Spark allows us to build much richer jobs And we can use combination of operations, such as sort, filter, joins and Many other high-level operations.
  • Spark builds an execution plan. It has a graph of operations. We want to perform. It runs the execution plan only when we want results that are needed.
    • It can optimise our tasks by running Two independent parts of our process in parallel.
    • This is done by execution engine at runtime
  • Spark Program is distributed Across all cores On my development environment.
  • Running on Cluster
    • The code is uploaded to the driver node or master Node.
    • The driver node builds an execution plan.
      • The execution plan is a type of a graph.
      • The graph is the way to efficiently divide up the work so that it can be processed in parallel.
    • We have two or more worker nodes which are used to parallel process tasks in work.
    • Once the execution plan is complete, we load in big data or any file system such as Hadoop or Amazon S3.
    • Data is distributed across worker nodes and is also partitioned.
      • A partition is a block of data. A note will have multiple partitions deployed to it.
    • Driver will send across the network, any Java functions that needs to be executed against this data.
    • Functions reach the worker in words, and these are executed against the partitions.
      • Function is applied to each partition in the data node.
    • The function executing against a partition is called as a task.
    • A partition is just a data, but task is a function executing on a code.
  • RDD-Resilient Distributed Data
    • RDD is the data that we are working with.
      • This data is distributed across multiple partitions, which are distributed across multiple nodes.
    • If any of the node fails Then data present on note can be recovered and recreated.
      • This is done by distributed aspect of RDD.
      • If the big set of data is single array, then that is the resilient distributed data set.
  • We can visualise the execution plan using a Web user interface in spark.
  • The entire execution plan is executed only once terminal function is called.
  • The execution plan is also called as DAG-Directed Acyclic Graph
    • DAG Is a graph where there are connections between nodes on the graph.
  • Some of the basic operations in spark are
  • Spark is implemented in Scala, but there is API for python programming and Java programming.
  • Lambda function is the best way to work in Java Spark.
  • Scala Tuples
    • A Tuple is a collection of values
    • A Tuple can have many items inside it as we like
    • Tuples have very simple syntax and can be defined via placing comma separated values in round brackets.
    • The values have to be of same type.
  • Pairs
    • Allows us to have a pair of values in RDD.
    • Allows us to store values as key and value pairs.
    • Allow us to perform operations such as grouping by key
    • We can process log files on servers using pairs Of levels of loggings and logs using pairs.
    • A pair RDD is an RDD with two columns with any Java types.
    • The two columns are defined by key and value.
    • Difference between pair RDD and Map
    • Java API does not have a similar data set.
    • Google guava Also had similar dataset as scala has which is used in Java.
    • Pair RDD Allows us to have many other methods which can perform operations based on key.
  • Flat Maps
    • Map takes a function and applies that function to every map in RDD.
      • For every value in input RDD there is a value in Output RDD
    • FlatMap gives multiple outputs for a given value or has a case where no value is returned.
      • For example if we have an RDD of sentences.
      • We may need to return words in a string
    • Flatmaps are used to return 0 or more values for a given singl value.
      • Returns a flat of values
  • Transformations
  • Actions
  • Sorts and Coalesce
    • Sorts do not work with foreach.
    • RDD is processed in different partitions so partitions is a chunk of data present in RDD.
    • Multiple threads process partition in parallel or partitions are distributed on different physical nodes which process them in parallel.
    • To combine all data into one partition we use coalesce() method.
    • coalesce() allows us to specify how many partitions we need to define to process data.
      • If we want do coalesce(1)  1 then only one partition is created.
      • All data is combined into one partition.
    • Use of coalesce Is not recommended as we may get out of memory exception if we try to perform further action’s on the data.
    • Number of partitions Created is based on the size of the input data.
    • If we are working on a HDFS partition, the size of each partition is equal to the size of HDFS Block, which is 64MB in size.
    • When Working in spark, we do not need to worry about shuffling the data or even really know about partition for the results of an action to be correct.
      • We should know about shuffle for performance. This is very important, but it is not needed for the correctness of data.
    • We should get correct sorting, regardless of how spark has organised the data Into partitions.
    • “ForEach” after sorting, runs on each partition in parallel.
    • When we deploy spark Application on real cluster, then each partition is on different node.
    • On each node “println” Will be executing
      • We wouldn’t see any output on the driver.
    • Spark, when deployed on machine, deploys a thread On each core of the machine to process data on partitions.
    • “ForEach” Execute the lambda on each partition in parallel.
    • Any action, other than foreach will give us the right answers.
      • We can use functions like take() etc But we should be mindful of these actions as they may affect the performance.
  • Coalesce
    • After performing many transformations on our multibyte, multi partition, RDD, we have reached a point where we only have a small amount of data.
    • For the remaining transformations there is no point in continuing across 1000 partitions, any shuffles will be pointlessly expensive
    • Coalesce is a way of reducing the number of partitions, it is never used for correctness.
    • Shuffling data across partition’s Is a slow and expensive.
    • The lower, the partitions is the more effective that transformation are.
  • Collect()
    • Is generally used when we have finished and we want together a small RDD on the driver node, for example printing.
    • Only Call, if we are sure, RDD will fit into a single JVM RAM.
    • If the results are big, we will write to a HDFS file.
  • Joins

  • References

Comments