Skip to main content
Amazon EMR
- Amazon EMR Is a real hardware cluster to run smart jobs?
- EMR Is billable service, specially when working with clusters.
- We deploy our spark jobs on a cluster Manager
- Spark supports Hadoop cluster manager.
- Hadoop is a collection of tools, one of which is map reduce.
- HDFS file system as a part of Hadoop
- EMR is Amazon implementation of Hadoop in the cloud.
- We can take advantage of Amazon EMR clusters to run Spark jobs.
- Go to EMR And create a cluster.
- Creating a cluster will take some time, depending on instance, type selected.
- Packing spark jar On EMR
- Setting master to “local[*]” Makes the program run as local Multithreaded Programme using as many threads as there are cores in the hardware.
- When we run programme on a cluster with above configuration, it runs only on a driver.
- Executive nodes Will sit idle doing nothing.
- Remove the set master configuration when deploying to a cluster.
- Store the input file on S3 and add its URL to the “textfile” property
- When uploading spark application do not include Spark distribution, library files as they are already available in our cluster in Amazon EMR.
- Also, check the main class correctness in pom.xml.
- Add the inbound firewall rules to cluster In the security groups.
- Running Spark Jobs on EMR
- SSH into your cluster and download the spark file from S3.
- To run a spark job On a cluster use script “spark-submit”.
- We can specify properties in arguments like
- Copy the master public DNS of Amazon EMR into browser with port 18080
- We get spark jobs, history server.
- Shows, list of spark application that have been completed.
- We also get diagnostics of a job like
- Duration
- Shows the task of job
- The executives page shows the information about the cluster
- It’s nodes
- Information about nodes.
- Performance of nodes
- Remove the resources.
- The execution shows following information
- Stage
- Number of tasks to be performed in the stage.
- A task is a set of code executed against a partition.
- [Stage 0 : (0+8) / 46]
- Means we have 46 partitions
- First number is in circular bracket is tasks completed, and the second number is task running.
- Above 0 tasks have been completed and 8 are running for now.
- Second number shows number of tasks running, depending on the cores Of the CPU and executors running for example If we have 2 executors Running and each has four cores Then we have 8 tasks running.
- [(40+6)/46]
- Means 40 are completed and six are pending out of 46.
- Here, only six cores Out of eight are running.
- Terminate cluster after testing to avoid running costs.
Comments
Post a Comment