Submitting a job to an EMR cluster can be done in several ways, depending on the type of job and the tools you're using. Here are common methods to submit jobs to an EMR cluster:
1. Submitting a Job via the AWS Management Console
-
Create or Access Your EMR Cluster:
- Navigate to the EMR section of the AWS Management Console.
- Create a new cluster or select an existing cluster.
-
Add Steps to Your Cluster:
- Go to the Steps tab within your cluster details.
- Click Add step.
- Choose the type of step you want to add (e.g., Spark, Hive, Hadoop, etc.).
- Provide the required parameters for the step, such as the script location, input and output paths, and any arguments.
- Click Add to submit the step.
2. Submitting a Job via the AWS CLI
You can use the aws emr add-steps command to submit a job. Here’s an example for submitting a Spark job:
aws emr add-steps --cluster-id j-2AXXXXXXGAPLF \
--steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--class,org.apache.spark.examples.SparkPi,s3://my-bucket/spark-job.jar,10]
Explanation of the parameters:
- --cluster-id: The ID of your EMR cluster.
- --steps: The steps to be added to the cluster.
- Type: The type of step (e.g., Spark, Hive, etc.).
- Name: A name for the step.
- ActionOnFailure: What to do if the step fails (CONTINUE, CANCEL_AND_WAIT, or TERMINATE_CLUSTER).
- Args: Arguments for the step, such as the deploy mode, main class, and any other necessary parameters.
3. Submitting a Job via the EMR API
You can use the AWS SDKs to interact with the EMR API and submit jobs programmatically. Here’s an example using the Python SDK (boto3):
import boto3
emr_client = boto3.client('emr')
response = emr_client.add_job_flow_steps(
JobFlowId='j-2AXXXXXXGAPLF',
Steps=[
{
'Name': 'Spark Program',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'spark-submit',
'--deploy-mode', 'cluster',
'--class', 'org.apache.spark.examples.SparkPi',
's3://my-bucket/spark-job.jar',
'10'
]
}
}
]
)
print(response)
4. Submitting a Job via SSH
If you need more control or need to execute complex workflows, you can SSH into the master node of your EMR cluster and manually run commands. Here’s how to do it:
-
SSH into the Master Node:
ssh -i /path/to/your-key.pem hadoop@<master-node-dns>
-
Run Your Job: For example, to submit a Spark job:
spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi s3://my-bucket/spark-job.jar 10
Summary
Submitting jobs to an EMR cluster can be done via the AWS Management Console, AWS CLI, AWS SDKs, or by SSH-ing into the cluster's master node. Each method provides different levels of control and convenience, allowing you to choose the most suitable approach based on your requirements.