Interacting with data in Amazon S3 using AWS EMR can be achieved through various big data frameworks supported by EMR, such as Apache Hadoop, Apache Spark, Apache Hive, and Presto. Here are the common methods to interact with data in S3:
1. Using Apache Hadoop
Apache Hadoop provides several utilities to interact with S3.
Example: Reading and Writing Data using Hadoop
hadoop fs -ls s3://my-bucket/path/
hadoop fs -copyToLocal s3://my-bucket/path/ /local/path/
hadoop fs -copyFromLocal /local/path/ s3://my-bucket/path/
2. Using Apache Spark
Spark provides built-in support for reading and writing data from S3.
Example: Reading and Writing Data using Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("S3Example").getOrCreate()
# Reading data from S3
df = spark.read.csv("s3://my-bucket/path/to/input.csv")
# Processing data
df_filtered = df.filter(df['column'] > 100)
# Writing data to S3
df_filtered.write.csv("s3://my-bucket/path/to/output/")
Submit the Spark job on EMR:
spark-submit --deploy-mode cluster s3://my-bucket/path/to/my-spark-job.py
3. Using Apache Hive
Hive can directly query data stored in S3.
Example: Creating an External Table in Hive
CREATE EXTERNAL TABLE my_table (
column1 STRING,
column2 INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://my-bucket/path/to/data/';
Querying Data:
SELECT * FROM my_table WHERE column2 > 100;
4. Using Presto
Presto allows interactive querying of data stored in S3.
Example: Querying Data using Presto
SELECT * FROM s3.my_database.my_table WHERE column2 > 100;
5. Using the AWS CLI
You can use the AWS CLI to interact with S3 data as part of your EMR workflow.
Example: Copy Data to and from S3
aws s3 cp s3://my-bucket/path/to/input.csv /local/path/
aws s3 cp /local/path/output.csv s3://my-bucket/path/to/output/
6. EMRFS (EMR File System)
EMRFS allows EMR clusters to use Amazon S3 as a Hadoop-compatible file system.
Example: Configuration in Hadoop or Spark
In your Hadoop configuration files (e.g., core-site.xml), you can specify the S3 path as the default file system:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>s3://my-bucket/</value>
</property>
</configuration>
In Spark, you can specify S3 paths directly:
spark.read.text("s3://my-bucket/path/to/data.txt")
Summary
Interacting with data in Amazon S3 using AWS EMR is straightforward, thanks to the integration of various big data frameworks like Hadoop, Spark, Hive, and Presto. These frameworks provide powerful tools to read, write, and process data stored in S3. Additionally, using the AWS CLI and EMRFS enhances the flexibility and ease of managing data across S3 and EMR.