A company wants to process large-scale big data using open-source frameworks such as Apache Spark and Hadoop. Which AWS service is MOST suited for this requirement?

1 / 1
Select an answer
CorrectB

Explanation

A question asking which service runs big data processing using OSS frameworks.

  • 1Apache Spark and HadoopOSS big data frameworks = EMR
  • 2large-scale big dataLarge-scale distributed processing = EMR
AIncorrect

AWS Glue

AWS Glue is a serverless data integration service specialized for data extraction, transformation, and loading (ETL).

Although it uses Spark internally, its use case is limited to ETL; a managed cluster for running large-scale processing with freely chosen OSS frameworks such as Hadoop, Spark, and Hive is Amazon EMR, so this is incorrect.

BCorrect

Amazon EMR

This is correct. Amazon EMR is a service that runs open-source big data frameworks such as Apache Spark, Hadoop, Hive, and Presto on a managed cluster. It enables large-scale data processing, analytics, and ML preprocessing while minimizing the operational overhead of building and managing clusters.

CIncorrect

Amazon Kinesis Data Streams

Kinesis Data Streams is a service that ingests streaming data in real time and delivers it to multiple applications.

It handles the intake (ingestion) side of data; it is NOT a distributed processing platform for large-scale data using frameworks such as Spark and Hadoop, so this is incorrect.

DIncorrect

Amazon Athena

Amazon Athena is a serverless analytical service that runs ad hoc SQL queries against data stored in S3.

Its use case is SQL-based querying; it is NOT a platform for running large-scale data processing jobs using frameworks such as Spark and Hadoop, so this is incorrect.

Key Takeaway

'Spark/Hadoop' and 'big data processing' point to Amazon EMR. Distinguish by use case: SQL queries against S3 use Athena, ETL uses Glue, and stream ingestion uses Kinesis.