Amazon AWS Certified Data Engineer - Associate (DEA-C01) - Data-Engineer-Associate Exam Practice Test

Question 1

A data engineer must orchestrate a data pipeline that consists of one AWS Lambda function and one AWS Glue job. The solution must integrate with AWS services.
Which solution will meet these requirements with the LEAST management overhead?

A. Use an Apache Airflow workflow that is deployed on Amazon Elastic Kubernetes Service (Amazon EKS). Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job. B. Use an Apache Airflow workflow that is deployed on an Amazon EC2 instance. Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job. C. Use an AWS Glue workflow to run the Lambda function and then the AWS Glue job. D. Use an AWS Step Functions workflow that includes a state machine. Configure the state machine to run the Lambda function and then the AWS Glue job.

Correct Answer: D

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 2

A data engineer configures a large number of AWS Glue jobs that all start up around the same time. All the jobs run for less than 1 hour in the same subnet of the same VPC. All the AWS Glue jobs run on a G.1X worker type.
Some of the jobs occasionally fail with the following error: "The specified subnet does not have enough free addresses to satisfy the request." What is the likely root cause of the error?

A. There are not enough IP addresses in the subnet. B. The G.1X worker type cannot access the subnet. C. There are not enough IP addresses in the VPC. D. AWS Glue does not have the correct IAM permissions to add additional IP addresses to the subnet.

Correct Answer: A

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 3

A company stores customer data in an Amazon S3 bucket. The company must permanently delete all customer data that is older than 7 years.

A. Configure an S3 Lifecycle policy to permanently delete objects that are older than 7 years. B. Configure an S3 Lifecycle policy to enable S3 Object Lock on all objects that are older than 7 years. C. Use Amazon Athena to query the S3 bucket for objects that are older than 7 years. Configure Athena to delete the results. D. Configure an S3 Lifecycle policy to move objects that are older than 7 years to S3 Glacier Deep Archive.

Correct Answer: A

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 4

A company needs to implement a data mesh architecture for trading, risk, and compliance teams. Each team has its own data but needs to share views. They have 1,000+ tables in 50 Glue databases. All teams use Athena and Redshift, and compliance requires full auditing and PII access control.

A. Create views in Athena for on-demand analysis. Use the Athena views in Amazon Redshift to perform cross-domain analytics. Use AWS CloudTrail to audit data access. Use AWS Lake Formation to establish fine-grained access control. B. Use AWS Glue Data Catalog views. Use CloudTrail logs and Lake Formation to manage permissions. C. Create materialized views and enable Amazon Redshift datashares for each domain. D. Use Lake Formation to set up cross-domain access to tables. Set up fine-grained access controls.

Correct Answer: A

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 5

A data engineer needs to analyze time-sensitive sales data. The company stores the data in an Amazon S3 bucket. The data engineer uses AWS Glue Data Catalog to access the data.
When performing the analysis, the data engineer notices that some records are missing or out of date.
What is the likely cause of these issues?

A. Versioning is not enabled on the S3 bucket. B. Incorrect IAM roles are assigned to the AWS Glue jobs. C. AWS Glue Data Catalog is not up to date with the latest S3 partition changes. D. The AWS Glue job schedules overlap with one another.

Correct Answer: C

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 6

A company is creating a new data pipeline to populate a data lake. A data analyst needs to prepare and standardize the data before a data engineering team can perform advanced data transformations. The data analyst needs a solution to process the data that does not require writing new code.
Which solution will meet these requirements with the LEAST operational effort?

A. Use AWS Glue Studio with data preparation recipe transformations. Ensure that the data engineers add additional transformations to complete the pipeline. B. Use Amazon SageMaker Canvas and SageMaker Data Wrangler to write to a new dataset. Ensure that the data engineers add additional transformations to complete the pipeline by using AWS Glue. C. Create a document that includes the data preparation rules. Ensure that the data engineers implement the rules in AWS Glue. D. Use Python and Pandas in an AWS Glue Studio notebook. Ensure that the data engineers add additional transformations to complete the pipeline.

Correct Answer: A

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 7

A company's data processing pipeline uses AWS Glue jobs and AWS Glue Data Catalog. All AWS Glue jobs must run in a custom VPC inside a private subnet. The company uses a NAT gateway to support outbound connections.
A data engineer needs to use AWS Glue to migrate data from an on-premises PostgreSQL database to Amazon S3. There is no current network connection between AWS and the on-premises environment.
However, the data engineer has updated the on-premises database to allow traffic from the custom VPC.
Which solution will meet these requirements?

A. Create a JDBC connection in AWS Glue with the database JDBC URL, username, and password. B. Create a Simple Authentication and Security Layer (SASL) connection in AWS Glue to the on- premises database. C. Create a JDBC connection in AWS Glue with a security group that allows TCP traffic to and from itself. D. Create a JDBC connection in AWS Glue that uses a JDBC driver stored in Amazon S3. Retrieve the database URL, username, and password from AWS Secrets Manager.

Correct Answer: D

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 8

A company needs to build an extract, transform, and load (ETL) pipeline that has separate stages for batch data ingestion, transformation, and storage. The pipeline must store the transformed data in an Amazon S3 bucket. Each stage must automatically retry failures. The pipeline must provide visibility into the success or failure of individual stages.
Which solution will meet these requirements with the LEAST operational overhead?

A. Schedule Apache Airflow directed acyclic graphs (DAGs) on Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate pipeline steps. Use Amazon Simple Queue Service (Amazon SQS) to ingest data. Use AWS Glue jobs to transform data and store the data in the S3 bucket. B. Build an Amazon EventBridge-based pipeline that invokes AWS Lambda functions to perform each stage. C. Chain AWS Glue jobs that perform each stage together by using job triggers. Set the MaxRetries field to 0. D. Deploy AWS Step Functions workflows to orchestrate AWS Lambda functions that ingest data. Use AWS Glue jobs to transform the data and store the data in the S3 bucket.

Correct Answer: D

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 9

A company needs to set up a data catalog and metadata management for data sources that run in the AWS Cloud. The company will use the data catalog to maintain the metadata of all the objects that are in a set of data stores. The data stores include structured sources such as Amazon RDS and Amazon Redshift. The data stores also include semistructured sources such as JSON files and .xml files that are stored in Amazon S3.
The company needs a solution that will update the data catalog on a regular basis. The solution also must detect changes to the source metadata.
Which solution will meet these requirements with the LEAST operational overhead?

A. Use Amazon Aurora as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the Aurora data catalog. Schedule the Lambda functions to run periodically. B. Use the AWS Glue Data Catalog as the central metadata repository. Extract the schema for Amazon RDS and Amazon Redshift sources, and build the Data Catalog. Use AWS Glue crawlers for data that is in Amazon S3 to infer the schema and to automatically update the Data Catalog. C. Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Glue crawlers to connect to multiple data stores and to update the Data Catalog with metadata changes. Schedule the crawlers to run periodically to update the metadata catalog. D. Use Amazon DynamoDB as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the DynamoDB data catalog. Schedule the Lambda functions to run periodically.

Correct Answer: C

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 10

A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks.
The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team ' s BI cluster.
The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster.
Which solution will meet these requirements?

A. Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week. Create an Amazon Redshift Spectrum table based on the content of the ETL cluster. B. Create materialized views based on the sales team ' s requirements. Grant the sales team direct access to the ETL cluster. C. Create database views based on the sales team ' s requirements. Grant the sales team direct access to the ETL cluster. D. Set up the sales team Bl cluster as a consumer of the ETL cluster by using Redshift data sharing.

Correct Answer: D

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 11

A data engineer is building a data pipeline. A large data file is uploaded to an Amazon S3 bucket once each day at unpredictable times. An AWS Glue workflow uses hundreds of workers to process the file and load the data into Amazon Redshift. The company wants to process the file as quickly as possible.
Which solution will meet these requirements?

A. Create an on-demand AWS Glue trigger to start the workflow. Create an AWS Database Migration Service (AWS DMS) migration task. Set the DMS source as the S3 bucket. Set the target endpoint as the AWS Glue workflow. B. Create an event-based AWS Glue trigger to start the workflow. Configure Amazon S3 to log events to AWS CloudTrail. Create a rule in Amazon EventBridge to forward PutObject events to the AWS Glue trigger. C. Create an on-demand AWS Glue trigger to start the workflow. Create an AWS Lambda function that runs every 15 minutes to check the S3 bucket for the daily file. Configure the function to start the AWS Glue workflow if the file is present. D. Create a scheduled AWS Glue trigger to start the workflow. Create a cron job that runs the AWS Glue job every 15 minutes. Set up the AWS Glue job to check the S3 bucket for the daily file. Configure the job to stop if the file is not present.

Correct Answer: B

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 12

A company runs an extract, transform, and load (ETL) job in AWS Glue. The job processes personally identifiable information (PII) data and writes logs to an Amazon CloudWatch Logs log group. A data engineer needs to mask PII data in the CloudWatch Logs log group.
Which solution will meet these requirements?

A. Call AWS Glue sensitive data detection APIs in the ETL job. B. Run an Amazon Macie sensitive data discovery job. C. Configure a data protection policy. Attach the policy to the CloudWatch log group. D. Attach an AWS Glue security configuration to the ETL job.

Correct Answer: C

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).

Question 13

A company uses AWS Step Functions to orchestrate a data pipeline. The pipeline consists of Amazon EMR jobs that ingest data from data sources and store the data in an Amazon S3 bucket. The pipeline also includes EMR jobs that load the data to Amazon Redshift.
The company ' s cloud infrastructure team manually built a Step Functions state machine. The cloud infrastructure team launched an EMR cluster into a VPC to support the EMR jobs. However, the deployed Step Functions state machine is not able to run the EMR jobs.
Which combination of steps should the company take to identify the reason the Step Functions state machine is not able to run the EMR jobs? (Choose two.)

A. Verify that the Step Functions state machine code has all IAM permissions that are necessary to create and run the EMR jobs. Verify that the Step Functions state machine code also includes IAM permissions to access the Amazon S3 buckets that the EMR jobs use. Use Access Analyzer for S3 to check the S3 access properties. B. Check the retry scenarios that the company configured for the EMR jobs. Increase the number of seconds in the interval between each EMR task. Validate that each fallback state has the appropriate catch for each decision state. Configure an Amazon Simple Notification Service (Amazon SNS) topic to store the error messages. C. Check for entries in Amazon CloudWatch for the newly created EMR cluster. Change the AWS Step Functions state machine code to use Amazon EMR on EKS. Change the IAM access policies and the security group configuration for the Step Functions state machine code to reflect inclusion of Amazon Elastic Kubernetes Service (Amazon EKS). D. Use AWS CloudFormation to automate the Step Functions state machine deployment. Create a step to pause the state machine during the EMR jobs that fail. Configure the step to wait for a human user to send approval through an email message. Include details of the EMR task in the email message for further analysis. E. Query the flow logs for the VPC. Determine whether the traffic that originates from the EMR cluster can successfully reach the data providers. Determine whether any security group that might be attached to the Amazon EMR cluster allows connections to the data source servers on the informed ports.

Correct Answer: A,E

Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).