A company's data engineer needs to optimize the performance of table SQL queries. The company stores data in an Amazon Redshift cluster. The data engineer cannot increase the size of the cluster because of budget constraints. The company stores the data in multiple tables and loads the data by using the EVEN distribution style. Some tables are hundreds of gigabytes in size. Other tables are less than 10 MB in size. Which solution will meet these requirements?
A. Keep using the EVEN distribution style for all tables. Specify primary and foreign keysfor all tables. B. Use the ALL distribution style for large tables. Specify primary and foreign keys for alltables. C. Use the ALL distribution style for rarely updated small tables. Specify primary andforeign keys for all tables. D. Specify a combination of distribution, sort, and partition keys for all tables.
Answer: C Explanation: This solution meets the requirements of optimizing the performance of table SQL queries without increasing the size of the cluster. By using the ALL distribution stylefor rarely updated small tables, you can ensure that the entire table is copied to every nodein the cluster, which eliminates the need for data redistribution during joins. This canimprove query performance significantly, especially for frequently joined dimension tables.However, using the ALL distribution style also increases the storage space and the loadtime, so it is only suitable for small tables that are not updated frequently orextensively. Byspecifying primary and foreign keys for all tables, you can help the query optimizer togenerate better query plans and avoid unnecessary scans or joins. You can also use theAUTO distribution style to let Amazon Redshift choose the optimal distribution style basedon the table size and the query patterns. References:Choose the best distribution styleDistribution stylesWorking with data distribution styles
Question # 52
A data engineer is configuring Amazon SageMaker Studio to use AWS Glue interactive sessions to prepare data for machine learning (ML) models. The data engineer receives an access denied error when the data engineer tries to prepare the data by using SageMaker Studio. Which change should the engineer make to gain access to SageMaker Studio?
A. Add the AWSGlueServiceRole managed policy to the data engineer's IAM user. B. Add a policy to the data engineer's IAM user that includes the sts:AssumeRole action forthe AWS Glue and SageMaker service principals in the trust policy. C. Add the AmazonSageMakerFullAccess managed policy to the data engineer's IAM user. D. Add a policy to the data engineer's IAM user that allows the sts:AddAssociation actionfor the AWS Glue and SageMaker service principals in the trust policy.
Answer: B Explanation: This solution meets the requirement of gaining access to SageMaker Studioto use AWS Glue interactive sessions. AWS Glue interactive sessions are a way to useAWS Glue DataBrew and AWS Glue Data Catalog from within SageMaker Studio. To useAWS Glue interactive sessions, the data engineer’s IAM user needs to have permissions toassume the AWS Glue service role and the SageMaker execution role. By adding a policyto the data engineer’s IAM user that includes the sts:AssumeRole action for the AWS Glueand SageMaker service principals in the trust policy, the data engineer can grant thesepermissions and avoid the access denied error. The other options are not sufficient ornecessary to resolve the error. References: Get started with data integration from Amazon S3 to Amazon Redshift using AWSGlue interactive sessionsTroubleshoot Errors - Amazon SageMakerAccessDeniedException on sagemaker:CreateDomain in AWS SageMaker Studio,despite having SageMakerFullAccess
Question # 53
A company stores petabytes of data in thousands of Amazon S3 buckets in the S3 Standard storage class. The data supports analytics workloads that have unpredictable and variable data access patterns. The company does not access some data for months. However, the company must be able to retrieve all data within milliseconds. The company needs to optimize S3 storage costs. Which solution will meet these requirements with the LEAST operational overhead?
A. Use S3 Storage Lens standard metrics to determine when to move objects to more costoptimizedstorage classes. Create S3 Lifecycle policies for the S3 buckets to move objectsto cost-optimized storage classes. Continue to refine the S3 Lifecycle policies in the futureto optimize storage costs. B. Use S3 Storage Lens activity metrics to identify S3 buckets that the company accessesinfrequently. Configure S3 Lifecycle rules to move objects from S3 Standard to the S3Standard-Infrequent Access (S3 Standard-IA) and S3 Glacier storage classes based on theage of the data. C. Use S3 Intelligent-Tiering. Activate the Deep Archive Access tier. D. Use S3 Intelligent-Tiering. Use the default access tier.
Answer: D Explanation: S3 Intelligent-Tiering is a storage class that automatically moves objects between four access tiers based on the changing access patterns. The default access tierconsists of two tiers: Frequent Access and Infrequent Access. Objects in the FrequentAccess tier have the same performance and availability as S3 Standard, while objects inthe Infrequent Access tier have the same performance and availability as S3 Standard-IA.S3 Intelligent-Tiering monitors the access patterns of each object and moves thembetween the tiers accordingly, without any operational overhead or retrieval fees. Thissolution can optimize S3 storage costs for data with unpredictable and variable accesspatterns, while ensuring millisecond latency for data retrieval. The other solutions are notoptimal or relevant for this requirement. Using S3 Storage Lens standard metrics and activity metrics can provide insights into the storage usage and access patterns, but theydo not automate the data movement between storage classes. Creating S3 Lifecyclepolicies for the S3 buckets can move objects to more cost-optimized storage classes, butthey require manual configuration and maintenance, and they may incur retrieval fees fordata that is accessed unexpectedly. Activating the Deep Archive Access tier for S3Intelligent-Tiering can further reduce the storage costs for data that is rarely accessed, butit also increases the retrieval time to 12 hours, which does not meet the requirement ofmillisecond latency. References:S3 Intelligent-TieringS3 Storage LensS3 Lifecycle policies[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]
Question # 54
A company uses AWS Step Functions to orchestrate a data pipeline. The pipeline consists of Amazon EMR jobs that ingest data from data sources and store the data in an Amazon S3 bucket. The pipeline also includes EMR jobs that load the data to Amazon Redshift. The company's cloud infrastructure team manually built a Step Functions state machine. The cloud infrastructure team launched an EMR cluster into a VPC to support the EMR jobs. However, the deployed Step Functions state machine is not able to run the EMR jobs. Which combination of steps should the company take to identify the reason the Step Functions state machine is not able to run the EMR jobs? (Choose two.)
A. Use AWS CloudFormation to automate the Step Functions state machine deployment.Create a step to pause the state machine during the EMR jobs that fail. Configure the stepto wait for a human user to send approval through an email message. Include details of theEMR task in the email message for further analysis. B. Verify that the Step Functions state machine code has all IAM permissions that arenecessary to create and run the EMR jobs. Verify that the Step Functions state machinecode also includes IAM permissions to access the Amazon S3 buckets that the EMR jobsuse. Use Access Analyzer for S3 to check the S3 access properties. C. Check for entries in Amazon CloudWatch for the newly created EMR cluster. Changethe AWS Step Functions state machine code to use Amazon EMR on EKS. Change theIAM access policies and the security group configuration for the Step Functions statemachine code to reflect inclusion of Amazon Elastic Kubernetes Service (Amazon EKS). D. Query the flow logs for the VPC. Determine whether the traffic that originates from theEMR cluster can successfully reach the data providers. Determine whether any securitygroup that might be attached to the Amazon EMR cluster allows connections to the datasource servers on the informed ports. E. Check the retry scenarios that the company configured for the EMR jobs. Increase thenumber of seconds in the interval between each EMR task. Validate that each fallback state has the appropriate catch for each decision state. Configure an Amazon SimpleNotification Service (Amazon SNS) topic to store the error messages.
Answer: B,D Explanation: To identify the reason why the Step Functions state machine is not able torun the EMR jobs, the company should take the following steps:Verify that the Step Functions state machine code has all IAM permissions that arenecessary to create and run the EMR jobs. The state machine code should havean IAM role that allows it to invoke the EMR APIs, such as RunJobFlow,AddJobFlowSteps, and DescribeStep. The state machine code should also haveIAM permissions to access the Amazon S3 buckets that the EMR jobs use as inputand output locations. The company can use Access Analyzer for S3 to check theaccess policies and permissions of the S3 buckets12. Therefore, option B iscorrect.Query the flow logs for the VPC. The flow logs can provide information about thenetwork traffic to and from the EMR cluster that is launched in the VPC. Thecompany can use the flow logs to determine whether the traffic that originates fromthe EMR cluster can successfully reach the data providers, such as Amazon RDS,Amazon Redshift, or other external sources. The company can also determinewhether any security group that might be attached to the EMR cluster allowsconnections to the data source servers on the informed ports. The company canuse Amazon VPC Flow Logs or Amazon CloudWatch Logs Insights to query theflow logs3 . Therefore, option D is correct.Option A is incorrect because it suggests using AWS CloudFormation to automate the StepFunctions state machine deployment. While this is a good practice to ensure consistencyand repeatability of the deployment, it does not help to identify the reasonwhy the statemachine is not able to run the EMR jobs. Moreover, creating a step to pause the statemachine during the EMR jobs that fail and wait for a human user to send approval throughan email message is not a reliable way to troubleshoot the issue. The company should usethe Step Functions console or API to monitor the execution history and status of the statemachine, and use Amazon CloudWatch to view the logs and metrics of the EMR jobs .Option C is incorrect because it suggests changing the AWS Step Functions state machinecode to use Amazon EMR on EKS. Amazon EMR on EKS is a service that allows you torun EMR jobs on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. While thisservice has some benefits, such as lower cost and faster execution time, it does notsupport all the features and integrations that EMR on EC2 does, such as EMR Notebooks,EMR Studio, and EMRFS. Therefore, changing the state machine code to use EMR onEKS may not be compatible with the existing data pipeline and may introduce new issues.Option E is incorrect because it suggests checking the retry scenarios that the companyconfigured for the EMR jobs. While this is a good practice to handle transient failures anderrors, it does not help to identify the root cause of why the state machine is not able to runthe EMR jobs. Moreover, increasing the number of seconds in the interval between eachEMR task may not improve the success rate of the jobs, and may increase the executiontime and cost of the state machine. Configuring an Amazon SNS topic to store the errormessages may help to notify the company of any failures, but it does not provide enoughinformation to troubleshoot the issue.References:1: Manage an Amazon EMR Job - AWS Step Functions 2: Access Analyzer for S3 - Amazon Simple Storage Service3: Working with Amazon EMR and VPC Flow Logs - Amazon EMR[4]: Analyzing VPC Flow Logs with Amazon CloudWatch Logs Insights - AmazonVirtual Private Cloud[5]: Monitor AWS Step Functions - AWS Step Functions[6]: Monitor Amazon EMR clusters - Amazon EMR[7]: Amazon EMR on Amazon EKS - Amazon EMR
Question # 55
A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks. The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team's BI cluster. The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster. Which solution will meet these requirements?
A. Set up the sales team Bl cluster asa consumer of the ETL cluster by using Redshift datasharing. B. Create materialized views based on the sales team's requirements. Grant the salesteam direct access to the ETL cluster. C. Create database views based on the sales team's requirements. Grant the sales teamdirect access to the ETL cluster. D. Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week.Create an Amazon Redshift Spectrum table based on the content of the ETL cluster.
Answer: A Explanation: Redshift data sharing is a feature that enables you to share live data acrossdifferent Redshift clusters without the need to copy or move data. Data sharing providessecure and governed access to data, while preserving the performance and concurrencybenefits of Redshift. By setting up the sales team BI cluster as a consumer of the ETLcluster, the company can share the ETL cluster data with the sales team withoutinterrupting the critical analysis tasks. The solution also minimizes the usage of thecomputing resources of the ETL cluster, as the data sharing does not consume any storage space or compute resources from the producer cluster. The other options are either notfeasible or not efficient. Creating materialized views or database views would require thesales team to have direct access to the ETL cluster, which could interfere with the criticalanalysis tasks. Unloading a copy of the data from the ETL cluster to anAmazon S3 bucketevery week would introduce additional latency and cost, as well as create datainconsistency issues. References:Sharing data across Amazon Redshift clustersAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 2: Data Store Management, Section 2.2: Amazon Redshift
Question # 56
A data engineer maintains custom Python scripts that perform a data formatting process that many AWS Lambda functions use. When the data engineer needs to modify the Python scripts, the data engineer must manually update all the Lambda functions. The data engineer requires a less manual way to update the Lambda functions. Which solution will meet this requirement?
A. Store a pointer to the custom Python scripts in the execution context object in a sharedAmazon S3 bucket. B. Package the custom Python scripts into Lambda layers. Apply the Lambda layers to theLambda functions. C. Store a pointer to the custom Python scripts in environment variables in a sharedAmazon S3 bucket. D. Assign the same alias to each Lambda function. Call reach Lambda function byspecifying the function's alias.
Answer: B Explanation: Lambda layers are a way to share code and dependencies across multipleLambda functions. By packaging the custom Python scripts into Lambda layers, the dataengineer can update the scripts in one place and have them automatically applied to all theLambda functions that use the layer. This reduces the manual effort and ensuresconsistency across the Lambda functions. The other options are either not feasible or notefficient. Storing a pointer to the custom Python scripts in the execution context object or inenvironment variables would require the Lambda functions to download the scripts from Amazon S3 every time they are invoked, which would increase latency and cost. Assigningthe same alias to each Lambda function would not help with updating the Python scripts, asthe alias only points to a specific version of the Lambda function code. References:AWS Lambda layersAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 3: Data Ingestion and Transformation, Section 3.4: AWS Lambda
Question # 57
A company maintains multiple
extract, transform, and load (ETL) workflows that ingest data from the
company's operational databases into an Amazon S3 based data lake. The
ETL workflows use AWS Glue and Amazon EMR to process data.The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort.Which solution will meet these requirements with the LEAST operational overhead?
A. AWS Glue workflows
B. AWS Step Functions tasks C. AWS Lambda functions D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows
Answer: A Explanation:
Explanation: AWS
Glue workflows are a feature of AWS Glue that enable you to create and
visualize complex ETL pipelines using AWS Glue components, such as
crawlers, jobs, triggers, anddevelopment endpoints. AWS Glue workflows
provide automated orchestration and require minimal manual effort, as
they handle dependency resolution, error handling, state management, and
resource allocation for your ETL workflows. You can use AWS Glue
workflows to ingest data from your operational databases into your
Amazon S3 based data lake, and then use AWS Glue and Amazon EMR to
process the data in the data lake. This solution will meet the
requirements with the least operational overhead, as it leverages the
serverless and fully managed nature of AWS Glue, and the scalability and
flexibility of Amazon EMR12.The other options are not optimal for the following reasons:
B. AWS Step Functions tasks. AWS Step Functions is a service that lets you coordinate multiple AWS services into serverless workflows. You can use AWS Step Functions tasks to invoke AWS Glue and Amazon EMR jobs as part of your ETL workflows, and use AWS Step Functions state machines to define the logic and flow of your workflows. However, this option would require more manual effort than AWS Glue workflows, as you would need to write JSON code to define your state machines, handle errors and retries, and monitor the execution history and status of your workflows3.
C. AWS Lambda functions. AWS Lambda is a service that lets you run code without provisioning or managing servers. You can use AWS Lambda functions to trigger AWS Glue and Amazon EMR jobs as part of your ETL workflows, and use AWS Lambda event sources and destinations to orchestrate the flow of your workflows. However, this option would also require more manual effort than AWS Glue workflows, as you would need to write code to implement your business logic, handle errors and retries, and monitor the invocation and execution of your Lambda functions. Moreover, AWS Lambda functions have limitations on the execution time, memory, and concurrency, which may affect the performance and scalability of your ETL workflows.
D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows. Amazon MWAA is a managed service that makes it easy to run open source Apache Airflow on AWS. Apache Airflow is a popular tool for creating and managing complex ETL pipelines using directed acyclic graphs (DAGs). You can use Amazon MWAA workflows to orchestrate AWS Glue and Amazon EMR jobs as part of your ETL workflows, and use the Airflow web interface to visualize and monitor your workflows. However, this option would have more operational overhead than AWS Glue workflows, as you would need to set up and configure your Amazon MWAA environment, write Python code to define your DAGs, and manage the dependencies and versions of your Airflow plugins and operators.
References:
1: AWS Glue Workflows
2: AWS Glue and Amazon EMR
3: AWS Step Functions
: AWS Lambda
: Amazon Managed Workflows for Apache Airflow
Question # 58
A company has multiple
applications that use datasets that are stored in an Amazon S3 bucket.
The company has an ecommerce application that generates a dataset that
contains personally identifiable information (PII). The company has an
internal analytics application that does not require access to the PII.To
comply with regulations, the company must not share PII unnecessarily. A
data engineer needs to implement a solution that with redact PII
dynamically, based on the needs of each application that accesses the
dataset.Which solution will meet the requirements with the LEAST operational overhead?
A. Create an S3 bucket policy to limit the
access each application has. Create multiple copies of the dataset. Give
each dataset copy the appropriate level of redaction for the needs of
the application that accesses the copy.
B. Create an S3 Object Lambda endpoint. Use the S3 Object Lambda endpoint to read data from the S3 bucket. Implement redaction logic within an S3 Object Lambda function to dynamically redact PII based on the needs of each application that accesses the data. C. Use AWS Glue to transform the data for each application. Create multiple copies of the dataset. Give each dataset copy the appropriate level of redaction for the needs of the application that accesses the copy. D. Create an API Gateway endpoint that has custom authorizers. Use the API Gateway endpoint to read data from the S3 bucket. Initiate a REST API call to dynamically redact PII based on the needs of each application that accesses the data.
Answer: Explanation:
Explanation: Option
B is the best solution to meet the requirements with the least
operational overhead because S3 Object Lambda is a feature that allows
you to add your own code to process data retrieved from S3 before
returning it to an application. S3 Object Lambda works with S3 GET
requests and can modify both the object metadata and the object data. By
using S3 Object Lambda, you can implement redaction logic within an S3
Object Lambda function to dynamically redact PII based on the needs of
each application that accesses the data. This way, you can avoid
creating and maintaining multiple copies of the dataset with different
levels of redaction.Option A is not a good solution because it
involves creating and managing multiple copies of the dataset with
different levels of redaction for each application. This option adds
complexity and storage cost to the data protection process and requires
additional resources and configuration. Moreover, S3 bucket policies
cannot enforce fine-grained data access control at the row and column
level, so they are not sufficient to redact PII.Option C is not a
good solution because it involves using AWS Glue to transform the data
for each application. AWS Glue is a fully managed service that can
extract, transform, and load (ETL) data from various sources to various
destinations, including S3. AWS Glue can also convert data to different
formats, such as Parquet, which is a columnar storage format that is
optimized for analytics. However, in this scenario, using AWS Glue to
redact PII is not the best option because it requires creating and
maintaining multiple copies of the dataset with different levels of
redaction for each application. This option also adds extra time and
cost to the data protection process and requires additional resources
and configuration.Option D is not a good solution because it
involves creating and configuring an API Gateway endpoint that has
custom authorizers. API Gateway is a service that allows youto create,
publish, maintain, monitor, and secure APIs at any scale. API Gateway
can also integrate with other AWS services, such as Lambda, to provide
custom logic for processing requests. However, in this scenario, using
API Gateway to redact PII is not the best option because it requires
writing and maintaining custom code and configuration for the API
endpoint, the custom authorizers, and the REST API call. This option
also adds complexity and latency to the data protection process and
requires additional resources and configuration.References:
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Introducing Amazon S3 Object Lambda – Use Your Code to Process Data as It Is Being Retrieved from S3
Using Bucket Policies and User Policies - Amazon Simple Storage Service
AWS Glue Documentation
What is Amazon API Gateway? - Amazon API Gateway
Question # 59
A company stores daily records of
the financial performance of investment portfolios in .csv format in an
Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3
data.The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog.Which solution will meet these requirements?
A. Create an IAM role that includes the
AmazonS3FullAccess policy. Associate the role with the crawler. Specify
the S3 bucket path of the source data as the crawler's data store.
Create a daily schedule to run the crawler. Configure the output
destination to a new path in the existing S3 bucket.
B. Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Specify a database name for the output. C. Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Specify a database name for the output. D. Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Configure the output destination to a new path in the existing S3 bucket.
Answer: B Explanation: Explanation: To make the S3 data accessible daily in the AWS Glue Data Catalog, the data engineer needs to create a crawler that can crawl the S3 data and write the metadata to the Data Catalog. The crawler also needs to run on a daily schedule to keep the Data Catalog updated with the latest data. Therefore, the solution must include the following steps:
Create
an IAM role that has the necessary permissions to access the S3 data
and the Data Catalog. The AWSGlueServiceRole policy is a managed policy
that grants these permissions1.
Associate the role with the crawler.
Specify
the S3 bucket path of the source data as the crawler’s data store. The
crawler will scan the data and infer the schema and format2.
Create
a daily schedule to run the crawler. The crawler will run at the
specified time every day and update the Data Catalog with any changes in
the data3.
Specify a database name for the output. The crawler
will create or update a table in the Data Catalog under the specified
database. The table will contain the metadata about the data in the S3
bucket, such as the location, schema, and classification.
Option B is the only solution that includes all these steps. Therefore, option B is the correct answer.Option
A is incorrect because it configures the output destination to a new
path in the existing S3 bucket. This is unnecessary and may cause
confusion, as the crawler does not write any data to the S3 bucket, only
metadata to the Data Catalog.Option C is incorrect because it
allocates data processing units (DPUs) to run the crawler every day.
This is also unnecessary, as DPUs are only used for AWS Glue ETL jobs,
not crawlers.Option D is incorrect because it combines the errors
of option A and C. It configures the output destination to a new path
in the existing S3 bucket and allocates DPUs to run the crawler every
day, both of which are irrelevant for the crawler.References:
2: Data Catalog and crawlers in AWS Glue - AWS Glue
3: Scheduling an AWS Glue crawler - AWS Glue
[4]: Parameters set on Data Catalog tables by crawler - AWS Glue
[5]: AWS Glue pricing - Amazon Web Services (AWS)
Question # 60
A company uses Amazon Athena for
one-time queries against data that is in Amazon S3. The company has
several use cases. The company must implement permission controls to
separate query processes and access to query history among users, teams,
and applications that are in the same AWS account.Which solution will meet these requirements?
A. Create an S3 bucket for each use case. Create
an S3 bucket policy that grants permissions to appropriate individual
IAM users. Apply the S3 bucket policy to the S3 bucket.
B. Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an 1AM policy that uses the tags to apply appropriate permissions to the workgroup. C. Create an JAM role for each use case. Assign appropriate permissions to the role for each use case. Associate the role with Athena. D. Create an AWS Glue Data Catalog resource policy that grants permissions to appropriate individual IAM users for each use case. Apply the resource policy to the specific tables that Athena uses.
Answer: B Explanation:
Explanation: Athena
workgroups are a way to isolate query execution and query history among
users, teams, and applications that share the same AWS account. By
creating a workgroup for each use case, the company can control the
access and actions on the workgroup resource using resource-level IAM
permissions or identity-based IAM policies. The company can also use
tags to organize and identify the workgroups, and use them as conditions
in the IAM policies to grant or deny permissions to the workgroup. This
solution meets the requirements of separating query processes and
access to query history among users, teams, and applications that are in
the same AWS account. References: