A company has a production AWS account that runs company workloads. The company's security team created a security AWS account to store and analyze security logs from the production AWS account. The security logs in the production AWS account are stored in Amazon CloudWatch Logs. The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security AWS account. Which solution will meet these requirements?
A. Create a destination data stream in the production AWS account. In the security AWSaccount, create an IAM role that has cross-account permissions to Kinesis Data Streams inthe production AWS account. B. Create a destination data stream in the security AWS account. Create an IAM role and atrust policy to grant CloudWatch Logs the permission to put data into the stream. Create asubscription filter in the security AWS account. C. Create a destination data stream in the production AWS account. In the production AWSaccount, create an IAM role that has cross-account permissions to Kinesis Data Streams inthe security AWS account. D. Create a destination data stream in the security AWS account. Create an IAM role and atrust policy to grant CloudWatch Logs the permission to put data into the stream. Create asubscription filter in the production AWS account.
Answer: D Explanation: Amazon Kinesis Data Streams is a service that enables you to collect, process, and analyze real-time streaming data. You can use Kinesis Data Streams toingest data from various sources, such as Amazon CloudWatch Logs, and deliver it todifferent destinations, such as Amazon S3 or Amazon Redshift. To use Kinesis DataStreams to deliver the security logs from the production AWS account to the security AWSaccount, you need to create a destination data stream in the security AWS account. Thisdata stream will receive the log data from the CloudWatch Logs service in the productionAWS account. To enable this cross-account data delivery, you need to create an IAM roleand a trust policy in the security AWS account. The IAM role defines the permissions thatthe CloudWatch Logs service needs to put data into the destination data stream. The trustpolicy allows the production AWS account to assume the IAM role. Finally, you need tocreate a subscription filter in the production AWS account. A subscription filter defines thepattern to match log events and the destination to send the matching events. In this case,the destination is the destination data stream in the security AWS account. This solutionmeets the requirements of using Kinesis Data Streams to deliver the security logs to thesecurity AWS account. The other options are either not possible or not optimal. You cannotcreate a destination data stream in the production AWS account, as this would not deliverthe data to the security AWS account. You cannot create a subscription filter in the securityAWS account, as this would not capture the log events from the production AWS account.References:Using Amazon Kinesis Data Streams with Amazon CloudWatch LogsAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 3: Data Ingestion and Transformation, Section 3.3: Amazon Kinesis DataStreams
Question # 22
A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options. The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS. Which extract, transform, and load (ETL) service will meet these requirements?
A. AWS Glue B. Amazon EMR C. AWS Lambda D. Amazon Redshift
Answer: A Explanation: AWS Glue is a fully managed serverless ETL service that can handlepetabytes of data in seconds. AWS Glue can run Apache Spark and Apache Flink jobswithout requiring any infrastructure provisioning or management. AWS Glue can alsointegrate with Apache Pig, Apache Oozie, and Apache Hbase using AWS Glue DataCatalog and AWS Glue workflows. AWS Glue can reduce the overall operational overheadby automating the data discovery, data preparation, and data loading processes. AWSGlue can also optimize the cost and performance of ETL jobs by using AWS Glue JobBookmarking, AWS Glue Crawlers, and AWS Glue Schema Registry. References:AWS GlueAWS Glue Data CatalogAWS Glue Workflows[AWS Glue Job Bookmarking][AWS Glue Crawlers][AWS Glue Schema Registry][AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]
Question # 23
A data engineering team is using an Amazon Redshift data warehouse for operational reporting. The team wants to prevent performance issues that might result from longrunning queries. A data engineer must choose a system table in Amazon Redshift to record anomalies when a query optimizer identifies conditions that might indicate performance issues. Which table views should the data engineer use to meet this requirement?
A. STL USAGE CONTROL B. STL ALERT EVENT LOG C. STL QUERY METRICS D. STL PLAN INFO
Answer: B Explanation: The STL ALERT EVENT LOG table view records anomalies when the queryoptimizer identifies conditions that might indicate performance issues. These conditionsinclude skewed data distribution, missing statistics, nested loop joins, and broadcasteddata. The STL ALERT EVENT LOG table view can help the data engineer to identify andtroubleshoot the root causes of performance issues and optimize the query execution plan.The other table views are not relevant for this requirement. STL USAGE CONTROLrecords the usage limits and quotas for Amazon Redshift resources. STL QUERYMETRICS records the execution time and resource consumption of queries. STL PLANINFO records the query execution plan and the steps involved in each query. References:STL ALERT EVENT LOGSystem Tables and ViewsAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 24
A media company wants to improve a system that recommends media content to customer based on user behavior and preferences. To improve the recommendation system, the company needs to incorporate insights from third-party datasets into the company's existing analytics platform. The company wants to minimize the effort and time required to incorporate third-party datasets. Which solution will meet these requirements with the LEAST operational overhead?
A. Use API calls to access and integrate third-party datasets from AWS Data Exchange. B. Use API calls to access and integrate third-party datasets from AWS C. Use Amazon Kinesis Data Streams to access and integrate third-party datasets fromAWS CodeCommit repositories. D. Use Amazon Kinesis Data Streams to access and integrate third-party datasets fromAmazon Elastic Container Registry (Amazon ECR).
Answer: A Explanation: AWS Data Exchange is a service that makes it easy to find, subscribe to, and use third-party data in the cloud. It provides a secure and reliable way to access andintegrate data from various sources, such as data providers, public datasets, or AWSservices. Using AWS Data Exchange, you can browse and subscribe to data products thatsuit your needs, and then use API calls or the AWS Management Console to export thedata to Amazon S3, where you can use it with your existing analytics platform. This solutionminimizes the effort and time required to incorporate third-party datasets, as you do notneed to set up and manage data pipelines, storage, or access controls. You also benefitfrom the data quality and freshness provided by the data providers, who can update theirdata products as frequently as needed12.The other options are not optimal for the following reasons:B. Use API calls to access and integrate third-party datasets from AWS. Thisoption is vague and does not specify which AWS service or feature is used toaccess and integrate third-party datasets. AWS offers a variety of services andfeatures that can help with data ingestion, processing, and analysis, but not all ofthem are suitable for the given scenario. For example, AWS Glue is a serverlessdata integration service that can help you discover, prepare, and combine datafrom various sources, but it requires you to create and run data extraction,transformation, and loading (ETL) jobs, which can add operational overhead3.C. Use Amazon Kinesis Data Streams to access and integrate third-party datasetsfrom AWS CodeCommit repositories. This option is not feasible, as AWSCodeCommit is a source control service that hosts secure Git-based repositories,not a data source that can be accessed by Amazon Kinesis Data Streams.Amazon Kinesis Data Streams is a service that enables you to capture, process,and analyze data streams in real time, suchas clickstream data, application logs,or IoT telemetry. It does not support accessing and integrating data from AWSCodeCommit repositories, which are meant for storing and managing code, notdata .D. Use Amazon Kinesis Data Streams to access and integrate third-party datasetsfrom Amazon Elastic Container Registry (Amazon ECR). This option is also notfeasible, as Amazon ECR is a fully managed container registry service that stores,manages, and deploys container images, not a data source that can be accessedby Amazon Kinesis Data Streams. Amazon Kinesis Data Streams does notsupport accessing and integrating data from Amazon ECR, which is meant forstoring and managing container images, not data .References: 1: AWS Data Exchange User Guide2: AWS Data Exchange FAQs3: AWS Glue Developer Guide: AWS CodeCommit User Guide: Amazon Kinesis Data Streams Developer Guide: Amazon Elastic Container Registry User Guide: Build a Continuous Delivery Pipeline for Your Container Images with AmazonECR as Source
Question # 25
A company uses an on-premises Microsoft SQL Server database to store financial transaction data. The company migrates the transaction data from the on-premises database to AWS at the end of each month. The company has noticed that the cost to migrate data from the on-premises database to an Amazon RDS for SQL Server database has increased recently. The company requires a cost-effective solution to migrate the data to AWS. The solution must cause minimal downtown for the applications that access the database. Which AWS service should the company use to meet these requirements?
A. AWS Lambda B. AWS Database Migration Service (AWS DMS) C. AWS Direct Connect D. AWS DataSync
Answer: B Explanation: AWS Database Migration Service (AWS DMS) is a cloud service that makesit possible to migrate relational databases, data warehouses, NoSQL databases, and othertypes of data stores to AWS quickly, securely, and with minimal downtime and zero data loss1. AWS DMS supports migration between 20-plus database and analytics engines,such as Microsoft SQL Server to Amazon RDS for SQL Server2. AWS DMS takesovermany of the difficult or tedious tasks involved in a migration project, such as capacityanalysis, hardware and software procurement, installation and administration, testing anddebugging, and ongoing replication and monitoring1. AWS DMS is a cost-effective solution,as you only pay for the compute resources and additional log storage used during themigration process2. AWS DMS is the best solution for the company to migrate the financialtransaction data from the on-premises Microsoft SQL Server database to AWS, as it meetsthe requirements of minimal downtime, zero data loss, and low cost.Option A is not the best solution, as AWS Lambda is a serverless compute service that letsyou run code without provisioning or managing servers, but it does not provide any built-infeatures for database migration. You would have to write your own code to extract,transform, and load the data from the source to the target, which would increase theoperational overhead and complexity.Option C is not the best solution, as AWS Direct Connect is a service that establishes adedicated network connection from your premises to AWS, but it does not provide anybuilt-in features for database migration. You would still need to use another service or toolto perform the actual data transfer, which would increase the cost and complexity.Option D is not the best solution, as AWS DataSync is a service that makes it easy totransfer data between on-premises storage systems and AWS storage services, such asAmazon S3, Amazon EFS, and Amazon FSx for Windows File Server, but it does notsupport Amazon RDS for SQL Server as a target. You would have to use another serviceor tool to migrate the data from Amazon S3 to Amazon RDS for SQL Server, which wouldincrease the latency and complexity. References:Database Migration - AWS Database Migration Service - AWSWhat is AWS Database Migration Service?AWS Database Migration Service DocumentationAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 26
A company has used an Amazon Redshift table that is named Orders for 6 months. The company performs weekly updates and deletes on the table. The table has an interleaved sort key on a column that contains AWS Regions. The company wants to reclaim disk space so that the company will not run out of storage space. The company also wants to analyze the sort key column. Which Amazon Redshift command will meet these requirements?
A. VACUUM FULL Orders B. VACUUM DELETE ONLY Orders C. VACUUM REINDEX Orders D. VACUUM SORT ONLY Orders
Answer: C Explanation:Amazon Redshift is a fully managed, petabyte-scale data warehouse service that enablesfast and cost-effective analysis of large volumes of data. Amazon Redshift uses columnarstorage, compression, and zone maps to optimize the storage and performance of data.However, over time, as data is inserted, updated, or deleted, the physical storage of datacan become fragmented, resulting in wasted disk space and degraded queryperformance. To address this issue, Amazon Redshift provides the VACUUM command,which reclaims disk space and resorts rows in either a specified table or all tables in thecurrent schema1.The VACUUM command has four options: FULL, DELETE ONLY, SORT ONLY, andREINDEX. The option that best meets the requirements of the question is VACUUMREINDEX, which re-sorts the rows in a table that has an interleaved sort key andrewritesthe table to a new location on disk. An interleaved sort key is a type of sort key thatgives equal weight to each column in the sort key, and stores the rows in a way thatoptimizes the performance of queries that filter by multiple columns in the sort key.However, as data is added or changed, the interleaved sort order can become skewed,resulting in suboptimal query performance. The VACUUM REINDEX option restores theoptimal interleaved sort order and reclaims disk space by removing deleted rows. Thisoption also analyzes the sort key column and updates the table statistics, which are usedby the query optimizer to generate the most efficient query execution plan23. The other options are not optimal for the following reasons:A. VACUUM FULL Orders. This option reclaims disk space by removing deletedrows and resorts the entire table. However, this option is not suitable for tables thathave an interleaved sort key, as it does not restore the optimal interleaved sortorder. Moreover, this option is the most resource-intensive and time-consuming,as it rewrites the entire table to a new location on disk.B. VACUUM DELETE ONLY Orders. This option reclaims disk space by removingdeleted rows, but does not resort the table. This option is not suitable for tablesthat have any sort key, as it does not improve the query performance by restoringthe sort order. Moreover, this option does not analyze the sort key column andupdate the table statistics.D. VACUUM SORT ONLY Orders. This option resorts the entire table, but doesnot reclaim disk space by removing deleted rows. This option is not suitable fortables that have an interleaved sort key, as it does not restore the optimalinterleaved sort order. Moreover, this option does not analyze the sort key columnand update the table statistics.References:1: Amazon Redshift VACUUM2: Amazon Redshift Interleaved Sorting3: Amazon Redshift ANALYZE
Question # 27
A company extracts approximately 1 TB of data every day from data sources such as SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undefined data schemas or data schemas that change. A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation. Which solution will meet these requirements with the LEAST operational overhead?
A. Use Amazon EMR to detect the schema and to extract, transform, and load the data intothe S3 bucket. Create a pipeline in Apache Spark. B. Use AWS Glue to detect the schema and to extract, transform, and load the data intothe S3 bucket. Create a pipeline in Apache Spark. C. Create a PvSpark proqram in AWS Lambda to extract, transform, and load the data intothe S3 bucket. D. Create a stored procedure in Amazon Redshift to detect the schema and to extract,transform, and load the data into a Redshift Spectrum table. Access the table from AmazonS3.
Answer: B Explanation: AWS Glue is a fully managed service that provides a serverless data integration platform.It can automatically discover and categorize data from various sources, including SAPHANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. It canalso infer the schema of the data and store it in the AWS Glue Data Catalog, which is acentral metadata repository. AWS Glue can then use the schema information to generateand run Apache Spark code to extract, transform, and load the data into an Amazon S3bucket. AWS Glue can also monitor and optimize the performance and cost of the datapipeline, and handle any schema changes that may occur in the source data. AWS Gluecan meet the SLA of loading the data into the S3 bucket within 15 minutes of data creation,as it can trigger the data pipeline based on events, schedules, or on-demand. AWS Gluehas the least operational overhead among the options, as it does not require provisioning,configuring, or managing any servers or clusters. It also handles scaling, patching, andsecurity automatically. References:AWS Glue[AWS Glue Data Catalog][AWS Glue Developer Guide]AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 28
A company stores data from an application in an Amazon DynamoDB table that operates in provisioned capacity mode. The workloads of the application have predictable throughput load on a regular schedule. Every Monday, there is an immediate increase in activity early in the morning. The application has very low usage during weekends. The company must ensure that the application performs consistently during peak usage times. Which solution will meet these requirements in the MOST cost-effective way?
A. Increase the provisioned capacity to the maximum capacity that is currently presentduring peak load times. B. Divide the table into two tables. Provision each table with half of the provisionedcapacity of the original table. Spread queries evenly across both tables. C. Use AWS Application Auto Scaling to schedule higher provisioned capacity for peakusage times. Schedule lower capacity during off-peak times. D. Change the capacity mode from provisioned to on-demand. Configure the table to scaleup and scale down based on the load on the table.
Answer: C Explanation: Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. DynamoDB offers twocapacity modes for throughput capacity: provisioned and on-demand. In provisionedcapacity mode, you specify the number of read and write capacity units per second thatyou expect your application to require. DynamoDB reserves the resources to meet yourthroughput needs with consistent performance. In on-demand capacity mode, you pay perrequest and DynamoDB scales the resources up and down automatically based on theactual workload. On-demand capacity mode is suitable for unpredictable workloads thatcan vary significantly over time1.The solution that meets the requirements in the most cost-effective way is to use AWSApplication Auto Scaling to schedule higher provisioned capacity for peak usage times andlower capacity during off-peak times. This solution has the following advantages:It allows you to optimize the cost and performance of your DynamoDB table byadjusting the provisioned capacity according to your predictable workload patterns.You can use scheduled scaling to specify the date and time for the scaling actions,and the new minimum and maximum capacity limits. For example, you canschedule higher capacity for every Monday morning and lower capacity forweekends2.It enables you to take advantage of the lower cost per unit of provisioned capacitymode compared to on-demand capacity mode. Provisioned capacity modecharges a flat hourly rate for the capacity you reserve, regardless of how much youuse. On-demand capacity mode charges for each read and write request youconsume, with nominimum capacity required. For predictable workloads,provisioned capacity mode can be more cost-effective than on-demand capacitymode1.It ensures that your application performs consistently during peak usage times byhaving enough capacity to handle the increased load. You can also use autoscaling to automatically adjust the provisioned capacity based on the actualutilization of your table, and set a target utilization percentage for your table orglobal secondary index. This way, you can avoid under-provisioning or overprovisioningyour table2.Option A is incorrect because it suggests increasing the provisioned capacity to themaximum capacity that is currently present during peak load times. This solution has thefollowing disadvantages: It wastes money by paying for unused capacity during off-peak times. If youprovision the same high capacity for all times, regardless of the actual workload,you are over-provisioning your table and paying for resources that you don’tneed1.It does not account for possible changes in the workload patterns over time. If yourpeak load times increase or decrease in the future, you may need to manuallyadjust the provisioned capacity to match the new demand. This adds operationaloverhead and complexity to your application2.Option B is incorrect because it suggests dividing the table into two tables and provisioningeach table with half of the provisioned capacity of the original table. This solution has thefollowing disadvantages:It complicates the data model and the application logic by splitting the data into twoseparate tables. You need to ensure that the queries are evenly distributed acrossboth tables, and that the data is consistent and synchronized between them. Thisadds extra development and maintenance effort to your application3.It does not solve the problem of adjusting the provisioned capacity according to theworkload patterns. You still need to manually or automatically scale the capacity ofeach table based on the actual utilization and demand. This may result in underprovisioningor over-provisioning your tables2.Option D is incorrect because it suggests changing the capacity mode from provisioned toon-demand. This solution has the following disadvantages:It may incur higher costs than provisioned capacity mode for predictableworkloads. On-demand capacity mode charges for each read and write requestyou consume, with no minimum capacity required. For predictable workloads,provisioned capacity mode can be more cost-effective than on-demand capacitymode, as you can reserve the capacity you need at a lower rate1.It may not provide consistent performance during peak usage times, as ondemandcapacity mode may take some time to scale up the resources to meet thesudden increase in demand. On-demand capacity mode uses adaptive capacity tohandle bursts of traffic, but it may not be able to handle very large spikes orsustained high throughput. In such cases, you may experience throttling orincreased latency.References:1: Choosing the right DynamoDB capacity mode - Amazon DynamoDB2: Managing throughput capacity automatically with DynamoDB auto scaling -Amazon DynamoDB3: Best practices for designing and using partition keys effectively - AmazonDynamoDB[4]: On-demand mode guidelines - Amazon DynamoDB[5]: How to optimize Amazon DynamoDB costs - AWS Database Blog[6]: DynamoDB adaptive capacity: How it works and how it helps - AWS DatabaseBlog[7]: Amazon DynamoDB pricing - Amazon Web Services (AWS)
Question # 29
A company is planning to migrate on-premises Apache Hadoop clusters to Amazon EMR. The company also needs to migrate a data catalog into a persistent storage solution. The company currently stores the data catalog in an on-premises Apache Hive metastore on the Hadoop clusters. The company requires a serverless solution to migrate the data catalog. Which solution will meet these requirements MOST cost-effectively?
A. Use AWS Database Migration Service (AWS DMS) to migrate the Hive metastore intoAmazon S3. Configure AWS Glue Data Catalog to scan Amazon S3 to produce the datacatalog. B. Configure a Hive metastore in Amazon EMR. Migrate the existing on-premises Hivemetastore into Amazon EMR. Use AWS Glue Data Catalog to store the company's datacatalog as an external data catalog. C. Configure an external Hive metastore in Amazon EMR. Migrate the existing on-premisesHive metastore into Amazon EMR. Use Amazon Aurora MySQL to store the company'sdata catalog. D. Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hivemetastore into Amazon EMR. Use the new metastore as the company's data catalog.
Answer: A Explanation: AWS Database Migration Service (AWS DMS) is a service that helps you migrate databases to AWS quickly and securely. You can use AWS DMS to migrate theHive metastore from the on-premises Hadoop clusters into Amazon S3, which is ahighlyscalable, durable, and cost-effective object storage service. AWS Glue Data Catalogis a serverless, managed service that acts as a central metadata repository for your dataassets. You can use AWS Glue Data Catalog to scan the Amazon S3 bucket that containsthe migrated Hive metastore and create a data catalog that is compatible with Apache Hiveand other AWS services. This solution meets the requirements of migrating the datacatalog into a persistent storage solution and using a serverless solution. This solution isalso the most cost-effective, as it does not incur any additional charges for running AmazonEMR or Amazon Aurora MySQL clusters. The other options are either not feasible or notoptimal. Configuring a Hive metastore in Amazon EMR (option B) or an external Hivemetastore in Amazon EMR (option C) would require running and maintaining Amazon EMRclusters, which would incur additional costs and complexity. Using Amazon Aurora MySQLto store the company’s data catalog (option C) would also incur additional costs andcomplexity, as well as introduce compatibility issues with Apache Hive. Configuring a newHive metastore in Amazon EMR (option D) would not migrate the existing data catalog, butcreate a new one, which would result in data loss and inconsistency. References:Using AWS Database Migration ServicePopulating the AWS Glue Data CatalogAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 4: Data Analysis and Visualization, Section 4.2: AWS Glue Data Catalog
Question # 30
A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded. A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB. How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?
A. Use a second Lambda function to invoke the first Lambda function based on AmazonCloudWatch events. B. Use the Amazon Redshift Data API to publish an event to Amazon EventBridqe.Configure an EventBridge rule to invoke the Lambda function. C. Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function. D. Use a second Lambda function to invoke the first Lambda function based on AWSCloudTrail events.
Answer: B Explanation: The Amazon Redshift Data API enables you to interact with your Amazon Redshift data warehouse in an easy and secure way. You can use the Data API to run SQLcommands, such as loading data into tables, without requiring a persistent connection tothe cluster. The Data API also integrates with Amazon EventBridge, which allows you tomonitor the execution status of your SQL commands and trigger actions based on events.By using the Data API to publish an event to EventBridge, the data engineer can invoke theLambda function that writes the load statuses to the DynamoDB table. This solution isscalable, reliable, and cost-effective. The other options are either not possible or notoptimal. You cannot use a second Lambda function to invoke the first Lambda functionbased on CloudWatch or CloudTrail events, as these services do not capture the loadstatus of Redshift tables. You can use the Data API to publish a message to an SQSqueue, but this would require additional configuration and polling logic to invoke theLambda function from the queue. This would also introduce additional latency and cost.References:Using the Amazon Redshift Data APIUsing Amazon EventBridge with Amazon RedshiftAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 2: Data Store Management, Section 2.2: Amazon Redshift