A machine learning (ML) specialist must develop a classification model for a financial services company. A domain expert provides the dataset, which is tabular with 10,000 rows and 1,020 features. During exploratory data analysis, the specialist finds no missing values and a small percentage of duplicate rows. There are correlation scores of > 0.9 for 200 feature pairs. The mean value of each feature is similar to its 50th percentile. Which feature engineering strategy should the ML specialist use with Amazon SageMaker?
A. Apply dimensionality reduction by using the principal component analysis (PCA)algorithm.
B. Drop the features with low correlation scores by using a Jupyter notebook.
C. Apply anomaly detection by using the Random Cut Forest (RCF) algorithm.
D. Concatenate the features with high correlation scores by using a Jupyter notebook.
A Machine Learning Specialist is designing a scalable data storage solution for Amazon SageMaker. There is an existing TensorFlow-based model implemented as a train.py script that relies on static training data that is currently stored as TFRecords Which method of providing training data to Amazon SageMaker would meet the business requirements with the LEAST development overhead?
A. Use Amazon SageMaker script mode and use train.py unchanged. Point the AmazonSageMaker training invocation to the local path of the data without reformatting the trainingdata.
B. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecorddata into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3bucket without reformatting the training data.
C. Rewrite the train.py script to add a section that converts TFRecords to protobuf andingests the protobuf data instead of TFRecords.
D. Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue orAWS Lambda to reformat and store the data in an Amazon S3 bucket.
A data scientist is using the Amazon SageMaker Neural Topic Model (NTM) algorithm to build a model that recommends tags from blog posts. The raw blog post data is stored in an Amazon S3 bucket in JSON format. During model evaluation, the data scientist discovered that the model recommends certain stopwords such as "a," "an,” and "the" as tags to certain blog posts, along with a few rare words that are present only in certain blog entries. After a few iterations of tag review with the content team, the data scientist notices that the rare words are unusual but feasible. The data scientist also must ensure that the tag recommendations of the generated model do not include the stopwords. What should the data scientist do to meet these requirements?
A. Use the Amazon Comprehend entity recognition API operations. Remove the detectedwords from the blog post data. Replace the blog post data source in the S3 bucket.
B. Run the SageMaker built-in principal component analysis (PCA) algorithm with the blogpost data from the S3 bucket as the data source. Replace the blog post data in the S3bucket with the results of the training job.
C. Use the SageMaker built-in Object Detection algorithm instead of the NTM algorithm forthe training job to process the blog post data.
D. Remove the stopwords from the blog post data by using the Count Vectorizer function inthe scikit-learn library. Replace the blog post data in the S3 bucket with the results of thevectorizer.
A Data Scientist received a set of insurance records, each consisting of a record ID, the final outcome among 200 categories, and the date of the final outcome. Some partial information on claim contents is also provided, but only for a few of the 200 categories. For each outcome category, there are hundreds of records distributed over the past 3 years. The Data Scientist wants to predict how many claims to expect in each category from month to month, a few months in advance. What type of machine learning model should be used?
A. Classification month-to-month using supervised learning of the 200 categories based onclaim contents.
B. Reinforcement learning using claim IDs and timestamps where the agent will identifyhow many claims in each category to expect from month to month.
C. Forecasting using claim IDs and timestamps to identify how many claims in eachcategory to expect from month to month.
D. Classification with supervised learning of the categories for which partial information onclaim contents is provided, and forecasting using claim IDs and timestamps for all other categories.
A Machine Learning Specialist uploads a dataset to an Amazon S3 bucket protected with server-side encryption using AWS KMS. How should the ML Specialist define the Amazon SageMaker notebook instance so it can read the same dataset from Amazon S3?
A. Define security group(s) to allow all HTTP inbound/outbound traffic and assign thosesecurity group(s) to the Amazon SageMaker notebook instance.
B. onfigure the Amazon SageMaker notebook instance to have access to the VPC. Grantpermission in the KMS key policy to the notebook’s KMS role.
C. Assign an IAM role to the Amazon SageMaker notebook with S3 read access to thedataset. Grant permission in the KMS key policy to that role.
D. Assign the same KMS key used to encrypt data in Amazon S3 to the AmazonSageMaker notebook instance.
A company provisions Amazon SageMaker notebook instances for its data science team and creates Amazon VPC interface endpoints to ensure communication between the VPC and the notebook instances. All connections to the Amazon SageMaker API are contained entirely and securely using the AWS network. However, the data science team realizes that individuals outside the VPC can still connect to the notebook instances across the internet. Which set of actions should the data science team take to fix the issue?
A. Modify the notebook instances' security group to allow traffic only from the CIDR rangesof the VPC. Apply this security group to all of the notebook instances' VPC interfaces.
B. Create an IAM policy that allows the sagemaker:CreatePresignedNotebooklnstanceUrland sagemaker:DescribeNotebooklnstance actions from only the VPC endpoints. Applythis policy to all IAM users, groups, and roles used to access the notebook instances.
C. Add a NAT gateway to the VPC. Convert all of the subnets where the AmazonSageMaker notebook instances are hosted to private subnets. Stop and start all of thenotebook instances to reassign only private IP addresses.
D. Change the network ACL of the subnet the notebook is hosted in to restrict access toanyone outside the VPC.
A data scientist is working on a public sector project for an urban traffic system. While studying the traffic patterns, it is clear to the data scientist that the traffic behavior at each light is correlated, subject to a small stochastic error term. The data scientist must model the traffic behavior to analyze the traffic patterns and reduce congestion How will the data scientist MOST effectively model the problem?
A. The data scientist should obtain a correlated equilibrium policy by formulating thisproblem as a multi-agent reinforcement learning problem.
B. The data scientist should obtain the optimal equilibrium policy by formulating thisproblem as a single-agent reinforcement learning problem.
C. Rather than finding an equilibrium policy, the data scientist should obtain accuratepredictors of traffic flow by using historical data through a supervised learning approach.
D. Rather than finding an equilibrium policy, the data scientist should obtain accuratepredictors of traffic flow by using unlabeled simulated data representing the new trafficpatterns in the city and applying an unsupervised learning approach.
A company is converting a large number of unstructured paper receipts into images. The company wants to create a model based on natural language processing (NLP) to find relevant entities such as date, location, and notes, as well as some custom entities such as receipt numbers. The company is using optical character recognition (OCR) to extract text for data labeling. However, documents are in different structures and formats, and the company is facing challenges with setting up the manual workflows for each document type. Additionally, the company trained a named entity recognition (NER) model for custom entity detection using a small sample size. This model has a very low confidence score and will require retraining with a large dataset. Which solution for text extraction and entity detection will require the LEAST amount of effort?
A. Extract text from receipt images by using Amazon Textract. Use the AmazonSageMaker BlazingText algorithm to train on the text for entities and custom entities.
B. Extract text from receipt images by using a deep learning OCR model from the AWSMarketplace. Use the NER deep learning model to extract entities.
C. Extract text from receipt images by using Amazon Textract. Use Amazon Comprehendfor entity detection, and use Amazon Comprehend custom entity recognition for customentity detection.
D. Extract text from receipt images by using a deep learning OCR model from the AWSMarketplace. Use Amazon Comprehend for entity detection, and use Amazon Comprehendcustom entity recognition for custom entity detection.
A machine learning specialist is developing a regression model to predict rental rates from rental listings. A variable named Wall_Color represents the most prominent exterior wall color of the property. The following is the sample data, excluding all other variables:
The specialist chose a model that needs numerical input data. Which feature engineering approaches should the specialist use to allow the regression model to learn from the Wall_Color data? (Choose two.)
A. Apply integer transformation and set Red = 1, White = 5, and Green = 10.
B. Add new columns that store one-hot representation of colors.
C. Replace the color name string by its length.
D. Create three columns to encode the color in RGB format.
E. Replace each color name by its training set frequency.
A company has set up and deployed its machine learning (ML) model into production with an endpoint using Amazon SageMaker hosting services. The ML team has configured automatic scaling for its SageMaker instances to support workload changes. During testing, the team notices that additional instances are being launched before the new instances are ready. This behavior needs to change as soon as possible. How can the ML team solve this issue?
A. Decrease the cooldown period for the scale-in activity. Increase the configuredmaximum capacity of instances.
B. Replace the current endpoint with a multi-model endpoint using SageMaker.
C. Set up Amazon API Gateway and AWS Lambda to trigger the SageMaker inferenceendpoint.
D. Increase the cooldown period for the scale-out activity.