Black Friday Special - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: dm70dm

Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam Questions and Answers

Questions 4

A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).

Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?

A)

Databricks-Machine-Learning-Associate Question 4

B)

Databricks-Machine-Learning-Associate Question 4

C)

Databricks-Machine-Learning-Associate Question 4

D)

Databricks-Machine-Learning-Associate Question 4

Options:

A.

OptionA

B.

Option B

C.

Option C

D.

Option D

Buy Now
Questions 5

A data scientist is developing a single-node machine learning model. They have a large number of model configurations to test as a part of their experiment. As a result, the model tuning process takes too long to complete. Which of the following approaches can be used to speed up the model tuning process?

Options:

A.

Implement MLflow Experiment Tracking

B.

Scale up with Spark ML

C.

Enable autoscaling clusters

D.

Parallelize with Hyperopt

Buy Now
Questions 6

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.

Which of the following approaches can they take to include as much information as possible in the feature set?

Options:

A.

Impute the missing values using each respective feature variable's mean value instead of the median value

B.

Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them

C.

Remove all feature variables that originally contained missing values from the feature set

D.

Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed

E.

Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

Buy Now
Questions 7

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

Options:

A.

Keras

B.

pandas

C.

PvTorch

D.

Spark ML

E.

Scikit-learn

Buy Now
Questions 8

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.

The Spark DataFrametrain_dfhas the following schema:

Databricks-Machine-Learning-Associate Question 8

The machine learning engineer shares the following code block:

Databricks-Machine-Learning-Associate Question 8

Which of the following changes does the machine learning engineer need to make to complete the task?

Options:

A.

They need to call the transform method on train df

B.

They need to convert the features column to be a vector

C.

They do not need to make any changes

D.

They need to utilize a Pipeline to fit the model

E.

They need to split thefeaturescolumn out into one column for each feature

Buy Now
Questions 9

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.

Which of the following feature engineering tasks will be the least efficient to distribute?

Options:

A.

One-hot encoding categorical features

B.

Target encoding categorical features

C.

Imputing missing feature values with the mean

D.

Imputing missing feature values with the true median

E.

Creating binary indicator features for missing values

Buy Now
Questions 10

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:

● Hyperparameter 1: [2, 5, 10]

● Hyperparameter 2: [50, 100]

Which of the following represents the number of machine learning models that can be trained in parallel during this process?

Options:

A.

3

B.

5

C.

6

D.

18

Buy Now
Questions 11

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression

B.

Spark ML cannot distribute linear regression training

C.

Iterative optimization

D.

Least-squares method

E.

Singular value decomposition

Buy Now
Questions 12

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Databricks-Machine-Learning-Associate Question 12

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?

Options:

A.

The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

B.

The process will leak data from the training set to the test set during the evaluation phase

C.

The process will be unable to parallelize tuning due to the distributed nature of pipeline

D.

The process will leak data prep information from the validation sets to the training sets for each model

Buy Now
Questions 13

Which of the following machine learning algorithms typically uses bagging?

Options:

A.

Gradient boosted trees

B.

K-means

C.

Random forest

D.

Linear regression

E.

Decision tree

Buy Now
Questions 14

A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective functionobjective_functionand they have defined the search spacesearch_space.

As a result, they have the following code block:

Databricks-Machine-Learning-Associate Question 14

Which of the following changes do they need to make to the above code block in order to accomplish the task?

Options:

A.

Change SparkTrials() to Trials()

B.

Reduce num_evals to be less than 10

C.

Change fmin() to fmax()

D.

Remove the trials=trials argument

E.

Remove the algo=tpe.suggest argument

Buy Now
Questions 15

Which of the following machine learning algorithms typically uses bagging?

Options:

A.

IGradient boosted trees

B.

K-means

C.

Random forest

D.

Decision tree

Buy Now
Questions 16

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.

Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

Options:

A.

Spark ML decision trees test every feature variable in the splitting algorithm

B.

Spark ML decision trees automatically prune overfit trees

C.

Spark ML decision trees test more split candidates in the splitting algorithm

D.

Spark ML decision trees test a random sample of feature variables in the splitting algorithm

E.

Spark ML decision trees test binned features values as representative split candidates

Buy Now
Questions 17

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model bycomparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.

Which of the following possible explanations for this difference is invalid?

Options:

A.

The second model is much more accurate than the first model

B.

The data scientist failed to exponentiate the predictions in the second model prior tocomputingthe RMSE

C.

The datascientist failed to take the logof the predictions in the first model prior to computingthe RMSE

D.

The first model is much more accurate than the second model

E.

The RMSE is an invalid evaluation metric for regression problems

Buy Now
Questions 18

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:

A.

spark_df.summary ()

B.

spark_df.stats()

C.

spark_df.describe().head()

D.

spark_df.printSchema()

E.

spark_df.toPandas()

Buy Now
Questions 19

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

Options:

A.

When the features are of the categorical type

B.

When the features are of the boolean type

C.

When the features contain a lot of extreme outliers

D.

When the features contain no outliers

E.

When the features contain no missingno values

Buy Now
Questions 20

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed thetrain_modelfunction, and they want to apply it to each group of DataFramedf.

They have written the following incomplete code block:

Databricks-Machine-Learning-Associate Question 20

Which of the following pieces of code can be used to fill in the above blank to complete the task?

Options:

A.

applyInPandas

B.

mapInPandas

C.

predict

D.

train_model

E.

groupedApplyIn

Buy Now
Questions 21

A data scientist is working with a feature set with the following schema:

Databricks-Machine-Learning-Associate Question 21

Thecustomer_idcolumn is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature.

Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?

Options:

A.

customer_id, loyalty_tier

B.

loyalty_tier

C.

units

D.

spend

E.

customer_id

Buy Now
Questions 22

A data scientist has created a linear regression model that useslog(price)as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFramepreds_df.

They are using the following code block to evaluate the model:

regression_evaluator.setMetricName("rmse").evaluate(preds_df)

Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable withprice?

Options:

A.

They should exponentiate the computed RMSE value

B.

They should take the log of the predictions before computing the RMSE

C.

They should evaluate the MSE of the log predictions to compute the RMSE

D.

They should exponentiate the predictions before computing the RMSE

Buy Now
Exam Name: Databricks Certified Machine Learning Associate Exam
Last Update: Nov 22, 2024
Questions: 74

PDF + Testing Engine

$49.5  $164.99

Testing Engine

$37.5  $124.99
buy now Databricks-Machine-Learning-Associate testing engine

PDF (Q&A)

$31.5  $104.99
buy now Databricks-Machine-Learning-Associate pdf
dumpsmate guaranteed to pass
24/7 Customer Support

DumpsMate's team of experts is always available to respond your queries on exam preparation. Get professional answers on any topic of the certification syllabus. Our experts will thoroughly satisfy you.

Site Secure

mcafee secure

TESTED 24 Nov 2024