Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Questions 4

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Databricks-Certified-Professional-Data-Engineer Question 4

Which command should be removed from the notebook before scheduling it as a job?

Options:

Cmd 2

Cmd 3

Cmd 4

Cmd 5

Cmd 6

Buy Now

Questions 5

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org.

Which of the following solutions addresses the situation while emphasizing simplicity?

Options:

Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.

Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes.

Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table.

Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.

Buy Now

Questions 6

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields, in total, 15 fields have been identified that will often be used for filter and join logic.

The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema.

Which of the following accurately presents information about Delta Lake and Databricks that may Impact their decision-making process?

Options:

Because Delta Lake uses Parquet for data storage, Dremel encoding information for nesting can be directly referenced by the Delta transaction log.

Tungsten encoding used by Databricks is optimized for storing string data: newly-added native support for querying JSON strings means that string types are always most efficient.

Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

By default Delta Lake collects statistics on the first 32 columns in a table; these statistics are leveraged for data skipping when executing selective queries.

Buy Now

Questions 7

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

Options:

The five Minute Load Average remains consistent/flat

Bytes Received never exceeds 80 million bytes per second

Total Disk Space remains constant

Network I/O never spikes

Overall cluster CPU utilization is around 25%

Buy Now

Questions 8

A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table.

Given the current implementation, which method can be used?

Options:

Parse the Delta Lake transaction log to identify all newly written data files.

Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.

Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and time travel functionality.

Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.

Buy Now

Questions 9

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.

Databricks-Certified-Professional-Data-Engineer Question 9

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

Options:

Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.

The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.

Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.

One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.

The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

Buy Now

Questions 10

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

Options:

The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.

Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.

Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.

Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.

Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

Buy Now

Questions 11

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Options:

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Buy Now

Questions 12

The following table consists of items found in user carts within an e-commerce website.

Databricks-Certified-Professional-Data-Engineer Question 12

The following MERGE statement is used to update this table using an updates view, with schema evaluation enabled on this table.

Databricks-Certified-Professional-Data-Engineer Question 12

How would the following update be handled?

Options:

The update is moved to separate ''restored'' column because it is missing a column expected in the target schema.

The new restored field is added to the target schema, and dynamically read as NULL for existing unmatched records.

The update throws an error because changes to existing columns in the target schema are not supported.

The new nested field is added to the target schema, and files underlying existing records are updated to include NULL values for the new field.

Buy Now

Questions 13

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users.

Databricks-Certified-Professional-Data-Engineer Question 13

Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

Options:

Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.

No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.

No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.

Buy Now

Questions 14

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Which kind of the test does the above line exemplify?

Options:

Integration

Unit

Manual

functional

Buy Now

Questions 15

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

Options:

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Buy Now

Questions 16

Which statement describes Delta Lake Auto Compaction?

Options:

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Buy Now

Questions 17

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.

Which statement describes the contents of the workspace audit logs concerning these events?

Options:

Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identity these events.

Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events.

Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.

Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.

Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.

Buy Now

Questions 18

A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.

The user attempts and fails to accomplish this by adding an expectation to the report table definition.

Which approach would allow using DLT expectations to validate all expected records are present in this table?

Options:

Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for report key values in a DLT expectation for the report table.

Define a function that performs a left outer join on validation_copy and report and report, and check against the result in a DLT expectation for the report table

Define a temporary table that perform a left outer join on validation_copy and report, and define an expectation that no report key values are null

Define a view that performs a left outer join on validation_copy and report, and reference this view in DLT expectations for the report table

Buy Now

Questions 19

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?

Options:

The five Minute Load Average remains consistent/flat

Bytes Received never exceeds 80 million bytes per second

Network I/O never spikes

Total Disk Space remains constant

CPU Utilization is around 75%

Buy Now

Questions 20

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

Options:

‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team.

No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

“Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.

“Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.

Buy Now

Questions 21

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Which statement describes this implementation?

Options:

The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Buy Now

Questions 22

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

Options:

A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.

An external table will be created in the storage container mounted to /mnt/finance eda bucket.

A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.

An managed table will be created in the storage container mounted to /mnt/finance eda bucket.

A managed table will be created in the DBFS root storage container.

Buy Now

Questions 23

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:

key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG

There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.

Which of the following solutions meets the requirements?

Options:

All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.

Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.

Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.

Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.

Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.

Buy Now

Questions 24

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

Databricks-Certified-Professional-Data-Engineer Question 24

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

Options:

Three columns will be returned, but one column will be named "redacted" and contain only null values.

Only the email and itv columns will be returned; the email column will contain all null values.

The email and ltv columns will be returned with the values in user itv.

The email, age. and ltv columns will be returned with the values in user ltv.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.

Buy Now

Questions 25

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.

Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

Options:

Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.

Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.

Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.

Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.

Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the

most recent value for each username.

Buy Now

Questions 26

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.

Databricks-Certified-Professional-Data-Engineer Question 26

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

Options:

Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.

Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.

Both commands will fail. No new variables, tables, or views will be created.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.

Buy Now

Questions 27

Which statement describes the default execution mode for Databricks Auto Loader?

Options:

New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.

Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.

New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.

Buy Now

Questions 28

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

Options:

The job_id is returned in this field.

The job_id and number of times the job has been are concatenated and returned.

The number of times the job definition has been run in the workspace.

The globally unique ID of the newly triggered run.

Buy Now

Questions 29

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

to_interval("event_time", "5 minutes").alias("time")

window("event_time", "5 minutes").alias("time")

"event_time"

window("event_time", "10 minutes").alias("time")

lag("event_time", "10 minutes").alias("time")

Buy Now

Questions 30

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

importlib.resource path

,sys.path

os-path

pypi.path

pylib.source

Buy Now

Questions 31

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

Options:

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

Merge all changes back to the main branch in the remote Git repository and clone the repo again

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Buy Now

Questions 32

What is the first of a Databricks Python notebook when viewed in a text editor?

Options:

%python

% Databricks notebook source

-- Databricks notebook source

//Databricks notebook source

Buy Now

Questions 33

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Options:

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Buy Now

Questions 34

Which statement describes integration testing?

Options:

Validates interactions between subsystems of your application

Requires an automated testing framework

Requires manual intervention

Validates an application use case

Validates behavior of individual elements of your application

Buy Now

Questions 35

Which is a key benefit of an end-to-end test?

Options:

It closely simulates real world usage of your application.

It pinpoint errors in the building blocks of your application.

It provides testing coverage for all code paths and branches.

It makes it easier to automate your test suite

Buy Now

Questions 36

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Databricks-Certified-Professional-Data-Engineer Question 36

Which solution would improve the performance?

Databricks-Certified-Professional-Data-Engineer Question 36

Options:

Option A

Option B

Option C

Option D

Buy Now

Exam Code: Databricks-Certified-Professional-Data-Engineer

Exam Name: Databricks Certified Data Engineer Professional Exam

Last Update: Nov 19, 2024

Questions: 120

PDF + Testing Engine

$56 ~~$159.99~~

Testing Engine

$42 ~~$119.99~~

PDF (Q&A)

$35 ~~$99.99~~

buy now Databricks-Certified-Professional-Data-Engineer pdf

Winter Sale - Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: dpm65

dumpsmate logo

Contact Email:

Hot Vendors

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options: