Databricks Certified Professional Data Engineer - Databricks-Certified-Professional-Data-Engineer Exam Practice Test

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as " unmanaged " ) Delta Lake tables.
Which approach will ensure that this requirement is met?
Correct Answer: A
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
The data science team has requested assistance in accelerating queries on free form text from user reviews.
The data is currently stored in Parquet with the below schema:
item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING
The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.
A junior data engineer suggests converting this data to Delta Lake will improve query performance.
Which response to the junior data engineer s suggestion is correct?
Correct Answer: D
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A data engineer is designing a pipeline in Databricks that processes records from a Kafka stream where late- arriving data is common.
Which approach should the data engineer use?
Correct Answer: D
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A developer has successfully configured credential for Databricks Repos and cloned a remote Git repository.
Hey don not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.
Use Response to pull changes from the remote Git repository commit and push changes to a branch that appeared as a changes were pulled.
Correct Answer: D
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A facilities-monitoring team is building a near-real-time Power BI dashboard off the Delta table device_readings :
* device_id STRING - unique sensor ID
* event_ts TIMESTAMP - ingestion timestamp (UTC)
* temperature_c DOUBLE - temperature in °C
* notes STRING
For each sensor, the team needs one row per non-overlapping 5-minute interval, offset by 2 minutes (for example, intervals like 00:02-00:07 , 00:07-00:12 , and so on), showing the average temperature in that slice.
The result must include each interval's start and end timestamps so downstream tools can plot time-series bars correctly. Which query satisfies the requirement?
Correct Answer: D
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A data architect has heard about lake ' s built-in versioning and time travel capabilities. For auditing purposes they have a requirement to maintain a full of all valid street addresses as they appear in the customers table.
The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.
Which piece of information is critical to this decision?
Correct Answer: A
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.
Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales .

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?
Correct Answer: A
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
Which is a key benefit of an end-to-end test?
Correct Answer: B
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?
Correct Answer: A
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A table named user_ltv is being used to create a view that will be used by data analysts on various teams.
Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:
SELECT * FROM email_ltv
Which statement describes the results returned by this query?
Correct Answer: A
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A streaming video analytics team ingests billions of events daily into a Unity Catalog-managed Delta table video_events . Analysts run ad-hoc point-lookup queries on columns like user_id, campaign_id, and region.
The team manually runs OPTIMIZE video_events ZORDER BY (user_id, campaign_id, region), but still sees poor performance on recent data and dislikes the operational overhead. The team wants a hands-off way to keep hot columns co-located as query patterns evolve.
Correct Answer: D
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A Delta Lake table representing metadata about content posts from users has the following schema:
* user_id LONG
* post_text STRING
* post_id STRING
* longitude FLOAT
* latitude FLOAT
* post_time TIMESTAMP
* date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?
Correct Answer: D
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?
Correct Answer: B
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A data engineer is creating a data ingestion pipeline to understand where customers are taking their rented bicycles during use. The engineer noticed that over time, data being transmitted from the bicycle sensors fails to include key details like latitude and longitude. Downstream analysts need both the clean records and the quarantined records available for separate processing.
The data engineer already has this code:
import dlt
from pyspark.sql.functions import expr
rules = {
" valid_lat " : " (lat IS NOT NULL) " ,
" valid_long " : " (long IS NOT NULL) "
}
quarantine_rules = " NOT({0}) " .format( " AND " .join(rules.values()))
@dlt.view
def raw_trips_data():
return spark.readStream.table( " ride_and_go.telemetry.trips " )
How should the data engineer meet the requirements to capture good and bad data?
Correct Answer: D
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).
A data engineer created a daily batch ingestion pipeline using a cluster with the latest DBR version to store banking transaction data, and persisted it in a MANAGED DELTA table called prod.gold.
all_banking_transactions_daily. The data engineer is constantly receiving complaints from business users who query this table ad hoc through a SQL Serverless Warehouse about poor query performance. Upon analysis, the data engineer identified that these users frequently use high-cardinality columns as filters. The engineer now seeks to implement a data layout optimization technique that is incremental, easy to maintain, and can evolve over time.
Which command should the data engineer implement?
Correct Answer: C
Explanation: Only visible for ExamsLabs members. You can sign-up / login (it's free).