Winter Sale - Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: dpm65

Hot Vendors

Professional-Data-Engineer Google Professional Data Engineer Exam Questions and Answers

Questions 4

Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?

Options:

A.

Use Google Stackdriver Audit Logs to review data access.

B.

Get the identity and access management IIAM) policy of each table

C.

Use Stackdriver Monitoring to see the usage of BigQuery query slots.

D.

Use the Google Cloud Billing API to see what account the warehouse is being billed to.

Buy Now
Questions 5

Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:

# Syntax error : Expected end of statement but got “-“ at [4:11]

SELECT age

FROM

bigquery-public-data.noaa_gsod.gsod

WHERE

age != 99

AND_TABLE_SUFFIX = ‘1929’

ORDER BY

age DESC

Which table name will make the SQL statement work correctly?

Options:

A.

‘bigquery-public-data.noaa_gsod.gsod‘

B.

bigquery-public-data.noaa_gsod.gsod*

C.

‘bigquery-public-data.noaa_gsod.gsod’*

D.

‘bigquery-public-data.noaa_gsod.gsod*`

Buy Now
Questions 6

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

Options:

A.

Add capacity (memory and disk space) to the database server by the order of 200.

B.

Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.

C.

Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.

D.

Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Buy Now
Questions 7

Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost. What should they do?

Options:

A.

Redefine the schema by evenly distributing reads and writes across the row space of the table.

B.

The performance issue should be resolved over time as the site of the BigDate cluster is increased.

C.

Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.

D.

Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.

Buy Now
Questions 8

Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)

Options:

A.

Disable writes to certain tables.

B.

Restrict access to tables by role.

C.

Ensure that the data is encrypted at all times.

D.

Restrict BigQuery API access to approved users.

E.

Segregate data across multiple tables or databases.

F.

Use Google Stackdriver Audit Logging to determine policy violations.

Buy Now
Questions 9

You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

Options:

A.

Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.

B.

Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.

C.

Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.

D.

Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.

E.

Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Buy Now
Questions 10

You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?

Options:

A.

The message body for the sensor event is too large.

B.

Your custom endpoint has an out-of-date SSL certificate.

C.

The Cloud Pub/Sub topic has too many messages published to it.

D.

Your custom endpoint is not acknowledging messages within the acknowledgement deadline.

Buy Now
Questions 11

Your company built a TensorFlow neural-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

Options:

A.

Threading

B.

Serialization

C.

Dropout Methods

D.

Dimensionality Reduction

Buy Now
Questions 12

You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?

Options:

A.

Disable caching by editing the report settings.

B.

Disable caching in BigQuery by editing table details.

C.

Refresh your browser tab showing the visualizations.

D.

Clear your browser history for the past hour then reload the tab showing the virtualizations.

Buy Now
Questions 13

You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time. What should you do?

Options:

A.

Send the data to Google Cloud Datastore and then export to BigQuery.

B.

Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.

C.

Send the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.

D.

Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance, import the data from Cloud Storage, and run an analysis as needed.

Buy Now
Questions 14

You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:

    No interaction by the user on the site for 1 hour

    Has added more than $30 worth of products to the basket

    Has not completed a transaction

You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

Options:

A.

Use a fixed-time window with a duration of 60 minutes.

B.

Use a sliding time window with a duration of 60 minutes.

C.

Use a session window with a gap time duration of 60 minutes.

D.

Use a global window with a time based trigger with a delay of 60 minutes.

Buy Now
Questions 15

You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?

Options:

A.

Include ORDER BY DESK on timestamp column and LIMIT to 1.

B.

Use GROUP BY on the unique ID column and timestamp column and SUM on the values.

C.

Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.

D.

Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

Buy Now
Questions 16

Your company needs to upload their historic data to Cloud Storage. The security rules don’t allow access from external IPs to their on-premises resources. After an initial upload, they will add new data from existing on-premises applications every day. What should they do?

Options:

A.

Execute gsutil rsync from the on-premises servers.

B.

Use Cloud Dataflow and write the data to Cloud Storage.

C.

Write a job template in Cloud Dataproc to perform the data transfer.

D.

Install an FTP server on a Compute Engine VM to receive the files and move them to Cloud Storage.

Buy Now
Questions 17

You are architecting a data transformation solution for BigQuery. Your developers are proficient with SOL and want to use the ELT development technique. In addition, your developers need an intuitive coding environment and the ability to manage SQL as code. You need to identify a solution for your developers to build these pipelines. What should you do?

Options:

A.

Use Cloud Composer to load data and run SQL pipelines by using the BigQuery job operators.

B.

Use Dataflow jobs to read data from Pub/Sub, transform the data, and load the data to BigQuery.

C.

Use Dataform to build, manage, and schedule SQL pipelines.

D.

Use Data Fusion to build and execute ETL pipelines

Buy Now
Questions 18

You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don’t get slots to execute their query and you need to correct this. You’d like to avoid introducing new projects to your account.

What should you do?

Options:

A.

Convert your batch BQ queries into interactive BQ queries.

B.

Create an additional project to overcome the 2K on-demand per-project quota.

C.

Switch to flat-rate pricing and establish a hierarchical priority model for your projects.

D.

Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.

Buy Now
Questions 19

Your company is selecting a system to centralize data ingestion and delivery. You are considering messaging and data integration systems to address the requirements. The key requirements are:

    The ability to seek to a particular offset in a topic, possibly back to the start of all data ever captured

    Support for publish/subscribe semantics on hundreds of topics

    Retain per-key ordering

Which system should you choose?

Options:

A.

Apache Kafka

B.

Cloud Storage

C.

Cloud Pub/Sub

D.

Firebase Cloud Messaging

Buy Now
Questions 20

You work for an airline and you need to store weather data in a BigQuery table Weather data will be used as input to a machine learning model. The model only uses the last 30 days of weather data. You want to avoid storing unnecessary data and minimize costs. What should you do?

Options:

A.

Create a BigQuery table where each record has an ingestion timestamp Run a scheduled query to delete all the rows with an ingestion timestamp older than 30 days.

B.

Create a BigQuery table partitioned by ingestion time Set up partition expiration to 30 days.

C.

Create a BigQuery table partitioned by datetime value of the weather date Set up partition expiration to 30 days.

D.

Create a BigQuery table with a datetime column for the day the weather data refers to. Run a scheduled query to delete rows with a datetime value older than 30 days.

Buy Now
Questions 21

You have several different unstructured data sources, within your on-premises data center as well as in the cloud. The data is in various formats, such as Apache Parquet and CSV. You want to centralize this data in Cloud Storage. You need to set up an object sink for your data that allows you to use your own encryption keys. You want to use a GUI-based solution. What should you do?

Options:

A.

Use Cloud Data Fusion to move files into Cloud Storage.

B.

Use Storage Transfer Service to move files into Cloud Storage.

C.

Use Dataflow to move files into Cloud Storage.

D.

Use BigQuery Data Transfer Service to move files into BigQuery.

Buy Now
Questions 22

You are collecting loT sensor data from millions of devices across the world and storing the data in BigQuery. Your access pattern is based on recent data tittered by location_id and device_version with the following query:

Professional-Data-Engineer Question 22

You want to optimize your queries for cost and performance. How should you structure your data?

Options:

A.

Partition table data by create_date, location_id and device_version

B.

Partition table data by create_date cluster table data by tocation_id and device_version

C.

Cluster table data by create_date location_id and device_version

D.

Cluster table data by create_date, partition by location and device_version

Buy Now
Questions 23

You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do?

Options:

A.

Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to the existing job name

B.

Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to a new unique job name

C.

Stop the Cloud Dataflow pipeline with the Cancel option. Create a new Cloud Dataflow job with the updated code

D.

Stop the Cloud Dataflow pipeline with the Drain option. Create a new Cloud Dataflow job with the updated code

Buy Now
Questions 24

You need to set access to BigQuery for different departments within your company. Your solution should comply with the following requirements:

    Each department should have access only to their data.

    Each department will have one or more leads who need to be able to create and update tables and provide them to their team.

    Each department has data analysts who need to be able to query but not modify data.

How should you set access to the data in BigQuery?

Options:

A.

Create a dataset for each department. Assign the department leads the role of OWNER, and assign the data analysts the role of WRITER on their dataset.

B.

Create a dataset for each department. Assign the department leads the role of WRITER, and assign the data analysts the role of READER on their dataset.

C.

Create a table for each department. Assign the department leads the role of Owner, and assign the data analysts the role of Editor on the project the table is in.

D.

Create a table for each department. Assign the department leads the role of Editor, and assign the data analysts the role of Viewer on the project the table is in.

Buy Now
Questions 25

You have an Apache Kafka Cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.

What should you do?

Options:

A.

Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.

B.

Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.

C.

Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read fron PubSub and write to GCS.

D.

Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read fron PubSub and write to GCS.

Buy Now
Questions 26

You are responsible for writing your company’s ETL pipelines to run on an Apache Hadoop cluster. The

pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the

pipelines?

Options:

A.

PigLatin using Pig

B.

HiveQL using Hive

C.

Java using MapReduce

D.

Python using MapReduce

Buy Now
Questions 27

You have enabled the free integration between Firebase Analytics and Google BigQuery. Firebase now

automatically creates a new table daily in BigQuery in the format app_events_YYYYMMDD. You want to

query all of the tables for the past 30 days in legacy SQL. What should you do?

Options:

A.

Use the TABLE_DATE_RANGE function

B.

Use the WHERE_PARTITIONTIME pseudo column

C.

Use WHERE date BETWEEN YYYY-MM-DD AND YYYY-MM-DD

D.

Use SELECT IF.(date >= YYYY-MM-DD AND date <= YYYY-MM-DD

Buy Now
Questions 28

You are developing an Apache Beam pipeline to extract data from a Cloud SQL instance by using JdbclO. You have two projects running in Google Cloud. The pipeline will be deployed and executed on Dataflow in Project A. The Cloud SQL instance is running jn Project B and does not have a public IP address. After deploying the pipeline, you noticed that the pipeline failed to extract data from the Cloud SQL instance due to connection failure. You verified that VPC Service Controls and shared VPC are not in use in these projects. You want to resolve this error while ensuring that the data does not go through the public internet. What should you do?

Options:

A.

Set up VPC Network Peering between Project A and Project B. Add a firewall rule to allow the peered subnet range to access all instances on the network.

B.

Turn off the external IP addresses on the Dataflow worker. Enable Cloud NAT in Project A.

C.

Set up VPC Network Peering between Project A and Project B. Create a Compute Engine instance without external IP address in Project B on the peered subnet to serve as a proxy server to the Cloud SQL database.

D.

Add the external IP addresses of the Dataflow worker as authorized networks in the Cloud SOL instance.

Buy Now
Questions 29

Your car factory is pushing machine measurements as messages into a Pub/Sub topic in your Google Cloud project. A Dataflow streaming job. that you wrote with the Apache Beam SDK, reads these messages, sends acknowledgment lo Pub/Sub. applies some custom business logic in a Doffs instance, and writes the result to BigQuery. You want to ensure that if your business logic fails on a message, the message will be sent to a Pub/Sub topic that you want to monitor for alerting purposes. What should you do?

Options:

A.

Use an exception handling block in your Data Flow’s Doffs code to push the messages that failed to be transformed through a side output

and to a new Pub/Sub topic. Use Cloud Monitoring to monitor the topic/num_jnacked_messages_by_region metric on this new topic.

B.

Enable retaining of acknowledged messages in your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the

subscription/num_retained_acked_messages metric on this subscription.

C.

Enable dead lettering in your Pub/Sub pull subscription, and specify a new Pub/Sub topic as the dead letter topic. Use Cloud Monitoring to

monitor the subscription/dead_letter_message_count metric on your pull subscription.

D.

Create a snapshot of your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the snapshot/numessages metric on this

snapshot.

Buy Now
Questions 30

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

Options:

A.

There are very few occurrences of mutations relative to normal samples.

B.

There are roughly equal occurrences of both normal and mutated samples in the database.

C.

You expect future mutations to have different features from the mutated samples in the database.

D.

You expect future mutations to have similar features to the mutated samples in the database.

E.

You already have labels for which samples are mutated and which are normal in the database.

Buy Now
Questions 31

You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to BigQuery for analysis. Which job type and transforms should this pipeline use?

Options:

A.

Batch job, PubSubIO, side-inputs

B.

Streaming job, PubSubIO, JdbcIO, side-outputs

C.

Streaming job, PubSubIO, BigQueryIO, side-inputs

D.

Streaming job, PubSubIO, BigQueryIO, side-outputs

Buy Now
Questions 32

You are running your BigQuery project in the on-demand billing model and are executing a change data capture (CDC) process that ingests data. The CDC process loads 1 GB of data every 10 minutes into a temporary table, and then performs a merge into a 10 TB target table. This process is very scan intensive and you want to explore options to enable a predictable cost model. You need to create a BigQuery reservation based on utilization information gathered from BigQuery Monitoring and apply the reservation to the CDC process. What should you do?

Options:

A.

Create a BigQuery reservation for the job.

B.

Create a BigQuery reservation for the service account running the job.

C.

Create a BigQuery reservation for the dataset.

D.

Create a BigQuery reservation for the project.

Buy Now
Questions 33

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. Which solution should you choose?

Options:

A.

Create an authorized view on the BigQuery table to control data access, and provide third-party companies with access to that view.

B.

Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.

C.

Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.

D.

Create a Cloud Dataflow job that reads the data in frequent time intervals, and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.

Buy Now
Questions 34

You are planning to migrate your current on-premises Apache Hadoop deployment to the cloud. You need to ensure that the deployment is as fault-tolerant and cost-effective as possible for long-running batch jobs. You want to use a managed service. What should you do?

Options:

A.

Deploy a Cloud Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://

B.

Deploy a Cloud Dataproc cluster. Use an SSD persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://

C.

Install Hadoop and Spark on a 10-node Compute Engine instance group with standard instances. Install the Cloud Storage connector, and store the data in Cloud Storage. Change references in scripts from hdfs:// to gs://

D.

Install Hadoop and Spark on a 10-node Compute Engine instance group with preemptible instances. Store data in HDFS. Change references in scripts from hdfs:// to gs://

Buy Now
Questions 35

A web server sends click events to a Pub/Sub topic as messages. The web server includes an event Timestamp attribute in the messages, which is the time when the click occurred. You have a Dataflow streaming job that reads from this Pub/Sub topic through a subscription, applies some transformations, and writes the result to another Pub/Sub topic for use by the advertising department. The advertising department needs to receive each message within 30 seconds of the corresponding click occurrence, but they report receiving the messages late. Your Dataflow job's system lag is about 5 seconds, and the data freshness is about 40 seconds. Inspecting a few messages show no more than 1 second lag between their event Timestamp and publish Time. What is the problem and what should you do?

Options:

A.

The advertising department is causing delays when consuming the messages. Work with the advertising department to fix this.

B.

Messages in your Dataflow job are processed in less than 30 seconds, but your job cannot keep up with the backlog in the Pub/Sub

subscription. Optimize your job or increase the number of workers to fix this.

C.

The web server is not pushing messages fast enough to Pub/Sub. Work with the web server team to fix this.

D.

Messages in your Dataflow job are taking more than 30 seconds to process. Optimize your job or increase the number of workers to fix this.

Buy Now
Questions 36

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

Options:

A.

Rewrite the job in Pig.

B.

Rewrite the job in Apache Spark.

C.

Increase the size of the Hadoop cluster.

D.

Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Buy Now
Questions 37

You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:

    The user profile: What the user likes and doesn’t like to eat

    The user account information: Name, address, preferred meal times

    The order information: When orders are made, from where, to whom

The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

Options:

A.

BigQuery

B.

Cloud SQL

C.

Cloud Bigtable

D.

Cloud Datastore

Buy Now
Questions 38

You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?

Options:

A.

Load the data every 30 minutes into a new partitioned table in BigQuery.

B.

Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery

C.

Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore

D.

Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.

Buy Now
Questions 39

You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?

Options:

A.

Change the processing job to use Google Cloud Dataproc instead.

B.

Manually start the Cloud Dataflow job each morning when you get into the office.

C.

Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.

D.

Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.

Buy Now
Questions 40

Your company is loading comma-separated values (CSV) files into Google BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?

Options:

A.

The CSV data loaded in BigQuery is not flagged as CSV.

B.

The CSV data has invalid rows that were skipped on import.

C.

The CSV data loaded in BigQuery is not using BigQuery’s default encoding.

D.

The CSV data has not gone through an ETL phase before loading into BigQuery.

Buy Now
Questions 41

You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.

You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)

Options:

A.

Redis

B.

HBase

C.

MySQL

D.

MongoDB

E.

Cassandra

F.

HDFS with Hive

Buy Now
Questions 42

Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.

You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)

Options:

A.

Introduce data compression for each file to increase the rate file of file transfer.

B.

Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.

C.

Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.

D.

Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.

E.

Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premices data to the designated storage bucket.

Buy Now
Questions 43

You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity ‘Movie’ the property ‘actors’ and the property ‘tags’ have multiple values but the property ‘date released’ does not. A typical query would ask for all movies with actor= ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

Professional-Data-Engineer Question 43

Professional-Data-Engineer Question 43

Options:

A.

Option A

B.

Option B.

C.

Option C

D.

Option D

Buy Now
Questions 44

Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day’s events. They also want to use streaming ingestion. What should you do?

Options:

A.

Create a table called tracking_table and include a DATE column.

B.

Create a partitioned table called tracking_table and include a TIMESTAMP column.

C.

Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.

D.

Create a table called tracking_table with a TIMESTAMP column to represent the day.

Buy Now
Questions 45

You need to compose visualizations for operations teams with the following requirements:

Which approach meets the requirements?

Options:

A.

Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show only suboptimal links in a table.

B.

Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates the metric, and shows only suboptimal rows in a table in Google Sheets.

C.

Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries all rows, applies a function to derive the metric, and then renders results in a table using the Google charts and visualization API.

D.

Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a table.

Buy Now
Questions 46

MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?

Options:

A.

Rowkey: date#device_idColumn data: data_point

B.

Rowkey: dateColumn data: device_id, data_point

C.

Rowkey: device_idColumn data: date, data_point

D.

Rowkey: data_pointColumn data: device_id, date

E.

Rowkey: date#data_pointColumn data: device_id

Buy Now
Questions 47

MJTelco is building a custom interface to share data. They have these requirements:

    They need to do aggregations over their petabyte-scale datasets.

    They need to scan specific time range rows with a very fast response time (milliseconds).

Which combination of Google Cloud Platform products should you recommend?

Options:

A.

Cloud Datastore and Cloud Bigtable

B.

Cloud Bigtable and Cloud SQL

C.

BigQuery and Cloud Bigtable

D.

BigQuery and Cloud Storage

Buy Now
Questions 48

MJTelco’s Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

Options:

A.

The zone

B.

The number of workers

C.

The disk size per worker

D.

The maximum number of workers

Buy Now
Questions 49

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

Options:

A.

Store the common data in BigQuery as partitioned tables.

B.

Store the common data in BigQuery and expose authorized views.

C.

Store the common data encoded as Avro in Google Cloud Storage.

D.

Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Buy Now
Questions 50

You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.

Which two actions should you take? (Choose two.)

Options:

A.

Ensure all the tables are included in global dataset.

B.

Ensure each table is included in a dataset for a region.

C.

Adjust the settings for each table to allow a related region-based security group view access.

D.

Adjust the settings for each view to allow a related region-based security group view access.

E.

Adjust the settings for each dataset to allow a related region-based security group view access.

Buy Now
Questions 51

You need to compose visualization for operations teams with the following requirements:

    Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)

    The report must not be more than 3 hours delayed from live data.

    The actionable report should only show suboptimal links.

    Most suboptimal links should be sorted to the top.

    Suboptimal links can be grouped and filtered by regional geography.

    User response time to load the report must be <5 seconds.

You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?

Options:

A.

Look through the current data and compose a series of charts and tables, one for each possible

combination of criteria.

B.

Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.

C.

Export the data to a spreadsheet, compose a series of charts and tables, one for each possible

combination of criteria, and spread them across multiple tabs.

D.

Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.

Buy Now
Questions 52

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

Options:

A.

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

B.

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

C.

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

D.

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Buy Now
Questions 53

Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.

Which approach should you take?

Options:

A.

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

B.

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

C.

Use the NOW () function in BigQuery to record the event’s time.

D.

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Buy Now
Questions 54

Which action can a Cloud Dataproc Viewer perform?

Options:

A.

Submit a job.

B.

Create a cluster.

C.

Delete a cluster.

D.

List the jobs.

Buy Now
Questions 55

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all the data in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

Options:

A.

Export the data into a Google Sheet for virtualization.

B.

Create an additional table with only the necessary columns.

C.

Create a view on the table to present to the virtualization tool.

D.

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

Buy Now
Questions 56

What are the minimum permissions needed for a service account used with Google Dataproc?

Options:

A.

Execute to Google Cloud Storage; write to Google Cloud Logging

B.

Write to Google Cloud Storage; read to Google Cloud Logging

C.

Execute to Google Cloud Storage; execute to Google Cloud Logging

D.

Read and write to Google Cloud Storage; write to Google Cloud Logging

Buy Now
Questions 57

Dataproc clusters contain many configuration files. To update these files, you will need to use the --properties option. The format for the option is: file_prefix:property=_____.

Options:

A.

details

B.

value

C.

null

D.

id

Buy Now
Questions 58

To give a user read permission for only the first three columns of a table, which access control method would you use?

Options:

A.

Primitive role

B.

Predefined role

C.

Authorized view

D.

It's not possible to give access to only the first three columns of a table.

Buy Now
Questions 59

In order to securely transfer web traffic data from your computer's web browser to the Cloud Dataproc cluster you should use a(n) _____.

Options:

A.

VPN connection

B.

Special browser

C.

SSH tunnel

D.

FTP connection

Buy Now
Questions 60

How would you query specific partitions in a BigQuery table?

Options:

A.

Use the DAY column in the WHERE clause

B.

Use the EXTRACT(DAY) clause

C.

Use the __PARTITIONTIME pseudo-column in the WHERE clause

D.

Use DATE BETWEEN in the WHERE clause

Buy Now
Questions 61

To run a TensorFlow training job on your own computer using Cloud Machine Learning Engine, what would your command start with?

Options:

A.

gcloud ml-engine local train

B.

gcloud ml-engine jobs submit training

C.

gcloud ml-engine jobs submit training local

D.

You can't run a TensorFlow program on your own computer using Cloud ML Engine .

Buy Now
Questions 62

The Dataflow SDKs have been recently transitioned into which Apache service?

Options:

A.

Apache Spark

B.

Apache Hadoop

C.

Apache Kafka

D.

Apache Beam

Buy Now
Questions 63

Which methods can be used to reduce the number of rows processed by BigQuery?

Options:

A.

Splitting tables into multiple tables; putting data in partitions

B.

Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause

C.

Putting data in partitions; using the LIMIT clause

D.

Splitting tables into multiple tables; using the LIMIT clause

Buy Now
Questions 64

Which of the following is not possible using primitive roles?

Options:

A.

Give a user viewer access to BigQuery and owner access to Google Compute Engine instances.

B.

Give UserA owner access and UserB editor access for all datasets in a project.

C.

Give a user access to view all datasets in a project, but not run queries on them.

D.

Give GroupA owner access and GroupB editor access for all datasets in a project.

Buy Now
Questions 65

What is the general recommendation when designing your row keys for a Cloud Bigtable schema?

Options:

A.

Include multiple time series values within the row key

B.

Keep the row keep as an 8 bit integer

C.

Keep your row key reasonably short

D.

Keep your row key as long as the field permits

Buy Now
Questions 66

Which Java SDK class can you use to run your Dataflow programs locally?

Options:

A.

LocalRunner

B.

DirectPipelineRunner

C.

MachineRunner

D.

LocalPipelineRunner

Buy Now
Questions 67

Which of the following is NOT true about Dataflow pipelines?

Options:

A.

Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner

B.

Dataflow pipelines can consume data from other Google Cloud services

C.

Dataflow pipelines can be programmed in Java

D.

Dataflow pipelines use a unified programming model, so can work both with streaming and batch data sources

Buy Now
Questions 68

Does Dataflow process batch data pipelines or streaming data pipelines?

Options:

A.

Only Batch Data Pipelines

B.

Both Batch and Streaming Data Pipelines

C.

Only Streaming Data Pipelines

D.

None of the above

Buy Now
Questions 69

If you want to create a machine learning model that predicts the price of a particular stock based on its recent price history, what type of estimator should you use?

Options:

A.

Unsupervised learning

B.

Regressor

C.

Classifier

D.

Clustering estimator

Buy Now
Questions 70

Which of the following is not true about Dataflow pipelines?

Options:

A.

Pipelines are a set of operations

B.

Pipelines represent a data processing job

C.

Pipelines represent a directed graph of steps

D.

Pipelines can share data between instances

Buy Now
Questions 71

You want to use a BigQuery table as a data sink. In which writing mode(s) can you use BigQuery as a sink?

Options:

A.

Both batch and streaming

B.

BigQuery cannot be used as a sink

C.

Only batch

D.

Only streaming

Buy Now
Questions 72

When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four values are required: project, region, name, and ____.

Options:

A.

zone

B.

node

C.

label

D.

type

Buy Now
Questions 73

You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?

Options:

A.

PCollection

B.

Transform

C.

Pipeline

D.

Sink API

Buy Now
Questions 74

Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?

Options:

A.

Issue a command to restart the database servers.

B.

Retry the query with exponential backoff, up to a cap of 15 minutes.

C.

Retry the query every second until it comes back online to minimize staleness of data.

D.

Reduce the query frequency to once every hour until the database comes back online.

Buy Now
Questions 75

Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other’s data. You want to ensure appropriate access to the data. Which three steps should you take? (Choose three.)

Options:

A.

Load data into different partitions.

B.

Load data into a different dataset for each client.

C.

Put each client’s BigQuery dataset into a different table.

D.

Restrict a client’s dataset to approved users.

E.

Only allow a service account to access the datasets.

F.

Use the appropriate identity and access management (IAM) roles for each client’s users.

Buy Now
Questions 76

Your company’s on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?

Options:

A.

Put the data into Google Cloud Storage.

B.

Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.

C.

Tune the Cloud Dataproc cluster so that there is just enough disk for all data.

D.

Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.

Buy Now
Exam Name: Google Professional Data Engineer Exam
Last Update: Nov 16, 2024
Questions: 330

PDF + Testing Engine

$56  $159.99

Testing Engine

$42  $119.99
buy now Professional-Data-Engineer testing engine

PDF (Q&A)

$35  $99.99
buy now Professional-Data-Engineer pdf
dumpsmate guaranteed to pass
24/7 Customer Support

DumpsMate's team of experts is always available to respond your queries on exam preparation. Get professional answers on any topic of the certification syllabus. Our experts will thoroughly satisfy you.

Site Secure

mcafee secure

TESTED 21 Nov 2024