Share With

Steps to do

This is interaction with GCP console and DATAPROC cluster where I should login into master node and do the following operations:

I had created a bucket and that bucket has multiple csv files.

In the master VM instance we should do the following

Build a wrapper i.e shell script
Shell script should call python file
That python file should have dataframe or spark summit job which asks us for input argument – csv file name and GCS location to take from that file.
After getting csv file with use of dataframe or spark summit job to load that csv into hive table.
Read data from hive table and write to external table.
Now, convert the data from csv table to parquet tabular format.
Show the table size storages for csv and parquet tables
Perform one aggregare query and show time diff of csv and parquet
Perform one Join operation and show diff in csv and parquet
Do the same for AVRO format

pur-new-sol

PFA

Related Questions