Thursday, August 20, 2020

Process for Advanced Users to use Raw Data

              Overall Plan

                

                                                          

                


Reading Raw Data and converting it to a readable/formatted form

 Tools and Technologies:

  •     Azure Data Lake
  •     Azure Data Factory
  •     Python
  •     Jupiter Notebook 

The purpose of this process is to enable High Level users to work with data in an integrated data friendly environment. As mentioned above, in Azure environment we are allocating a sandbox space to do all the necessary operations. 

For this example, we will work on a converting a Parquet File to a CSV.

Step 1

To start things with, we need to create two containers under your preferred storage area in Azure portal

raw – To land all the data in raw form

sandbox – Where the user is going to do all the operations with data 

Step 2

Upload the parquet file, that you want to convert to csv into your raw container. 

Step 3

Change the access level of the parquet file to Container.


Step 4

Copy the path of the uploaded .parquet file


Step 5

Open Jupyter Notebook in your web browser and select Python.

Step 6

Type the below python script to read the uploaded file inside Jupyter Notebook.

import pandas as pd

pd.read_parquet(r' <paste the copied path in Step 4 here>')

 

Step 7

You should get a view such as below:

 

Complete transformation of a parquet file to a readable format in Azure Data Lake

Step 1

Create a pipeline – drag and drop a copy Operation.

Step 2

Select your parquet file that you uploaded to raw as the Source dataset.


Step 3

Select your csv (DelimeterdText) as the Sink Dataset and point it out to your Sandbox container.

Note:

When creating this text based (csv) dataset please select the “First row as header option”.

Up to this point, you have selected your parquet file (source) from Raw environment and you are about to convert it to text format and store in sandbox for further analysis.

Step 4

Now you can run the pipeline to start the conversion.

Step 5

Once the process is completed, you can fine the converted file in form of .txt in your sandbox.

Step 6

Download your file from sandbox

Note:

Along with the above described data conversion method, the same purpose can be achieved in your local environment using Python and Jupyter Notebook.


Connecting PowerBI with the source in Sandbox 

Step 1

Open and sign into PowerBI

Step 2

Get a new data source. Get data -> More..->Blob Storage

 Step 3

Type your storage account name as Account name or URL

 

 

 

 

 

 

 

 

Step 4

You should see your containers on the left side and the data sources on the right side of the screen.

Since we need to use the sources in sandbox, select sandbox and click Load.

 Step 5

Under Fields you will find the sandbox as below:

Step 6

We have to transform the sandbox data into a understandable format.

Click Data -> Home -> Transform Data -> Transform

 

 Step 7

Click Binary next to the source file you want to analyze

Step 8

Click Close & Apply

Step 9

Now you can visualize and work with the formatted data in PowerBI. 

 

 

 

 

 

 

 

       

Azure Data Factory – Automated Email Notifications using Logic Apps

Logic Apps allow you to easily create a workflow in the cloud without having to write much code. Since ADF has no built-in mechanism to send...