Overall Plan
Reading Raw Data and converting it to a readable/formatted form
- Azure Data Lake
- Azure Data Factory
- Python
- Jupiter Notebook
The purpose of this process is to enable High Level users to work with data in an integrated data friendly environment. As mentioned above, in Azure environment we are allocating a sandbox space to do all the necessary operations.
For this example, we will work on a converting a Parquet
File to a CSV.
Step 1
To start things with, we need to create two containers under
your preferred storage area in Azure portal
raw – To land all the
data in raw form
sandbox – Where the user is going to do all the operations with data
Step 2
Upload the parquet file, that you want to convert to csv into your raw container.
Step 3
Change the
access level of the parquet file to Container.
Step 4
Copy the
path of the uploaded .parquet file
Step 5
Open Jupyter
Notebook in your web browser and select Python.
Step 6
Type the below python script to read the uploaded file
inside Jupyter Notebook.
|
import
pandas as pd pd.read_parquet(r' <paste the copied path in Step 4 here>') |
Step 7
Complete transformation of a parquet file to a readable format in Azure
Data Lake
Step 1
Create a pipeline – drag and drop a copy Operation.
Step 2
Select your parquet file that you uploaded to raw as
the Source dataset.
Step 3
Select your csv (DelimeterdText) as the Sink Dataset
and point it out to your Sandbox container.
Note:
When creating this text based (csv) dataset please select
the “First row as header option”.
Up to this point, you have selected your parquet file
(source) from Raw environment and you are about to convert it to text format
and store in sandbox for further analysis.
Step 4
Now you can run the pipeline to start the conversion.
Step 5
Once the process is completed, you can fine the converted
file in form of .txt in your sandbox.
Step 6
Download your file from sandbox
Note:
Along with the above described data conversion method, the
same purpose can be achieved in your local environment using Python and Jupyter
Notebook.
Connecting PowerBI with the source in Sandbox
Step 1
Open and sign into PowerBI
Step 2
Get a new data source. Get data -> More..->Blob
Storage
Type your storage account name as Account name or URL
Step 4
You should see your containers on the left side and the data
sources on the right side of the screen.
Under Fields you will find the
sandbox as below:
Step 6
We have to transform the sandbox data into a understandable format.
Click Data -> Home ->
Transform Data -> Transform
Click Binary next to the
source file you want to analyze
Step 8
Step 9
Now you can visualize and work
with the formatted data in PowerBI.