Leon's BI Blog: Dynamic Data Partitioning

When it comes to handling large files in Azure Data Lake it is challenging to divide data into sets before the stages of analysis. It is always good to have a control over the method of partitioning.

Though ADF provides some straightforward options of data partitioning, this method is focused on adding unique ids to the individual rows and then collecting the unique set of records to create a partitioned file. In addition, we can have a control over the number of partition file and the number of records per file.

Scenario

To make sure the basic idea is well understood by the audience, here as an example I am taking a sample file which contains only 12 records as the source to be portioned into many files.

Environment

ADL –> ADF –> Mapping Data Flow –> Source –> Add surrogate Key –> Assign Index –> Define partition in Sink.

Steps

1. Add a Mapping Data flow

2. Add two Parameters as below (row_index,filecount)

3. Create a Data source for your parent source

4. Point the created Data Source as the source.

5. Add a Surrogate Key for the source.

6. Provide a name for the surrogate key (surkey)

7. Add a Derived Column

8. Configure the properties as below

9. Create a column and provide a name (partition)

10. Add an expression as below

iif((surkey%$filecount)==0, surkey%$filecount, surkey%$filecount)

11. Add a Sink to the Derived column

12. Create a Sink dataset and point it to the Sink

13. Click Settings and set the configuration as below

14. Click Optimize and set the configuration as below -> Select Key as Partition type and set Key column as the column partition.

15. Add this Data flow to a pipeline

16. Add a parameter to the pipeline

17. Set the parameter values as below

18. Add the Data flow to a pipeline and execute the pipeline.

Leon's BI Blog

Monday, August 3, 2020

Dynamic Data Partitioning - Azure Data Lake

When it comes to handling large files in Azure Data Lake it is challenging to divide data into sets before the stages of analysis. It is always good to have a control over the method of partitioning.

Scenario

Environment

Steps

Azure Data Factory – Automated Email Notifications using Logic Apps

Search This Blog