When it comes to handling large files in Azure Data Lake it is challenging to divide data into sets before the stages of analysis. It is always good to have a control over the method of partitioning.
Though ADF provides some straightforward options of data partitioning, this method is focused on adding unique ids to the individual rows and then collecting the unique set of records to create a partitioned file. In addition, we can have a control over the number of partition file and the number of records per file.
Scenario
To make sure the basic idea is well understood by the
audience, here as an example I am taking a sample file which contains only 12
records as the source to be portioned into many files.
Environment
ADL –> ADF –> Mapping Data
Flow –> Source –> Add surrogate Key –> Assign Index –> Define
partition in Sink.
Steps
1. Add a Mapping Data flow
2. Add two Parameters as below
(row_index,filecount)
4. Point the created Data Source as the source.
5. Add a Surrogate Key for the source.
7. Add a Derived Column
9. Create a column and provide a name (partition)
10. Add an expression as below
iif((surkey%$filecount)==0, surkey%$filecount, surkey%$filecount)
11. Add a Sink to the Derived column
12. Create a Sink dataset and point it to the Sink
13. Click Settings and set the configuration as below
14. Click Optimize and set the configuration as below -> Select Key as Partition type and set Key column as the column partition.
15. Add this Data flow to a pipeline
16. Add a parameter to the pipeline
17. Set the parameter values as below
18. Add the Data flow to a pipeline and execute the pipeline.