We have several large CSV files in Azure Data Lake Store that were created using the Append method of the .NET API. Recently, we switched over to ConcurrentAppend for performance reasons. Since ConcurrentAppend and Append cannot be used interchangeably, the switch required us to create a new folder structure for the files, to make sure that the ConcurrentAppend would never hit any files created using Append.
However, our downstream application needs to load all data, both from before and after the switch. Instead of changing our application, we wanted to join the files (using the PowerShell SDK Join-AzureRmDataLakeStoreItem cmdlet), but the documentation does not specify whether files joined this way can be written to by ConcurrentAppend after the join. I suspect that we will face issues, since we are going to join files created by both methods (maybe it's not even possible to do the join?)
So my questions are as follows:
Cost is a concern, which is why we prefer to use the PowerShell cmdlet if possible, and would like to avoid the last option.
At present after the join operation, no append operations can be executed on the file. We are currently working on a feature to remove this limitation. However, at present after concatenating files, the appends will not work.