Search code examples
regexcsvgoogle-apps-scripttimeout

Google Apps Script: REGEX to fix malformed pipe delimited csv file runs too slowly


I have a Google Apps Script that processes this "csv" file daily. The file is getting larger and it is starting to time out. The pipe delimited "csv" file includes new line and next line in the comments fields in some records. This causes those records to break before the true end of record. The following code removes the extraneous new line and next line when they are in the middle of a record and formats the data in a useful csv format. Is there a more efficient way to write this code?

Here's the snippet:

function cleanCSV(csvFileId){
//The file we receive has line breaks in the middle of the records, this removes the line breaks and converts the file to a csv.
        var content = DriveApp.getFileById(csvFileId).getBlob().getDataAsString();
        var identifyNewLine = content.replace(/\r\n\d{1,5}\|/g,"~~$&"); //This marks the beginning of a new record with double tildes before we can remove all the line breaks.
        var noReturnsContent = identifyNewLine.replace(/\r\n/g, ""); //Removes Returns
        var newContent = noReturnsContent.replace(/~~/g,"\r\n"); //returns one record per client
        var noEndQuote = newContent.replace(/'\|/g,"|"); // removes trailing single quote
        var csvContent = noEndQuote.replace(/\|'/g,"|"); // removes leading single quote
        //Logger.log(csvContent);
        var sheetId = DriveApp.getFolderById(csvFolderId).createFile(csvFileName, csvContent, MimeType.CSV).getId();
        return sheetId;
}

Here is a sample of the file:


Solution

  • The first three replace lines can be merged into one, you just want to remove all \r\n occurrences that are not followed with 1 to 5 digits and a |, .replace(/\r\n(?!\d{1,5}\|)/g,"").

    The last two replace lines can also be merged into one if you use alternaton, .replace(/'\||\|'/g,"|").

    Use

    function cleanCSV(csvFileId){
    //The file we receive has line breaks in the middle of the records, this removes the line breaks and converts the file to a csv.
        var content = DriveApp.getFileById(csvFileId).getBlob().getDataAsString();
        var newContent = content.replace(/\r\n(?!\d{1,5}\|)/g,""); // remove line endings not followed with 1-5 digits and |
    var csvContent = newContent.replace(/'\||\|'/g,"|"); // removes trailing/leading single quote
        //Logger.log(csvContent);
        var sheetId = DriveApp.getFolderById(csvFolderId).createFile(csvFileName, csvContent, MimeType.CSV).getId();
        return sheetId;
    }