What I have is an OCR'd Walmart receipt in a Google Document (Walmart allows you to email to yourself a .jpg version of your receipt, and this image can be opened with Google Docs, during which it applies OCR to extract text. The result is excellent with very few errors.)
Here is a link to the actual receipt from the OCR: https://docs.google.com/document/d/1zSV09UGajna4DPtrHUrB6F82NugpYjaftMjomoKVXpE/edit?usp=sharing
I have OCR'd hundreds of Walmart receipts using Google Docs. The OCR'd document suffers from some formatting issues, so I have written some code to help regularize it, as a part of a larger goal to get all of my receipts into a database.
While I am able to solve many of the formatting replacements, I am stuck trying to replace the percent sign that comes after "TAX 1" and "TAX 2" with a tab character, so that I can then push down the "TAX 2" entry to a new line. I noticed that the %
sign will always be followed by a newline character and then the actual numerical value of the tax (for both "TAX 1" and "TAX 2") on the next line:
Example OCR Text:
SUBTOTAL 126.61 TAX 1 6.750 %
7.78 TAX 2 2.000 %
0.23 TOTAL 134.62
Desired Output Text:
SUBTOTAL 126.61
TAX 1 6.750 % 7.78
TAX 2 2.000 % 0.23
TOTAL 134.62
Objective:
I can do this using the CTRL-F "Find and Replace" menu in the Google Docs UI, using regex options without any problem, but I cant write an Apps Script function to do the same. I have searched everywhere. I realize that the RegEx in GAS is limited. But I don't know enough to know if that is my problem AND what a workaround could be. Likewise, I don't know enough of RegEx to know if the limited version GAS supports is not the cause of my problem (e.g. am I overlooking something).
Here's the code excerpt I use for formatting:
var body = DocumentApp.getActiveDocument().getBody();
/**
* other formatting stuff
*/
//Find SUBTOTAL, remove the space before SUBTOTAL and move it down one line.
body.replaceText(' SUBTOTAL', '\n\nSUBTOTAL\t');
//Find TAX 1, remove the space before TAX 1 and move it down one line.
body.replaceText(' TAX 1', '\nTAX 1\t');
//Find TAX 2, remove the space before TAX 2 and move it down one line.
body.replaceText(' TAX 2', '\nTAX 2\t');
//Find TOTAL, replace it.
body.replaceText('TOTAL', '\nTOTAL\t');
//Find PERCENT SIGN AND ADD A NEWLINE AFTER IT, replace it all with a tab character.
body.replaceText("[ %\n]","\t");
The first 4 replaces work great. Its the last one (the percent sign) that doesnt work. I've tried to escape that percent sign like this:
body.replaceText("[ \%\n]","\t");
and
body.replaceText("[ \\%\n]","\t");
I've tried to remove the braces like this:
body.replaceText(" \%\n","\t");
and
body.replaceText(" \\%\n","\t");
But each gives different results, frankly - messing up the entire receipt text badly.
So the percent sign is the problem - I think.
How can I fix the formatting for the "TAX 1" and "TAX 2" lines?
Example fulltext OCR'd receipt: https://docs.google.com/document/d/1zSV09UGajna4DPtrHUrB6F82NugpYjaftMjomoKVXpE/edit?usp=sharing
You want to replace the text of shared Document to the values you want. The values you want can be retrieved by the script in my comment. If my understanding is correct, how about this?
In the sample script in my comment, the whole text is retrieved and replace to the values you want. When this situation is reflected to the Document, how about this sample script? When you use this script, please do as follows.
function myFunction() {
var body = DocumentApp.getActiveDocument().getBody();
var newText = body.getText()
.match(/(SUBTOTAL[\s\S]+?)VISA/)[1]
.replace(/TAX/g, "\nTAX")
.replace(/ TOTAL/g, "\nTOTAL\t")
.replace(/%\n/g, "%\t");
body.clear();
body.setText(newText);
}
If this was not what you want, I'm sorry.