We have a script which compares a CSV line by using regular expressions. The CSV line has 4 columns with a semicolon as the separator.
The code-snippet which uses regular expression for comparing the CSV Line is as follows:
strPattern = ""
strPattern &= "([^\;]{1,64})\;#"
strPattern &= "([^\;]{0,64})\;#"
strPattern &= "([^\;]{0,64})\;#"
strPattern &= "([^\;]{0,64})\"
The quantifier {0, 64} from above matches the preceeding elements atleast 0 times, but not more than 64 times.
Now, there is a requirement to increase the Maximum value in the quantifier from {0, 64} to {0, 1256}. But then again, there are chances the number of characters in the cloumn might exceed the maximum value 1256.
So, I was thinking of excluding the maximum value in the quantifier altogether, since we couldn't predict how many characters a column might contain.
After excluding the maximum value, the script now looks like this:
strPattern = ""
strPattern &= "([^\;]{1,})\;#"
strPattern &= "([^\;]{0,})\;#"
strPattern &= "([^\;]{0,})\;#"
strPattern &= "([^\;]{0,})\"
The quantifier {0,} compares the preceeding elements 0 or more times.
I would like to know if removing the maximum value in the quantifier would cause performance issues?A single CSV file might contain anywhere between 1000 to 50000 records. So, I want to know if removing the maximum value will cause substantial performance lag while processing thousands of CSV lines.
I don't have the required test data to see if this would result in performance issues.
So, it would be great if anyone has had experience with using the quantifiers without a maximum value.
If you are restricted from changing the implementation of parsing the CSV it would seem that you won't have much of a choice but to remove the maximum limit. The question to ask is, if the size were over 1256 would you want to compare them? It looks like the answer is yes you would.
So that being said, the only option is to remove the limit. And now for my experienced guess as to how performance will be impacted:
It probably won't be a problem.
You can always mock some data and test. You can always throw a question back to the business owner, "Will I need to support 4 quittrillion characters on a line?" If it goes to 10,000 characters wide, it should be fine, that's not all that big. If it's merely capturing the data it will just read the line and and hold it. No big deal. If you were doing negative look arounds or something I would be more concerned.
I work with files with millions of rows and performance is relative right? Does a few seconds matter? A few minutes? Micro seconds? If a 8 hour process adds 5 seconds who cares. If a process that takes 1 second goes to 2 seconds but runs 80,000 times, then maybe it matters. :) Good luck!