I know this is a stupid question, but I feel like someone might want to know (or inform a <redacted> co-worker about) this. I am not attaching a specific programming language to this since I think it could apply to all of them. Correct me if I am wrong about that.
Main question: Is it faster and/or better to look for entries in a constant String
or in a List<String>
?
Details: Let's say that I want to see if a given extension is in a list of supported extensions. Which of the following is the best (in regards to programming style) and/or fastest:
const String SUPPORTED = ".exe.bin.sh.png.bmp" /*etc*/;
public static bool IsSupported(String ext){
ext = Normalize(ext);//put the extension in some expected state like lowercase
//String.Contains
return SUPPORTED.Contains(ext);
}
final static List<String> SUPPORTED = MakeAnImmutableList(".exe", ".bin", ".sh",
".png",".bmp" /*etc*/);
public static bool IsSupported(String ext){
ext = Normalize(ext);//put the extension in some expected state like lowercase
//List.Contains
return SUPPORTED.Contains(ext);
}
First, it is important to note that the solutions are not functionally equivalent. The substring search will return true for strings like x
and exe.bin
while the List<String>.contains()
will not. In that sense the List<String>
version is likely to be closer to the semantic you want. Any possible performance comparison should keep that in mind.
Now, on to performance.
From an asymptotic and algorithm complexity point of view, the List<String>.contains()
approach will be faster than the other alternative as the length of the strings grows. Conceptually, the String.contains
version needs to look for a match at each position in the SUPPORTED
String, while the List.contains()
version only needs to match starting at the start of each substring - as soon as finds a mismatch in the current candidate, it skips to the next. This is related to the above note that the options aren't functionally equivalent: the String.contains
options can in theory match a much wider universe of inputs, so it has to do more work before rejecting candidates.
Complexity-wise, this difference could be something like O(N)
for List.contains()
versus O(N^2)
for String.contains()
if you take N to be the number of candidates and assume each candidate has a bounded length, and to the String.contains()
method in the usual brute-force "look for a match starting at each position" algorithm. As it turns out, the Java String.contains()
implementation isn't exactly doing the basic O(N^2)
search, but it isn't doing Boyer-Moore either. In general you can expect that once the substrings get long enough, the List.String
approach will be faster.
At a closer to the metal perspective, both approaches have their advantages. The String.contains()
solution avoids the overhead of iterating over the List
elements: the entire call will be spent in the intrinsified String.contains
implementation, and all the char
s making up the entire SUPPORTED
Strings are contiguous, so this is memory-friendly. The List.contains()
approach will spend a lot of time doing the double-dereferencing needed to go from each List
element to the contained String
and then to the contained char[]
array, and this is likely to dominate if the strings you are comparing against are very short.
On other hand, the List.contains
solution ultimately calls into String.equals
which is likely implemented in terms of Arrays.equal(char[], char[])
which is heavily optimized with SSE and AVX intrinsics on x86 platforms and likely to be blazing fast, even compared to the optimized version of String.contains()
. So if the Strings become long, again expect List.contains()
to pull ahead.
All that said, there is a simple, canonical way to do this quickly: a HashSet<String>
with all the candidate strings. That's just a simple String.hash()
(which is cached and so often "free") and a single lookup in the hash table.