tl;dr; By a slim margin, the fastest way to check a filename matching a list of extensions is filename.endswith(extensions)
This turned out to be premature optimization. The context is that I want to check if a filename matches the file extension in a list of 6.
The list being ['.sym', '.dl_', '.ex_', '.pd_', '.dbg.gz', '.tar.bz2']
. Meaning, it should return True
for foo.sym
or foo.dbg.gz
. But it should return False
for bar.exe
or bar.gz
.
I put together a litte benchmark, ran it a bunch of times and looked at the results. Here are the functions I wrote:
def f1(filename):
for each in extensions:
if filename.endswith(each):
return True
return False
def f2(filename):
return filename.endswith(extensions_tuple)
regex = re.compile(r'({})$'.format(
'|'.join(re.escape(x) for x in extensions)
))
def f3(filename):
return bool(regex.findall(filename))
def f4(filename):
return bool(regex.search(filename))
The results are boring. But I guess that's a result too:
FUNCTION MEDIAN MEAN f1 9543 times 0.0110ms 0.0116ms f2 9523 times 0.0031ms 0.0034ms f3 9560 times 0.0041ms 0.0045ms f4 9509 times 0.0041ms 0.0043ms
For a list of ~40,000 realistic filenames (with result True
75% of the time), I ran each function 10 times. So, it means it took on average 0.0116ms to run f1
10 times here on my laptop with Python 3.6.
More premature optimization
Upon looking into the data and thinking about this will be used. If I reorder the list of extensions so the most common one is first, second most common second etc. Then the performance improves a bit for f1
but slows down slightly for f3
and f4
.
Conclusion
That .endswith(some_tuple)
is neat and it's hair-splittingly faster. But really, this turned out to not make a huge difference in the grand scheme of things. On average it takes less than 0.001ms to do one filename match.
Comments
Whow nice! I didn't even know that `.startswith()/.endswith()` eat tuples!! 👍 Thanks!
But you didn't consider using `os.path.splitext()`? And then compare if in list?
What about lowercasing it before? To match accidentally upper cased extensions?
os.path.splitext will say the extension is .gz for both foo.tar.gz and foo.gz and I needed it to be more specific.
Lowercasing would be the same across the board.
Yeah, that tuple trick on endswith is nice.
It helped me to solve problem! It also takes less code that I expected. Thanks!
Great solution. An extended problem seeks to process files ending in .xlsx, .xlsm, .xltm, .xltx with my list value having items ('xls', 'xlt') or even (.xl). My thoughts are do it in two steps: (1) you use .endswith for the simple hits, then (2) take a pass on my problem set, whatever the solution is.