tl;dr; It's ciso8601
.
I have this Python app that I'm working on. It has a cron job that downloads a listing of every single file listed in an S3 bucket. AWS S3 publishes a manifest of .csv.gz files. You download the manifest and for each hashhashash.csv.gz you download those files. Then my program reads these CSV files and it is able to ignore certain rows based on them being beyond the retention period. It basically parses the ISO formatted datetime as a string, compares it with a cutoff datetime.datetime
instance and is able to quickly skip or allow it in for full processing.
At the time of writing, it's roughly 160 .csv.gz files weighing a total of about 2GB. In total it's about 50 millions rows of CSV. That means it's 50 million datetime parsings.
I admit, this cron job doesn't have to be super fast and it's OK if it takes an hour since it's just a cron job running on a server in the cloud somewhere. But I would like to know, is there a way to speed up the date parsing because that feels expensive to do in Python 50 million times per day.
Here's the benchmark:
import csv
import datetime
import random
import statistics
import time
import ciso8601
def f1(datestr):
return datetime.datetime.strptime(datestr, '%Y-%m-%dT%H:%M:%S.%fZ')
def f2(datestr):
return ciso8601.parse_datetime(datestr)
def f3(datestr):
return datetime.datetime(
int(datestr[:4]),
int(datestr[5:7]),
int(datestr[8:10]),
int(datestr[11:13]),
int(datestr[14:16]),
int(datestr[17:19]),
)
# Assertions
assert f1(
'2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f2(
'2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f3(
'2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == '201709211254'
functions = f1, f2, f3
times = {f.__name__: [] for f in functions}
with open('046444ae07279c115edfc23ba1cd8a19.csv') as f:
reader = csv.reader(f)
for row in reader:
func = random.choice(functions)
t0 = time.clock()
func(row[3])
t1 = time.clock()
times[func.__name__].append((t1 - t0) * 1000)
def ms(number):
return '{:.5f}ms'.format(number)
for name, numbers in times.items():
print('FUNCTION:', name, 'Used', format(len(numbers), ','), 'times')
print('\tBEST ', ms(min(numbers)))
print('\tMEDIAN', ms(statistics.median(numbers)))
print('\tMEAN ', ms(statistics.mean(numbers)))
print('\tSTDEV ', ms(statistics.stdev(numbers)))
Yeah, it's a bit ugly but it works. Here's the output:
FUNCTION: f1 Used 111,475 times BEST 0.01300ms MEDIAN 0.01500ms MEAN 0.01685ms STDEV 0.00706ms FUNCTION: f2 Used 111,764 times BEST 0.00100ms MEDIAN 0.00200ms MEAN 0.00197ms STDEV 0.00167ms FUNCTION: f3 Used 111,362 times BEST 0.00300ms MEDIAN 0.00400ms MEAN 0.00409ms STDEV 0.00225ms
In summary:
f1
: 0.01300 millisecondsf2
: 0.00100 millisecondsf3
: 0.00300 milliseconds
Or, if you compare to the slowest (f1
):
f1
: baselinef2
: 13 times fasterf3
: 6 times faster
UPDATE
If you know with confidence that you don't want or need timezone aware datetime instances, you can use csiso8601.parse_datetime_unaware
instead.
from the README:
"Please note that it takes more time to parse aware datetimes, especially if they're not in UTC. If you don't care about time zone information, use the parse_datetime_unaware method, which will discard any time zone information and is faster."
In my benchmark the strings I use look like this 2017-09-21T12:54:24.000Z
. I added another function to the benmark that uses ciso8601.parse_datetime_unaware
and it clocked in at the exact same time as f2
.
Comments