Filtered by Python

Page 6

Reset

Fancy linkifying of text with Bleach and domain checks (with Python)

October 10, 2018
2 comments Python, Web development

Bleach is awesome. Thank you for it @willkg! It's a Python library for sanitizing text as well as "linkifying" text for HTML use. For example, consider this:

>>> import bleach
>>> bleach.linkify("Here is some text with a url.com.")
'Here is some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'

Note that sanitizing is separate thing, but if you're curious, consider this example:

>>> bleach.linkify(bleach.clean("Here is <script> some text with a url.com."))
'Here is &lt;script&gt; some text with a <a href="http://url.com" rel="nofollow">url.com</a>.'

With that output you can confidently template interpolate that string straight into your HTML.

Getting fancy

That's a great start but I wanted a more. For one, I don't always want the rel="nofollow" attribute on all links. In particular for links that are within the site. Secondly, a lot of things look like a domain but isn't. For example This is a text.at the start which would naively become...:

>>> bleach.linkify("This is a text.at the start")
'This is a <a href="http://text.at" rel="nofollow">text.at</a> the start'

...because text.at looks like a domain.

So here is how I use it here on www.peterbe.com to linkify blog comments:


def custom_nofollow_maker(attrs, new=False):
    href_key = (None, u"href")

    if href_key not in attrs:
        return attrs

    if attrs[href_key].startswith(u"mailto:"):
        return attrs

    p = urlparse(attrs[href_key])
    if p.netloc not in settings.NOFOLLOW_EXCEPTIONS:
        # Before we add the `rel="nofollow"` let's first check that this is a
        # valid domain at all.
        root_url = p.scheme + "://" + p.netloc
        try:
            response = requests.head(root_url)
            if response.status_code == 301:
                redirect_p = urlparse(response.headers["location"])
                # If the only difference is that it redirects to https instead
                # of http, then amend the href.
                if (
                    redirect_p.scheme == "https"
                    and p.scheme == "http"
                    and p.netloc == redirect_p.netloc
                ):
                    attrs[href_key] = attrs[href_key].replace("http://", "https://")

        except ConnectionError:
            return None

        rel_key = (None, u"rel")
        rel_values = [val for val in attrs.get(rel_key, "").split(" ") if val]
        if "nofollow" not in [rel_val.lower() for rel_val in rel_values]:
            rel_values.append("nofollow")
        attrs[rel_key] = " ".join(rel_values)

    return attrs

html = bleach.linkify(text, callbacks=[custom_nofollow_maker])

This basically taking the default nofollow callback and extending it a bit.

By the way, here is the complete code I use for sanitizing and linkifying blog comments here on this site: render_comment_text.

Caveats

This is slow because it requires network IO every time a piece of text needs to be linkified (if it has domain looking things in it) but that's best alleviated by only doing it once and either caching it or persistently storing the cleaned and rendered output.

Also, the check uses try: requests.head() except requests.exceptions.ConnectionError: as the method to see if the domain works. I considered doing a whois lookup or something but that felt a little wrong because just because a domain exists doesn't mean there's a website there. Either way, it could be that the domain/URL is perfectly fine but in that very unlucky instant you checked your own server's internet or some other DNS lookup thing is busted. Perhaps wrapping it in a retry and doing try: requests.head() except requests.exceptions.RetryError: instead.

Lastly, the business logic I chose was to rewrite all http:// to https:// only if the URL http://domain does a 301 redirect to https://domain. So if the original link was http://bit.ly/redirect-slug it leaves it as is. Perhaps a fancier version would be to look at the domain name ending. For example HEAD http://google.com 301 redirects to https://www.google.com so you could use the fact that "www.google.com".endswith("google.com").

UPDATE Oct 10 2018

Moments after publishing this, I discovered a bug where it would fail badly if the text contained a URL with an ampersand in it. Turns out, it was a known bug in Bleach. It only happens when you try to pass a filter to the bleach.Cleaner() class.

So I simplified my code and now things work. Apparently, using bleach.Cleaner(filters=[...]) is faster so I'm losing that. But, for now, that's OK in my context.

Also, in another later fix, I improved the function some more by avoiding non-HTTP links (with the exception of mailto: and tel:). Otherwise it would attempt to run requests.head('ssh://server.example.com') which doesn't make sense.

The ideal number of workers in Jest

October 8, 2018
0 comments Python, React

tl;dr; Use --runInBand when running jest in CI and use --maxWorkers=3 on your laptop.

Running out of memory on CircleCI

We have a test suite that covers 236 tests across 68 suites and runs mainly a bunch of enzyme rendering of React component but also some plain old JavaScript function tests. We hit a problem where tests utterly failed in CircleCI due to running out of memory. Several individual tests, before it gave up or failed, reported to take up to 45 seconds.
Turns out, jest tried to use 36 workers because the Docker OS it was running was reporting 36 CPUs.


> circleci@9e4c489cf76b:~/repo$ node
> var os = require('os')
undefined
> os.cpus().length
36

After forcibly setting --maxWorkers=2 to the jest command, the tests passed and it took 20 seconds. Yay!

But that got me thinking, what is the ideal number of workers when I'm running the suite here on my laptop? To find out, I wrote a Python script that would wrap the call CI=true yarn run test --maxWorkers=%(WORKERS) repeatedly and report which number is ideal for my laptop.

After leaving it running for a while it spits out this result:

SORTED BY BEST TIME:
3 8.47s
4 8.59s
6 9.12s
5 9.18s
2 9.51s
7 10.14s
8 10.59s
1 13.80s

The conclusion is vague. There is some benefit to using some small number greater than 1. If you attempt a bigger number it might backfire and take longer than necessary and if you do do that your laptop is likely to crawl and cough.

Notes and conclusions

django-pipeline and Zopfli

August 15, 2018
0 comments Python, Web development, Django

tl;dr; I wrote my own extension to django-pipeline that uses Zopfli to create .gz files from static assets collected in Django. Here's the code.

Nginx and Gzip

What I wanted was to continue to use django-pipeline which does a great job of reading a settings.BUNDLES setting and generating things like /static/js/myapp.min.a206ec6bd8c7.js. It has configurable options to not just make those files but also generate /static/js/myapp.min.a206ec6bd8c7.js.gz which means that with gzip_static in Nginx, Nginx doesn't have to Gzip compress static files on-the-fly but can basically just read it from disk. Nginx doesn't care how the file got there but an immediate advantage of preparing the file on disk is that the compression can be higher (smaller .gz files). That means smaller responses to be sent to the client and less CPU work needed from Nginx. Your job is to set gzip_static on; in your Nginx config (per location) and make sure every compressable file exists on disk with the same name but with the .gz suffix.

In other words, when the client does GET https://example.com/static/foo.js Nginx quickly does a read on the file system to see if there exists a ROOT/static/foo.js.gz and if so, return that. If the files doesn't exist, and you have gzip on; in your config, Nginx will read the ROOT/static/foo.js into memory, compress it (usually with a lower compression level) and return that. Nginx takes care of figuring out whether to do this, at all, dynamically by reading the Accept-Encoding header from the request.

Zopfli

The best solution today to generate these .gz files is Zopfli. Zopfli is slower than good old regular gzip but the files get smaller. To manually compress a file you can install the zopfli executable (e.g. brew install zopfli or apt install zopfli) and then run zopfli $ROOT/static/foo.js which creates a $ROOT/static/foo.js.gz file.

So your task is to build some pipelining code that generates .gz version of every static file your Django server creates.
At first I tried django-static-compress which has an extension to regular Django staticfiles storage. The default staticfiles storage is django.contrib.staticfiles.storage.StaticFilesStorage and that's what django-static-compress extends.

But I wanted more. I wanted all the good bits from django-pipeline (minification, hashes in filenames, concatenation, etc.) Also, in django-static-compress you can't control the parameters to zopfli such as the number of iterations. And with django-static-compress you have to install Brotli which I can't use because I don't want to compile my own Nginx.

Solution

So I wrote my own little mashup. I took some ideas from how django-pipeline does regular gzip compression as a post-process step. And in my case, I never want to bother with any of the other files that are put into the settings.STATIC_ROOT directory from the collectstatic command.

Here's my implementation: peterbecom.storage.ZopfliPipelineCachedStorage. Check it out. It's very tailored to my personal preferences and usecase but it works great. To use it, I have this in my settings.py: STATICFILES_STORAGE = "peterbecom.storage.ZopfliPipelineCachedStorage"

I know what you're thinking

Why not try to get this into django-pipeline or into django-compress-static. The answer is frankly laziness. Hopefully someone else can pick up this task. I have fewer and fewer projects where I use Django to handle static files. These days most of my projects are single-page-apps that are 100% static and using Django for XHR requests to get the data.

Django lock decorator with django-redis

August 14, 2018
4 comments Python, Web development, Django, Redis

Here's the code. It's quick-n-dirty but it works wonderfully:


import functools
import hashlib

from django.core.cache import cache
from django.utils.encoding import force_bytes


def lock_decorator(key_maker=None):
    """
    When you want to lock a function from more than 1 call at a time.
    """

    def decorator(func):
        @functools.wraps(func)
        def inner(*args, **kwargs):
            if key_maker:
                key = key_maker(*args, **kwargs)
            else:
                key = str(args) + str(kwargs)
            lock_key = hashlib.md5(force_bytes(key)).hexdigest()
            with cache.lock(lock_key):
                return func(*args, **kwargs)

        return inner

    return decorator

How To Use It

This has saved my bacon more than once. I use it on functions that really need to be made synchronous. For example, suppose you have a function like this:


def fetch_remote_thing(name):
    try:
        return Thing.objects.get(name=name).result
    except Thing.DoesNotExist:
        # Need to go out and fetch this
        result = some_internet_fetching(name)  # Assume this is sloooow
        Thing.objects.create(name=name, result=result)
        return result

That function is quite dangerous because if executed by two concurrent web requests for example, they will trigger
two "identical" calls to some_internet_fetching and if the database didn't have the name already, it will most likely trigger two calls to Thing.objects.create(name=name, ...) which could lead to integrity errors or if it doesn't the whole function breaks down because it assumes that there is only 1 or 0 of these Thing records.

Easy to solve, just add the lock_decorator:


@lock_decorator()
def fetch_remote_thing(name):
    try:
        return Thing.objects.get(name=name).result
    except Thing.DoesNotExist:
        # Need to go out and fetch this
        result = some_internet_fetching(name)  # Assume this is sloooow
        Thing.objects.create(name=name, result=result)
        return result

Now, thanks to Redis distributed locks, the function is always allowed to finish before it starts another one. All the hairy locking (in particular, the waiting) is implemented deep down in Redis which is rock solid.

Bonus Usage

Another use that has also saved my bacon is functions that aren't necessarily called with the same input argument but each call is so resource intensive that you want to make sure it only does one of these at a time. Suppose you have a Django view function that does some resource intensive work and you want to stagger the calls so that it only runs it one at a time. Like this for example:


def api_stats_calculations(request, part):
    if part == 'users-per-month':
        data = _calculate_users_per_month()  # expensive
    elif part == 'pageviews-per-week':
        data = _calculate_pageviews_per_week()  # intensive
    elif part == 'downloads-per-day':
        data = _calculate_download_per_day()  # slow
    elif you == 'get' and the == 'idea':
        ...

    return http.JsonResponse({'data': data})

If you just put @lock_decorator() on this Django view function, and you have some (almost) concurrent calls to this function, for example from a uWSGI server running with threads and multiple processes, then it will not synchronize the calls.

The solution to this is to write your own function for generating the lock key, like this for example:


@lock_decorator(
    key_maker=lamnbda request, part: 'api_stats_calculations'
)
def api_stats_calculations(request, part):
    if part == 'users-per-month':
        data = _calculate_users_per_month()  # expensive
    elif part == 'pageviews-per-week':
        data = _calculate_pageviews_per_week()  # intensive
    elif part == 'downloads-per-day':
        data = _calculate_download_per_day()  # slow
    elif you == 'get' and the == 'idea':
        ...

    return http.JsonResponse({'data': data})

Now it works.

How Time-Expensive Is It?

Perhaps you worry that 99% of your calls to the function don't have the problem of calling the function concurrently. How much is this overhead of this lock costing you? I wondered that too and set up a simple stress test where I wrote a really simple Django view function. It looked something like this:


@lock_decorator(key_maker=lambda request: 'samekey')
def sample_view_function(request):
    return http.HttpResponse('Ok\n')

I started a Django server with uWSGI with multiple processors and threads enabled. Then I bombarded this function with a simple concurrent stress test and observed the requests per minute. The cost was extremely tiny and almost negligable (compared to not using the lock decorator). Granted, in this test I used Redis on redis://localhost:6379/0 but generally the conclusion was that the call is extremely fast and not something to worry too much about. But your mileage may vary so do your own experiments for your context.

What's Needed

You need to use django-redis as your Django cache backend. I've blogged before about using django-redis, for example Fastest cache backend possible for Django and Fastest Redis configuration for Django.

django-html-validator now supports Django 2.x

August 13, 2018
0 comments Python, Web development, Django

django-html-validator is a Django project that can validate your generated HTML. It does so by sending the HTML to https://html5.validator.nu/ or you can start your own Java server locally with vnu.jar from here. The output is that you can have validation errors printed to stdout or you can have them put as .txt files in a temporary directory. You can also include it in your test suite and make it so that tests fail if invalid HTML is generated during rendering in Django unit tests.

The project seems to have become a lot more popular than I thought it would. It started as a one-evening-hack and because there was interest I wrapped it up in a proper project with "docs" and set up CI for future contributions.

I kinda of forgot the project since almost all my current projects generate JSON on the server and generates the DOM on-the-fly with client-side JavaScript but apparently a lot of issues and PRs were filed related to making it work in Django 2.x. So I took the time last night to tidy up the tox.ini etc. and the necessary compatibility fixes to make it work with but Django 1.8 up to Django 2.1. Pull request here.

Thank you all who contributed! I'll try to make a better job noticing filed issues in the future.

Quick dog-piling (aka stampeding herd) URL stresstest

August 10, 2018
0 comments Python

Whenever you want to quickly bombard a URL with some concurrent traffic, you can use this:


import random
import time
import requests
import concurrent.futures


def _get_size(url):
    sleep = random.random() / 10
    # print("sleep", sleep)
    time.sleep(sleep)
    r = requests.get(url)
    # print(r.status_code)
    assert len(r.text)
    return len(r.text)


def run(url, times=10):
    sizes = []
    futures = []
    with concurrent.futures.ThreadPoolExecutor() as executor:
        for _ in range(times):
            futures.append(executor.submit(_get_size, url))
        for future in concurrent.futures.as_completed(futures):
            sizes.append(future.result())
    return sizes


if __name__ == "__main__":
    import sys

    print(run(sys.argv[1]))

It's really basic but it works wonderfully. It starts 10 concurrent threads that all hit the same URL at almost the same time.
I've been using this stress test a local Django server to test some atomicity writes with the file system.

A good Django view function cache decorator for http.JsonResponse

June 20, 2018
0 comments Python, Web development, Django

I use this a lot. It has served me very well. The code:


import hashlib
import functools

import markus  # optional
from django.core.cache import cache
from django import http
from django.utils.encoding import force_bytes, iri_to_uri

metrics = markus.get_metrics(__name__)  # optional


def json_response_cache_page_decorator(seconds):
    """Cache only when there's a healthy http.JsonResponse response."""

    def decorator(func):

        @functools.wraps(func)
        def inner(request, *args, **kwargs):
            cache_key = 'json_response_cache:{}:{}'.format(
                func.__name__,
                hashlib.md5(force_bytes(iri_to_uri(
                    request.build_absolute_uri()
                ))).hexdigest()
            )
            content = cache.get(cache_key)
            if content is not None:

                # metrics is optional
                metrics.incr(
                    'json_response_cache_hit',
                    tags=['view:{}'.format(func.__name__)]
                )

                return http.HttpResponse(
                    content,
                    content_type='application/json'
                )
            response = func(request, *args, **kwargs)
            if (
                isinstance(response, http.JsonResponse) and
                response.status_code in (200, 304)
            ):
                cache.set(cache_key, response.content, seconds)
            return response

        return inner

    return decorator

To use it simply add to Django view functions that might return a http.JsonResponse. For example, something like this:


@json_response_cache_page_decorator(60)
def search(request):
    q = request.GET.get('q')
    if not q:
        return http.HttpResponseBadRequest('no q')
    results = search_database(q)
    return http.JsonResponse({
        'results': results,
    })

The reasons I use this instead of django.views.decorators.cache.cache_page() is because of a couple of reasons.

  • cache_page generates cache keys that don't contain the view function name.
  • cache_page tries to cache the whole http.HttpResponse instance which can't be serialized if you use the msgpack serializer.
  • cache_page also sends Cache-Control headers which is not always what you want.
  • Not possible to inject your own custom code such as my usage of metrics.

Disclaimer: This snippet of code comes from a side-project that has a very specific set of requirements. They're rather unique to that project and I have a full picture of the needs. E.g. I know what specific headers matter and don't matter. Your project might be different. For example, perhaps you don't have markus to handle your metrics. Or perhaps you need to re-write the query string for something to normalize the cache key differently. Point being, take the snippet of code as inspiration when you too find that django.views.decorators.cache.cache_page() isn't good enough for your Django view functions.

GeneratorExit - How to clean up after the last yield in Python

June 7, 2018
7 comments Python

tl;dr; Use except GeneratorExit if your Python generator needs to know the consumer broke out.

Suppose you have a generator that yields things out. After each yield you want to execute some code that does something like logging or cleaning up. Here one such trivialized example:

The Problem


def pump():
    numbers = [1, 2, 3, 4]
    for number in numbers:
        yield number
        print("Have sent", number)
    print("Last number was sent")


for number in pump():
    print("Got", number)

print("All done")

The output is, as expected:

Got 1
Have sent 1
Got 2
Have sent 2
Got 3
Have sent 3
Got 4
Have sent 4
Last number was sent
All done

In this scenario, the consumer of the generator (the for number in pump() loop in this example) gets every thing the generator generates so after the last yield the generator is free to do any last minute activities which might be important such as closing a socket or updating a database.

Suppose the consumer is getting a bit "impatient" and breaks out as soon as it has what it needed.


def pump():
    numbers = [1, 2, 3, 4]
    for number in numbers:
        yield number
        print("Have sent", number)
    print("Last number was sent")


for number in pump():
    print("Got", number)
    # THESE TWO NEW LINES
    if number >= 2:
        break

print("All done")

What do you think the out is now? I'll tell you:

Got 1
Have sent 1
Got 2
All done

In other words, the potentially important lines print("Have sent", number) and print("Last number was sent") never gets executed! The generator could tell the consumer (through documentation) of the generator "Don't break! If you don't want me any more raise a StopIteration". But that's not a feasible requirement.

The Solution

But! There is a better solution and that's to catch GeneratorExit exceptions.


def pump():
    numbers = [1, 2, 3, 4]
    try:
        for number in numbers:
            yield number
            print("Have sent", number)
    except GeneratorExit:
        print("Exception!")
    print("Last number was sent")


for number in pump():
    print("Got", number)
    if number == 2:
        break
print("All done")

Now you get what you might want:

Got 1
Have sent 1
Got 2
Exception!
Last number was sent
All done

Next Level Stuff

Note in the last example's output, it never prints Have sent 2 even though the generator really did send that number. Suppose that's an important piece of information, then you can reach that inside the except GeneratorExit block. Like this for example:


def pump():
    numbers = [1, 2, 3, 4]
    try:
        for number in numbers:
            yield number
            print("Have sent", number)
    except GeneratorExit:
        print("Have sent*", number)
    print("Last number was sent")


for number in pump():
    print("Got", number)
    if number == 2:
        break
print("All done")

And the output is:

Got 1
Have sent 1
Got 2
Have sent* 2
Last number was sent
All done

The * is just in case we wanted to distinguish between a break happening or not. Depends on your application.

Writing a custom Datadog reporter using the Python API

May 21, 2018
2 comments Python

Datadog is an awesome sofware-as-a-service where you can aggregate and visualize statsd metrics sent from an application. For visualizing timings you create a time series graph. It can look something like this:

Time series

This time series looks sane because because it's timings made very very frequently. But what if it happens very rarely. Like once a day. Then, the graph doesn't look very useful. See this example:

"Rare time" series

Not only is it happening rarely, the amount of seconds is really quite hard to parse. I.e. what's 2.6 million milliseconds (answer is approximately 45 minutes). So to solve that I used the Datadog API. Now I can get a metric of every single point in milliseconds and I can make a little data table with human-readable dates and times.

The end result looks something like this:

SCOPE: ENV:PROD
+-------------------------+------------------------+-----------------------+
|          WHEN           |        TIME AGO        |       TIME TOOK       |
+=========================+========================+=======================+
| Mon 2018-05-21T17:00:00 | 2 hours 43 minutes ago | 23 minutes 32 seconds |
+-------------------------+------------------------+-----------------------+
| Sun 2018-05-20T17:00:00 | 1 day 2 hours ago      | 20 seconds            |
+-------------------------+------------------------+-----------------------+
| Sat 2018-05-19T17:00:00 | 2 days 2 hours ago     | 20 seconds            |
+-------------------------+------------------------+-----------------------+
| Fri 2018-05-18T17:00:00 | 3 days 2 hours ago     | 2 minutes 24 seconds  |
+-------------------------+------------------------+-----------------------+
| Wed 2018-05-16T20:00:00 | 4 days 23 hours ago    | 38 minutes 38 seconds |
+-------------------------+------------------------+-----------------------+

It's not gorgeous and there are a lot of caveats but it's at least really easy to read. See the metrics.py code here.

I don't think you can run this code since you don't have the same (hardcoded) metrics but hopefully it can serve as an example to whet your appetite.

What I'm going to do next, if I have time, is to run this as a Flask app instead that outputs a HTML table on a Herokup app or something.

Rust > Go > Python ...to parse millions of dates in CSV files

May 15, 2018
10 comments Python

It all started so innocently. The task at hand was to download an inventory of every single file ever uploaded to a public AWS S3 bucket. The way that works is that you download the root manifest.json. It references a boat load of .csv.gz files. So to go through every single file uploaded to the bucket, you read the manifest.json, the download each and every .csv.gz file. Now you can parse these and do something with each row. An example row in one of the CSV files looks like this:

"net-mozaws-prod-delivery-firefox","pub/firefox/nightly/latest-mozilla-central-l10n/linux-i686/xpi/firefox-57.0a1.gn.langpack.xpi","474348","2017-09-21T13:08:25.000Z","0af6ce0b494d1f380a8b1cd6f42e36cb"

In the Mozilla Buildhub what we do is we periodically do this, in Python (with asyncio), to spot if there are any files in the S3 bucket have potentially missed to record in an different database.
But ouf the 150 or so .csv.gz files, most of the files are getting old and in this particular application we can be certain it's unlikely to be relevant and can be ignored. To come to that conclusion you parse each .csv.gz file, parse each row of the CSV, extract the last_modified value (e.g. 2017-09-21T13:08:25.000Z) into a datetime.datetime instance. Now you can quickly decide if it's too old or recent enough to go through the other various checks.

So, the task is to parse 150 .csv.gz files totalling about 2.5GB with roughly 75 million rows. Basically parsing the date strings into datetime.datetime objects 75 million times.

Python

What this script does is it opens, synchronously, each and every .csv.gz file, parses each records date and compares it to a constant ("Is this record older than 6 months or not?")

This is an extraction of a bigger system to just look at the performance of parsing all those .csv.gz files to figure out how many are old and how many are within 6 months. Code looks like this:


import datetime
import gzip
import csv
from glob import glob

cutoff = datetime.datetime.now() - datetime.timedelta(days=6 * 30)

def count(fn):
    count = total = 0
    with gzip.open(fn, 'rt') as f:
        reader = csv.reader(f)
        for line in reader:
            lastmodified = datetime.datetime.strptime(
                line[3],
                '%Y-%m-%dT%H:%M:%S.%fZ'
            )
            if lastmodified > cutoff:
                count += 1
            total += 1

    return total, count


def run():
    total = recent = 0
    for fn in glob('*.csv.gz'):
        if len(fn) == 39:  # filter out other junk files that don't fit the pattern
            print(fn)
            t, c = count(fn)
            total += t
            recent += c

    print(total)
    print(recent)
    print('{:.1f}%'.format(100 * recent / total))


run()

Code as a gist here.

Only problem. This is horribly slow.

To reproduce this, I took a sample of 38 of these .csv.gz files and ran the above code with CPython 3.6.5. It took 3m44s on my 2017 MacBook Pro.

Let's try a couple low-hanging fruit ideas:

  • PyPy 5.10.1 (based on 3.5.1): 2m30s
  • Using asyncio on Python 3.6.5: 3m37s
  • Using a thread pool on Python 3.6.5: 7m05s
  • Using a process pool on Python 3.6.5: 1m5s

Hmm... Clearly this is CPU bound and using multiple processes is the ticket. But what's really holding us back is the date parsing. From the "Fastest Python datetime parser" benchmark the trick is to use ciso8601. Alright, let's try that. Diff:

6c6,10
< cutoff = datetime.datetime.now() - datetime.timedelta(days=6 * 30)
---
> import ciso8601
>
> cutoff = datetime.datetime.utcnow().replace(
>     tzinfo=datetime.timezone.utc
> ) - datetime.timedelta(days=6 * 30)
14,17c18
<             lastmodified = datetime.datetime.strptime(
<                 line[3],
<                 '%Y-%m-%dT%H:%M:%S.%fZ'
<             )
---
>             lastmodified = ciso8601.parse_datetime(line[3])

Version with ciso8601 here.

So what originally took 3 and a half minutes now takes 18 seconds. I suspect that's about as good as it gets with Python.
But it's not too shabby. Parsing 12,980,990 date strings in 18 seconds. Not bad.

Go

My Go is getting rusty but it's quite easy to write one of these so I couldn't resist the temptation:


package main

import (
    "compress/gzip"
    "encoding/csv"
    "fmt"
    "log"
    "os"
    "path/filepath"
    "strconv"
    "time"
)

func count(fn string, index int) (int, int) {
    fmt.Printf("%d %v\n", index, fn)
    f, err := os.Open(fn)
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()
    gr, err := gzip.NewReader(f)
    if err != nil {
        log.Fatal(err)
    }
    defer gr.Close()

    cr := csv.NewReader(gr)
    rec, err := cr.ReadAll()
    if err != nil {
        log.Fatal(err)
    }
    var count = 0
    var total = 0
    layout := "2006-01-02T15:04:05.000Z"

    minimum, err := time.Parse(layout, "2017-11-02T00:00:00.000Z")
    if err != nil {
        log.Fatal(err)
    }

    for _, v := range rec {
        last_modified := v[3]

        t, err := time.Parse(layout, last_modified)
        if err != nil {
            log.Fatal(err)
        }
        if t.After(minimum) {
            count += 1
        }
        total += 1
    }
    return total, count
}

func FloatToString(input_num float64) string {
    // to convert a float number to a string
    return strconv.FormatFloat(input_num, 'f', 2, 64)
}

func main() {
    var pattern = "*.csv.gz"

    files, err := filepath.Glob(pattern)
    if err != nil {
        panic(err)
    }
    total := int(0)
    recent := int(0)
    for i, fn := range files {
        if len(fn) == 39 {
            // fmt.Println(fn)
            c, t := count(fn, i)
            total += t
            recent += c
        }
    }
    fmt.Println(total)
    fmt.Println(recent)
    ratio := float64(recent) / float64(total)
    fmt.Println(FloatToString(100.0 * ratio))
}

Code as as gist here.

Using go1.10.1 I run go make main.go and then time ./main. This takes just 20s which is about the time it took the Python version that uses a process pool and ciso8601.

I showed this to my colleague @mostlygeek who saw my scripts and did the Go version more properly with its own repo.
At first pass (go build filter.go and time ./filter) this one clocks in at 19s just like my naive initial hack. However if you run this as time GOPAR=2 ./filter it will use 8 workers (my MacBook Pro as 8 CPUs) and now it only takes: 5.3s.

By the way, check out @mostlygeek's download.sh if you want to generate and download yourself a bunch of these .csv.gz files.

Rust

First @mythmon stepped up and wrote two versions. One single-threaded and one using rayon which will use all CPUs you have.

The version using rayon looks like this (single-threaded version here):


extern crate csv;
extern crate flate2;
#[macro_use]
extern crate serde_derive;
extern crate chrono;
extern crate rayon;

use std::env;
use std::fs::File;
use std::io;
use std::iter::Sum;

use chrono::{DateTime, Utc, Duration};
use flate2::read::GzDecoder;
use rayon::prelude::*;

#[derive(Debug, Deserialize)]
struct Row {
    bucket: String,
    key: String,
    size: usize,
    last_modified_date: DateTime<Utc>,
    etag: String,
}

struct Stats {
    total: usize,
    recent: usize,
}

impl Sum for Stats {
    fn sum<I: Iterator<Item=Self>>(iter: I) -> Self {
        let mut acc = Stats { total: 0, recent: 0 };
        for stat in  iter {
            acc.total += stat.total;
            acc.recent += stat.recent;
        }
        acc
    }
}

fn main() {
    let cutoff = Utc::now() - Duration::days(180);
    let filenames: Vec<String> = env::args().skip(1).collect();

    let stats: Stats = filenames.par_iter()
        .map(|filename| count(&filename, cutoff).expect(&format!("Couldn't read {}", &filename)))
        .sum();

    let percent = 100.0 * stats.recent as f32 / stats.total as f32;
    println!("{} / {} = {:.2}%", stats.recent, stats.total, percent);
}

fn count(path: &str, cutoff: DateTime<Utc>) -> Result<Stats, io::Error> {
    let mut input_file = File::open(&path)?;
    let decoder = GzDecoder::new(&mut input_file)?;
    let mut reader = csv::ReaderBuilder::new()
        .has_headers(false)
        .from_reader(decoder);

    let mut total = 0;
    let recent = reader.deserialize::<Row>()
        .flat_map(|row| row)  // Unwrap Somes, and skip Nones
        .inspect(|_| total += 1)
        .filter(|row| row.last_modified_date > cutoff)
        .count();

    Ok(Stats { total, recent })
}

I installed it like this (I have rustc 1.26 installed):

▶ cargo build --release --bin single_threaded
▶ time ./target/release/single_threaded ../*.csv.gz

That finishes in 22s.

Now let's try the one that uses all CPUs in parallel:

▶ cargo build --release --bin rayon
▶ time ./target/release/rayon ../*.csv.gz

That took 5.6s

That's rougly 3 times faster than the best Python version.

When chatting with my teammates about this, I "nerd-sniped" in another colleague, Ted Mielczarek who forked Mike's Rust version.

Compile and running these two I get 17.4s for the single-threaded version and 2.5s for the rayon one.

In conclusion

  1. Simplest Python version: 3m44s
  2. Using PyPy (for Python 3.5): 2m30s
  3. Using asyncio: 3m37s
  4. Using concurrent.futures.ThreadPoolExecutor: 7m05s
  5. Using concurrent.futures.ProcessPoolExecutor: 1m5s
  6. Using ciso8601 to parse the dates: 1m08s
  7. Using ciso8601 and concurrent.futures.ProcessPoolExecutor: 18.4s
  8. Novice Go version: 20s
  9. Go version with parallel workers: 5.3s
  10. Single-threaded Rust version: 22s
  11. Parallel workers in Rust: 5.6s
  12. (Ted's) Single-threaded Rust version: 17.4s
  13. (Ted's) Parallel workers in Rust: 2.5s

Most interesting is that this is not surprising. Of course it gets faster if you use more CPUs in parallel. And of course a C binary to do a critical piece in Python will speed things up. What I'm personally quite attracted to is how easy it was to replace the date parsing with ciso8601 in Python and get a more-than-double performance boost with very little work.

Yes, I'm perfectly aware that these are not scientific conditions and the list of disclaimers is long and boring. However, it was fun! It's fun to compare and constrast solutions like this. Don't you think?