Django 2.0 came out a couple of weeks ago. It now supports "conditional aggregation" which is SQL standard I didn't even know about.
Before
So I have a Django app which has an endpoint that generates some human-friendly stats about the number of uploads (and their total size) in various different time intervals.
First of all, this is how it set up the time intervals:
today = timezone.now()
start_today = today.replace(hour=0, minute=0, second=0)
start_yesterday = start_today - datetime.timedelta(days=1)
start_this_month = today.replace(day=1)
start_this_year = start_this_month.replace(month=1)
And then, for each of these, there's a little function that returns a dict for each time interval:
def count_and_size(qs, start, end):
sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
return {
'count': sub_qs.count(),
'total_size': sub_qs.aggregate(size=Sum('size'))['size'],
}
numbers['uploads'] = {
'today': count_and_size(upload_qs, start_today, today),
'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
'this_month': count_and_size(upload_qs, start_this_month, today),
'this_year': count_and_size(upload_qs, start_this_year, today),
}
What you get is exactly 2 x 4 = 8 queries. One COUNT
and one SUM
for each time interval. E.g.
SELECT SUM("upload_upload"."size") AS "size" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... SELECT COUNT(*) AS "__count" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...6 more queries...
Middle
Oops. I think this code comes from a slightly rushed job. We can do the COUNT
and the SUM
at the same time for each query.
# New, improved count_and_size() function!
def count_and_size(qs, start, end):
sub_qs = qs.filter(created_at__gte=start, created_at__lt=end)
return sub_qs.aggregate(
count=Count('id'),
total_size=Sum('size'),
)
numbers['uploads'] = {
'today': count_and_size(upload_qs, start_today, today),
'yesterday': count_and_size(upload_qs, start_yesterday, start_today),
'this_month': count_and_size(upload_qs, start_this_month, today),
'this_year': count_and_size(upload_qs, start_this_year, today),
}
Much better, now there's only one query per time bucket. So 4 queries in total. E.g.
SELECT COUNT("upload_upload"."id") AS "count", SUM("upload_upload"."size") AS "total_size" FROM "upload_upload" WHERE ("upload_upload"."created_at" >= ... ...3 more queries...
After
But we can do better than that! Instead, we use conditional aggregation. The syntax gets a bit hairy because there's so many keyword arguments, but I hope I've indented it nicely so it's easy to see how it works:
def make_q(start, end):
return Q(created_at__gte=start, created_at__lt=end)
q_today = make_q(start_today, today)
q_yesterday = make_q(start_yesterday, start_today)
q_this_month = make_q(start_this_month, today)
q_this_year = make_q(start_this_year, today)
aggregates = upload_qs.aggregate(
today_count=Count('pk', filter=q_today),
today_total_size=Sum('size', filter=q_today),
yesterday_count=Count('pk', filter=q_yesterday),
yesterday_total_size=Sum('size', filter=q_yesterday),
this_month_count=Count('pk', filter=q_this_month),
this_month_total_size=Sum('size', filter=q_this_month),
this_year_count=Count('pk', filter=q_this_year),
this_year_total_size=Sum('size', filter=q_this_year),
)
numbers['uploads'] = {
'today': {
'count': aggregates['today_count'],
'total_size': aggregates['today_total_size'],
},
'yesterday': {
'count': aggregates['yesterday_count'],
'total_size': aggregates['yesterday_total_size'],
},
'this_month': {
'count': aggregates['this_month_count'],
'total_size': aggregates['this_month_total_size'],
},
'this_year': {
'count': aggregates['this_year_count'],
'total_size': aggregates['this_year_total_size'],
},
}
Voila! One single query to get all those pieces of data.
The SQL sent to PostgreSQL looks something like this:
SELECT COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "today_total_size", COUNT("upload_upload"."id") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_count", SUM("upload_upload"."size") FILTER (WHERE ("upload_upload"."created_at" >= ...)) AS "yesterday_total_size", ... FROM "upload_upload";
Is this the best thing to do? I'm starting to have my doubts.
Watch Out!
When I take this now 1 monster query for a spin with an EXPLAIN ANALYZE
prefix I notice something worrying!
QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Aggregate (cost=74.33..74.34 rows=1 width=16) (actual time=0.587..0.587 rows=1 loops=1) -> Seq Scan on upload_upload (cost=0.00..62.13 rows=813 width=16) (actual time=0.012..0.210 rows=813 loops=1) Planning time: 0.427 ms Execution time: 0.674 ms (4 rows)
A sequential scan! That's terrible. The created_at
column is indexed in a BTREE
so why can't it use the index.
The short answer is: I don't know!
I've uploaded a reduced, but still complete, example demonstrating this in a gist. It's very similar to the example in the stackoverflow question I asked.
So what did I do? I went back to the "middle" solution. One SELECT
query per time bucket. So 4 queries in total, but at least all 4 is able to use an index.