There was a really interesting discussion on the django-users mailing list about how to best select random elements out of a SQL database the most efficient way. I knew using a regular RANDOM()
in SQL can be very slow on big tables but I didn't know by how much. Had to run a quick test!
Cal Leeming discussed a snippet of his to do with pagination huge tables which uses the MAX(id)
aggregate function.
So, I did a little experiment on a table with 84,000 rows in it. Realistic enough to matter even though it's less than millions. So, how long would it take to select 10 random items, 10 times? Benchmark code looks like this:
TIMES = 10
def using_normal_random(model):
for i in range(TIMES):
yield model.objects.all().order_by('?')[0].pk
t0 = time()
for i in range(TIMES):
list(using_normal_random(SomeLargishModel))
t1 = time()
print t1-t0, "seconds"
Result:
41.8955321312 seconds
Nasty!! Also running this you'll notice postgres spiking your CPU like crazy.
A much better approach is to use Python's random.randint(1, <max ID>)
. Looks like this:
from django.db.models import Max
from random import randint
def using_max(model):
max_ = model.objects.aggregate(Max('id'))['id__max']
i = 0
while i < TIMES:
try:
yield model.objects.get(pk=randint(1, max_)).pk
i += 1
except model.DoesNotExist:
pass
t0 = time()
for i in range(TIMES):
list(using_max(SomeLargishModel))
t1 = time()
print t1-t0, "seconds"
Result:
0.63835811615 seconds
Much more pleasant!
UPDATE
Commentator, Ken Swift, asked what if your requirement is to select 100 random items instead of just 10. Won't those 101 database queries be more costly than just 1 query with a RANDOM()
. Answer turns out to be no.
I changed the script to select 100 random items 1 time (instead of 10 items 10 times) and the times were the same:
using_normal_random() took 41.4467599392 seconds
using_max() took 0.6027739048 seconds
And what about 1000 items 1 time:
using_normal_random() took 204.685141802 seconds
using_max() took 2.49527382851 seconds
UPDATE 2
The algorithm for returning a generator has a couple of flaws:
- Can't pass in a QuerySet
- You get primary keys returned, not ORM instances
- You can't pass in a number
- Internally, it might randomly select a number already tried
Here's a much more complete function:
def random_queryset_elements(qs, number):
assert number <= 10000, 'too large'
max_pk = qs.aggregate(Max('pk'))['pk__max']
min_pk = qs.aggregate(Min('pk'))['pk__min']
ids = set()
while len(ids) < number:
next_pk = random.randint(min_pk, max_pk)
while next_pk in ids:
next_pk = random.randint(min_pk, max_pk)
try:
found = qs.get(pk=next_pk)
ids.add(found.pk)
yield found
except qs.model.DoesNotExist:
pass