Connecting with psycopg2 without a username and password

February 24, 2011
12 comments Python

My colleague Lukas and I banged our heads against this for much too long today. So, our SQLAlchemy is was configured like this:


ENV_DB_CONNECTION_DSN = postgresql://localhost:5432/mydatabase

And the database doesn't have a password (local) so I can log in to it like this on the command line:


$ psql mydatabase

Which assumes the username peterbe which is what I'm logged in. So, this is a shortcut for doing this:


$ psql mydatabase -U peterbe

Which, assumes a blank/empty password.

Truncated! Read the rest by clicking the link below.

Optimization of getting random rows out of a PostgreSQL in Django

February 23, 2011
48 comments Django

There was a really interesting discussion on the django-users mailing list about how to best select random elements out of a SQL database the most efficient way. I knew using a regular RANDOM() in SQL can be very slow on big tables but I didn't know by how much. Had to run a quick test!

Cal Leeming discussed a snippet of his to do with pagination huge tables which uses the MAX(id) aggregate function.

So, I did a little experiment on a table with 84,000 rows in it. Realistic enough to matter even though it's less than millions. So, how long would it take to select 10 random items, 10 times? Benchmark code looks like this:


TIMES = 10
def using_normal_random(model):
   for i in range(TIMES):
       yield model.objects.all().order_by('?')[0].pk

t0 = time()
for i in range(TIMES):
   list(using_normal_random(SomeLargishModel))
t1 = time()
print t1-t0, "seconds"

Result:


41.8955321312 seconds

Nasty!! Also running this you'll notice postgres spiking your CPU like crazy.

A much better approach is to use Python's random.randint(1, <max ID>). Looks like this:


 from django.db.models import Max
 from random import randint
 def using_max(model):
   max_ = model.objects.aggregate(Max('id'))['id__max']
   i = 0
   while i < TIMES:
       try:
           yield model.objects.get(pk=randint(1, max_)).pk
           i += 1
       except model.DoesNotExist:
           pass

t0 = time()
for i in range(TIMES):
   list(using_max(SomeLargishModel))
t1 = time()
print t1-t0, "seconds"

Result:


0.63835811615 seconds

Much more pleasant!

UPDATE

Commentator, Ken Swift, asked what if your requirement is to select 100 random items instead of just 10. Won't those 101 database queries be more costly than just 1 query with a RANDOM(). Answer turns out to be no.

I changed the script to select 100 random items 1 time (instead of 10 items 10 times) and the times were the same:


using_normal_random() took 41.4467599392 seconds
using_max() took 0.6027739048 seconds

And what about 1000 items 1 time:


using_normal_random() took 204.685141802 seconds
using_max() took 2.49527382851 seconds

UPDATE 2

The algorithm for returning a generator has a couple of flaws:

  1. Can't pass in a QuerySet
  2. You get primary keys returned, not ORM instances
  3. You can't pass in a number
  4. Internally, it might randomly select a number already tried

Here's a much more complete function:


 def random_queryset_elements(qs, number):
    assert number <= 10000, 'too large'
    max_pk = qs.aggregate(Max('pk'))['pk__max']
    min_pk = qs.aggregate(Min('pk'))['pk__min']
    ids = set()
    while len(ids) < number:
        next_pk = random.randint(min_pk, max_pk)
        while next_pk in ids:
            next_pk = random.randint(min_pk, max_pk)
        try:
            found = qs.get(pk=next_pk)
            ids.add(found.pk)
            yield found
        except qs.model.DoesNotExist:
            pass

Nice testimonial about django-static

February 21, 2011
0 comments Django

My friend Chris is a Django newbie who has managed to build a whole e-shop site in Django. It will launch on a couple of days and when it launches I will blog about it here too. He sent me this today which gave me a smile:

"I spent today setting up django_static for the site, and optimising it for performance. If there's one thing I've learned from you, it's optimisation.

So, my homepage is now under 100KB (was 330KB), and it loads in @5-6 seconds from hard refresh (was 13-14 seconds at its worst). And I just got a 92 score on Yslow. I do believe I have the fastest tea website around now, and I still haven't installed caching.

Wicked huh?"

He's talking about using django-static. Then I get another email shortly after with this:

"correction - I get 97 on YSlow if I use a VPN.

I just found that the Great Firewall tags extra HTTP requests onto every request I make from my browser, pinging a server in Shanghai with a PHP script which probably checks the page for its content or if its on some kind of blocked list. Cheeky buggers!"

It's that interesting! (Note: Chris is based in China but hosts the test site in the UK)

How I profile my Nginx + proxy pass server

February 16, 2011
3 comments Web development, Python

Like so many others you probably have an Nginx server sitting in front of your application server (Django, Zope, Rails). The Nginx server serves static files right off the filesystem and when it doesn't do that it proxy passes the request on to the backend. You might be using proxy_pass, uwsgi or fastcgi_pass or at least something very similar. Most likely you have an Nginx site configure something like this:


server {
   access_log /var/log/nginx/mysite.access.log;
   location ^~ /static/ {
       root /var/lib/webapp;
       access_log off;
   }
   location / {
       proxy_pass http://localhost:8000;
   }
}

What I do is that I add an access log directive that times every request. This makes it possible to know how long every non-trivial request takes for the backend to complete:


server {
   log_format timed_combined '$remote_addr - $remote_user [$time_local]  ' 
                             '"$request" $status $body_bytes_sent '
                             '"$http_referer" "$http_user_agent" $request_time';
   access_log /var/log/nginx/timed.mysite.access.log timed_combined;

   location ^~ /css/ {
       root /var/lib/webapp/static;
       access_log off;
   }
   location / {
       proxy_pass http://localhost:8000;
   }
}

Truncated! Read the rest by clicking the link below.

DoneCal homepage now able to do 10,000 requests/second

February 13, 2011
0 comments DoneCal

I've done some work refactoring the homepage of DoneCal so that it does no logic other than just serving HTML. What it used to do was some basic security checks and stuff so that it says "Hi Peter" and a log out link. Now all of that has been moved to one simple piece of AJAX call.

BEFORE:


# ab -n 1000 -c 10 http://donecal.com/
...
Requests per second:    353.65 [#/sec] (mean)

AFTER:


# ab -n 1000 -c 10 http://donecal.com/
...
Requests per second:    9796.78 [#/sec] (mean)

# ab -n 1000 -c 10 http://donecal.com/auth/logged_in.json
...
Requests per second:    3756.25 [#/sec] (mean)

The reason why loading the index.html can be so fast is because I'm using Nginx directly. In my Nginx config I have to not use the static file if the request isn't a GET request or if it has a query string. I'll need to remove that stuff too and then it means that I can push the index.html file out to my AWS CloudFront CDN using a CNAME.

DoneCal is my first web application that is this Javascript heavy. It raises the bar in terms of optimal HTTP optimization to get the best user experience possible. I love learning this new way of working.

EditDistanceMatcher - NodeJS script for doing edit distance 1 matching

February 5, 2011
0 comments JavaScript

I needed a very basic spell correction string matcher in my current NodeJS project so I wrote a simple class called EditDistanceMatcher that compares a string against another string and matches if it's 1 edit distance away. With it you can do things like Google search's "Did you mean: poop?" when you search for pop.

Note, this code doesn't check popularity of correct words (e.g. "pop" might appear much more often than "poop" so it'll suggest "pop" if you enter "poup"). Anyway this simple snippet from the unit tests will reveal how it works:


     /* The match() method */
     var edm = new EditDistanceMatcher(["peter"]);
     // edm.match returns an array and remember,
     // in javascript ['peter'] == ['peter'] => false
     test.equal(edm.match("petter").length, 1);
     test.equal(edm.match("petter")[0], 'peter');
     test.equal(edm.match("junk").length, 0);

     /* the is_matched() method */
     var edm = new EditDistanceMatcher(["peter"]);
     test.equal(typeof edm.is_matched('petter'), 'boolean');
     test.equal(typeof edm.is_matched('junk'), 'boolean');
     test.ok(edm.is_matched("petter"));
     test.ok(!edm.is_matched("junk"));

The most basic use case is if you have a quiz and you want to accept some spelling mistakes. "What's the capital of Sweden?; STOKHOLM; Correct!"

For the unlazy this NodeJS code can very easily be used in a browser by simply removing the exports stuff.

edit_distance.js

tests/test_edit_distance.js

Note! I wrote this in an airport lounge so I'm sure it can be improved lots more.

DoneCal on MumbaiMirror

February 3, 2011
1 comment DoneCal

Here's a nice write up about DoneCal on MumbaiMirror

"All in all, DoneCal is one of those Web 2.0 tools that you wouldn’t really miss if it wasn’t around, but once you use it, you can’t go back."

They don't make a link to DoneCal which I suspect is some sort of half assed attempt to avoid too many outgoing links. They've strangely spent time writing about another web page but can't make a link to it. If I've learned anything from Google is that the ultimate mantra of SEO is: don't try to be smarter than us, just write great content and let us worry about ranking.

If these guys are worried about that, why don't they use a rel="nofollowup" attribute on the link?

DoneCal.com international visitors

January 21, 2011
0 comments DoneCal

DoneCal.com international visitors For the first time in my life I've launched a web site/app that isn't mostly popular in the United States. Yay!(?) Not that I care or that it matters but it's worth noting. For some reason it's currently most popular in France, followed closely by China and United States is not till the 5th place.

Of the United States visitors I'm not surprised the California bunch is more prominent. The service is quite new and quite technically interesting for people in the industry so I guess a lot of those visitors are Silicon Valley type folks.

Fastest "boolean SQL queries" possible with Django

January 14, 2011
5 comments Django

For those familiar with the Django ORM they know how easy it is to work with and that you can do lots of nifty things with the result (QuerySet in Django lingo).

So I was working report that basically just needed to figure out if a particular product has been invoiced. Not for how much or when, just if it's included in an invoice or not.

Truncated! Read the rest by clicking the link below.

django-static version 1.5 automatically taking care of imported CSS

January 11, 2011
1 comment Django

I just released django-static 1.5 (github page) which takes care of optimizing imported CSS files.

To explain, suppose you have a file called foo.css and do this in your Django template:


{% load django_static %}
<link href="{% slimfile "/css/foo.css" %}"
  rel="stylesheet" type="text/css" />

And in foo.css you have the following:


@import "bar.css";
body {
   background-image: url(/images/foo.png);
}

And in bar.css you have this:


div.content {
   background-image: url("bar.png");
}

The outcome is the following:


# foo.css
@import "/css/bar.1257701299.css";
body{background-image:url(/images/foo.1257701686.png)}

# bar.css
div.content{background-image:url("/css/bar.1257701552.png")}

In other words not only does it parse your CSS content and gives images unique names you can set aggressive caching headers on, it will also unfold imported CSS files and optimize them too.

I think that's really useful. You with one single setting (settings.DJANGO_STATIC=True) you can get all your static resources massaged and prepare for the best possible HTTP optimization. Also, it's all automated so you never need to run any build scripts and the definition of what static resources to use (and how to optimize them) is all defined in the template. This I think makes a lot more sense than maintaining static resources in a config file.

The coverage is 93% and there is an example app to look at in the if you prefer that over a README.