I just ordered tea from the Min River Tea Farm

February 27, 2011
0 comments Misc. links

I just ordered tea from the Min River Tea Farm I just ordered myself one bag of Jasmine Pearls from The Min River Tea Farm that my friend Chris has recently launched.

As soon as I get my tea I'm going to take a picture of myself drinking it and send in my pic so that £1 gets donated, by Chris, to the Mind UK charity.

If you live in the UK and love genuine sourced Chinese teas do check it out. The ordering process is lovingly easy and safe. Although I'm an Earl Grey fan myself, I love having some good jasmine tea available at home without having to worry about the caffeine in Earl Grey tea keeping me awake.

Best of luck to Chris and his new site! Please take the time to browse and read about his teas. If you're outside the UK and you want a bag, just send him and email.

Eloquent Javascript by Marijn Haverbeke

February 25, 2011
0 comments JavaScript

Eloquent Javascript by Marijn Haverbeke What a lovely title for a book! I wanted to read a book about the proper way to write Javascript but I couldn't wait any longer for John Resig's Secrets of the JavaScript Ninja which isn't out in print yet (also a great title by the way).

Eloquent Javascript begins very lightly with the basics of Javascript programming. Variables, scope, data structures and control flow. To be perfectly honest I didn't read it very carefully but I believe I did pick up a thing or two at least. The chapter on error handling was useful but the really interesting chapters were "Functional Programming" and "Object-Oriented Programming". What I love about Marijn's style of writing is that he starts very very simple and builds up the code to be better and better. Not based on what you can do but instead why you should do it. As you go along you can then immediately snap up what the benefits are for yourself. Sometimes it's brevity and sometimes it's for faster performance. My only criticism if I'm allowed is that the jargon is quite a lot keep up with. Especially around "constructors" and "prototypes" which is sometimes easy to forget (especially if you're from another language where these things mean different things).

I'm not great at it but I already knew how to write modules and "classes" so ultimately there wasn't a whole lot to take away from it to be honest. Some tricks such as the inheritance function which Marijn introduced was neat and that might be something I'll copy. Nevertheless, this book showed and educated me in why we do things as modules and stuff which I genuinely appreciated.

Thanks for a great book Marijn! Keep up the good work!

Connecting with psycopg2 without a username and password

February 24, 2011
12 comments Python

My colleague Lukas and I banged our heads against this for much too long today. So, our SQLAlchemy is was configured like this:


ENV_DB_CONNECTION_DSN = postgresql://localhost:5432/mydatabase

And the database doesn't have a password (local) so I can log in to it like this on the command line:


$ psql mydatabase

Which assumes the username peterbe which is what I'm logged in. So, this is a shortcut for doing this:


$ psql mydatabase -U peterbe

Which, assumes a blank/empty password.

Truncated! Read the rest by clicking the link below.

Optimization of getting random rows out of a PostgreSQL in Django

February 23, 2011
48 comments Django

There was a really interesting discussion on the django-users mailing list about how to best select random elements out of a SQL database the most efficient way. I knew using a regular RANDOM() in SQL can be very slow on big tables but I didn't know by how much. Had to run a quick test!

Cal Leeming discussed a snippet of his to do with pagination huge tables which uses the MAX(id) aggregate function.

So, I did a little experiment on a table with 84,000 rows in it. Realistic enough to matter even though it's less than millions. So, how long would it take to select 10 random items, 10 times? Benchmark code looks like this:


TIMES = 10
def using_normal_random(model):
   for i in range(TIMES):
       yield model.objects.all().order_by('?')[0].pk

t0 = time()
for i in range(TIMES):
   list(using_normal_random(SomeLargishModel))
t1 = time()
print t1-t0, "seconds"

Result:


41.8955321312 seconds

Nasty!! Also running this you'll notice postgres spiking your CPU like crazy.

A much better approach is to use Python's random.randint(1, <max ID>). Looks like this:


 from django.db.models import Max
 from random import randint
 def using_max(model):
   max_ = model.objects.aggregate(Max('id'))['id__max']
   i = 0
   while i < TIMES:
       try:
           yield model.objects.get(pk=randint(1, max_)).pk
           i += 1
       except model.DoesNotExist:
           pass

t0 = time()
for i in range(TIMES):
   list(using_max(SomeLargishModel))
t1 = time()
print t1-t0, "seconds"

Result:


0.63835811615 seconds

Much more pleasant!

UPDATE

Commentator, Ken Swift, asked what if your requirement is to select 100 random items instead of just 10. Won't those 101 database queries be more costly than just 1 query with a RANDOM(). Answer turns out to be no.

I changed the script to select 100 random items 1 time (instead of 10 items 10 times) and the times were the same:


using_normal_random() took 41.4467599392 seconds
using_max() took 0.6027739048 seconds

And what about 1000 items 1 time:


using_normal_random() took 204.685141802 seconds
using_max() took 2.49527382851 seconds

UPDATE 2

The algorithm for returning a generator has a couple of flaws:

  1. Can't pass in a QuerySet
  2. You get primary keys returned, not ORM instances
  3. You can't pass in a number
  4. Internally, it might randomly select a number already tried

Here's a much more complete function:


 def random_queryset_elements(qs, number):
    assert number <= 10000, 'too large'
    max_pk = qs.aggregate(Max('pk'))['pk__max']
    min_pk = qs.aggregate(Min('pk'))['pk__min']
    ids = set()
    while len(ids) < number:
        next_pk = random.randint(min_pk, max_pk)
        while next_pk in ids:
            next_pk = random.randint(min_pk, max_pk)
        try:
            found = qs.get(pk=next_pk)
            ids.add(found.pk)
            yield found
        except qs.model.DoesNotExist:
            pass

Nice testimonial about django-static

February 21, 2011
0 comments Django

My friend Chris is a Django newbie who has managed to build a whole e-shop site in Django. It will launch on a couple of days and when it launches I will blog about it here too. He sent me this today which gave me a smile:

"I spent today setting up django_static for the site, and optimising it for performance. If there's one thing I've learned from you, it's optimisation.

So, my homepage is now under 100KB (was 330KB), and it loads in @5-6 seconds from hard refresh (was 13-14 seconds at its worst). And I just got a 92 score on Yslow. I do believe I have the fastest tea website around now, and I still haven't installed caching.

Wicked huh?"

He's talking about using django-static. Then I get another email shortly after with this:

"correction - I get 97 on YSlow if I use a VPN.

I just found that the Great Firewall tags extra HTTP requests onto every request I make from my browser, pinging a server in Shanghai with a PHP script which probably checks the page for its content or if its on some kind of blocked list. Cheeky buggers!"

It's that interesting! (Note: Chris is based in China but hosts the test site in the UK)

How I profile my Nginx + proxy pass server

February 16, 2011
3 comments Web development, Python

Like so many others you probably have an Nginx server sitting in front of your application server (Django, Zope, Rails). The Nginx server serves static files right off the filesystem and when it doesn't do that it proxy passes the request on to the backend. You might be using proxy_pass, uwsgi or fastcgi_pass or at least something very similar. Most likely you have an Nginx site configure something like this:


server {
   access_log /var/log/nginx/mysite.access.log;
   location ^~ /static/ {
       root /var/lib/webapp;
       access_log off;
   }
   location / {
       proxy_pass http://localhost:8000;
   }
}

What I do is that I add an access log directive that times every request. This makes it possible to know how long every non-trivial request takes for the backend to complete:


server {
   log_format timed_combined '$remote_addr - $remote_user [$time_local]  ' 
                             '"$request" $status $body_bytes_sent '
                             '"$http_referer" "$http_user_agent" $request_time';
   access_log /var/log/nginx/timed.mysite.access.log timed_combined;

   location ^~ /css/ {
       root /var/lib/webapp/static;
       access_log off;
   }
   location / {
       proxy_pass http://localhost:8000;
   }
}

Truncated! Read the rest by clicking the link below.

DoneCal homepage now able to do 10,000 requests/second

February 13, 2011
0 comments DoneCal

I've done some work refactoring the homepage of DoneCal so that it does no logic other than just serving HTML. What it used to do was some basic security checks and stuff so that it says "Hi Peter" and a log out link. Now all of that has been moved to one simple piece of AJAX call.

BEFORE:


# ab -n 1000 -c 10 http://donecal.com/
...
Requests per second:    353.65 [#/sec] (mean)

AFTER:


# ab -n 1000 -c 10 http://donecal.com/
...
Requests per second:    9796.78 [#/sec] (mean)

# ab -n 1000 -c 10 http://donecal.com/auth/logged_in.json
...
Requests per second:    3756.25 [#/sec] (mean)

The reason why loading the index.html can be so fast is because I'm using Nginx directly. In my Nginx config I have to not use the static file if the request isn't a GET request or if it has a query string. I'll need to remove that stuff too and then it means that I can push the index.html file out to my AWS CloudFront CDN using a CNAME.

DoneCal is my first web application that is this Javascript heavy. It raises the bar in terms of optimal HTTP optimization to get the best user experience possible. I love learning this new way of working.

EditDistanceMatcher - NodeJS script for doing edit distance 1 matching

February 5, 2011
0 comments JavaScript

I needed a very basic spell correction string matcher in my current NodeJS project so I wrote a simple class called EditDistanceMatcher that compares a string against another string and matches if it's 1 edit distance away. With it you can do things like Google search's "Did you mean: poop?" when you search for pop.

Note, this code doesn't check popularity of correct words (e.g. "pop" might appear much more often than "poop" so it'll suggest "pop" if you enter "poup"). Anyway this simple snippet from the unit tests will reveal how it works:


     /* The match() method */
     var edm = new EditDistanceMatcher(["peter"]);
     // edm.match returns an array and remember,
     // in javascript ['peter'] == ['peter'] => false
     test.equal(edm.match("petter").length, 1);
     test.equal(edm.match("petter")[0], 'peter');
     test.equal(edm.match("junk").length, 0);

     /* the is_matched() method */
     var edm = new EditDistanceMatcher(["peter"]);
     test.equal(typeof edm.is_matched('petter'), 'boolean');
     test.equal(typeof edm.is_matched('junk'), 'boolean');
     test.ok(edm.is_matched("petter"));
     test.ok(!edm.is_matched("junk"));

The most basic use case is if you have a quiz and you want to accept some spelling mistakes. "What's the capital of Sweden?; STOKHOLM; Correct!"

For the unlazy this NodeJS code can very easily be used in a browser by simply removing the exports stuff.

edit_distance.js

tests/test_edit_distance.js

Note! I wrote this in an airport lounge so I'm sure it can be improved lots more.

DoneCal on MumbaiMirror

February 3, 2011
1 comment DoneCal

Here's a nice write up about DoneCal on MumbaiMirror

"All in all, DoneCal is one of those Web 2.0 tools that you wouldn’t really miss if it wasn’t around, but once you use it, you can’t go back."

They don't make a link to DoneCal which I suspect is some sort of half assed attempt to avoid too many outgoing links. They've strangely spent time writing about another web page but can't make a link to it. If I've learned anything from Google is that the ultimate mantra of SEO is: don't try to be smarter than us, just write great content and let us worry about ranking.

If these guys are worried about that, why don't they use a rel="nofollowup" attribute on the link?

DoneCal.com international visitors

January 21, 2011
0 comments DoneCal

DoneCal.com international visitors For the first time in my life I've launched a web site/app that isn't mostly popular in the United States. Yay!(?) Not that I care or that it matters but it's worth noting. For some reason it's currently most popular in France, followed closely by China and United States is not till the 5th place.

Of the United States visitors I'm not surprised the California bunch is more prominent. The service is quite new and quite technically interesting for people in the industry so I guess a lot of those visitors are Silicon Valley type folks.