Building a database of UK house prices

(0 comments)

The UK Land Registry keeps a record of the date and price of every house sale in the country and makes this data available on its website. It provides an online tool for searching entries related to individual properties and even has a SPARQL interface, but here we download the raw data and load it into a PostgreSQL database with Python.

First, get the "complete transaction" data in comma-separated (csv) format from this page. At the time of writing, this file, pp-complete.csv, is 3.2 GB in size (there seems to be no compressed version available).

If you don't already have it, download and install PostgreSQL.

We won't try to insert all of the fields from the price-paid data file into our database, but will instead limit ourselves to the following:

  • price: Sale price stated on the transfer deed
  • date_of_transfer: Date when the sale was completed, as stated on the transfer deed
  • postcode: Eight-character postcode, of the property
  • paon: Primary Addressable Object Name, e.g. house name and/or number
  • saon: Secondary Addressable Object Name. If there is a sub-building, for example the building is divided into flats, there will be a SAON
  • street
  • locality
  • town_city: Town or city name
  • district
  • county

Open a PostgreSQL session from the command line with psql and create and connect to a new database called house_prices:

# CREATE DATABASE house_prices;
CREATE DATABASE
# \c house_prices
You are now connected to database "house_prices" as user "christian".

Create a table, pp with columns representing the above fields:

house_prices=# CREATE TABLE pp (
                 id serial primary key,
                 price integer NOT NULL,
                 date_of_transfer date NOT NULL,
                 postcode char(8) NOT NULL,
                 PAON varchar(100) NULL,
                 SAON varchar(100) NULL,
                 street varchar(200) NULL,
                 locality varchar(200) NULL,
                 town_city varchar(100) NULL,
                 district varchar(100) NULL,
                 county varchar(60) NULL
               );
CREATE TABLE

Also add indexes to some of the columns to speed up searching:

house_prices=# CREATE INDEX ON pp (price);
CREATE INDEX
house_prices=# CREATE INDEX ON pp (postcode);
CREATE INDEX
house_prices=# CREATE INDEX ON pp (county);
CREATE INDEX
house_prices=# CREATE INDEX ON pp (date_of_transfer);
CREATE INDEX

Now to upload the data. The following Python script enables one to do this in chunks using the psycopg2 Python-PostgreSQL Database Adapter. Doing it in one go might take longer than you'd like so set nstart and nend to the line numbers (starting at 1!) in pp-complete.csv that you'd like to upload.

# upload_pp.py
import psycopg2
from collections import namedtuple
import sys
import csv

# to upload the first million lines set nstart and nend as follows:
nstart = 1
nend   = 1000000

conn = psycopg2.connect(database='house_prices', user='<USERNAME>',
                        password='<PASSWORD>')
cursor = conn.cursor()

Field = namedtuple('Field', 'name index type max_length')

def parse_date(s):
    return s[:10]

def strip_nonascii(s):
    return ''.join([c if ord(c) < 128 else ' ' for c in s])

price = Field('price', 1, str, None)
date_of_transfer = Field('date_of_transfer', 2, parse_date, None)
postcode = Field('postcode', 3, str, 8)
paon = Field('paon', 7, str, 100)
saon = Field('saon', 8, strip_nonascii, 100)
street = Field('street', 9, str, 200)
locality = Field('locality', 10, str, 200)
town_city = Field('town_city', 11, str, 100)
district = Field('district', 12, str, 100)
county = Field('county', 12, str, 60)

output_fields = [price, date_of_transfer, postcode, paon, saon, street,
                 locality, town_city, district, county]
field_names = [field.name for field in output_fields]
cs_field_names = ', '.join(field_names)

pp_name = 'pp-complete.csv'

with open(pp_name, encoding='cp1252') as pp_file:
    pp_reader = csv.reader(pp_file)

    for row in pp_reader:
        if pp_reader.line_num < nstart:
            continue
        if pp_reader.line_num > nend:
            break

        data = []
        for field in output_fields:
            s = row[field.index]
            field_value = field.type(s)
            data.append(field_value)

        query = 'INSERT INTO pp ({0:s}) VALUES ({1:s});'.format(cs_field_names,
                                    ','.join(['%s']*10))
        cursor.execute(query, data)

        if not pp_reader.line_num % 10000:
            print(pp_reader.line_num)
conn.commit()

There are a couple of things to note:

  • We store information about each field to be uploaded to the database in a namedtuple
  • The date of transfer is stored in pp-complete.csv in the format 'YYYY-MM-DD 00:00' (don't ask why), so to get a string corresponding to the date only, we slice this field as s[:10].

  • The Land Registry folk don't state a character encoding for their csv files, but given the UK government's apparent fondness for Excel files in so many areas of its open data activities, one can guess cp-1252 might work. This turns out to be important (for the current version of pp-complete.csv) because of some stray non-ascii characters which Python 3 chokes on when it tries to interpret using its default utf-8 encoding. We strip out these characters for the field they appear in using the function strip_nonascii.

  • Don't forget to conn.commit() when we're done or the data won't be INSERTed at all!

To test the database, upload the first 6 million rows (say) and find the most expensive house sold in this subset:

house_prices=#select max(price) from pp;


max    
----------
 54959000
(1 row)

house_prices=# select * from pp where price=54959000;
   id    |  price   | date_of_transfer | postcode | paon | saon |   street    | locality | town_city |        district        |         county         
---------+----------+------------------+----------+------+------+-------------+----------+-----------+------------------------+------------------------
 1405652 | 54959000 | 2012-03-26       | SW10 9SU | 20   |      | THE BOLTONS |          | LONDON    | KENSINGTON AND CHELSEA | KENSINGTON AND CHELSEA
(1 row)

This property was sold for £54,959,000 in March 2012.

Current rating: 3

Comments

Comments are pre-moderated. Please be patient and your comment will appear soon.

There are currently no comments

New Comment

required

required (not published)

optional

required