The UK Land Registry keeps a record of the date and price of every house sale in the country and makes this data available on its website. It provides an online tool for searching entries related to individual properties and even has a SPARQL interface, but here we download the raw data and load it into a PostgreSQL database with Python.
First, get the "complete transaction" data in comma-separated (csv) format from this page. At the time of writing, this file, pp-complete.csv
, is 3.2 GB in size (there seems to be no compressed version available).
If you don't already have it, download and install PostgreSQL.
We won't try to insert all of the fields from the price-paid data file into our database, but will instead limit ourselves to the following:
price
: Sale price stated on the transfer deeddate_of_transfer
: Date when the sale was completed, as stated on the transfer deedpostcode
: Eight-character postcode, of the propertypaon
: Primary Addressable Object Name, e.g. house name and/or numbersaon
: Secondary Addressable Object Name. If there is a sub-building, for example the building is divided into flats, there will be a SAONstreet
locality
town_city
: Town or city namedistrict
county
Open a PostgreSQL session from the command line with psql
and create and connect to a new database called house_prices
:
# CREATE DATABASE house_prices;
CREATE DATABASE
# \c house_prices
You are now connected to database "house_prices" as user "christian".
Create a table, pp
with columns representing the above fields:
house_prices=# CREATE TABLE pp (
id serial primary key,
price integer NOT NULL,
date_of_transfer date NOT NULL,
postcode char(8) NOT NULL,
PAON varchar(100) NULL,
SAON varchar(100) NULL,
street varchar(200) NULL,
locality varchar(200) NULL,
town_city varchar(100) NULL,
district varchar(100) NULL,
county varchar(60) NULL
);
CREATE TABLE
Also add indexes to some of the columns to speed up searching:
house_prices=# CREATE INDEX ON pp (price);
CREATE INDEX
house_prices=# CREATE INDEX ON pp (postcode);
CREATE INDEX
house_prices=# CREATE INDEX ON pp (county);
CREATE INDEX
house_prices=# CREATE INDEX ON pp (date_of_transfer);
CREATE INDEX
Now to upload the data. The following Python script enables one to do this in chunks using the psycopg2
Python-PostgreSQL Database Adapter. Doing it in one go might take longer than you'd like so set nstart
and nend
to the line numbers (starting at 1!) in pp-complete.csv
that you'd like to upload.
# upload_pp.py
import psycopg2
from collections import namedtuple
import sys
import csv
# to upload the first million lines set nstart and nend as follows:
nstart = 1
nend = 1000000
conn = psycopg2.connect(database='house_prices', user='<USERNAME>',
password='<PASSWORD>')
cursor = conn.cursor()
Field = namedtuple('Field', 'name index type max_length')
def parse_date(s):
return s[:10]
def strip_nonascii(s):
return ''.join([c if ord(c) < 128 else ' ' for c in s])
price = Field('price', 1, str, None)
date_of_transfer = Field('date_of_transfer', 2, parse_date, None)
postcode = Field('postcode', 3, str, 8)
paon = Field('paon', 7, str, 100)
saon = Field('saon', 8, strip_nonascii, 100)
street = Field('street', 9, str, 200)
locality = Field('locality', 10, str, 200)
town_city = Field('town_city', 11, str, 100)
district = Field('district', 12, str, 100)
county = Field('county', 12, str, 60)
output_fields = [price, date_of_transfer, postcode, paon, saon, street,
locality, town_city, district, county]
field_names = [field.name for field in output_fields]
cs_field_names = ', '.join(field_names)
pp_name = 'pp-complete.csv'
with open(pp_name, encoding='cp1252') as pp_file:
pp_reader = csv.reader(pp_file)
for row in pp_reader:
if pp_reader.line_num < nstart:
continue
if pp_reader.line_num > nend:
break
data = []
for field in output_fields:
s = row[field.index]
field_value = field.type(s)
data.append(field_value)
query = 'INSERT INTO pp ({0:s}) VALUES ({1:s});'.format(cs_field_names,
','.join(['%s']*10))
cursor.execute(query, data)
if not pp_reader.line_num % 10000:
print(pp_reader.line_num)
conn.commit()
There are a couple of things to note:
namedtuple
The date of transfer is stored in pp-complete.csv
in the format 'YYYY-MM-DD 00:00'
(don't ask why), so to get a string corresponding to the date only, we slice this field as s[:10]
.
The Land Registry folk don't state a character encoding for their csv files, but given the UK government's apparent fondness for Excel files in so many areas of its open data activities, one can guess cp-1252 might work. This turns out to be important (for the current version of pp-complete.csv
) because of some stray non-ascii characters which Python 3 chokes on when it tries to interpret using its default utf-8 encoding. We strip out these characters for the field they appear in using the function strip_nonascii
.
Don't forget to conn.commit()
when we're done or the data won't be INSERTed at all!
To test the database, upload the first 6 million rows (say) and find the most expensive house sold in this subset:
house_prices=#select max(price) from pp;
max
----------
54959000
(1 row)
house_prices=# select * from pp where price=54959000;
id | price | date_of_transfer | postcode | paon | saon | street | locality | town_city | district | county
---------+----------+------------------+----------+------+------+-------------+----------+-----------+------------------------+------------------------
1405652 | 54959000 | 2012-03-26 | SW10 9SU | 20 | | THE BOLTONS | | LONDON | KENSINGTON AND CHELSEA | KENSINGTON AND CHELSEA
(1 row)
This property was sold for £54,959,000 in March 2012.
Share on Twitter Share on Facebook
Comments
Comments are pre-moderated. Please be patient and your comment will appear soon.
There are currently no comments
New Comment