Hacking the BCN housing market

By Tudor Barbu tudor.barbu@skyscanner.net

...not so long ago in a galaxy not so far away

my landlady told me to vacate the premises within a month :)

I like to do things better that's why I usually wait until I'm older and wiser to do them.


Finding a good flat in Barcelona is hard


Especially when you are a guiri who does not speak Spanish and / or Catalan

What am I doing wrong?

...a good question to start with

Starting point:

  • Clearly defined search criteria (location, budget, number of rooms)
  • A conversation cheatsheet in Spanish
  • Very limited time frame

The problem(s)

  • spend too much time going through the same results
  • usually call after the flat was already rented out
  • the agent would suggest another house but that particular conversation was not on my cheatsheet

The solution(s)

  • prevent me from going over the same flats over and over again
  • help me avoid duplicates
  • pick the best price
cheap expensive

while €20 is not a lot of money


can buy you a lot of churros :)

Prevent me from going over the same flats over and over again

Store the links into a DB with a rejected flag, only click through the non rejected ones

id primary key
external_id 134370443 (unique key)
rejected 1/0 (index)
url http://fotocasa.es/vivienda/barcelona-capital/ascensor-valencia-80-134370443

No fancy interface needed, Sequel Pro FTW

Scrapying the pages

Why Scrapy?

While I'm *not* a python programmer!

I did some python projects in the past and I need to back to speed with it

Scrapy is probably the most popular scraping framework

  • First result in Google for "scrapying framework"
  • Popular means most problems have been already dealt with by someone else
  • Just go to StackOverflow

Using Scrapy

Scrapy architecture

or the quick solution

$ scrapy runspider house.py

Initial implementation

import scrapy
import peewee
from playhouse.db_url import connect

db = connect('mysql://root:@')

class House(peewee.Model):
    external = peewee.IntegerField(db_column='external_id', unique=True)
    rejected = peewee.IntegerField(index=True, null=True)
    url = peewee.CharField()

    class Meta:
        database = db

class HouseSpider(scrapy.Spider):
    name       = 'HouseSpider'
    start_urls = [

    def parse(self, response):
        for li in response.css('div#photo-content ul.listPhotos li'):
            url, external_id = self.extract_url_info(
                    'div.property-information a.property-location'

            house = House(

            except peewee.IntegrityError

    def extract_url_info(self, long_url):
        url = long_url[0:long_url.find('?')]
        external_id = url[url.rfind('-') + 1:]

        return url, external_id

...and it worked on the first try!

which I must admit was a bit unexpected

How about the second part?

  • eliminate duplicates
  • find the smallest price

What cannot be used to detect duplicates

  • external_id
  • geo position
  • area (sqm)
  • other not-obvious-during-a-visit items

What can be used to detect duplicates

  • number of rooms
  • images
  • street name(ish)

...or a combination of the 3

How to find out that too images are similar

can't use hashes due to the Avalanche effect

Mechanical turk - high cost in time & money



Amazing resource :)

...with a lot of "copy-paste"-able content

New and improved

    def parse(self, response):
        for li in response.css('div#photo-content ul.listPhotos li'):
            url, external_id = self.extract_url_info(
                    'div.property-information a.property-location'

                House.get(House.external == external_id)
            except peewee.DoesNotExist:
                yield scrapy.Request(
                        'url' : url,
                        'external_id' : external_id

    def parse_secondary_request(self, response):
        rooms = response.css('#litRooms b::text').extract()[0]

            bathrooms = response.css('#litBaths b::text').extract()[0]
            bathrooms = 0

        street = response.css('h1.property-title::text').extract()[0].strip()
        comma   = street.find(',')
        street = street[9:comma if comma != -1 else len(street)]

        price = int(re.search(r'\d+', response.css('#priceContainer::text').extract()[0]).group())

            floor = int(re.search(r'\d+', response.css('#litFloor::text').extract()[0]).group())
            floor = 0

        duplicate = None
        insert    = False

        hashes = self.get_image_hashes(response)
        houses = House.select(House, Picture).join(Picture).where(
             House.rooms == rooms,
             House.bathrooms == bathrooms,
             House.floor == floor

        def find(houses, hashes):
            for house in houses:
                for picture in house.pictures:
                    for hash in hashes:
                        if hamming_distance(picture.hash, hash) <= 10:
                            return house
            return None

        if houses.count() == 0:
            duplicate = None
            duplicate = find(houses, hashes)

        if duplicate == None:
            insert = True
            if duplicate.price > price:
                insert = True

        if insert == True:
            house = House(
                bathrooms = bathrooms,
                floor     = floor,
                price     = price,
                rooms     = rooms,
                external  = response.meta['external_id'],
                street    = street,
                url       = response.meta['url']
            for hash in hashes:
                picture = Picture(
                    house = house.id,
                    hash  = hash

    def get_image_hashes(self, response):
        hashes = []
        for img in response.css('.carousel_slide img'):
            path = img.xpath('@data-src').extract()
            if len(path) == 0:
                path = img.xpath('@src').extract()

            path = path[0]
            path = cStringIO.StringIO(urllib.urlopen(path).read())

        return hashes

Did it work?


...kind of

Some stats

  • 174 lines of spaghetti code
  • took around 3-4 hours to develop (learning curve)
  • would take a python dev < 30 min

Automating tasks


If This Then That

Feedback welcome