…is the name of the presentation I gave last week at the Barcelona Python Meetup, held for the first time on Skyscanner‘s premisses. It was supposed to be a lightning talk taking no more than 10 minutes about my adventure finding a flat in Barcelona, but it got stretched out to around 20 minutes.
Once upon a time…
…my otherwise amazing landlady kicked me out 🙂 and I had a month to find a new flat. Since I’m currently procrastinating reading a book on how to fight procrastination, it’s quite clear that I did not start looking for a flat right away and instead I waited until I the deadline was close enough to force me into full panic mode. Panic which was amplified by the realisation that while finding a place in Barcelona is generally hard, it’s incredibly hard to find one if you don’t speak Spanish. (Note, learning Spanish is on my todo list, somewhere after the aforementioned book on procrastination 🙂 )
After few days of unsuccessful attempts, I started to look for ways of optimising the process.
Where to start?
The best place to start is usually by analysing the current process, see what the bottlenecks are and start optimising from there. I was using fotocasa and I had a clear set of search criteria – price, location, number of rooms – that I would always punch in and a conversation cheatsheet in Spanish – an already translated text that I would parrot to the letting agent. My process was simple, go through the list of flats until I find one that I like and call the number. With results ranging from no answer to the landlord realising that I’m a foreigner and trying to scam me into paying the guiri tax, but most often I would be told that the flat has been just rented out.
And here is the first bottleneck: I was spending too much going through the same results over and over again. If I wanted to not end up on the streets – not that bad, since the weather is generally friendly in Barcelona, but still – I needed to be the first person calling and not lose time browsing the results list.
All I needed was a list with all the flats where I can cross out the ones I don’t like once and for all, while keeping the list updated with all the new ads that match my criteria.
Scrapy to the rescue
I chose Scrapy because it’s the most popular scrapping framework – first result in Google for this keyword – and when something is this popular, it means that any problem I might have, somebody else already posted a solution to it online.
Off topic: I can’t keep wondering what the losses would be in the software industry if StackOverflow goes offline.
That being said, Scrapy is an awesome professional level framework. But I went for the quick and dirty solution:
$ scrapy runspider house.py
which allows me to run a spider without creating a project.
That, together with peewee, a lightweight DB access abstraction layer helped me solve the first problem. I would crawl the site, get the links to the flats and store them in the database with a true/false rejected flag. Once I would decide I don’t want a flat, I would mark it as rejected and never see it again. Easy to do, since fotocasa uses unique IDs in their URLs. For example, the correct URL:
displays the same page as a made up one:
as long as the ID 136576807 stays in place. This was enough to make sure I would only store each entry only once.
The hard part
But what about multiple entries on the same flat? Made by different agencies. This gets even trickier, since depending on the agency, the price can be lower 🙂 Not by much, but still.
This is the same flat, just that the one on the left is €20 cheaper. And from my own experience, I can tell you that that buys you a lot of churros 🙂
In this case, finding duplicates is *really* difficult. Because real estate agents tend to inflate the numbers when they post something online, to make it more appealing, especially things that are not obvious during a visit, like surface in square meters. Nobody (without OCD) will notice that a flat is just 75sqm instead of 80sqm.
Of course, if the ad says 3 rooms and there’s only one, people might notice.
The best way of telling whether 2 ads point to the same flat is by analysing the images. Which is more difficult than initially thought. I couldn’t use any regular hashing algorithms due to the avalanche effect and Mechanical Turk was out of the question in my case, due to cost & time constrains.
So I did what any normal person would do: research. And by research I mean clicking on the first result in Google, which happened to be this one, which proved to be an amazing resource with a lot of “copy-paste”-able content 🙂
In the end, after 3-4 hours of work, I ended up with the script described in this gist, which I’m not sure how effective it actually is in detecting duplicated via images, because I got lucky on the second try and found an awesome flat. The only one I actually visited.
PS: If you want to see the slides, they’re available here. Just press <space> to navigate!