How did the idea of a product like OpenCage Data come up?
Well originally we were part of London-based real estate search engine Nestoria (also built in Perl). In that business we processed millions of real estate listings from various countries every day. Along with lots of data cleansing, one of the things we needed to do was geocode them. Depending on the country and the quality of the data that was a real challenge, and various off-the-shelf solutions we looked at either weren’t up to the task or were prohibitively expensive.
Simultaneously, OpenStreetMap had been invented in London and there was a very active local community that we were participating in. So we saw that open data was getting better and better and eventually we built our own internal solution on top of different open data geocoders. Various customers asked us how we did the geocoding, so in 2014 as an experiment we put the solution on the web under the OpenCage brand as a beta service for others to experiment with.
In 2015, Nestoria was sold and the buyers took a different approach (more keyword search than geographic search), and thus weren’t interested in OpenCage. So Marc Tobias (one of the Nestoria team members) and I took it over. We kept tweaking the service based on the feedback of the original customers, and eventually moved it out of beta and offered various pricing tiers depending on usage. It’s grown steadily from there.
How does it work?
Well technically we are actually a meta-geocoding service.
Users send us a request to our geocoding API and we then error check that, pass it to different geocoders that we run behind the scenes, get back the results, merge and deduplicate them, clean things up (including formatting the address correctly), and add in relevant information that helps make developers’ lives easier - things like local currency info, timezone info, wikidata reference, etc. Finally we send all that back as JSON or XML.
Clients can call the API directly via curl or similar, but there are also many SDKs and libraries for different languages. Including of course Geo::Coder::OpenCage for Perl 5 (there’s also a Perl 6 OpenCage library).
What’s your personal backstory with Perl?
Like a lot of people I picked up Perl in the mid-nineties when I was learning how to make a webpage. CGI FTW! After university (where I studied and hated civil engineering) fate took me to yahoo.de during the internet 1.0 boom. I worked on lots of different content products like news, sports, finance, etc, all in Perl. Basically they all involved manipulating text to eventually generate web pages, Perl of course does that task very well.
I learned an immense amount, and frankly I am a bit embarrassed when I think about how primitive our way of working was - no real test infrastructure, minimal version control, etc. It was very much “just fucking do it”. Later when we started Nestoria the modern Perl movement was gaining momentum and we had very solid practices - automated tests builds, continuous integration, etc. But also making time to contribute back to the Perl ecosystem via code, sponsoring and speaking at events like the London Perl Workshop, training younger developers at the start of their careers, and we used to do a “Module of the Month” feature on the Nestoria dev blog where we sponsored various CPAN modules we were using.
Anyway, coming back to your question, when we started OpenCage we did it in Perl and it’s still that way today. It works very well for us and keeps getting more and more solid, which is critical when you’re a smaller team trying to do a lot. As with every Perl project, we stand on the shoulders of CPAN giants. Thank you, Perl contributors!
Tell us more about the team at OpenCage Data
The founding team is myself and Marc Tobias. Legally OpenCage is a UK company, but we have a very distributed team. The foundation of our service though is the millions of open data contributors around the world. We’re proud to be a corporate member of the OpenStreetMap Foundation as a small way to help give back to that community, and we run #geomob, the location based service developer meetup up in London. Anyone interesting in the geo space - and it is pretty fascinating technically - should come along for the talks, conversations, and free beers.
What’s the tech stack behind it all?
Perhaps surprisingly these days we host our own machines rather than use something like AWS. Basically it’s just much cheaper given the huge volume of data we have to have in memory (literally the whole world) - and it’s changing all the time.
So we have some load balancers, then linux servers running Apache, the client request goes into good old mod_perl, we do some authentication using the service Kong which we highly recommend, and then depending on the query we pass it to various geocoders we have set up - for example OpenStreetMap’s Nominatim (to which Marc Tobias is a major contributor), Data Science Toolkit, and others. Here’s a full list.
The technical challenges are about reliability (ie never being down), latency, freshness of data, and of course correctness.
Forward geocoding in particular can be very hard as the query is basically free text, and can be in any language (or mix of languages - fun!). As you have seen if you’ve ever spent time working with databases, people have A LOT of garbage in their data, and we see that when they try to geocode it. The simple case of someone sending us a well formatted address is fairly straightforward, but a lot of noise gets thrown at us as well, so it’s about trying to do the best we possibly can with that.
What is your ideal customer persona?
It’s a surprisingly diverse group actually. Companies from all over the world. It’s big companies and tiny start-ups. For a lot of them it’s that they have a continual stream of addresses they want geocoded (ie forward geocoding). But another big group is vehicle tracking companies. The cost of a GPS tracking device has fallen and fallen, and so they are gathering tons of coordinates all the time and eventually need those turned into human readable locations (ie reverse geocoding). What’s interesting is that there are very different requirements in terms of what level of granularity users want, how critical speed of response is, how much they care about freshness of data, etc. And we’re continually hearing of new use cases. Basically we’re back end infrastructure - digital plumbing, if you will - and lots of people need that to build upon.
Any client names you may want to mention?
Sure, we’re fortunate to be trusted by some well known global brands like BMW, Toyota, Diageo, and Bosch. But also organizations that are only active in one region or country, or tiny start-ups just getting started. Recently the Norwegian police force started working with us which was nice.
How many users do you have at the moment?
Our model is the standard freemium model - there is a usage-restricted free tier that people can use as long as they like, or if they need more they can upgrade to a paid plan. As is typical in that model, the vast majority of users stay on the free plan and a few percent become customers. I don't track the free number that closely as lots of people sign up, do their geocoding and then never come back (you either need geocoding or you don’t) and after six dormant months or so we delete their account. In total though it’s been tens of thousands.
What do you provide beyond what the other geocoders on the market offer?
Great question. We provide the flexibility of opendata with the reliability and robustness of an enterprise partner. And we do that at a very affordable price.
Really we have two main competitors. The first is Google, the best known name in the world in maps and geo APIs. They have a fantastic product, of course, but have three weaknesses. First up - they are very expensive at higher volumes. Secondly, you have to agree to their terms and conditions, which many find restrictive. For example you can only display the data they return on a Google map. Open data gives you more options. Finally, lots of companies feel like they already have enough of a dependency on Google.
The other big competitor is companies thinking they should just run their own geocoding infrastructure, by, for example, setting up their own copy of something like OSM’s Nominatim. It depends on your precise use case of course, but in most situations it’s much more cost effective and to have us run it for you. Also, our results tend to be better because we’re running multiple geocoders. One thing we see people underestimate is that they need to not just set it all up but they then need to keep both the software and the data up to date. So it’s an ongoing maintenance challenge, and eventually many realize they have enough ops issues in their life already. Anyone thinking of going the self-hosting route should chat with us. Our most loyal customers are teams that wasted time and energy on doing it themselves before switching to us.
What’s the plan for the future?
The pleasure and pain of geocoding is that the world is always changing (I write that as someone living in Barcelona, Spain/Republic of Catalonia - as just one recent example). So there’s ALWAYS more to do. But the constant is that it’s about listening to the customers, understanding their needs, and then finding a solution.
One thing we’re always working is making it simpler for developers to use our API. We have Perl well covered, but there are lots of smaller or newer languages we’d love to have libraries for. We’re happy to pay anyone who contributes a library or SDK.Tweet