5 November 2009 | Richard Chow
Google has just announced a free GPS navigation service with the latest version of Android. It’s a classic Google bargain: they provide the desired content — in this case, phone-based maps and turn-by-turn directions — and the user will (eventually) see ads.
The lingering question, of course, is what happens to your data.
Why should you be be concerned about your “location trace”? The EFF has an overview On Locational Privacy, and How to Avoid Losing it Forever. You can deduce a lot about people from their locational traces: where they sleep and work and play, what stores and restaurants they like, who they spend time with, more.
Privacy should be the biggest concern for users of location-based social networking apps like Foursquare, Google Latitude, Loopt, and others. For example, will these companies store and analyze your location traces to figure out what ads to show you as part of their business model? [Maybe not; Google Latitude claims to overwrite historical log data whenever new data comes in.]
Location traces are only the beginning of the story. People create many types of contextual data, such as phone logs, web history, e-mail, search engine queries, and more. Let’s consider a different model — one which radically shifts the balance of power from the corporations to the consumer.
The idea: enhance privacy by taking advantage of the insecurity of contextual data.
For example:
I don’t necessarily recommend these approaches as described; it may not be practical to multiply operating expenditures by an order of magnitude or more to support privacy. But this approach might be appropriate at certain times or in specialized domains like the military.
The ability to generate convincing fake contextual traces might also be useful in social applications. Suppose you want to conceal your Vegas trip, yet don’t want to go off the location-app grid, which might arouse suspicion. You would just need to generate one convincing fake trace and substitute that for your actual trace.
We predict fraudsters will be first adopters of technology to fake contextual data, and that this will drive better techniques for detecting fake contextual data (which, in turn, improve the ability to create fake contextual data: a classic arms race). Click fraudsters have already been engaging in a primitive form of faking search engine queries. Consider the version of click fraud where publishers target ads to certain locations, and fraudsters need to seem to “appear” in those locations to get the ads — their trace to the location must appear realistic to foil fraud detection algorithms.
At PARC, we’ve been experimenting with creating convincing fake location traces. It’s not a trivial task. For example, you can’t just splice in something from a database of past traces if you’re assuming that the parties trying to detect the fakes have access to the same data and are as, if not more, knowledgeable as the parties trying to commit the forgery.
We have developed an algorithm where we fake a driving trace leveraging Google Maps, by:
One tricky challenge is simulating noise errors in the GPS signal. Errors in an actual GPS signal seem to drift – for example, see this plot of an actual trace. When the signal is off a bit, it tends to stay off in the same direction for a little while.
By the way, faking contextual data only works if you convincingly fake all the data in concert (e.g., a GPS trace showing you at work would not match accelerometer data consistent with driving).
See our full paper for more details of what we did. Others have also developed methods for faking location traces of a car trip, such as John Krumm of Microsoft Research and Pravin Shankar et al from Rutgers.
PARC is also exploring the idea that contextual data implicitly authenticates you. It’s a privacy problem because you not only have to worry about what the data is telling others, but that the data itself fingerprints you.
We don’t know how to make guarantees about how convincing a fake is. Perhaps the best option is peer review (as with cipher evaluation). If you’d like to try distinguishing our fake traces from real ones, download this zip file. The file contains 6 traces of the same route with 2 fakes; the readings in each trace are 5 seconds apart. Let us know which two you think are fake, and why.
Nice post. Thanks for pointing to my work. I like how you’ve thought beyond just the technical aspects of this solution for privacy.
I also like your answer to why we can’t just use previously gathered GPS traces as the fake traces. I’m sometimes asked this question (e.g. by reviewers), and I never had a very good answer until reading your post. You’re right that attackers could have access to the same historical traces as anyone else, rendering the traces useless for this purpose. Good thinking!
Very interesting post, thanks for pointing to my work.
I agree that evaluation of “fake locations” or SybilQueries is an important problem. This would have to be done as a combination of user study/peer review, as well as machine learning/clustering techniques, and is a great avenue for future research.
I took a look at your traces, and my guess for the fake traces are 08042008.kml and 08062008.kml The rather naive reasoning is that the CDF of distance between consecutive points of these two traces looks different from the other 4 (see this figure)
Thanks for the comments, everybody!
Pravin, your guess is right. You’ve pointed out that some adjustments are needed in the distributions used by the coordinate-generation algorithm.
What’s tricky is that ideally we don’t want the algorithm to depend on data from actual trips along the route. The advantage of the algorithm is not needing this data (otherwise, we might have used John Krumm’s algorithm). One thought is to divide up roads into some categories like “freeway” and “residential”, and have distributions for each category. Of course, there might need to be a time-component to the category also, e.g. “freeway-during-rushhour”…
ambient intelligence AR augmented reality authentication batteries brainstorming business of innovation CHI cleantech collaboration collective intelligence competitive edge computer vision context-aware computing contextual intelligence crowdsourcing curation data centers decision making disruptive innovation electric vehicles email energy energy efficiency epic conference ethnography ethnography in industry ethnomethodology ev everyware field of use government green HCI information overload innovation innovation culture innovation strategy intellectual property IP IT kiffets licensing lithium-ion location based services long tail malware materials minimum viable product mobile computing mobile devices & interfaces mobile security MVP natural language processing news NSF open innovation opportunity discovery organic electronics Pasteur's Quadrant personal information management pervasive computing phishing photovoltaics portfolio management printed electronics privacy QR codes recommendation systems research methodology responsive mirror SaaS search smart environment smart grid social analytics social computational systems social indexing social media social streams social web software as a service technology scouting technology trends terms thin film transistors twitter ubicomp user behavior modeling user centered design user experience user interface design v2g vehicle-to-grid virtualization virtual machines virtual reality web 2.0 Wikipedia
November 6th, 2009 at 9:59pm
Posted by Senthil
Nice one, Richard! Interesting read.