This blog is an excerpt of an article that currently appears on infosecisland.com.
Written in collaboration with Yunhui Long.
In this time and age, where companies brew money using user data, consumer privacy is at stake. Incessant identity thefts and phishing attacks, and revelations about mass government surveillance have resulted in privacy paranoia among consumers. Consumers have thus come to prefer products and services with stronger privacy postures. To this end, two major privacy technologies have gained immense attention — End-to-end Encryption and Differential Privacy. While both the technologies strive to protect user privacy, interestingly, when put together, the whole is smaller than the sum of its parts.
Firstly, what exactly are end-to-end encryption and differential privacy?
What is End-to-end Encryption?
End-to-end encryption (E2EE) is a popular privacy technology for instant-messaging services. With this technology, only the communicating users can read the messages. Technically speaking, this works by encoding the sender’s message in such a way that only the receiver has the key to decode it.
Just in the past three years, various messaging apps have implemented end-to-end encryption. Notably, this shield not only protects users from external eavesdroppers, but also ensures that even the company offering the instant-messaging service cannot access the data.
What is Differential Privacy?
Intuitively, differential privacy is a technique that can reveal interesting patterns in a large dataset, while still protecting privacy of individual data entries.
To understand the technique, consider a database of salaries of all Software Engineers in the Silicon Valley. Let us say that an analyst is allowed to access the average of the salaries. Denote the average value by avg1. Let us say that a new item v is added to the database; let the new average be avg2. The analyst can easily decode what v is, just by knowing avg1 and avg2. [v = avg2 * (N+1) – avg1 * N], where N is the total number of salaries in the original database.
Differential privacy avoids such scenarios. More specifically, differential privacy is a statistical learning tool that works by adding carefully computed mathematical noise to the statistical aggregate. In the above example, the noise term added to the average salary does not allow the analyst to learn information about the exact salary of any individual software engineer. The noise term is large enough to mask individual data items, but small enough to allow any patterns in the dataset to appear.
Differential Privacy to Protect User Privacy
Until recently, differential privacy had been a topic of theoretical research without much application in real-world scenarios. Clearly, differential privacy can bring a significant value to the table: In the today’s consumer-driven economy, it’s crucial for businesses to learn and adapt to consumer behavior. Thus, collecting and studying patterns in consumer data has become key ingredient for success survival. The ability to extract patterns from large datasets, while still protecting privacy of individual data points seems to be a boon.
An application of differential privacy: Consider a company C providing an end-to-end encrypted instant-messaging service. A desirable feature of an instant-messaging service is smart autocomplete. To provide this feature, all the data that C needs is just the English dictionary. Now, consider a smart autocomplete feature that also suggests trending slangs even before you have heard of those slangs. Note that the suggestions are specifically based on a population’s messaging behavior. So, clearly, this feature needs consumer data. In such a case, differential privacy might be used to collect and process consumer data, while still preserving individual privacy.
Methodologies for implementing differential privacy: Unfortunately, differential privacy had been confined only to theoretical research, and there isn’t much work on how to employ this in practice. Thus, the exact methodologies of implementing this technology large scale is unclear. A specific interesting question is what exact methodology one should use to sample the noise terms. There are two major methodologies in the literature:
- A prevalent methodology is to first collect the exact data points, compute an aggregate (such as, total count or average) of the collected data points, and then add noise to the aggregate. This necessitates the users sending their exact information to the company C. Thus, while this methodology protects user privacy from the public, the company still gets access to the exact user information. This is undesirable.
- Thankfully, there is another methodology which, although less prevalent, seems to fit the bill: It involves adding noise to the data points at the user end before the data is collected and sent to C’s cloud storage. Then C would aggregate the noised data points. This helps preserve some privacy from C too. A significant research in this area came from Google Research —RAPPOR methodology, and it involves the so-called tools of ‘hashing’ and ‘sub-sampling’.
However, the devil is in the details.
The Devil in the Details
Detail 1: RAPPOR-like techniques would require C to know a set of candidate strings of which C is computing the usage frequency. For concreteness, in the case of trending slangs, C would actually need to know the slangs, that the users are sending through the messaging service, to determine the frequency.
Detail 2: Recall that the conversations are end-to-end encrypted. Also recall that the objective of end-to-end encryption is to have no door through which C can obtain user data (so that C may not be coerced to reveal user data even by government surveillance warrants).
The devil: Detail 1 implied that C needs to look into user data, Detail 2 recalled that the data is already encrypted. In other words, to learn the candidate strings used in differential privacy techniques, the company may need to see the unencrypted content of individuals’ conversations, which is against the very intention of end-to-end encryption.
This shows that the two privacy technologies fundamentally tussle with each other. In fact, we have seen that one can seriously backpedal the other. Thus, any methodology that will make these privacy technologies work together will be incredibly non-trivial and ground-breaking.
In summary, although end-to-end encryption and differential privacy offer strong user privacy protection, these two technologies interact in interesting ways, one fundamentally backpedaling the effect of the other. In this light, while differential privacy is a promising tool, implementing and deploying it while retaining the privacy of end-to-end encryption is challenging.