Data is now the crux of the Internet economy: It holds the promise of profitable insights and bears the commercial destiny of online corporations. As a result, organizations – think Google and Facebook as prime examples– are gathering big data in increasingly large data centers with hopes of both current and future monetization.
Although data is extremely volatile and easy to share, little filters out from corporate data troves. Corporations retain exclusive control over the collected data, spawning entirely new “digital divides”. In fact, this data stratification tends to have an intoxicating effect as companies start collecting big data out of fear of being outperformed, thus driving the economics of relentless surveillance.
On the other end of the spectrum, many advocate for open data initiatives, arguing that certain data should be publicly shared. Data sharing promises better services for end users by thwarting data scarcity and spurring open innovation. Cities and governments eagerly follow this model but with corporate data, there are massive roadblocks mainly due to concerns of security/privacy, liability, and competition.
Take security as an example. The fall-out from data leaks can be devastating. The recent headlines with Target and Adobe demonstrate how important security is. Similarly, privacy is an important issue since customers might not wish for their data to serve unexpected purposes. Another problem is competition. Corporations do not want to make it easy for others to compete against them in the marketplace, and prefer enforcing substantial switching cost through vendor lock-in. In fact, users themselves have limited access to their data. Even after years of efforts, Google‘s data liberation group still hasn’t emancipated numerous Google services.
Ideally, data sharing should have controls that limit chances of negative externalities while preserving benefits of data sharing. What this means is that data must be shared responsibly. Existing data sharing technologies rely on vetting processes to build trust with potential data partners. For example, data sharing has been shown to improve the value of analytics in clinical trials. These approaches rely on non-disclosure agreements and memoranda of understanding, which tend to be slow, costly and thus seldom satisfactory.
At PARC, we consider a new paradigm: privacy-preserving data sharing. We developed a first generation secure collaboration technology that addresses the aforementioned challenges through cryptography and achieves private and real-time data fusion.
Data Sharing for Cyber Threat Mitigation
One prominent application of our technology is cyber threat mitigation. For years, the U.S. government and standardization groups such as ITU-T and IETF promoted data sharing for cyber threat mitigation. The idea is that the sharing of security data among corporations will help thwart online attacks since cyber threat mitigation mechanisms work best with extensive information about attackers.
So far, these data sharing efforts have had limited success. Last year, the U.S. House of Representatives proposed active legislation, with the Cyber Intelligence Sharing and Protection Act (CISPA). This was heavily disputed and stalled as it granted broad immunity to data sharing entities (i.e., legal concerns), and took generous views on what data could be shared and with whom (i.e., security, privacy, and competition concerns). The debate continues this year with the Cyber Intelligence Sharing Act (CISA) currently discussed in Congress.
Our approach relies on a privacy-preserving analysis of data sharing risks and benefits. We design cryptographic protocols based on secure multi-party computations to identify promising commercial partners for data sharing by calculating data similarity over encrypted data. This reduces competitive risk as it identifies data sharing partners that share similar interests and properties, and protects security/privacy as it enables organizations to make informed data sharing decisions prior to revealing any of their data. Finally, we control data sharing with those identified partners through sharing of specific data subsets that maximize benefits.
We experiment with our approach to predict presence of malicious activity in online traffic and initial results show that data sharing improves prediction success by about 100% on average. For example, two companies can query each other for common sources of attacks on their web infrastructure. If companies have many attackers in common (i.e., high similarity), they then enter a data-sharing phase and exchange with each other information about these attackers. With that information, they can enrich their predictive models for mitigation of future attacks.
Technology alone cannot solve all problems, but it can yield novel tradeoffs between transparency and security: It is possible to design data sharing approaches that tackle security, privacy, and competitiveness concerns. In brief, if we dare to make innovative use of big data troves, security and privacy can help devise novel and responsible data sharing approaches.