An Apple software engineer recently revealed that Apple is now rolling out its use of differential privacy to cover both web browsing and health data, as it now uses the technique to process millions of pieces of daily information from device users.
Differential privacy has so far flown largely under the radar, so we thought it would be a good time to look at what it does and how it works – and to ask how comfortable you feel about its wider use by Apple …
Prior to the development of the technique, tech companies with access to large volumes of data faced a fundamental dilemma. If you collect and analyse that data, it can be of tremendous value in helping you understand what your customers do and what they want – and allow you to provide a better service as a result. If you take data analysis right down to the level of individual users, you can offer a highly personalized service – but at a potential cost to their privacy.
This is the approach that Google has taken, and is why it is some way ahead of Apple when it comes to things like identifying travel plans from emailed etickets and proactively letting you know when it’s time to leave for the airport.
If, on the other hand, you decide that user privacy is more important than data-mining, then your customers will feel comforted by the fact that you’re not mining all their data – but the downside may be less intelligent services.
This is the approach that Apple has historically taken.
What is differential privacy?
Differential privacy is a potential solution to this dilemma. It’s a method of collecting and analyzing large volumes of data from individuals, but processing it in a way that ensures that nothing can be tied back to any one individual. You can’t use it to deliver fully-personalized services the way Google does, but you can use aggregated learning to deliver an all-round better service to your customers.
The WSJ gave an example of how the technique could be used in a survey about illegal drug use. If you anonymously ask 100 people whether they use marijuana, and you also ask them a bunch of other questions, then there’s the risk that a combination of answers could identify individuals.
For example, if you also asked those people what color car they drive, then there may be only one person in that survey who drives a blue car. If someone answered yes to smoking marijuana and also said that their car is blue, then we can work out who they are even though the data is theoretically anonymous.
Real-life examples would obviously be more complex – involving millions of people and many more than two data items – but the same principle applies. For example, Netflix uses anonymous IDs to log our TV and movie preferences, but an analysis by the University of Texas showed that having just a little bit of knowledge about an individual can allow us to de-anonymize the data.
What differential privacy does is to add a certain amount of mathematical ‘noise’ to the data you collect so that you can no longer know with certainty anything about any specific individual.
In the drug survey example, question 1 for 90 of the people would be whether they smoked marijuana. For the other 10 people, Q1 would be ‘Flip a coin and answer Yes if it comes up heads.’ Then if we see that our blue car driver answered Yes to question 1, we can no longer state that he’s a drug user – he may be one of the people who got the coin-flipping version of the survey.
The dummy questions need to be ones with known response rates (50/50 in the case of a coin flip), and there’s some clever mathematics involved in ensuring that your data analysis is accurate, but the net result is that you can still, within a margin of error, determine what percentage of people smoke pot without identifying any of them.
In the case of health data, Apple would know, for example, how many iPhone owners have a particular body mass index, but it wouldn’t know who any of them are.
How comfortable are you with Apple’s approach?
Apple started using differential privacy with the launch of iOS 10. When you opt in to sending Diagnostic and Usage Data, Apple applies differential privacy to that data.
The move wasn’t without controversy. My colleague Greg Barbosa wrote a piece about Apple not making it clear to users how the data was being used, and a cryptography professor from John Hopkins questioned whether Apple’s approach was truly safe.
The problem, he says, is that there’s an inevitable trade-off between the accuracy of the data you collect and the privacy of individuals. In other words, the more mathematical noise you introduce in order to protect privacy, the less accurate your data.
But all the signs point to Apple erring on the side of caution, coming down firmly on the privacy side of this trade-off. Apple is, it says, looking only for ‘general patterns.’
It has so far used the data to improve things like auto-correct suggestions.
While one academic without specific knowledge of Apple’s approach has questioned it, another who has had at least a ‘quick peek’ at the technology believes the company’s approach is sound.
However, it’s fair to say that the company is now moving into more sensitive areas by analyzing web browsing habits and – in particular – health data. It’s likely that this kind of use will put pressure on Apple to share more details of its approach more widely than it has done to date.
How comfortable do you feel about Apple expanding data analysis using differential privacy to collect web browsing and health data? Please take our poll, and share your thoughts in the comments.