How the facebook API led to the Cambridge Analytica Fiasco

How weak API terms of service, lack of transparency, and permissive API scopes led to the Facebook-Cambridge Analytica scandal

The Facebook-Cambridge Analytica data scandal from earlier this year was not about a data breach. Nothing was hacked. It was more nuanced than that. Think: permissive API scopes, a lack of awareness about the data being accessed via your friends, a lack of contingency around that shared data, and poor API terms-of-service enforcement by Facebook. Collectively, these oversights allowed distant 3rd parties to monopolize on the data being disseminated and use it to manipulate and influence content on a user’s Facebook feed. This post explores the Facebook API History and how it led to what happened with Cambridge Analytica.

Protecting peopleâ€™s information is at the heart of everything we do, and we require the same from people who operate apps on Facebook.Paul Grewal, VP & Deputy General Counsel, Facebook, March 16, 2018

It’s not uncommon for an API to have substandard terms of service enforcement. Organizations who make APIs available to developers need to deploy a variety of both manual and automated checks to ensure compliance with their policies. There is no simple technical solution to enforce privacy rules with data access via an API.

The solution to improve this would be better API onboarding and scope request due diligences, better API management, better app monitoring, validation, and a machine-readable API terms-of-service format that could automatically enforce data governance and data policy infringement, tied to API Management. In the near future, I see a solution that stores every data-share-contract in a public ledger, making it easier to trace what happens to personal data as it relates to the “smart contract” tied to it. Enforcing is another issue altogether. But at least there would be measures to prevent or at least limit what happened to the data shared to Cambridge Analytica, among others.

In the French documentary Unfair Game: How Trump Manipulated America, creator Thomas Huchon describes how UK based Cambridge Analytica actually helped the Trump team focus on targeted communication towards undecided voters in democratic leaning states where it was far-fetched to think that the state could switch to republican. He outlines how this shift resulted from the butterfly effect, stemming from a seeming innocuous ‘personality game’ played by one million Facebook users. This opened the door to tens of millions of accounts who could be marketed to by specialist lobby groups.

Platform Thinking: API Design, multi-level access control, and OAuth scopes are keys to safe and programmable interactions

One responsibility of Facebook’s is to warn users what is happening with their data when developers access the platform via its APIs, and what Facebook is doing with users data when they use Facebook logins. In 2010, Facebook redesigned its platform and added open graph tools, including their Graph API v1.0. With this tool, developers could now see social connections between people, and see the connections people have based on their interests and likes. The main feature of this Graph API that led to the 2018 Cambridge Analytica scandal was the newfound ability for a developer to have access to each user’s friends list and all their friend’s data with the consent of just one user. And all of this unbeknownst to their friend, providing their privacy settings were lax.

When designing an API-powered platform to scale, one must be aware about the design of the programmable interactions that will occur on your assets. In API design, an important part is the access level management. Meaning – who is able to see what and what action they can perform on that data. Facebook’s rules in 2010 were too permissive. If you are a user with 500 friends, it would only take 1 of those friends to give access to a third-party app his/her friend list, and have that data stored by that third party app without your direct consent.

The data we’re referring to is everything that would appear on your about me section of your profile, including actions, activities, birthday, check-ins, history, events, games activity, groups, hometown, interests, likes, location, photo tags, photos, relationship details, religion, politics .. the list goes on. In other words, more than enough data for a 3rd party to profile you and target you, based on the actions of a single user. And it’s likely that one of your friends did not do anything on the apps permission window but say ‘yes’ so they could take that funny personality quiz or IQ test.

This allowed applications to scale to millions of users very quickly. They exploited Facebook user data and the Facebook platform to spread virally. It seems clear that Facebook knew this, since it was the reason Facebook shut down Twitter access to Facebook finding friends, and later to Google, Vine, and Yandex.

Cut to 2014. Facebook shuts down their Graph API to everybody, including the ability to look for friend’s data. This occurred on the heels of some astronomical growth, during which time they onboarded developers with access to these highly valuable assets. And even though they shut down the Graph API, developers with contracts already signed could still access the API for 1 year to avoid breakage.

The original intent of the Graph API was to help developers make links between users based on interest, and create new connections (i.e. if you have 4 friends that like the same sushi restaurant, why not propose that you all go there together?). Tis was how Cambridge Analytica was able to mine data from 50 million users, stemming from consent from only 1 million users based on a 3rd party app. And in spite of the Graph APIs permissive scope and the lack of explicit permission rules, the real problem is Facebook’s lack of platform and API terms of service enforcement.

Enforcing platform policies and API Terms of Service: the biggest lie of the programmable economy

Every platform having APIs publishes a document explaining the rules interacting programmatcally with the platform via APIs in a section called “API Terms of Service”, often under the platform policy rules. In the rules, a platform explains what you are allowed to do or not dowith the platform and its data. It explains the platform rule about revocation of access, data, privacy, data export, data storage to comply with platform rules, the user experience defended by the platform, and the law.

Let’s say that Facebook did protect their interests first, and checked users’ interest a second time. In fact, in 2012, Facebook had a policy stating that:

“Competing social networks: (a) You may not use Facebook Platform to export user data into a competing social network without our permission; (b) Apps on Facebook may not integrate, link to, promote, distribute, or redirect to any app on any other competing social network”

Because of the Twitter, Vine, and Google case shared ealier, Facebook updated their Terms of Service with this:

“Reciprocity and Replicating core functionality: (a) Reciprocity: Facebook Platform enables developers to build personalized, social experiences via the Graph API and related APIs. If you use any Facebook APIs to build personalized or social experiences, you must also enable people to easily share their experiences back with people on Facebook. (b) Replicating core functionality: You may not use Facebook Platform to promote, or to export user data to, a product or service that replicates a core Facebook product or service without our permission”

That means that Facebook actively protected their assets value by now obliging 3rd party platforms to give back at least with the same functionality. They call it the ‘reciprocity’ rule. They could do this because it was possible to check these features on the 3rd party apps. For other types of data mining companies, there was a different context.

Because Facebook had their focus on competing social networks, they did not focus enough on companies like Cambridge Analytica, and the thousands like this one that were built during the 2010-2014 Graph API-including-friends-data-era. The reason they couldn’t it it technically — when you request API access to a 3rd party platform, you must declare that whatever it is you are doing must follow the rules set forth in the terms of service. With the numbers of apps created on Facebook between 2010-2014, the validation was almost always automatic, so you could declare a video game when you were actually preparing to collect data for a future presidential election.

But what can Facebook do if an app accesses 3rd party data with user consent, stores it, then duplicates it on another server in another data center? Nothing technically. If Facebook manages to discover it by changes, they can sue the fraudulent companies, but without any proof, that’s difficult.

Regulation will only keep honest people honest about user’s data

Regulation can be an important part of the solution, as Mark Zuckerberg proposed in his last interview. For example, Europe introduced new privacy legislation in 2018 called GDPR (General Data Protection Regulation). Under GDPR, companies not enforcing rules about personal data and how they are collected could face fines of up to $20M or 4% of total worldwide revenues, whichever amount is higher. This is not a technical approach, but European parliament hopes the threat will be big enough for companies to respect it. Still, fraudulent companies may still use proxy companies to access this data, from countries where regulation is more passive on privacy. It is better to have a technical solution added to regulation.

Technical solutions are required to solve API Trust

As Kin Lane, API Evangelist and former presidential innovation fellow on APIs under Barack Obama states in his post, the first step is to be better at identifying who wants access to your users data, and have independent 3rd party reviewers check the legitimacy of developers and companies at the onboarding levels. Having a robust application review process is a good start. The second step would be to have stronger ties with a robust API Management solution. By knowing who is doing what, and managing that with rate limiting (Facebook today allows up to 100 million requests a day without verification. That said, volume is a very simple indicator. Watching for suspicious patterns can be be much more telling, as they can be indicators of machine crawlers. Identifying companies who are receiving significantly more data than the average could be a way of detecting intent to store data, not just support the user experience.

Facebook could also create a user privacy management certification, that vouches for specific companies behavior with Facebook user data. This can allow Facebook to conduct more due diligence to ensure that 3rd party companies are not merely proxies used by others to maliciously obtain user data. Companies could be incentivized to get certified by getting higher request limits.

In a recent discussion with Kin Lane, he recommended also monitoring at the app level, a little bit like the Apple store does to manually check every app. This would allow Facebook to verify that the data flow corresponds to the application’s user experience and is therefore legitimate. Then all apps would be available in a public directory and the review process data could be public too. This audit could be performed when the app reaches a threshold number of API calls, or a certain number of OAuth tokens.

In the context of an API terms of service, an interesting technical idea developed by Tyler Singletary (head of platform of Klout at the time) was to develop a machine-readable API Terms of Service document. If you marry that with your API Management solution, it could make things programatically easier to detect such data mining by the quantity of data stored and then revoke the access of the 3rd party accessing the data.

Following Lawrence Lessig’s idea that code is law (and law is code), then the legal contract (data use and policy) could be tied with interface contracts (APIs) directly into the code. That would be a first step in monitoring API terms of service programmatically and provide a better way of detecting data mining. (Enforcement is still required based on the findings uncovered during monitoring).

Another step would be to timestamp and digitally sign every data sharing interaction between a user and the application, and store those permission tokens in a shared ledger. The idea is that every data share transaction would be unique and traceable.

But let’s be clear, even if we can enforce the legal contract, we will never be able to get the data back by technically enforcing the contract. It is a little bit like when you share a secret to a 3rd party in an offline discussion. Even if you sign an NDA, the 3rd party is always able to share it secrectly to someone else. Once data is out, it is out. If you happen to discover it you can sue them in court. This scandal should remind us about digital vigilance. It reminds me of an old Arab proverb – written about words but I am adapting to data in this concept: The data you keep secret is your slave, the data you share is your master.