Facebook / Cambridge Analytics, the Equifax data breach, and political email hackings are just a few examples of the misuse of data and data ethics violations that are constantly making headlines. All except your most hardened hacker would agree that breaking into a database containing secured personal information is both legally and morally repugnant. But in this modern era of Big Data and analytics, circumstances can be much murkier. Though there are rules governing how data is collected, stored and used, there is no central code or Hippocratic Oath for Data Scientists. Therefore, it is our responsibility to frequently and deeply think about how we interact with data.
We break down data ethics into four facets: ownership, consent, privacy, and transparency. Who owns data – is it the collector or the collectee? What does it mean to own data? This is an issue of control.
- Ownership means that one can determine what data is stored and can move or delete it as desired. We define consent as a set of permissions to use data including who, what, when, and for what purpose.
- Consent should be limited in scope and relate only to permissions explicitly granted. Consent should also be able to be revoked at any time.
- Data privacy relates to the idea that only authorized entities have access to personal data governed by the terms of consent. As hacks and data breaches can happen, a more ambiguous question is what constitutes a reasonable attempt at privacy? If data is anonymized, how anonymous is it really; can identity still be determined with some effort?
- The final aspect of data ethics is transparency, or what is the data being used for and how is it being manipulated?
As data scientists, we spend a good deal of time designing and optimizing algorithms to analyze data. Because humans are biased, data are biased, and therefore algorithms can be biased as well. Here is a concrete example of gender bias in the deep learning algorithms behind Google Translate. I translated “She is a doctor” from English into Turkish, a non-gendered language. Then, I translated it back into English, yielding “He is a doctor.” The same is not true for “she is a nurse.” As a field, we need to keep this in mind and go to lengths to avoid all types of biases in analysis.
What is being done to govern the proper use of data? In the United States, there are field-specific regulations such as HIPAA (Health Insurance Portability and Accountability Act) and the Fair Credit Reporting Act. Recently, Europe enacted a more sweeping set of regulations – the General Data Protection Regulation (GDPR), establishing strict rules on how European citizens’ data can be used regardless of where in the world the person and data-collecting entity are located. Generally, GDPR requires strict privacy settings, opt-in consent vs. opt-out, gives ownership to the person whose data is being collected, and requires full disclosures about data use. Violators can be hit with very significant fines.
However, advances in data science often outpace such regulatory actions. So, what does all this mean for Pandata? As a data science consulting company, we work with a plethora of datasets from a variety of clients at varying levels of data maturity. While some breaches are due to negligence or nefarious acts, many are the result of careless errors that can happen when one forgets that data is more than just numbers that live in the cloud. We look at data protection from a global perspective and take initiative beyond what regulations stipulate. In addition to data protection training, all team members, whether they work directly with data or not, are involved in ethical discussions around appropriate use and implications of our projects. It is critically important to have these conversations with our clients too. Companies that have strong internal ethics and transparent procedures are the best equipped to avoid errors, and therefore, penalties tied to regulation. They also win the trust of their customer base.
Need help understanding your industry’s data-related regulations or where to begin, when it comes to data ethics? Contact us at firstname.lastname@example.org.
Hannah Arnson is a Data Scientist at Pandata.