PyCon is the largest annual gathering of the Python user
community, a group to which Pandata enthusiastically belongs. Several members
of Team Panda were able to participate in this year’s PyCon, conveniently held
in Cleveland. While there were too many interesting talks to recap, we wanted
to highlight a few that sparked the most thought and discussion.
Ethics in Data Science is a topic that is often at the forefront
of our minds at Pandata. Two talks by J. Henry Hinnefeld and Manojit Nandi on
model fairness stood out. The former talk stressed that it is not enough to
build models with extra constraints to solve the fairness question, but that
the subtleties of fairness need a human touch. Different groups experience
different realities or ground truths. The question becomes is fairness equality
or equity? Equity is giving everyone what they need to be
successful. Equality is treating everyone the same. A
model can treat everyone equally, but if they do not start from the same
starting point, the model is not equitable. One view is then that the model is
unfair. Not only do different groups have different ground truth conditions,
but data can be unintentionally biased through human bias. Taking these issues
into consideration when designing a model is an important ethical issue.
The talk by Nandi focused on how data and algorithms can be implemented ethically. Math is not racist, but the way we may use algorithms to, for example, identify who gets a bank loan can be. How can we take steps to mitigate this problem? Ethics training for data scientists is essential – awareness that there are biases in data and in how we interpret outcomes is the first step to combatting harm. Diversity on data science teams brings key insights from the perspective of underrepresented groups. There are also software packages that have been developed to mathematically address biases and ethics including AI360 and Fair Test.
There was a myriad of technical talks, ranging from code testing to building solvers to data visualization and everything in between. An interesting and relevant talk by William Horton Focused on using GPUs in Python code.
Moore’s law refers to the idea that the processing speed of a
circuit doubles roughly every two years. However, as transistors on computer chips continue to get smaller and smaller, the
ability to produce results in line with Moore’s law has begun to slow due to
fundamental limitations of physics. Yet the scale of data continues to expand
rapidly, creating bottlenecks in the processing speed of workloads. These
bottlenecks are due to the limitation of how many parallel processes can be run
at one time based on the limited number of cores and threads in current CPU
offerings. In comes a new contender – one previously dedicated to providing
gamers with smoother experiences in video games – the GPU.
GPUs can contain thousands of smaller
cores in comparison to the double-digit core count of current CPU offerings.
They are also built to specialize in certain types of computations common in
data science. Surely this must require large changes in code, you might wonder.
Software like CUDA, a parallel computing platform developed by nVidia, has
multiple ways to be used as a “drop-in” within existing code. CUDA
implementations offer a rich environment that gives you a wealth of control
regarding how GPU resources are utilized to help decrease the processing time
of code by immense amounts. Libraries like PyCUDA and Numba take this one step
further and allow developers to write code in Python that is “translated” in
order to be used efficiently on a GPU with minimal refactoring of code. With
GPU-enabled code giving faster results in the rage of 30x faster (or more!), we
truly are entering a new age of processing that only seems to be getting better
and better as time goes on.
Numerous Python hobbyist talks were fun to attend for the craftier
Pandas. Talks on using Python to generate cross-stitch patterns by Katie
McLaughlin, making music in Python by Jessica Garson, and using signal
processing in Python to generate music for and control a player piano by JP
Bader gave us many ideas of projects for our spare time.
One of the most fascinating lessons of PyCon was that Python is more than just a coding language or professional tool, it is almost a way of life. The undercurrent of inclusiveness, intellectual curiosity and passion that permeated the conference was inspiring. As a team, we walked away with an array of new tools to address model fairness, optimize code, and learn new techniques that we look forward to applying to client work. There also might be a Python-generated Panda cross-stitch in our future.
Co-written by Pandata team members Hannah Arnson, Lead Data Scientist, Julie Novic, Data Analyst, and Joseph Homrocky, Data Engineer.