What’s In a Data Science Toolkit?

What makes up the nuts and bolts of a data science
toolkit?  What are the technical
requirements and how do we know how to choose among them?

Operating Systems

The foundation of the toolkit is the operating system computers use to function.  Options are Linux, Windows, and iOS/MacOS.  While all share many similarities, Linux’s various distributions are favored by a large chunk of programmers. While not all, many data scientists and analysts do look to iOS/MacOS or Windows.

Programming Languages

Data science requires a statistical
programing language to do machine learning and AI.  While there are several statistical
programing languages, the two dominant ones in the field are Python and R.   The Data Science capabilities in both
languages are rapidly expanding because of the addition of libraries – packages
of developed code that is designed to perform specific and integral tasks.  Python
is quickly becoming the programming language of industry because it is not
exclusively a statistical programing language and is used in software and web
development as well.  R is primarily a statistical language,
though new libraries are expanding its utility into web development.  Functionally equivalent, both languages can
be incorporated into a Data Science Environment.  Choosing between them is often a matter of
programmer preference and background.

Languages like SQL (Structured
Query Language) are specific to a type of computing, in this case,
databases.  They allow Data Scientists
and Data Engineers to interact with data in databases (and data warehouses)
such as SQL Server, MySQL, Google’s BigQuery and Apache’s Hive.

Databases

Databases come in many flavors and choosing the one that is
right for your company is dependent on several factors:

  • How standard is your data? That is, do all entries contain the
    same columns and types of data?
  • Do you have multiple sources of data?
  • Do you have different blueprints (called schemas) for how each
    source of data is stored in the database?
  • How much data do you have?
  • Is your data input static or does it update often or frequently?

SQL databases like SQL Server,
MySQL, SQLite, and Google Cloud SQL are relational databases. Relational
databases are ones where bits of information are stored separately and
connected by a key.  These connections
are the relations that give this database structure its name.  They are useful when dealing with data whose
structure will not change, as well as when sets of data (tables) have some sort
of relation between each other. A good example is a company organizing departments
and employees. There is a relation
between employees and the department they work in.

The alternative to a relational
database is called NoSQL.  They include Amazon’s
MongoDB, Google’s Cloud Datastore and Microsoft’s Azure CosmosDB. These
databases both store data in a flexible format, as well as not containing the
relational constraints of standard SQL databases. This allows for new
columns to be added without needing to restructure the database.  All the related information is in the same
table.  This allows the data to be
accessed easily. NoSQL databases are best suited for situations where new
businesses are acquired or data collection strategies change frequently.   You will want a NoSQL database when tying
together multiple databases with similar but not identical structures.

Data Lake vs Data
Warehouse

Where you store your data is dependent on what type of data
you have.  A “Data Lake” is used when all you have is raw, unprocessed data
that frequently has varying structures that do not have any relations between
one another.  A “Data Warehouse” is
similar to a Data Lake, but is used to store structured or relational
data from many sources, not just one.

Hadoop Data Science
Environments

Hadoop is an open-source software framework that allows for
commodity hardware to be used together to divide a workload into multiple
processes simultaneously across multiple machines. This allows companies to
perform computations on a scale that was not possible without building a
supercomputer.

Hadoop is used when a company wishes to perform data
analysis on a very large scale of
data, a Terabyte or more, while still being extremely cost-effective. It is
also used when the sources of data vary across the data set.  C

The data world is very large and full of options, and there is no “one option suits all.” Dealing with data is a careful balancing act that requires hours of research into the structure and sources behind that data. Choosing the right setup and toolkit can make or break your project.  Contact hello@pandata.co to learn more about setting up the right data science toolkit.

Joseph Homrocky is Data Engineer and Julie Novic is a Data Analyst, both at Pandata.

Leave a Reply

Your email address will not be published. Required fields are marked *