What makes up the nuts and bolts of a data science toolkit? What are the technical requirements and how do we know how to choose among them?
The foundation of the toolkit is the operating system computers use to function. Options are Linux, Windows, and iOS/MacOS. While all share many similarities, Linux’s various distributions are favored by a large chunk of programmers. While not all, many data scientists and analysts do look to iOS/MacOS or Windows.
Data science requires a statistical programing language to do machine learning and AI. While there are several statistical programing languages, the two dominant ones in the field are Python and R. The Data Science capabilities in both languages are rapidly expanding because of the addition of libraries – packages of developed code that is designed to perform specific and integral tasks. Python is quickly becoming the programming language of industry because it is not exclusively a statistical programing language and is used in software and web development as well. R is primarily a statistical language, though new libraries are expanding its utility into web development. Functionally equivalent, both languages can be incorporated into a Data Science Environment. Choosing between them is often a matter of programmer preference and background.
Languages like SQL (Structured Query Language) are specific to a type of computing, in this case, databases. They allow Data Scientists and Data Engineers to interact with data in databases (and data warehouses) such as SQL Server, MySQL, Google’s BigQuery and Apache’s Hive.
Databases come in many flavors and choosing the one that is right for your company is dependent on several factors:
- How standard is your data? That is, do all entries contain the same columns and types of data?
- Do you have multiple sources of data?
- Do you have different blueprints (called schemas) for how each source of data is stored in the database?
- How much data do you have?
- Is your data input static or does it update often or frequently?
SQL databases like SQL Server, MySQL, SQLite, and Google Cloud SQL are relational databases. Relational databases are ones where bits of information are stored separately and connected by a key. These connections are the relations that give this database structure its name. They are useful when dealing with data whose structure will not change, as well as when sets of data (tables) have some sort of relation between each other. A good example is a company organizing departments and employees. There is a relation between employees and the department they work in.
The alternative to a relational database is called NoSQL. They include Amazon’s MongoDB, Google’s Cloud Datastore and Microsoft’s Azure CosmosDB. These databases both store data in a flexible format, as well as not containing the relational constraints of standard SQL databases. This allows for new columns to be added without needing to restructure the database. All the related information is in the same table. This allows the data to be accessed easily. NoSQL databases are best suited for situations where new businesses are acquired or data collection strategies change frequently. You will want a NoSQL database when tying together multiple databases with similar but not identical structures.
Data Lake vs Data Warehouse
Where you store your data is dependent on what type of data you have. A “Data Lake” is used when all you have is raw, unprocessed data that frequently has varying structures that do not have any relations between one another. A “Data Warehouse” is similar to a Data Lake, but is used to store structured or relational data from many sources, not just one.
Hadoop Data Science Environments
Hadoop is an open-source software framework that allows for commodity hardware to be used together to divide a workload into multiple processes simultaneously across multiple machines. This allows companies to perform computations on a scale that was not possible without building a supercomputer.
Hadoop is used when a company wishes to perform data analysis on a very large scale of data, a Terabyte or more, while still being extremely cost-effective. It is also used when the sources of data vary across the data set. C
The data world is very large and full of options, and there is no “one option suits all.” Dealing with data is a careful balancing act that requires hours of research into the structure and sources behind that data. Choosing the right setup and toolkit can make or break your project. Contact firstname.lastname@example.org to learn more about setting up the right data science toolkit.
Joseph Homrocky is Data Engineer and Julie Novic is a Data Analyst, both at Pandata.