DataScience.US
A Data Professionals Community

Data Science for Newbies: An Introductory Tutorial Series for Software Engineers

This post summarizes and links to the individual tutorials which make up this introductory look at data science for newbies, mainly focusing on the tools, with a practical bent, written by a software engineer from the perspective of a software engineering approach.

46

Editor’s note: This is an overview of a multi-part tutorial on data science for newbies. The author has given the series a different — tongue-in-cheek — title; take it in stride and recognize that the series’ approach and content is a fresh look at getting started with various aspects of data science from a software engineering perspective.

PySpark console
The PySpark console.
Part 1: Getting Started

To do some serious statistics with Python one should use a proper distribution like the one provided by Continuum Analytics. Of course, a manual installation of all the needed packages (Pandas, NumPy, Matplotlib etc.) is possible but beware the complexities and convoluted package dependencies. In this article we’ll use the Anaconda Distribution. The installation under Windows is straightforward but avoid the usage of multiple Python installations (for example, Python3 and Python2 in parallel). It’s best to let Anaconda’s Python binary be your standard Python interpreter.

Part 2: Analyzing Reddit Comments & Querying Databases

Patterns are everywhere but many of them can’t be immediately recognized. This is one of the reasons why we’re digging deep holes in our databases, data warehouses, and other silos. In this article we’ll use a few more methods from Pandas’ DataFrames and generate plots. We’ll also create pivot tables and query an MS SQL database via ODBC. SqlAlchemy will be our helper in this case and we’ll see that even Losers like us can easily merge and filter SQL tables without touching the SQL syntax. No matter the task you always need a powerful tool-set in the first place. Like the Anaconda Distribution which we’ll be using here. Our data sources will be things like JSON files containing reddit comments or SQL-databases like Northwind. Many 90’es kids used Northwind to learn SQL.

Reddit comment ratings
Highest rated comments for all available sub-reddits.
Part 2 Addendum: Playing SQL with DataFrames

So, let’s talk about a few features from Pandas I’ve forgot to mention in the last two articles…CONTINUE READING

Comments

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

X