Copyright © 2018 DataScience.US All Rights Reserved.
The Art of Data Science: The Skills You Need and How to Get Them
The meteoric growth of available data has precipitated the need for data scientists to leverage that surplus of information.
This spotlight has caused many industrious people to wonder “can I be a data scientist, and what are the skills I would need?”. The answer to the first question is yes – regardless of your prior experience and education, this role is accessible to motivated individuals looking to meet this challenge. As for the second question, the necessary skills (some formal and some more… artful) and how to acquire them, based on my experience as a data scientist, are enumerated below…
1. Knowing The Algorithms
To be a data scientist, you need to know how and when to apply an appropriate machine-learning algorithm. Period. Here’s where to develop and hone those skills:
- The Basics – one of my favorite books on this subject in Machine Learning in Action. It covers the main categories of algorithms (classification, regression, clustering, etc.) and provides Python code to experiment with the examples for yourself. I recommend working through a similar book first to get a firm foundation.
- Advanced – the best place to find out what’s hot in data science is the website Kaggle. The competitions result in the democratization of data science – all that matters is that the method that achieves the best results wins. Problems and methods are discussed in forums and the winners share their approaches in blogs. If you set up a free account and follow the competitions for 6-8 months, you will get loads of real-life machine learning experience.
2. Extracting Good Features
Applying the appropriate algorithm doesn’t guarantee performance. You need to provide the method with the right inputs (note: in some situations the raw data will suffice). This is commonly referred to as “feature engineering”. You need to be ready for any potential scenario – and practice makes perfect – but here are some common variable types you’ll need to be aware of:
- The Basics – you’ll learn a lot about feature engineering from the Kaggle competitions. But you will want to keep your eye on social media for posts about good feature development. Two twitter feeds I recommend are @DataScienceCtrl and @analyticbridge. You can also read about other data scientists giving their takes on acquiring the skills for free.
- Risk tables – translating non-numeric data into risk can be effective when you’re dealing with many categories of varying size. Here’s a blog on building smoothed risk tables.
- Text – when you’re dealing with a text field supplied by the user, anything goes (spelling, formatting, morphology, etc.) – as an example, think about how many variations of your address will result in you getting your mail. Maybe 50? There are entire books written on how to deal with text (here’s one of my favorites). Techniques are generally classified under the heading of NLP (Natural Language Processing). Common methods for translating raw text into features include TF-IDF, token analyzers (e.g. Lucene) and one-hot encoding.
- Composite Features – data science borrows heavily from other fields, often crafting features from the principles of statistics, information theory, biodiversity, etc. A very handy tool to have in your arsenal is theLog Likelihood Ratio. It’s almost like building a mini-model into one feature – rather than have a method discover this from the raw data, present a calculated test statistic which encapsulates the behavior.
3. Demonstrating the Value
Sites like Kaggle are invaluable for loading your data science toolbox with effective methods. And new tools are becoming available every day, like Google’s TensorFlow. But the real art of data science is in applying these tools to address a business challenge – otherwise they’re just impressive equations on a chalkboard. The only real way to learn this art is practice, practice and more practice. Here are a couple of things to keep an eye on while you’re practicing:
- Reason codes – how will your solution be used? It doesn’t happen in every case, but sometimes interpreting the results of your model is critical to realizing the value. For instance, if a high score gets put into a queue for human review, the reviewer might need to know why this particular transaction scored high. Crafting good features and selecting an algorithm that yields reason codes are critical for interpreting the results.
- Operational considerations – the choice of evaluation metric must be carefully chosen when estimating the value of your solution. Generally, there is a cost associated with taking action on a high score. Is there a limit to how many actions you can take (such as the maximum number of units a human team can manually review)? Is there a cost associated with making a bad decision – false positives can be a problem and should be factored into the metrics. The situation presented by these scenarios may make the standard model metrics obsolete.
- Generalizing the results – a model that is trained to great acclaim is worthless if performance doesn’t hold up in real life. For each of your features, ask yourself “will this information be available when my model needs it”? If not, the impressive performance you attained in the lab won’t be achieved when it counts. For example, if you’re trying to estimate hospital readmission today and you trained a model user notes from the previous day, make sure those notes will be available quickly enough that the features could get a value.