Copyright © 2018 DataScience.US All Rights Reserved.
Rescuing Data Science and Big Data Projects
This blog was written after participating in data science and big data projects over the past few years, plus researching and talking to other people on why their projects succeeded or failed.
Most of the engagements I’ve personally been involved in are already in progress before being called in to help rescue the project. It should come as no surprise that the main issues are non-technical.
Everything you do, every decision in a big data/data science project needs to be tied to a business requirement or a question the business wants answered. Too many projects flounder in the first few months investigating the matrix of overlapping (and sometimes confusing) technologies needed to build a solution. All the vendors are out there screaming, “Pick Me, Pick Me”! Or your friend tells you what he is using on his [supposedly] successful project. The point is that it is really hard to make the choices on what technologies you will use, so plan some time to investigate. I have my own opinion on what products need to be used, but I’ll save that for another article.
Setting executive level expectations is important in a data science project. It may not be the first thing that needs to happen in the project but I’m putting it up front to help set the rhythm for the rest of the article.
When you do get in front of the right audience, here are some points to cover when talking to an executive about what a successful data science project looks like:
- The project has a team with the right set of skills
- The project needs time to build in accuracy checks to verify answers
- The project needs an executive champion
- Executives need to know how to validate the answers
- Executives cannot request unrealistic results
An example of unrealistic results is taken from an actual project where a three-year forecast was requested when the team only had six months of data.
Creating a method of validating answers from your data science models is often one of the most difficult tasks in the project. If it is a financial oriented project, often the accounting and financial systems are good sources for validation. The technical team needs to take a first pass at validation and needs to be prepared to explain to the business audience how the answer was derived and how it was validated. Executives like to have the process explained to them; it becomes a way for them to start trusting the answers coming from the new system. Executives can also be used as a reality check – they can often rely on their experience to give an intuitive double-check on whether the answer seems realistic.
The right set of skills on a project are a combination of traditional IT/Development skills plus Data Science skills.
Example: When a data science project is given to an IT team the result is often a good, solid, well performing infrastructure with a clean and efficient data flow. However, this team usually struggles with applying the right algorithms to the business problems. Another strong skill supplied by someone from IT is project leadership; they tend to know more about running and leading a successful project than a data scientist.
When a project is given to a pure data science team, the result is typically very good answers to the business questions because data science people often know the right algorithms to apply to the different questions. However, the infrastructure is fragile, poorly managed and there are often complaints of poor performance.
The recommended approach is to assemble a team with a good mix of skills. Here is a list of the types of skills needed:
Business expert / Domain Expert / Knowledge worker
- Needs to frame the questions properly
- Needs to have skill to be able to tell the story when the answers arrive
- Knows the data sources and uses of the data
- Architects the data movement processes
- Incorporates master data
- Cleanses the data
- Establishes the meta data repository
- Helps validate the answers
- Usually the performance expert
- Needs ability to know what algorithms to apply to business questions
- Needs to understand the data sources and meanings
- Needs ability to tell the story when answers arrive
- Should have some coding background
- Must follow proper coding and software engineering disciplines
Data Visualization / Consumption expert
- Can be the same Business Expert mentioned earlier
- Different tools can introduce architectural changes earlier in the stack
- Different tools produce different query patterns
Architect & project leader
- Understands all aspects of the project
- Realizes the value of having well rounded team
Note: During a proof-of-concept (POC) phase of a project it is not necessary to have all these skill sets. It is up to the project leader to determine which skills are needed. The other skills will be needed when the model is operationalized.
Proper Software Engineering Techniques need to be followed. Often the data science people have not been exposed to source control tools and need to be taught the fundamentals. Everything should be kept under source control tools; i.e. R scripts; Python scripts, etc. This point also extends to the automated deployment scripts created by the infrastructure team – this is something usually overlooked by the operations and deployment teams.
The next challenge is teaching the data science team the value of multiple environments; i.e. traditional dev, test, production environments. Then teach them how to deploy to each environment. Don’t forget to teach them the golden rule of ‘never change production code directly’; check the code out of the repository, deploy and test in the proper environment.
[Tweet ““Data that is loved tends to survive.” – Kurt Bollacker, Data Scientist”]
Successfully creating a model is not the end of the project. Empirical evidence has shown that the data science team tends to jump to the next business question to begin building the next model. It is complex enough and takes long enough to collect and understand the data to feed into a model so it is understandable that creating a successful model can be construed as the end of the project. Professional discipline must be exercised to operationalize the model, which is the one of the most important goals of the project. Many project plans do not account for the amount of time it takes to move from a successful model to full production.
Don’t believe the first answer that is produced by a model. Figure out a way to validate the answers. Question whether the training set used was the right one. Build in time to the project plan for building methods or processes for validation – it is often a step left out of the original project plans.
Learn how to explain the answers. Prepare presentations to the executives and business community that will be benefiting from this project. The Business Expert, who is one of the key team players, should be a big part in developing the validation processes as well as in preparing the presentations.
Don’t believe all the industry hype. Undertaking a data science project should be done because it is the best way to arrive at certain business challenges, not because it is the cool trend and everyone else is doing it. Many BI visualization tools can be used to quickly arrive at answers to business questions without going through a separate modeling and training process. And the data doesn’t always need to be copied to an HDFS file system to do data science. Some R and Python based products can read directly from some of the more popular relational databases.
It is not a big data problem. Let’s turn our attention to big data projects for this next point. Too many projects have jumped to a Hadoop based solution only to discover that it is more expensive and slower than their current technology. Example: a recent customer went back to their relational based system when we showed them that putting a clustered column-store index on their big fact tables got them 7x-10x improvement in query times, and the data+index footprint was 8x smaller. In my opinion, Hadoop is becoming a niche player (Hmmm, sounds like another blog topic!).
I’ve seen customers bypass in-memory or scale-out technologies that are available from their current relational database system in favor of a Hadoop based solution. I would recommend at least exploring or doing a POC with what is available today to extend your current system. You might find yourself going a lot further with less cost and less time invested.
I’ve also seen projects where the requirements point to a Big Compute solution but the customer has jumped on the Hadoop/Big Data bandwagon. My advice is to learn how to tell the difference between Big Compute projects and Big Data projects. Don’t worry if you get it wrong the first time because it’s somewhat easy to switch to a Big Compute solution because it can usually use the same data source as your Big Data project and experience has taught us that collecting and shaping the data correctly is a big part of the project.
In conclusion, successful data science and big data projects are done by a well-rounded team, sponsored by an executive, and have a direct impact on business decisions that need to be made. Don’t forget to spend some time doing investigations into patterns and products before launching an official project.