Real Data Scientists Do It Themselves

September 11, 2014

Real Data Scientists Do It Themselves

The ever increasing demand is far outpacing the supply of data science professionals. A lot of these gaps are filled in by professionals from different domains who are doubling up as data scientists. While these professionals are experts in their respective domains they rely on their academic training imparted to them about a decade or two ago. Around that time, the statistical techniques were not in vogue in subsequent work environment, they were not learnt with much interest and rigor. Also, the techniques were not subjected to large variety and volumes of data sets so their reliability and validity weren't put to stress test.

Though these pseudo data scientists, relying on their training or some quick wiki reads or newsletters, are aware of various statistical techniques and even conceptually explain them to others but lack the rigor and know how to execute these techniques themselves.

The rigor challenge does not pose much threat. If you are unaware of a problem it does not exist for you. As long as it does not exist, you know everything necessary to be project yourself as an expert. Also, the client does not know much and does not have choice other than relying on you. For example, whatever the type of data distribution be you assume it to be normally distributed and apply standard parametric tests to them because you do not know how to treat negative binomial distribution or Zero inflated Poisson distribution. It is important to understand that as more and more data is captured on many many variables, many of these variables have NULL or zero values. This ill treatment of data leads to unreliable false positive p-values or t-test stats or F Stats which again does not pose much threat because you are reliant on the tools which overwhelm you with too many choices in input methods and you choose default option. Seriously, how many really bother about several dozen options available for factoring and rotations in Factor analysis. There are many more examples to cite but I want to keep the post short. I will touch upon them when I talk about specific techniques in subsequent posts.

Now the challenge is of executing the method or technique using a software. That is solved by hiring recent graduates with general management degrees who possess several certificates through open courses or short term crash course training institutions which focuses on hands-on use of tool. This new breed is born in the era of point and click to get things done wherein the computer blurts out output in few milliseconds which they are not responsible to interpret since there is a pseudo data scientist to do that. These small successes in getting an analysis done in software make them start believe that they are data scientists in making. At times, the computer throws an error message that there is a data type mismatch or Need numeric data etc. The pseudo assistant data scientist would check and find that variable is a string data. There are easy transform and recode commands available allowing to convert string data into numeric data. And they would do recode them into consecutive numbers starting at 1 which will make them appear like an interval scale data. It is important to know that even if the values are numeric it may still be a nominal data. Now post transformation, the analysis works fine and hurray the problem is solved.

Overall, the talent gap is filled fast. The tried and tested decade old techniques based on linear models continue to be used. No heed is paid to adopting newer developments in ensemble learning or Non-linear models such as SVMs, Random Forests, Logit, Probit, Tobit, Ridge etc. The issue with the newer models are difficulty with interpreting the outputs which require far more rigorous understanding and experience with application on different data sets.

The field of data science is fast evolving. Any fast evolving field has pseudo experts and incorrect application of concepts. Fundamental untenable mistakes are committed at different stages of data analysis without even realizing them.

There are universities adapting to new needs. Harvard has started teaching programming in MBA courses. Many universities have launched specialized degree programmes offering rigorous training in data science. In my interactions with data scientists in leading data companies, I found even the very senior executives are doing the entire cycle of data analysis right from pulling the data from databases to analyzing the data to preparing the final report all by themselves. The approach is to have small number of high performing senior data scientists to accomplish the task. One of the criteria in a large data company for a lean team of data scientist was to ensure that not more than two large sized pizzas are needed to entire team for a lunch or a dinner. Real data companies do not have teams with large hierarchies. Anyone aspiring to be data scientist have to upskill in all aspects to be successful and most importantly doing it themselves is the key to success.

The real data scientists have to have understanding of business context, conceptual grounding in the techniques, know-how of tools and methods to execute the techniques and ability to present the insights to the end user.

Comments