INTRODUCTION TO CRITICAL THINKING IN STATISTICS FOR DATA SCIENCE

Do vaccines cause autism? Is climate change taking place? If so, are human activities to blame? How do we predict the risk of the next killer virus? How safe are GMO foods? These are questions most of us deeply care about and the answers or further inquiry which these questions will raise have enormous social impact. Upon seeing these questions, it is natural to ask, how do we proceed? Where do our investigations begin? This is where Statistics shines. In my opinion, the greatest utility of statistics is providing tools which enable us to analyze complex issues and gain insight. I wish someone had given me a similar overview of statistics before being thrust into standard deviation calculations and t-tests.

The book titled The Evolution of Physics  by Einstein and Infeld is one of my favorite books. I like the way Einstein and Infeld introduce the reader to the notion of vectors by illustrating how the problem of understanding motion inspired the vector concept. Similarly, in Naked Statistics, Charles Wheelan draws attention to four kinds of problems that motivate the use of statistical tools. These four problems are:

  • Description and Comparison: How do we compute a number or measure that gives us a satisfactory summary of an event?
  • Inference: How do we reason about a population from taking samples of data from the population?
  • Quantifying Risk: Uncertainty is a fact of life. How do we quantify multiple outcomes and what actions can we take to mitigate risk?
  • Identifying Important Relationships: Do vaccines cause autism?

Luckily for us, great mathematicians have invented tools and techniques for attacking problems which fall into any of the categories listed above. If you haven’t listened to the wonderful talk Statistics for Hackers by Jake Vanderplas, please take the time to do so. In that talk Jake outlines general strategies for investigating statistical questions which tend to be one of the four problems mentioned above.

Satisfied with the “why” of statistics, as budding Data Scientists and Machine Learning Engineers, we must be clear on what critical thinking in statistics means. Critical thinking is about analyzing and evaluating the validity of arguments and claims. We see critical thinking needed in statistics because numbers and statistical evidence tend to be used to support the conclusions of many arguments.

The essence of critical thinking in statistics is asking the right question. When presented with an argument involving statistical evidence, many questions can be raised. However, I’ll attempt to present what I think are the most obvious questions one should ask.

Many arguments involve generalizations about heterogenous groups. Arguments of this sort tend to make conclusions about the group from samples taken. In scenarios such as this, we must ask ourselves questions such as How large is the sample? How was the information collected from the sample? Is the sample representative of the population?

Closely related to the issue of representativeness of a sample is the Base rate fallacy. The base rate fallacy rears its head often in the medical arena. Usually, the sensitivity or true positive rate(TPR) of the test for disease X is presented and you want to estimate the likelihood of a person having disease X given a positive test result. In cases like this, it is important to ask for both the base rate of disease X in the population as well as the false positive rate(FPR) of the test. You can view the FPR as telling us about the predictive power of the test. We cannot generalize using the TPR alone.

In addition, you must also ask if the statistic is true. This is particularly evident in arguments where the statistic is presented with too much precision.  For example, if someone tells you that there were 1789521 lightning strikes in 2016, you need to raise an eye brow.

Of course, how could I forget the whole “correlation proves causation” logical fallacy. A lot has been said about this and you can find good explanations all over the internet. I mention it here because it draws attention to the difference between an observational study and an experiment. The essential difference between an observational study and an experiment is that in the latter, the researcher has control over which subjects receive a treatment. There are many excellent resources available on well designed experiments if you wish to explore the topic further.

I cannot be encyclopedic in my coverage of what to look out for in arguments involving statistics. The matter of misleading with statistics is well documented in popular books such as this. There are other details to look out for such as carefully reading graphs and watching out for unlabeled axis, interpreting statistics about distributions and understanding hypothesis tests. I hope you found this post helpful. You can check out the books listed below to learn more.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s