Search code examples
statisticskaggle

Why some portion of statistics is not used in data science


I have learned statistics including mean, median, mode and different tests

being Z test, F test and chi-square and all but generally participating in

difficult numeric data prediction challenges like on kaggle and other

platforms I hardly see anyone using statistical tests like z, f, chi-square,

normalization of data these - all we use boxplots, bar plots to see mean,

median, mode etc.

my question is where these tests are an integral part in data science, for what

sort of problems are these mainly designed - research based.

What portion of statistics should ideally be used in a data science problem and

why only some portion is used when all of statistics is must for data science.

I am asking regarding tests and other statistics except the algorithms.


Solution

  • You're most likely to see statistical hypothesis testing in data science if you're looking at something like A/B testing, where your goal is to determine whether there is a reliable difference between two samples and the size of that difference.

    Kaggle competitions specifically are supervised learning problems rather than hypothesis testing, which is why you don't see people using things like chi-squared. (Which makes sense: if you have ten people do hypothesis testing on the same dataset, they should all get pretty much the same answer, which would make for a pretty uninteresting competition.)

    Personally, I think it's good to be familiar with both statistical hypothesis testing and machine-learning techniques, since they have different uses. Hope that helps! :)