Why raw data are important

Raw data are important in validating scientific work. Even so simple an operation as smoothing by time-averaging can have counter-intuitive effects, such as Simpson’s Paradox:

For a simple and homey example, here are the batting averages of Derek Jeter and David Justice in 1995, 1996, and 1997:

in 1995, Jeter had 12 hits in 48 at-bats, for an average of .250. Justice beat him with 104 hits in 411 times up for .253.

in 1996, Jeter hit 183 times in 582 tries for .314; Justice hit 45 out of 140 for .321, winning again.

in 1997, Jeter was 190 for 654, averaging .291. Justice got 163 hits in 495 tries for an whopping .329 average.

In other words, each year Justice out-hit Jeter according to standard stats.

But: for the three years as a whole, Jeter hit 385 out of 1284 for an overall average of .300, whereas Justice’s numbers were 312 of 1046, for an average of only .298. I.e. for the three years combined, Jeter had a higher average than Justice.

Counter-intuitive? Yes. Should scientists be allowed to get away with releasing only averaged, interpreted, adjusted, or otherwise massaged data? No.

Gaming the Future: The Book!

Why raw data are important

Leave a comment

Search Foresight Institute