Naked Statistics — Charles Wheelan

Naked Statistics is an interesting book written grounded with real-world examples in a simple and playful style. It is an enjoyable book to read that adds context to theoretical concepts often detached from daily activities.

These types of books are necessary because they provide some levity in often dry disciplines. Often by complementing your education with works that simplify the subject matter one gains better clarity in the concepts once presented to you.

There are 13 Chapters with a conclusion. From descriptive statistics to probability going through a chapter on data, polling, and regression.

For each chapter I took bullets point of key ideas while highlighting 1 to 3 of them I wish to develop.

Chapter 1 — What’s the point?

a) The Gini index is a tool for comparison

b) The point is that statistics helps us process data

This remark may come across as obvious, but it is helpful to remind ourselves why we use statistics in the first place. Because of the volume but also the type of data we deal with it is important to remind ourselves that this discipline requires us to identify use the appropriate mathematical methods to the type of data we are dealing with. This might be one of the most challenging parts of applied statistics or mathematics.

c) Descriptive statistics exist to simplify, which always implies some loss of nuance or detail

d) Any model to deal with risk must have probability as its foundation

e) Probability can be used to catch cheats in some situations

f) Statistics is a lot like good detective work. The data yield clues and patterns that can ultimately lead to meaningful conclusions

What we are seeking with statistics is to gain insights or elements of truth. No graph nor illustration will provide a full picture but fragments of that truth. As a would a detective do, it is our interpretation that will lead to a larger understanding on what is really happening.

g) Regression analysis is the tool that enables research to isolate a relationship between two variables

h) Statistical analysis is more like good detective work

i) You can lie with statistics

Chapter 2 — Descriptive Statistics

a) Descriptive statistics are the numbers and calculations we use to summarize raw data

b) The average income in America is not equal to the income of the average American

c) Descriptive statistics can be like online dating profiles: technically accurate and yet pretty darn misleading

You must be weary of the simplification of data. For example, means can be technically accurate but because of the potential influence of outliners it might not show a clear picture.

d) The sensitivity of the mean to outliers is why we should not gauge the economic health of the American middle class by looking at per capita income

e) Median signals the “middle of a distribution”

f) Neither the median nor the mean is hard to calculate

Chapter 3 — Deceptive Description

a) Distinction between precision and accuracy

b) The most precise measurements or calculations should be checked against common sense

The is an error most people can fall into

c) Descriptive statistics can suffer of a lack of clarity over what exactly we are trying to define, describe or explain

d) Pay attention to the unit of analysis. Who or what is being described and is that different from the “who” or “what” being described by someone else?

e) “The median isn’t the message”

This comes from a famed evolutionary biologist Stephen Jay Gould who was diagnosed with a cancer with a median survival time of 8 months. He eventually died of another illness 20 years later. All this to say that the median is a powerful measurement used often in parallel of the mean but shouldn’t be considered too conclusively. Different factors influence how to use the median as interpretation. In his case his youth, quality treatment skewed his chances to the right. More information: https://journalofethics.ama-assn.org/article/median-isnt-message/2013-01

f) Percentages don’t lie but can exaggerate. One way to make growth look explosive is to use percentage change to describe some change relative to a very low starting point.

g) Small percentage of an enormous sum can be a big number

h) “You had better be darn sure that what you are measuring is really what you are trying to measure”

i) Any index will be sensitive to how it is constructed

j) Statistical malfeasance has very little to do with bad math. If anything, impressive calculations can obscure nefarious motive

Chapter 4 — Correlation

a) Correlation does not cause causation

This is an error made when people try to understand a relationship between two variables. Correlation means an interdependence of variable quantities while causation is when one event is the result of the occurrence of another event. An example would be that driving at higher speeds correlate with greater accidents but does not cause them.

b) Correlation co-efficiency has two attractive characteristics — a single number from -1 to 1 and that it doesn’t have any units attached to it

Chapter 7 — The Importance of Data

a) Data is to statistics what a good offensive line is to a star quarterback. In front of every star quarterback is a good group of blockers

b) Data sample that is representative of some larger group or population

c) That inferences made from reasonably large properly drawn samples can be every bit as accurate as attempting to elicit the same information from the entire population

d) The easier way to gather a representative sample of a larger population is to select some subset of that population randomly

e) A representative sample is a fabulously important thing, for it opens the door to some of the most powerful tools that statistics has to offer. Getting a good sample is harder than it looks. Many of the most egregious statistical assertions are caused by good statistical methods applied to bad samples, not the opposite. Size matters and bigger is better. A bigger sample will not make up for errors in its composition or “bias”

f) We ask of data is that they provide some source of comparison. Is a new medicine more effective than the current treatment?

What is to keep in mind is to find groups that are similar except for a distinct difference (in this case “treatment”). This would be the usage of control groups and experimental groups.

g) Randomization useful for treatment and control groups that differ only in that one group is getting the treatment and the other is not

h) Longitudinal data sets are the research equivalent of a Ferrari

This is a rare type of research due to it requiring discipline and trackage of the same sample across different points in time. A famous longitudinal study is the Grant Study as it attempts to answer the secret to living a good life by asking Harvard graduate students (starting 1942) every 2 years about their physical and mental well-being.

i) The research equivalent of a Toyota is a cross-sectional data set, which is a collection of data gathered at a single point in time

j) Behind very important study there are good data that made the analysis possible.

l) Some of the most egregious statistical mistakes involve lying with data

There is an interesting article “How to lie with bad data” in which the article states the famous saying “Garbage-in, Garbage-out”. Simply put, bad data can provide results that don’t have any validity at all. The data collection phase is crucial for any statistical project and requires the process of data cleaning

m) Selection bias: data selection

n) Publication bias : positive findings are more likely to be published than negative findings

o) Recall bias: memory is not a great source of data. Recall bias is one reason that longitudinal studies are often preferred to cross-sectional studies. In a longitudinal study the data are collected contemporaneously

p) Survivorship bias: When some or many of the observations are falling out of the sample, changing the composition of the observations that are left and therefore affecting the results of any analysis

q) Getting good data is a lot harder than it seems

Chapter 8 — The Central Limit Theorem

a) Central limit theorem is a combination of probability and proper sampling

b) The core principle underlying the central limit theorem is that a large, properly drawn sample will resemble the population from which it is drawn

c) The central limit theorem enables us to make these inferences:

d) If we have detailed information about some population, then we can make powerful inferences about any properly drawn sample from that population

e) If we have detailed information about a properly drawn sample (mean and standard deviation), we can make strikingly accurate inferences about the population from which that sample was drawn

f) The central limit theorem enables us to calculate the probability that a particular sample was drawn from a given population

g) If we know the underlying characteristics of two samples, we can infer whether both samples were likely drawn from the same population

h) According to the central limit theorem, the sample means for any population will be distributed roughly as a normal distribution around the population mean.

i) The population from which the samples are being drawn does not have to have a normal distribution for the sample means to be distributed normally

j) The standard error is the standard deviation of the sample means

q) For the central limit theorem to apply, the sample size needs to be relatively large (over 30 as a rule of thumb)

I just find this a useful technical knowledge (sample size) for statistical experiments. The minimum sample size of 30 has been stated by other professionals.

l) If you draw large, rando samples from any population, the means of those samples will be distributed normally around the population means

m) Most samples mean will lie reasonably close to the population means; the standard error is what defines “reasonably close”

Chapter 9 — Inference

a) The most likely explanation is not always the right explanation. Statistical inference is the process by which the data speak to us, enabling us to draw meaningful conclusions.

With quantitative data you can use it to for descriptive or inferential statistics. Statistical inference is the process of analysing if the results from your sample will translate to the population. Some common types of tests include t-test, ANOVA, regression, correlation, and chi-square.

b) Statistical inference is caused by the marriage of data and probability

c) One of the most common tools in statistical inference is hypothesis testing

As mentioned with the different tests mentioned above, hypothesis testing allows to assess whether your data is generalizable to the broader population. Hypothesis testing is everywhere.

d) Statistics alone cannot prove anything, instead, we use statistical inference to accept or reject explanations based on their relative likelihood

e) The alternative hypothesis is a conclusion that must be true if we can reject the null hypothesis

f) Researchers often create a null hypothesis in hopes of being able to reject it.

g) One of the most common thresholds that researchers use for rejecting a null hypothesis is 5 percent

h) The distinction between correlation and causation is crucial to the proper interpretation of statistical results

Chapter 10 — Polling

a) The power of polling stems from the same source as our previous sampling examples: the central limit theorem.

I never thought of the relationship between polling and CLT. Certainly, the most insightful comment within a pretty dry chapter.

b) The standard error will fall as the sample size gets larger, since n is in the denominator. The standard error also tends

c) When we solicit public opinion, the phrasing of the question and the choice of language can matter enormously. The real challenge of polling is twofold: finding and reaching that proper sample; and eliciting information from that representative group in a way it accurately reflects what its members believe

Chapter 11 — Regression Analysis

a) The hard part of regression analysis is determination which variables ought to be considered in the analysis and how that can best be done. Regression analysis is easy to use but hard to use well.

b) Regression analysis has the amazing capacity to isolate a statistical relationship that we care about.

c) At its core, regression analysis seeks to find the “best fit” for a linear relationship between two variables

d) For regression coefficient, you will generally be interested in three things: sign, size, and significance

e) One rule of thumb is that the coefficient is likely to be statistically significant when the coefficient is at least twice the size of the standard error

A coefficient that is more than about twice as large as the SE will be statistically significant at p=<. 05 (if a p-value is less than 0.05, it is judged as “significant,” and if the p-value is greater than 0.05, it is judged as “not significant.”)

f) Multiple regression analysis is the best tool we have for finding meaningful patterns in large, complex data sets

Multiple regression is a statistical technique to understand the relationship between one dependent variable and several independent variables.

g) A high proportion of all important research done in the social sciences over the past half century draws on regression analysis

Chapter 12 — Common Regression Mistakes

a) Top 7 abuses of regression analysis

b) Using regression to analyse a nonlinear relationship

c) Regression analysis is meant to be used when the relationship between variables can be expressed using a linear equation

d) Correlation does not equal causation

e) Reverse causality

f) Omitted variable bias

g) Highly correlated explanatory variables (multicollinearity)

h) Extrapolating beyond the data

i) Data mining (too any variables)

j) Clever researchers can always build a theory after the fact for why some curious variables that is just nonsense turns up as statistically significant

This is I believe is more common than one thinks.

q) The accepted convention is to reject a null hypothesis when we observe something that would happen by chance only 1 in 20 times or less if the null hypothesis were true.

l) Designing a good regression equation — figuring out what variables should be examined and where the data should come from is more important than the underlying statistical calculations.

This is an important principle to keep in mind. I marked it in bold as it is a fundamental principle to follow through.

m) Regression analysis build only a circumstantial case. (any association between two variables is like a fingerprint at the scene of the crime)

Chapter 13 — Program Evaluation

a) Most common approaches for isolating a treatment

b) Randomized, controlled experiments

This is certainly the type of experiments I have witnessed the most. Contrary to my initial belief, they can be expensive.

c) Natural experiment

d) Non-equivalent control

e) Differences in differences

Overall, this book is a great companion for anyone studying statistics. The fun and grounded approach to statistics allows some levity within a discipline that can be dry at times.

Chapter 5, Chapter 5 ½, Chapter 6 were omitted. My focus on studying this book was to improve my understanding of statistics. Chapter 5–6 topics were on probabilities

Leave a Comment