In this vignette, we’ll walk through conducting a \(\chi^2\) (chi-squared) test of independence and a chi-squared goodness of fit test using
infer. We’ll start out with a chi-squared test of independence, which can be used to test the association between two categorical variables. Then, we’ll move on to a chi-squared goodness of fit test, which tests how well the distribution of one categorical variable can be approximated by some theoretical distribution.
Throughout this vignette, we’ll make use of the
gss dataset supplied by
infer, which contains a sample of data from the General Social Survey. See
?gss for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let’s suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:
## Rows: 500 ## Columns: 11 ## $ year <dbl> 2014, 1994, 1998, 1996, 1994, 1996, 1990, 2016, 2000, 1998, 2… ## $ age <dbl> 36, 34, 24, 42, 31, 32, 48, 36, 30, 33, 21, 30, 38, 49, 25, 5… ## $ sex <fct> male, female, male, male, male, female, female, female, femal… ## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no d… ## $ partyid <fct> ind, rep, ind, ind, rep, rep, dem, ind, rep, dem, dem, ind, d… ## $ hompop <dbl> 3, 4, 1, 4, 2, 4, 2, 1, 5, 2, 4, 3, 4, 4, 2, 2, 3, 2, 1, 2, 5… ## $ hours <dbl> 50, 31, 40, 40, 40, 53, 32, 20, 40, 40, 23, 52, 38, 72, 48, 4… ## $ income <ord> $25000 or more, $20000 - 24999, $25000 or more, $25000 or mor… ## $ class <fct> middle class, working class, working class, working class, mi… ## $ finrela <fct> below average, below average, below average, above average, a… ## $ weight <dbl> 0.8960, 1.0825, 0.5501, 1.0864, 1.0825, 1.0864, 1.0627, 0.478…
To carry out a chi-squared test of independence, we’ll examine the association between income and educational attainment in the United States.
college is a categorical variable with values
no degree, indicating whether or not the respondent has a college degree (including community college), and
finrela gives the respondent’s self-identification of family income—either
far below average,
far above average, or
DK (don’t know).
This is what the relationship looks like in the sample data:
If there were no relationship, we would expect to see the purple bars reaching to the same height, regardless of income class. Are the differences we see here, though, just due to random noise?
First, to calculate the observed statistic, we can use
The observed \(\chi^2\) statistic is 30.6825. Now, we want to compare this statistic to a null distribution, generated under the assumption that these variables are not actually related, to get a sense of how likely it would be for us to see this observed statistic if there were actually no association between education and income.
generate the null distribution in one of two ways—using randomization or theory-based methods. The randomization approach permutes the response and explanatory variables, so that each person’s educational attainment is matched up with a random income from the sample in order to break up any association between the two.
Note that, in the line
specify(college ~ finrela) above, we could use the equivalent syntax
specify(response = college, explanatory = finrela). The same goes in the code below, which generates the null distribution using theory-based methods instead of randomization.
To get a sense for what these distributions look like, and where our observed statistic falls, we can use
We could also visualize the observed statistic against the theoretical null distribution. Note that we skip the
calculate() steps when using the theoretical approach, and that we now need to provide
method = "theoretical" to
To visualize both the randomization-based and theoretical null distributions to get a sense of how the two relate, we can pipe the randomization-based null distribution into
visualize(), and further provide
method = "both".
Either way, it looks like our observed test statistic would be really unlikely if there were actually no association between education and income. More exactly, we can calculate the p-value:
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0
Thus, if there were really no relationship between education and income, the probability that we would see a statistic as or more extreme than 30.6825 is approximately 0.
Note that, equivalently to the steps shown above, the package supplies a wrapper function,
chisq_test, to carry out Chi-Squared tests of independence on tidy data. The syntax goes like this:
## # A tibble: 1 x 3 ## statistic chisq_df p_value ## <dbl> <int> <dbl> ## 1 30.7 5 0.0000108
Now, moving on to a chi-squared goodness of fit test, we’ll take a look at the self-identified income class of our survey respondents. Suppose our null hypothesis is that
finrela follows a uniform distribution (i.e. there’s actually an equal number of people that describe their income as far below average, below average, average, above average, far above average, or that don’t know their income.) The graph below represents this hypothesis:
It seems like a uniform distribution may not be the most appropriate description of the data–many more people describe their income as average than than any of the other options. Lets now test whether this difference in distributions is statistically significant.
First, to carry out this hypothesis test, we would calculate our observed statistic.
The observed statistic is 487.984. Now, generating a null distribution, by just dropping in a call to
# generating a null distribution, assuming each income class is equally likely null_distribution_gof <- gss %>% specify(response = finrela) %>% hypothesize(null = "point", p = c("far below average" = 1/6, "below average" = 1/6, "average" = 1/6, "above average" = 1/6, "far above average" = 1/6, "DK" = 1/6)) %>% generate(reps = 1000, type = "simulate") %>% calculate(stat = "Chisq")
Again, to get a sense for what these distributions look like, and where our observed statistic falls, we can use
This statistic seems like it would be really unlikely if income class self-identification actually followed a uniform distribution! How unlikely, though? Calculating the p-value:
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0
Thus, if each self-identified income class was equally likely to occur, the probability that we would see a distribution like the one we did is approximately 0.
Again, equivalently to the steps shown above, the package supplies a wrapper function,
chisq_test, to carry out Chi-Squared goodness of fit tests on tidy data. The syntax goes like this:
## # A tibble: 1 x 3 ## statistic chisq_df p_value ## <dbl> <dbl> <dbl> ## 1 488. 5 3.13e-103