My last post talked a bit about analyzing regression results and thinking statistically in general. Today I read a blog about a psychology article that fits well, so I’m going to talk about that a bit. No sports, but more stats talk. Maybe a discussion that doesn’t involve Wins Produced or APM will allow everyone to think a bit more rationally?
First, let me say that this paper is not in my range of expertise; I don’t study evolutionary psych, mating/sexual behavior, or humor. But the data analysis is pretty straightforward, and that’s what I’ll be looking at.
The first issue that Psycasm brings up is the sample size. He says it’s potentially too big. If you agree with this, you probably also enjoy watching only the playoffs and believe that Shaq hits his free throws when they count (ok, there was a little sports). His concern actually fits in with what I’ve said previously; with a big sample you are very likely to find significant effects if they are there because the bar for significance becomes so low. In this case, a correlation of .15 tested at significant. Then Psycasm said the result would be more believable if the sample size were smaller. This is patently ridiculous. Barring measurement error, the correlation should stay around .15 (looking at the borderline case; other correlations were higher) with a smaller sample. That same correlation would not be significant in a smaller sample, and presumably the relationship would then be rejected.
This thinking fits in with what Phil was complaining about in his post and what I tried to address last time; the significance of your result is not as critical as many people (and academia) make it out to be. Finding a significant result in a small sample could be ‘more impressive’ because there was a higher bar of evidence to get over, but it’s also a less certain result and more likely to be affected by luck. It also ignores the issue of thinking about the strength of the results themselves; how do you feel about a correlation of .15? Given the topic here it seems reasonably-sized to me, but in an absolute sense it isn’t that great. That should be the topic at hand, not whether the sample size was too big to find an actual result. If you think .15 is a good-sized correlation, it doesn’t matter if it’s significant or not. Note that Psycasm’s suggestion of using a directional hypothesis doesn’t help much here; a two-tailed test (as opposed to one-tailed) would require a lower p value to be significant, but 1) you’re still thinking p value instead of importance b) I imagine they used a two-tailed test c) this isn’t the critical part of the analysis anyway.
Psycasm’s next major issue is with the raters and how the ratings were used. He thought there should be more raters and an even gender split, since part of the paper was about sex differences. More raters might have pinned down the caption humor better, but it also adds more noise. The raters all need to be checked against each other and would bring their own biases to the rating process (which will come up again in a minute). As it turns out, 4 of the raters were men and 2 were women. I doubt the results would have turned out much different if it had been 3 and 3, or 5 and 5. Since most captions were unfunny, the best rating for each cartoon was used and each subject got a humor scored based on the average of their best rating across three cartoons and six raters. That seems fine to me, although you obviously lose some information. They could have fit some kind of fancy Poisson model to all the data, but whatever. The important part is the reliability of that measure, which the authors say is decent. This gets back to Psycasm’s criticism of the raters; if the raters are pretty consistent, then the concern about gender differences in the raters is a moot point. In fact, if there were equal men and women the ratings might become more inconsistent if the men and women disagree. It would be nice if the authors told us if the male and female raters disagreed, but since their reliability is higher than some other papers presumably the raters’ genders didn’t matter too much.
Now we can get to the actual data analysis. Psycasm notes that in the raw correlations, humor is not correlated with mating success and says that it’s surprising given how the paper has been relayed in the media. You can devote whole books to media issues (and people have), but the important part is that the raw correlations *are expected* to have weird relationships. The assumption is that the variables are inter-correlated, which means that the raw correlations are next to useless. The authors of the papers did the right next step, which was to run a stepwise regression on the data. This is where the magic happened, as the only variable to significantly predict mating success was humor. BUT, the R squared was only .03. Now we’ve come full circle. The result that the paper will ultimately focus on is based on a significant result with little oomph – only 3% of mating success is attributable to humor ability. I’m going to guess that 20% is personality besides humor, 40% looks, and 37% being in the same room as drunk people.
The final bit is a structural equation modeling (SEM) analysis of the data which suggests that more intelligent people have higher humor scores, and higher humor scores lead to more mating success. The authors don’t report if they tried any other models, so it could be that other chains are just as likely. But it justifies the title of the paper: humor reveals intelligence (if you see someone being funny, they are probably also smart) and predicts mating success (that’s what the model says!). As a side note, the wikipedia page for SEM notes that it’s important to have a big enough sample size to get reliable results, which may be another reason to find Psycasm’s criticism of the sample size odd.
I’ve come down generally on the side of the paper here, but I want to emphasize that I’m not necessarily agreeing with their conclusions (I didn’t even read the discussion section, to be honest) or what the general media has said about it. And Psycasm has some good points about other ways the article could be improved or the research generally made better. But as far as the data analysis goes, it seems on the up and up to me for an academic article. A large sample size is good; it makes it more likely you’ll get significant results, notably in correlations, but it also helps to account for noise. The real issue is how you interpret your results, and significance is not the best way. If there’s a flaw in the paper’s process, it’s that they don’t note the small size of the connection between humor and mating success. Everything else being equal, I’m sure it is helpful to be funnier. But it isn’t going to help you much according to their data.