Safer Science
Simine Vazire
At the last ARP conference in Charlotte, I organized a symposium called “Safer Science: How to improve the quality and replicability of personality research”. I got the idea for this symposium from various discussions I’d had about the “replicability crisis”. First, I served on an APS taskforce on this issue, during which my primary role was to throw a yellow flag every time the group made a recommendation that applied only to experimental designs. This experience made me reflect on which of the challenges our field faces are specific to experimental (i.e., mostly social and cognitive) psych, and what are the unique challenges faced by non-experimental research. It’s easy to get haughty about not dropping conditions or not peeking at your data when you work with large-scale correlational datasets. But we personality researchers are also susceptible to capitalizing on chance in other ways (it’s easy to do when our datasets have 7,000 variables).
The purpose of the Safer Science symposium was to stimulate discussion about what personality researchers can do to improve the quality of our research. I used the phrase “safer science” because, like sex, I believe that science is never completely safe – there will always be errors, false positives, and even fraud, and we should not delude ourselves that we can eradicate these. However, we can always do better. Some people seem to find the replicability crisis depressing – I find it uplifting because the popularity of the reform initiatives shows that we want to do better. This is what progress looks like. This is the only way a field becomes stronger. Rather than pointing fingers, personality researchers should join in.
There are signs of change everywhere. JRP has new submission guidelines emphasizing power and transparency, and encouraging replication studies. SPSP is in the midst of similar changes. APS has already opened its doors to pre-registered replication reports, and is about to put in place new submission guidelines for Psychological Science. NSF is holding a workshop on replicability. It’s a brave new world.
What does this mean for personality research? The talks in the Safer Science symposium shed some light on some issues for us to keep in mind as we try to do better science. Here are a few highlights (with some editorializing from me):
- Replication studies are vital to our field’s scientific health, and should be encouraged. Replicators should not be treated as attackers – many are not out to tear anyone down but honestly want to know if they can replicate (and sometimes extend) a published finding. Likewise, even well-powered failures to replicate should not automatically be seen as a mark on the original researcher – some amount of error is expected by chance and should be chalked up to bad luck. (Funder talk)
- Personality journals tend to publish studies with larger sample sizes than social or social/personality journals (JP came out on top, followed by JRP) and a journal’s N-Pact factor (the average sample size of each study) is negatively correlated with its Impact Factor (Vazire & Fraley talk).
- Meta-analyses can give us an idea of what effect sizes to expect when planning our sample sizes, but because of publication bias, these estimates are likely inflated. When planning our studies and when interpreting published studies, we should keep in mind that the true effect size is likely smaller than the reported effect size. The smaller the sample, the more inflated the reported effect size is likely to be. (Srivastava talk)
- Papers that include multiple, under-powered studies that all report significant findings are statistically improbable. Thus, when such a paper is published, it likely means that there were other studies that ‘didn’t work’ (file drawer), or other creative techniques used to achieve a string of significant results from under-powered studies. (Schimmack talk)
So where do we go from here? First, we shouldn’t get too comfortable. None of the journals we examined reached the average sample size Sanjay recommended: 180 for 80% power (to detect an r of .10), or 300 for 90% power. The average sample size at JP – the journal that came out on top – is 178, at JRP it is 128, and at JPSP:PPID it is 122 (Fraley & Vazire, 2013). We are still falling short of adequate power. So lesson #1 is that we should, whenever possible, take the time to increase our sample sizes.
Second, we should remember to take the time to attempt to replicate our own and each other’s work. And we shouldn’t take an attempted replication as a sign of mistrust. Indeed, as Funder put it, we should be flattered if someone deems our finding important enough to be worthy of attempted replication.
Notice that I used the phrase “take the time” in both of these recommendations. That’s because if we are to increase our sample sizes and replicate our results, it is going to take time. And time is something that none of us feel like we have an abundance of. So what are we to do? One option is to put all of our studies on mTurk, or start doing only self-report or vignette studies. That is my idea of research dystopia. Yes, we need larger samples and more replication, but not at the expense of methodological rigor – we need to continue using multiple methods, sampling non-college students, coding actual behavior, and tracking people over time. So there’s only one solution. We need to join the slow science movement.
References
Fraley, R. C. & Vazire, S. (2013). The N-Pact Factor: Evaluating the quality of empirical journals with respect to sample size and statistical power. Manuscript under review.