Where do the numbers published in scientific articles come from?

Brief summary

Where do the numbers published in scientific articles come from and how often are they correct? Recently we discovered that it can be a lot more difficult to answer this question than one might hope. Our attempt to reproduce values reported in 35 articles published in the journal Cognition revealed analysis pipelines peppered with errors. I outline some elements of a reproducible workflow that may help to mitigate these problems in future research.

This is the second of two posts on our recent paper: “Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data policy at the journal Cognition”. The first part is available here.


Most quantitative research involves a data analysis pipeline: a series of steps that wrangle, filter, and sort the raw data, feed it into algorithms, and generate statistics, tables, and visualizations that are eventually published in scientific articles. A minimum credibility threshold we might expect all scientific articles to surpass is that any reported values can be recovered if we repeat the original data analysis pipeline. This known as analytic (or computational) reproducibility. During psychology’s ongoing credibility crisis much attention has been paid to the replicability of findings i.e., are similar findings observed when a study repeats the methods of a previous study and obtains new data. Analytic reproducibility on the other hand, has been largely overlooked, and has not previously been investigated systematically in our field.

In my last blog post I described how my colleagues and I began a study to examine the impact of a mandatory open data policy introduced at the journal Cognition. Many datasets were made available, but a substantial number did not appear to be reusable because they were incomplete or poorly documented.

Assessing analytic reproducibility

In the second phase of our study, we took a closer look at 35 articles and tried to reproduce a set of “target outcomes” - roughly two paragraphs of descriptive and inferential statistics - associated with a key finding. These 35 articles were deemed to have reusable datasets in phase 1, so had already passed several quality checks. We only selected target outcomes produced by the most ‘straightforward’ analyses – the usual suspects from undergraduate psychology statistical training (e.g., means, standard deviations, correlations, confidence intervals, t-tests, ANOVAs etc.). Values were considered ‘non-reproducible’ if comparison to the corresponding values output by our own analysis exceeded a pre-defined margin of error (10%).

Despite extensive efforts by our team, we initially encountered errors in 24 of the 35 reproducibility assessments (69%). In all of these cases, we sent an e-mail to the original authors explaining our project and providing a link to the pre-registered protocol. We provided a summary of the problems we had encountered, linked to an R Markdown report rendered in html that detailed our re-analysis attempt (e.g., here), and asked for clarification or additional information that might enable us to complete the analysis successfully. We were delighted to receive prompt and helpful replies from 100% of the authors we contacted, sometimes with a cheery word of encouragement about the goals of our project.

The additional information provided by authors resolved some issues, typically by correcting errors in the data files or supplementing/clarifying the analysis specifications reported in the published paper. Ultimately, all values were reproducible without any author assistance in 11 (31%) articles. All values were reproducible in an additional 11 (31%) articles but only with author assistance. Finally, there was at least one non-reproducible value in 13 (37%) articles despite author assistance. In total, we attempted to reproduce 1324 individual values and 64 (5%) reproducibility errors remained after author assistance was provided (Figure 1). Importantly, based on a case-by-case assessment of the pattern of errors in each article, we did not observe any clear indications that original conclusions were seriously impacted.

<small>Post-author assistance status of all 1324 values checked for reproducibility as a function of article and value type (n = count/proportion; ci = confidence interval; misc = miscellaneous; M = mean/median; df = degrees of freedom; es = effect size; test = test statistic; p = p-value; sd/se = standard deviation/standard error. Article colours represent the overall outcome: not fully reproducible despite author assistance (red), reproducible with author assistance (orange) and reproducible without author assistance (green).</small>

Figure 1: Post-author assistance status of all 1324 values checked for reproducibility as a function of article and value type (n = count/proportion; ci = confidence interval; misc = miscellaneous; M = mean/median; df = degrees of freedom; es = effect size; test = test statistic; p = p-value; sd/se = standard deviation/standard error. Article colours represent the overall outcome: not fully reproducible despite author assistance (red), reproducible with author assistance (orange) and reproducible without author assistance (green).

Reproducibility, bananas, and flat pack furniture

Reproducibility checks can quickly drive you bananas - every discrepancy between your own analysis and the target outcomes quickly leads to a spiral of self-doubt as you start wondering whether it is your own analysis or the original analysis that is erroneous. There was rarely a clear recipe to follow because the analysis scripts for the target studies were usually not available, and specification of the original analyses in the paper itself was frequently unclear, incomplete, or incorrect.

One thing that it was hard to convey in the paper was just how time-consuming and mentally exhausting these reproducibility checks were. The process generally involved extensive collaborative detective work between several members of our team and the original authors to figure out exactly where the values reported in the papers had come from. We estimate that most checks requiring author assistance absorbed 5–25 person hours of work (from our team) and took many months of back-and-forth communication. And this was just for a couple of paragraphs of results from the most straightforward analyses.

In the paper, we drew the following analogy:

Conducting an analytic reproducibility check without an analysis script is rather like assembling flat pack furniture without an instruction booklet: one is given some materials (categorized and labelled with varying degrees of helpfulness), and a diagram of the final product, but is missing the step-by-step instructions required to convert one into the other. In many cases, we found it necessary to call the helpline, and, while the response was helpful, the process became excessively time-consuming and frustrating.

To keep ourselves motivated, we sometimes worked side-by-side during a series of ‘reproducathons’ (Figure 2). We found this to be much more efficient and manageable than working in isolation. Plus Mike gave us pizza and coffee.

<small>Photos from our reproducathons taken by [Gustav Nilsonne](https://twitter.com/GustavNilsonne).</small>

Figure 2: Photos from our reproducathons taken by Gustav Nilsonne.

As we discussed the problems we were encountering, we naturally reflected on our own analysis pipelines and realised how vulnerable they might be to the same problems. I started to gain a new perspective on the project: the errors we were discovering were an inevitable byproduct of the inherent fallibility of human beings. Scientists are human and humans make errors - that’s not surprising. What’s surprising is that, as a field, we’re not really doing anything about it.

What should we do?

Human error is inevitable. But that doesn’t mean that we cannot take steps to mitigate their quantity and consequences. In fact there are sub-fields of psychology dedicated to studying human errors and ways to prevent them (human factors/cognitive ergonomics). We can also learn a great deal from our colleagues in computationally intensive disciplines who have developed strategies such as ‘defensive programming’ to minimize the chance of error in computational work and maximise reproducibility. Jeff Rouder, Julia Haaf, and Hope Snyder have a great paper that outlines how principles of error reduction established in human factors research can be applied to analysis pipelines in psychology.

In our project, we adopted a number of principles and tools to try and mitigate error and ensure reproducibility (Figure 3). This arrangement may not work for everyone, but it illustrates that creating an entirely reproducible analysis pipeline is possible. I won’t deny that there is some upfront cost in setting up such a pipeline, but ultimately I believe the long-term benefits far outweigh these short-term costs. Whilst we can never eliminate human error from the research process, I sleep easier at night knowing that we have implemented a system that reduces the likelihood of errors occurring. Additionally, because we built this pipeline ‘in the open’ from the very beginning, everything has been organized and documented with the assumption that someone may want to verify or re-use aspects of it. This has proved immensely useful as we have been able to quickly get up and running with a similar reproducibility project simply by adapting our existing pipeline.

The core principle of our analysis pipeline was transparency. We made all raw data publicly available on the Open Science Framework along with codebooks (documentation) to faciliate re-use and verification by independent scientists (and our future selves). We attempted to comprehensively report all details of our analyses in the paper, but also shared analysis scripts because it can be difficult to capture analysis specifications in sufficient detail in regular prose. We adopted a ‘co-piloting’ scheme which involved a minimum of two team members cross-checking all analyses. We used Github for code collaboration and version control, which helped to track the provenance of bugs and avoid a confusing multiplicity of document versions. All of our analyses were written in a literate programming style using R Markdown which enabled us to interleave regular prose with analysis code. Using knitr, we could convert the R Markdown into analysis reports, in HTML, Word, or pdf format, for every single reproducibility check. Frederik Aust’s package papaja enabled us to convert the R Markdown for the main paper into a smart looking APA-formatted manuscript, ready to be dropped straight onto the BITSS preprint server.

Elements of our reproducible analysis pipeline.

Figure 3: Elements of our reproducible analysis pipeline.

The last piece of the puzzle was to ensure that others could re-run our analysis code. This is not straightforward as the successful operation of all code is dependent on idiosyncratic features of the local computational environment (i.e., operating system version, R version, package versions). The computational environment is likely to be different on someone else’s computer, your computer in a few weeks, and potentially no one’s computer in a few years. One way to address this issue is to create a ‘software container’ that specifies the exact computational environment used to run the analyses originally. We used a platform called Code Ocean, which made this process relatively straightforward. When you visit the software container for our paper on Code Ocean and click the ‘run’ button, it will take our raw data and analysis code and re-generate the final paper from scratch. Give it a try with your next project, it feels like the future.

Conclusion

The outcome of our analytic reproducibility checks was both encouraging and concerning. Its relieving that none of the errors we found appear to seriously undermine the original conclusions drawn in the published papers. However, on our first pass 69% of the articles had at least one non-reproducible value. That’s pretty concerning. Extensive detective work was required to identify the provenance of reported values, and sometimes, even with assistance from original authors, it was not possible to say where the reported values had come from. As we only assessed a small subset of the most straightforward analyses from articles for which data was already determined to be available and in-principle reusable, our findings are very likely to underestimate reproducibility issues in most other parts of the literature.

Neither original authors nor independent scientists interested in re-using data should have to spend their time engaging in extensive back-and-forth exchanges just to establish the analytic reproducibility of published findings. We should be able to take that for granted. It is in everyone’s best interests to adopt reproducible analysis pipelines for a more efficient, less error-prone, and high credibility approach to research.


Thanks to Mike Frank for feedback on an earlier version of this post.

comments powered by Disqus