Are open data actually reusable?

Brief summary

Many efforts are underway to promote data sharing in psychology, however it is currently unclear if the in-principle benefits of data availability are being realized in practice. In a recent study, we found that a mandatory open data policy introduced at the journal Cognition led to a substantial increase in available data, but a considerable portion of this data was not reusable. For data to be reusable, it needs to be clearly structured and well-documented. Open data alone will not be enough to achieve the benefits envisioned by proponents of data sharing.

This is the first installment of a two-part blog discussing our recent paper: “Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data policy at the journal Cognition”. The second installment is here.


Back in 2015 I spent a couple of months visiting the Center for Open Science in Virginia. After filling a few days reveling in the novelty of their treadmill desks, I thought I should make myself useful and joined a meta-science project being led by Mallory Kidwell. The goal was to evaluate the effectiveness of the ‘open badges’ scheme introduced the previous year at the journal Psychological Science. The scheme was simple: after a publication decision had been made, authors were asked if they would voluntarily make data (and/or materials) publicly available, in return for which the journal would post a colourful badge on the paper in order to signal adoption of this open research practice. I thought open badges sounded daft, but I thought open data sounded brilliant, so I decided to go along for the ride.

Many hours of data extraction and falling off treadmills later, we found that there was a substantial increase in data availability after open badges were introduced. In the comparison journals by contrast, data was hardly ever shared. Perhaps badges were not so daft after all. This was an observational study so we should be cautious about drawing straightforward causal conclusions1, but here I want to surface an aspect of the study that is often overlooked: We didn’t just record data availability, we also examined whether shared data were actually reusable (in principle). We found that often they are not: a substantial proportion (50/111, 45%) of reportedly available data (from studies with and without badges) was in fact not available, incomplete, incorrect, or had insufficient documentation. If open data are not actually reusable then obviously this undermines all of the useful things that we could do with them. Recently observed trends towards increased data sharing are encouraging, but may be partly illusory if it turns out that many of these data are not reusable.

A new study of data reusability

In order to further investigate this issue, Mike Frank and I thought up a study that capitalized on the introduction of an open data policy at the journal Cognition in March 2015. Unlike the open badges scheme, data sharing was mandatory under this policy: all authors had to make their data files publicly available online before their articles would be published. We decided to examine 591 articles published in Cognition between March 2014 and March 2017, and extract information about data availability and reusability, following a coding scheme similar to Kidwell et al.

Mike and I realized that we could not undertake this endeavour alone, so we sounded the open science conch and summoned meta-scientists from across the realm to join us in our quest (Figure 1). Although I had worked in a large collaborative team like this before on Mallory’s badges project, I had never led one. It turned out to be one of the most rewarding parts of the project. Although my attempts to motivate the team with bizarre animal photographs may have been more amusing to me than any one else, it turned out that everyone was happy to put in a ton of work anyway, and we all learned a lot from each other along the way. I’ve since bumped into some of my co-authors at conferences and met them face-to-face for the first time.

The 'Cognition Open Data' team.

Figure 1: The ‘Cognition Open Data’ team.

Ultimately, we evaluated 417 articles submitted before Cognition’s open data policy was introduced and 174 articles submitted afterwards. Policy introduction was associated with a substantial increase in the number of articles reporting available data, although 38/174 (22%) post-policy articles did not comply (Figure 2).

Proportion of articles with data available statements as a function of submission date across the assessment period. Solid red lines represent predictions of an interrupted time series analysis segmented by pre-policy and post-policy periods. The dashed red line estimates, based on the pre-policy period, the trajectory of data available statement inclusion if the policy had no effect. Confidence bands (red) indicate 95% CIs.

Figure 2: Proportion of articles with data available statements as a function of submission date across the assessment period. Solid red lines represent predictions of an interrupted time series analysis segmented by pre-policy and post-policy periods. The dashed red line estimates, based on the pre-policy period, the trajectory of data available statement inclusion if the policy had no effect. Confidence bands (red) indicate 95% CIs.

Fairly good news so far, but how many of these data sets were actually reusable? In cases where a data set was reportedly available, we attempted to download and open it (accessibility), recorded whether all data needed for evaluation and reproduction of the research had been shared (completeness), and noted whether the data set was sufficiently well documented (understandability). Fortunately, most data sets were accessible, but the other findings of this exercise (Figure 3) are rather troubling. Only 23/103 (22%) accessible data sets in the pre-policy period appeared to be reusable (both complete and understandable). There was substantial improvement in the post-policy period, however still only 85/133 (64%) accessible data sets were deemed reusable.

Was shared data actually reusable?

Figure 3: Was shared data actually reusable?

Most scientific data are unlikely to be available and for most available data, reusability is unknown. Our findings suggest that promoting data availability without attending to the issue of data reusability could be misguided. Our initial visual assessment of data files is likely to underestimate the problem – if we had actually tried to re-use the data in practice it seems likely that additional issues would emerge. Indeed, when we did try to re-use the data for a subset of 35 articles in an attempt to establish their analytic reproducibility, several issues with the data files emerged that we hadn’t previously spotted. We will discuss these findings in more detail in another blog post.

So what should we do?

If available data are not well curated then this seriously undermines their utility. That’s why we’ve recommended that journals with open data policies also provide guidance to authors on how to adquately prepare their data files to maximise their utility. Curating data is not as straightforward as it might seem. It requires some imagination to comprehend that the variable naming scheme that seems completely logical to you in the present moment might make no sense to one of your colleagues, or even yourself in a few months time.

Here are some tips we have learned for improving the reusability of data files (please also share your own tips in the comments section below):

  • if there are no negative constraints (e.g., privacy issues), share the rawest possible digital version of the data. In cases where a great deal of pre-processing is required (e.g., eye-tracking data), it is also extremely helpful to share “basic level” data i.e., data summarised at the most relevant unit of analysis (in psychology this is typically the participant level).

  • organise data logically, for example in ‘tidy’ format, where each variable is a column and each observation is a row.

  • if it is not too cumbersome, use human legible long form for column headings and group codes. For example, rather than “sbj” use “subject” or “participant id”, and rather than “1” and “2” write out “male” and “female”. If this gets messy, you can provide translation of your short-hand in a separate data dictionary/codebook (below).

  • provide a data dictionary/codebook to explain your data. There are nifty automated tools that will generate codebooks for you, for example the R package dataMaid and codebook which can used with R, SPSS, and Stata.

  • share data in a machine-readable and human-readable format that does not rely on proprietary software. Comma-separated values (csv) is a good default.

  • accompany your data with ‘metadata’ – the who, what, where, why, and when, of the data collection process.

Conclusion

It is encouraging to hear about new efforts to increase data availability, however the findings of our recent study suggest that additional attention is needed to the issue of data reusability. If available data is not reusable data then the in-principle benefits of data sharing will not be realized in practice.

This is the first installment of a two-part blog. The second installment focuses on lessons learned from our assessment of analytic reproducibility and is available here.

Thanks to Mike Frank for feedback on an earlier version of this post.


  1. Recently there has been some thought-provoking discussion about the badges study initiated by a couple of blog posts by Hilda Bastian, here and here.

Related

comments powered by Disqus