In 2000, two undisclosed neuroscience journals opened their database to an interesting study, which was subsequently published in Brain : Rothwell and Martyn set out to determine the ‘reproducibility’ of the assessments of submitted articles by independent reviewers. They found, not surprisingly, that the recommendations of the reviewers had a strong influence on the acceptance of the articles. However, there was no or only little agreement between reviewers regarding priory. The agreement between reviewers regarding recommendation (accept, reject, revise) was also not better than chance.
Two recent publication have picked up this thread, and found rather horrifying results:
In Science this week John Bohannon reports the results of an interesting experiment. He deliberately faked completely flawed studies reporting the anticancer effects of non-existing phytodrugs, following the template:
‘Molecule X from lichen species Y inhibits the growth of cancer cell Z. To substitute for those variables, [he] created a database of molecules, lichens, and cancer cell lines and wrote a computer program to generate hundreds of unique papers. Other than those differences, the scientific content of each paper [was] identical.’
The studies included ethical problems, reported results that were not reflected in the experiments, the study design was wrong, etc. He then submitted them to 304 open access journals. 157 accepted it for publication! While this may reflect more a problem of some open access journals which are dedicated to so called ‘predatory publishing’ (to skim off publication fees from willing authors), some journals were published by respectable publishers.
Eyre-Walker and Stoletzki in the same week published an article in PLOS Biol, comparing peer review, impact factor, and number of citations to assess the ‘merit’ of a paper. They use a dataset of 6500 articles (e.g. from the F1000 database) for which they had post publication peer review by at least two authors. Again, just like in the Rothwell and Martyn Study, agreement between reviewers was not much better than chance (r2 of 0,07). The score of the assessors also very weakly correlated with the number of citations drawn by those articles (2=0,06). They summarize that ‘we have shown that none of the measures of scientific merit that we have investigated are reliable.’
What follows from all this? A good to-do list can be found in the editoral accompanying the Eyre-Walker & Stoletzky article. Eisen et al. advocate multidimensional assessment tools (‘altmetrics’), but for now ‘Do what you can today; help disrupt and redesign the scientific norms around how we assess, search, and filter science.’
Rothwell PM, Martyn CN (2000) Reproducibility of peer review in clinical neuroscience. Is agreement between reviewers any greater than would be expected by chance alone? Brain.123 ( Pt 9):1964-9.
Bohannon J (2013) Who’s afraid of peer review? Science 342:60-65
Eyre-Walker A, Stoletzki N (2013) The Assessment of Science: The Relative Merits of Post-Publication Review, the Impact Factor, and the Number of Citations. PLoS Biol 11(10): e1001675. doi:10.1371/journal.pbio.1001675
Eisen JA, MacCallum CJ, Neylon C (2013) Expert Failure: Re-evaluating Research Assessment. PLoS Biol 11(10): e1001677. doi:10.1371/journal.pbio.1001677