If the -rxiv preprint servers had impact factors, what would they be?

J.P. Smith
5 min readJun 17, 2019

--

Here I look at three prominent non-peer-reviewed online preprint platforms (bioRxiv, psyArXiv, and SocArXiv) to try to see what their impact factors would be if they had them.

The formula for calculating impact factor for year n is B/C, where:

n = Most recent complete year (i.e. last year)

A = all citations in year n

B = a subset of A consisting only of citations to articles published in years n-1 or n-2

C = all citable items published in years n-1 and n-2 combined

I should start by noting that here, n will be 2018.

Let’s start with SocArXiv. Google Scholar doesn’t make it easy to figure out how many citations a given source got in any given year, but it does make it easy to figure out how many times articles the source from a specified time range have been cited in total. So if you enter source:“socarxiv” into Google Scholar, and restrict the range to 2016 (n-2) to 2017 (n-1), you get 578 results. Thus, this gives us the value of C, assuming:

  1. That no articles published in the desired source are excluded from this search
  2. That no articles not published in the desired source are included in this search
  3. That the number displayed at the top of the page is an exactly perfect representation of the number of results

The first two assumptions seem reasonable enough, but the third seems questionable given how the number of Google search results shown is often inaccurate. Indeed, in the case of SocArXiv (man that capitalization is confusing), the exact number of sources (scrolling to the final page of results) appears to be 627, so I will now define C as 627.

What about A? As I noted earlier it is not possible AFAIK to find the number of times a given source has been cited in a given year on GS. But we know that the total number of citations so far to SocArXiv articles published in 2016–2017 is 344. Therefore, this sets an upper limit to B, which only includes citations in articles published in 2018. So if we wanted to be generous, we could pretend that B = 344, ignoring the fact that many of these citations are in papers/other works published before (or after) 2018, meaning they shouldn’t count toward the IF. Therefore, the generous impact factor of SocArXiv is:

344/627 = 0.549

How many of these B = 344 citations, exactly, are in year 2018 as required? I could look at every single paper (with 1 or more citation) and count how many are in the year 2018, but I don’t feel like it because that would take too long. So instead I will try to estimate this based on extrapolating a small number of the included papers.

Consider this paper, the first one to show up in my SocArXiv source search, which has 49 citations. But only 5 of them (10%) are in 2018. For the second paper in my search, 0 of its 37 citations are in 2018, and for the third, 10/33 citations (30%) are in 2018. For the fourth, it’s 18/32 (56%), and for the fifth, it’s 5 out of 18 (28%). So if we sum the numerators and denominators of the last five (which include a total of 169 citations, or almost half of all SocArXiv citations during this time period), it gives us an estimated 38/169 = 22% of all citations that we should actually count towards an IF. 22% of 344 (rounding to the nearest whole number) is 76, so the realistic IF of SocArXiv is…

76/627 = 0.121

Notably, even the generous IF for SocArXiv presented here is not very impressive — to say nothing of the realistic one.

Now let’s look at PsyArXiv. It has a total of 541 papers during this time period. Thus, C = 541. The total number of citations is 299. So the generous IF of PsyArXiv is:

299/541 = 0.553

So as we did earlier we will look at the first 5 results. The first one has been cited 51 times, of which 21 are in 2018. The second has been cited 39 times, of which 6 were in 2018. The third is actually a republished version of a book chapter, and it has 32 citations, of which 14 are in 2018. Number four has been cited 22 times, of which 8 are in 2018. And finally, number five has been cited 21 times, of which only 1 is in 2018. The total numerator (sum) is 21+6+14+8+1=50. And the total number of citations is 51+39+32+22+21=165. So first I should note that these 5 papers combined actually account for a majority of PsyArXiv’s citations (about 55%) during this time period. Thus we should only count 50/165 = about 30% of all included citations here, if we extrapolate as we did previously. Thus we are assuming that 30% of the total 299 citations, or about 90, are in 2018.

Anyway, this means the realistic IF of PsyArXiv is:

90/541 = 0.166

Finally, bioRxiv. It looks like there are a total of 987 papers published in bioRxiv during this time period. So C = 987. The total # of citations is 2,402, meaning that the generous IF of bioRxiv is:

2402/987 = 2.434

Paper 1 has 168 cites, of which 20 are in 2018.

Paper 2 has 160 cites, of which 99 are in 2018.

Paper 3 has 130 cites, of which 58 are in 2018.

Paper 4 has 83 cites, of which 30 are in 2018.

And lastly, paper 5 has 76 cites, of which 42 are in 2018.

In total, it is therefore estimated that (20+99+58+30+42=249)/(617) citations here, or about 40%, are in the year 2018. So the realistic IF of bioRxiv is:

(2402*40%=about 961)/987 = 0.974

In conclusion, it is clear that even the most highly cited of these three (bioRxiv) is much less frequently cited (per article) than almost any highly regarded journal (certainly compared to the really prestigious ones like Nature). This seems to be a good thing, because we should be citing peer-reviewed sources much more often than non-peer-reviewed ones in general. It also needs to be pointed out that these estimated IFs are necessarily substantially inflated because they include all Google Scholar citations, and GS is a very inclusive database, certainly compared to the Web of Science, for instance, which is what the JCR impact factors are based on. Thus many citations included here would not have been included in a conventional JCR-based impact factor, not least because many of them were to other non-peer-reviewed preprints.

--

--

J.P. Smith
J.P. Smith

Written by J.P. Smith

I am no longer active on Medium.

No responses yet