The NIPS Experiment

Examining the Repeatability of Peer Review

In [2]:

```
# rerun Fernando's script to load in css
%run talktools
```

In [3]:

```
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
from IPython.display import HTML
```

- NIPS is premier outlet for machine learning.
- Particularly at interface with Neuroscience and Cognitive Science

- In 2014 NIPS had:

1678 submissions

414 accepted papers

20 oral presentations

62 spotlight presentations

331 poster presentations

19 papers rejected without review

- To review papers we had:
- 1474 active reviewers (1133 in 2013)
- 92 area chairs (67 in 2013)
- 2 program chairs

- How consistent was the process of peer review?
- What would happen if you independently reran it?

- We selected ~10% of NIPS papers to be reviewed twice, independently.
- 170 papers were reviewed by two separate committees.
- Each committee was 1/2 the size of the full committee.
- Reviewers allocated at random
- Area Chairs allocated to ensure distribution of expertise

Deadline 6th June

- three weeks for paper
*bidding*and allocation - three weeks for
*review* - two weeks for discussion and
*adding/augmenting*reviews/reviewers - one week for
*author rebuttal* - two weeks for
*discussion* - one week for
*teleconferences*and*final decisons* - one week cooling off

Decisions sent 9th September

- To check public opinion before experiment: scicast question

Committee 1 | |||

Accept | Reject | ||

Committee 2 | Accept | 22 | 22 |

Reject | 21 | 101 |

4 papers rejected or withdrawn without review.

*inconsistency*: 43/166 =**0.259**- proportion of decisions that were not the same

*accept precision*$0.5 \times 22/44$ + $0.5 \times 21/43$ =**0.495**- probability any accepted paper would be rejected in a rerunning

*reject precision*= $0.5\times 101/(22+101)$ + $0.5\times 101/(21 + 101)$ =**0.175**- probability any rejected paper would be rejected in a rerunning

*agreed accept rate*= 22/101 =**0.218**- ratio between aggreed accepted papers and agreed rejected papers.

Public reaction after experiment documented here

Open Data Science (see Heidelberg Meeting)

NIPS was run in a very open way. Code and blog posts all available!

Reaction triggered by this blog post.

Committee 1 | |||

Accept | Reject | ||

Committee 2 | Accept | 10.4 (1 in 16) | 31.1 (3 in 16) |

Reject | 31.1 (3 in 16) | 93.4 (9 in 16) |

For random committee we expect:

*inconsistency*of 3 in 8 (37.5%)*accept precision*of 1 in 4 (25%)*reject precision*of 3 in 4 (75%) and a*agreed accept rate*of 1 in 10 (10%).

Actual committee's accept precision markedly better with 50% accept precision.

In [12]:

```
rv = binom(340, 0.23)
x = np.arange(60, 120)
fig, ax = plt.subplots(figsize=(10,5))
ax.bar(x, rv.pmf(x))
display(HTML('<h3>Number of Accepted Papers for p = 0.23</h3>'))
ax.axvline(87,linewidth=4, color='red')
plt.show()
```

- How reliable is the consistent accept score?

In [15]:

```
rv = binom(166, 0.13)
x = np.arange(10, 30)
fig, ax = plt.subplots(figsize=(10,5))
ax.bar(x, rv.pmf(x))
display(HTML('<h3>Number of Consistent Accepts given p=0.13</h3>'))
ax.axvline(22,linewidth=4, color='red')
plt.show()
```

- Multinomial distribution three outcomes.
- Uniform Dirichlet prior.
- (doesn't account for implausability of 'active inconsistency')

In [25]:

```
def posterior_mean_var(k, alpha):
"""Compute the mean and variance of the Dirichlet posterior."""
alpha_0 = alpha.sum()
n = k.sum()
m = (k + alpha)
m /= m.sum()
v = (alpha+k)*(alpha_0 - alpha + n + k)/((alpha_0+n)**2*(alpha_0+n+1))
return m, v
k = np.asarray([22, 43, 101])
alpha = np.ones((3,))
m, v = posterior_mean_var(k, alpha)
outcome = ['consistent accept', 'inconsistent decision', 'consistent reject']
for i in range(3):
display(HTML("<h4>Probability of " + outcome[i] +' ' + str(m[i]) + "+/-" + str(2*np.sqrt(v[i])) + "</h4>"))
```

In [17]:

```
def sample_precisions(k, alpha, num_samps):
"""Helper function to sample from the posterior distibution of accept,
reject and inconsistent probabilities and compute other statistics of interest
from the samples."""
k = np.random.dirichlet(k+alpha, size=num_samps)
# Factors of 2 appear because inconsistent decisions
# are being accounted for across both committees.
ap = 2*k[:, 0]/(2*k[:, 0]+k[:, 1])
rp = 2*k[:, 2]/(k[:, 1]+2*k[:, 2])
aa = k[:, 0]/(k[:, 0]+k[:, 2])
return ap, rp, aa
ap, rp, aa = sample_precisions(k, alpha, 10000)
print ap.mean(), '+/-', 2*np.sqrt(ap.var())
print rp.mean(), '+/-', 2*np.sqrt(rp.var())
print aa.mean(), '+/-', 2*np.sqrt(aa.var())
```

In [19]:

```
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
_ = ax[0].hist(ap, 20)
_ = ax[0].set_title('Accept Precision')
ax[0].axvline(0.25, linewidth=4)
_ = ax[1].hist(rp, 20)
_ = ax[1].set_title('Reject Precision')
ax[1].axvline(0.75, linewidth=4)
_ = ax[2].hist(aa, 20)
_ = ax[2].set_title('Agreed Accept Rate')
ax[2].axvline(0.10, linewidth=4)
```

Out[19]:

<matplotlib.lines.Line2D at 0x10d4508d0>

- For parallel-universe NIPS we expect between 38% and 64% of the presented papers to be the same.
- For random-parallel-universe NIPS we only expect 25% of the papers to be the same.

- Error types:
- type I error as accepting a paper which should be rejected.
- type II error rejecting a paper should be accepted.

- Controlling for error:
- many reviewer discussions can be summarised as
*subjective*opinions about whether controlling for type I or type II is more important. - with low accept rates, type I errors are
*much*more common.

- many reviewer discussions can be summarised as
- Normally in such discussions we believe there is a clear underlying boundary.
- For conferences there is no clear separation points, there is a spectrum of paper quality.
- Should be explored alongside
*paper scores*.