The NIPS Experiment

Examining the Repeatability of Peer Review

[Neil Lawrence](http://inverseprobability.com/) [@lawrennd](http://twitter.com/lawrennd)

with Corinna Cortes (Google, NYC)

MLPM Summer School

21st September 2015

In [1]:
# rerun Fernando's script to load in css
%run talktools
In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
from IPython.display import HTML

NIPS

  • NIPS is premier outlet for machine learning.
  • Particularly at interface with Neuroscience and Cognitive Science

NIPS in Numbers

  • In 2014 NIPS had:
    1678 submissions
    414 accepted papers
    20 oral presentations
    62 spotlight presentations
    331 poster presentations
    19 papers rejected without review

NIPS in Numbers

  • To review papers we had:
    • 1474 active reviewers (1133 in 2013)
    • 92 area chairs (67 in 2013)
    • 2 program chairs

The NIPS Experiment

  • How consistent was the process of peer review?
  • What would happen if you independently reran it?

The NIPS Experiment

  • We selected ~10% of NIPS papers to be reviewed twice, independently.
  • 170 papers were reviewed by two separate committees.
    • Each committee was 1/2 the size of the full committee.
    • Reviewers allocated at random
    • Area Chairs allocated to ensure distribution of expertise

Decision Making Timeline

Deadline 6th June

  1. three weeks for paper bidding and allocation
  2. three weeks for review
  3. two weeks for discussion and adding/augmenting reviews/reviewers
  4. one week for author rebuttal
  5. two weeks for discussion
  6. one week for teleconferences and final decisons
  7. one week cooling off

Decisions sent 9th September

Speculation

Results

Committee 1
AcceptReject
Committee 2Accept2222
Reject21101

4 papers rejected or withdrawn without review.

Summarizing the Table

  • inconsistency: 43/166 = 0.259
    • proportion of decisions that were not the same
  • accept precision $0.5 \times 22/44$ + $0.5 \times 21/43$ = 0.495
    • probability any accepted paper would be rejected in a rerunning
  • reject precision = $0.5\times 101/(22+101)$ + $0.5\times 101/(21 + 101)$ = 0.175
    • probability any rejected paper would be rejected in a rerunning
  • agreed accept rate = 22/101 = 0.218
    • ratio between aggreed accepted papers and agreed rejected papers.

Reaction After Experiment

A Random Committee @ 25%

Committee 1
AcceptReject
Committee 2Accept10.4 (1 in 16)31.1 (3 in 16)
Reject31.1 (3 in 16) 93.4 (9 in 16)

Stats for Random Committee

For random committee we expect:

  • inconsistency of 3 in 8 (37.5%)
  • accept precision of 1 in 4 (25%)
  • reject precision of 3 in 4 (75%) and a
  • agreed accept rate of 1 in 10 (10%).

Actual committee's accept precision markedly better with 50% accept precision.

Uncertainty: Accept Rate

In [3]:
rv = binom(340, 0.23)

x = np.arange(60, 120)
fig, ax = plt.subplots(figsize=(10,5))
ax.bar(x, rv.pmf(x))
display(HTML('<h3>Number of Accepted Papers for p = 0.23</h3>'))
ax.axvline(87,linewidth=4, color='red') 
plt.show()

Number of Accepted Papers for p = 0.23

Uncertainty: Accept Precision

  • How reliable is the consistent accept score?
In [4]:
rv = binom(166, 0.13)
x = np.arange(10, 30)
fig, ax = plt.subplots(figsize=(10,5))
ax.bar(x, rv.pmf(x))
display(HTML('<h3>Number of Consistent Accepts given p=0.13</h3>'))
ax.axvline(22,linewidth=4, color='red') 
plt.show()

Number of Consistent Accepts given p=0.13

Bayesian Analysis

  • Multinomial distribution three outcomes.
  • Uniform Dirichlet prior.
    • (doesn't account for implausability of 'active inconsistency')
In [5]:
def posterior_mean_var(k, alpha):
    """Compute the mean and variance of the Dirichlet posterior."""
    alpha_0 = alpha.sum()
    n = k.sum()
    m = (k + alpha)
    m /= m.sum()
    v = (alpha+k)*(alpha_0 - alpha + n + k)/((alpha_0+n)**2*(alpha_0+n+1))
    return m, v

k = np.asarray([22, 43, 101])
alpha = np.ones((3,))
m, v = posterior_mean_var(k, alpha)
outcome = ['consistent accept', 'inconsistent decision', 'consistent reject']
for i in range(3):
    display(HTML("<h4>Probability of " + outcome[i] +' ' + str(m[i]) +  "+/-" + str(2*np.sqrt(v[i])) + "</h4>"))

Probability of consistent accept 0.136094674556+/-0.0600011464859

Probability of inconsistent decision 0.260355029586+/-0.087455352832

Probability of consistent reject 0.603550295858+/-0.150347100243

In [6]:
def sample_precisions(k, alpha, num_samps):
    """Helper function to sample from the posterior distibution of accept, 
    reject and inconsistent probabilities and compute other statistics of interest 
    from the samples."""

    k = np.random.dirichlet(k+alpha, size=num_samps)
    # Factors of 2 appear because inconsistent decisions 
    # are being accounted for across both committees.
    ap = 2*k[:, 0]/(2*k[:, 0]+k[:, 1])
    rp = 2*k[:, 2]/(k[:, 1]+2*k[:, 2])
    aa = k[:, 0]/(k[:, 0]+k[:, 2])
    return ap, rp, aa

ap, rp, aa = sample_precisions(k, alpha, 10000)
print ap.mean(), '+/-', 2*np.sqrt(ap.var())
print rp.mean(), '+/-', 2*np.sqrt(rp.var())
print aa.mean(), '+/-', 2*np.sqrt(aa.var())
0.508753122542 +/- 0.128980361541
0.822081388624 +/- 0.0531283853908
0.184137068656 +/- 0.0694158213505
In [7]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
_ = ax[0].hist(ap, 20)
_ = ax[0].set_title('Accept Precision')
ax[0].axvline(0.25, linewidth=4)
_ = ax[1].hist(rp, 20)
_ = ax[1].set_title('Reject Precision')
ax[1].axvline(0.75, linewidth=4)
_ = ax[2].hist(aa, 20)
_ = ax[2].set_title('Agreed Accept Rate')
ax[2].axvline(0.10, linewidth=4)
Out[7]:
<matplotlib.lines.Line2D at 0x10dd28910>

Conclusion

  • For parallel-universe NIPS we expect between 38% and 64% of the presented papers to be the same.
  • For random-parallel-universe NIPS we only expect 25% of the papers to be the same.

Discussion

  • Error types:
    • type I error as accepting a paper which should be rejected.
    • type II error rejecting a paper should be accepted.
  • Controlling for error:
    • many reviewer discussions can be summarised as subjective opinions about whether controlling for type I or type II is more important.
    • with low accept rates, type I errors are much more common.
  • Normally in such discussions we believe there is a clear underlying boundary.
  • For conferences there is no clear separation points, there is a spectrum of paper quality.
  • Should be explored alongside paper scores.
In [ ]: