The NIPS Experiment
Neil D. Lawrence
RADIANT Meeting, University of Zurich, Switzerland
NeurIPS in Numbers
- To review papers we had:
- 1474 active reviewers (1133 in 2013)
- 92 area chairs (67 in 2013)
- 2 program chairs
NeurIPS in Numbers
- In 2014 NeurIPS had:
- 1678 submissions
- 414 accepted papers
- 20 oral presentations
- 62 spotlight presentations
- 331 poster presentations
- 19 papers rejected without review
The NeurIPS Experiment
- How consistent was the process of peer review?
- What would happen if you independently reran it?
The NeurIPS Experiment
- We selected ~10% of NeurIPS papers to be reviewed twice, independently.
- 170 papers were reviewed by two separate committees.
- Each committee was 1/2 the size of the full committee.
- Reviewers allocated at random
- Area Chairs allocated to ensure distribution of expertise
Timeline for NeurIPS
- Submission deadline 6th June
- three weeks for paper bidding and allocation
- three weeks for review
- two weeks for discussion and adding/augmenting reviews/reviewers
- one week for author rebuttal
- two weeks for discussion
- one week for teleconferences and final decisons
- one week cooling off
- Decisions sent 9th September
Speculation
NeurIPS Experiment Results
block
4 papers rejected or withdrawn without review.
Reaction After Experiment
A Random Committee @ 25%
|
Committee 1
|
|
Accept
|
Reject
|
Committee 2
|
Accept
|
10.4 (1 in 16)
|
31.1 (3 in 16)
|
Reject
|
31.1 (3 in 16)
|
93.4 (9 in 16)
|
NeurIPS Experiment Results
|
Committee 1
|
|
Accept
|
Reject
|
Committee 2
|
Accept
|
22
|
22
|
Reject
|
21
|
101
|
A Random Committee @ 25%
|
Committee 1
|
|
Accept
|
Reject
|
Committee 2
|
Accept
|
10
|
31
|
Reject
|
31
|
93
|
Conclusion
- For parallel-universe NIPS we expect between 38% and 64% of the presented papers to be the same.
- For random-parallel-universe NIPS we only expect 25% of the papers to be the same.
Discussion
- Error types:
- type I error as accepting a paper which should be rejected.
- type II error rejecting a paper should be accepted.
- Controlling for error:
- many reviewer discussions can be summarised as subjective opinions about whether controlling for type I or type II is more important.
- with low accept rates, type I errors are much more common.
- Normally in such discussions we believe there is a clear underlying boundary.
- For conferences there is no clear separation points, there is a spectrum of paper quality.
- Should be explored alongside paper scores.