We selected ~10% of NeurIPS papers to be reviewed twice, independently.
170 papers were reviewed by two separate committees.
Each committee was 1/2 the size of the full committee.
Reviewers allocated at random
Area Chairs allocated to ensure distribution of expertise
Notes on the Timeline for NeurIPS
AC recruitment (3 waves): * 17/02/2014 * 08/03/2014 * 09/04/2014
We requested names of reviewers from ACs in two waves: * 25/03/2014 * 11/04/2014
Reviewer recruitment (4 waves): * 14/04/2014 * 28/04/2014 * 09/05/2014 * 10/06/2014 (note this is after deadline … lots of area chairs asked for reviewers after the deadline!). We invited them en-masse.
06/06/2014 Submission Deadline
12/06/2014 Bidding Open For Area Chairs (this was delayed by CMT issues)
17/06/2014 Bidding Open For Reviewers
01/07/2014 Start Reviewing
21/07/2014 Reviewing deadline
04/08/2014 Reviews to Authors
11/08/2014 Author Rebuttal Due
25/08/2014 Teleconferences Begin
30/08/2014 Teleconferences End
1/09/2014 Preliminary Decisions Made
9/09/2014 Decisions Sent to Authors
Decision Making Timeline
Deadline 6th June
three weeks for paper bidding and allocation
three weeks for review
two weeks for discussion and adding/augmenting reviews/reviewers
For random committee we expect: * inconsistency of 3 in 8 (37.5%) * accept precision of 1 in 4 (25%) * reject precision of 3 in 4 (75%) and a * agreed accept rate of 1 in 10 (10%).
Actual committee’s accept precision markedly better with 50% accept precision.
Uncertainty: Accept Rate
Uncertainty: Accept Precision
How reliable is the consistent accept score?
Bayesian Analysis
Multinomial distribution three outcomes.
Uniform Dirichlet prior.
(doesn’t account for implausability of ‘active inconsistency’)
Conclusion
For parallel-universe NIPS we expect between 38% and 64% of the presented papers to be the same.
For random-parallel-universe NIPS we only expect 25% of the papers to be the same.
Discussion
Error types:
type I error as accepting a paper which should be rejected.
type II error rejecting a paper should be accepted.
Controlling for error:
many reviewer discussions can be summarised as subjective opinions about whether controlling for type I or type II is more important.
with low accept rates, type I errors are much more common.
Normally in such discussions we believe there is a clear underlying boundary.
For conferences there is no clear separation points, there is a spectrum of paper quality.