[edit]

# The NeurIPS Experiment

**Swiss National Science Foundation**on May 10, 2022 [jupyter][google colab][reveal]

#### Abstract

In 2014, along with Corinna Cortes, I was Program Chair of the Neural Information Processing Systems conference. At the time, when wondering about innovations for the conference, Corinna and I decided it would be interesting to test the consistency of reviewing. With this in mind, we randomly selected 10% of submissions and had them reviewed by two independent committees. In this talk I will briefly review the construction of the experiment, explain how the NeurIPS review process worked and talk about what I felt the implications for reviewing were, vs what the community reaction was.

# Introduction

The NIPS experiment was an experiment to determine the consistency of the review process. After receiving papers, we selected 10% that would be independently rereviewed. The idea was to determine how consistent the decisions between the two sets of independent papers would be. In 2014 NIPS received 1678 submissions and we selected 170 for the experiment. These papers are referred to below as ‘duplicated papers.’

To run the experiment, we created two separate committees within the NIPS program committee. The idea was that the two separate committees would review each duplicated paper independently and results compared.

## NeurIPS in Numbers

In 2014 the NeurIPS conference had 1474 active reviewers (up from 1133 in 2013), 92 area chairs (up from 67 in 2013) and two program chairs, Corinna Cortes and me.

The conference received 1678 submissions and presented 414 accepted papers, of which 20 were presented as talks in the single-track session, 62 were presented as spotlights and 331 papers were presented as posters. Of the 1678 submissions, 19 papers were rejected without review.

## The NeurIPS Experiment

The objective of the NeurIPS experiment was to determine how consistent the process of peer review is. One way of phrasing this question is to ask: what would happen to submitted papers in the conference if the process was independently rerun?

For the 2014 conference, to explore this question, we selected \(\approx 10\%\) of submitted papers to be reviewed twice, by independent committees. This led to 170 papers being selected from the conference for dual reviewing. For these papers the program committee was divided into two. Reviewers were placed randomly on one side of the committee or the other. For Program Chairs we also engaged in some manual selection to ensure we had expert coverage in all the conference areas on both side of the committee.

## Timeline for NeurIPS

Chairing a conference starts with recruitment of the program committee, which is usually done in a few stages. The primary task is to recruit the area chairs. We sent out our program committee invites in three waves.

- 17/02/2014
- 08/03/2014
- 09/04/2014

By recruiting area chairs first, you can involve them in recruiting reviewers. We requested names of reviewers from ACs in two waves.

- 25/03/2014
- 11/04/2014

In 2014, this wasn’t enough to obtain the requisite number of reviewers, so we used additional approaches. These included lists of previous NeurIPS authors. For each individual we were looking for at least two previously-published papers from NeurIPS and other leading leading ML venues like ICML, AISTATS, COLT, UAI etc.. We made extensive use of DBLP for verifying each potential reviewer’s publication track record.

- 14/04/2014
- 28/04/2014
- 09/05/2014
- 10/06/2014 (note this is after deadline … lots of area chairs asked for reviewers after the deadline!). We invited them en-masse.

- 06/06/2014 Submission Deadline
- 12/06/2014 Bidding Open for Area Chairs (this was
*delayed*by CMT issues) - 17/06/2014 Bidding Open for Reviewers
- 01/07/2014 Start Reviewing
- 21/07/2014 Reviewing deadline
- 04/08/2014 Reviews to Authors
- 11/08/2014 Author Rebuttal Due
- 25/08/2014 Teleconferences Begin
- 30/08/2014 Teleconferences End
- 1/09/2014 Preliminary Decisions Made
- 9/09/2014 Decisions Sent to Authors

## Paper Scoring and Reviewer Instructions

The instructions to reviewers for the 2014 conference are still available online here.

To keep quality of reviews high, we tried to keep load low. We didn’t assign any reviewer more than 5 papers, most reviewers received 4 papers.

## Quantitative Evaluation

Reviewers give a score of between 1 and 10 for each paper. The program committee will interpret the numerical score in the following way:

10: Top 5% of accepted NIPS papers, a seminal paper for the ages.

I will consider not reviewing for NIPS again if this is rejected.

9: Top 15% of accepted NIPS papers, an excellent paper, a strong accept.

I will fight for acceptance.

8: Top 50% of accepted NIPS papers, a very good paper, a clear accept.

I vote and argue for acceptance.

7: Good paper, accept.

I vote for acceptance, although would not be upset if it were rejected.

6: Marginally above the acceptance threshold.

I tend to vote for accepting it, but leaving it out of the program would be no great loss.

5: Marginally below the acceptance threshold.

I tend to vote for rejecting it, but having it in the program would not be that bad.

4: An OK paper, but not good enough. A rejection.

I vote for rejecting it, although would not be upset if it were accepted.

3: A clear rejection.

I vote and argue for rejection.

2: A strong rejection. I’m surprised it was submitted to this conference.

I will fight for rejection.

1: Trivial or wrong or known. I’m surprised anybody wrote such a paper.

I will consider not reviewing for NIPS again if this is accepted.

Reviewers should NOT assume that they have received an unbiased sample of papers, nor should they adjust their scores to achieve an artificial balance of high and low scores. Scores should reflect absolute judgments of the contributions made by each paper.

## Impact Score

The impact score was an innovation introduce in 2013 by Ghahramani and Welling that we retained for 2014. Quoting from the instructions to reviewers:

Independently of the Quality Score above, this is your opportunity to identify papers that are very different, original, or otherwise potentially impactful for the NIPS community.

There are two choices:

2: This work is different enough from typical submissions to potentially have a major impact on a subset of the NIPS community.

1: This work is incremental and unlikely to have much impact even though it may be technically correct and well executed.

Examples of situations where the impact and quality scores may point in opposite directions include papers which are technically strong but unlikely to generate much follow-up research, or papers that have some flaw (e.g. not enough evaluation, not citing the right literature) but could lead to new directions of research.

## Confidence Score

Reviewers also give a confidence score between 1 and 5 for each paper. The program committee will interpret the numerical score in the following way:

5: The reviewer is absolutely certain that the evaluation is correct and very familiar with the relevant literature.

4: The reviewer is confident but not absolutely certain that the evaluation is correct. It is unlikely but conceivable that the reviewer did not understand certain parts of the paper, or that the reviewer was unfamiliar with a piece of relevant literature.

3: The reviewer is fairly confident that the evaluation is correct. It is possible that the reviewer did not understand certain parts of the paper, or that the reviewer was unfamiliar with a piece of relevant literature. Mathematics and other details were not carefully checked.

2: The reviewer is willing to defend the evaluation, but it is quite likely that the reviewer did not understand central parts of the paper.

1: The reviewer’s evaluation is an educated guess. Either the paper is not in the reviewer’s area, or it was extremely difficult to understand.

## Qualitative Evaluation

All NIPS papers should be good scientific papers, regardless of their specific area. We judge whether a paper is good using four criteria; a reviewer should comment on all of these, if possible:

Quality

Is the paper technically sound? Are claims well-supported by theoretical analysis or experimental results? Is this a complete piece of work, or merely a position paper? Are the authors careful (and honest) about evaluating both the strengths and weaknesses of the work?

Clarity

Is the paper clearly written? Is it well-organized? (If not, feel free to make suggestions to improve the manuscript.) Does it adequately inform the reader? (A superbly written paper provides enough information for the expert reader to reproduce its results.)

Originality

Are the problems or approaches new? Is this a novel combination of familiar techniques? Is it clear how this work differs from previous contributions? Is related work adequately referenced? We recommend that you check the proceedings of recent NIPS conferences to make sure that each paper is significantly different from papers in previous proceedings. Abstracts and links to many of the previous NIPS papers are available from http://books.nips.cc

Significance

Are the results important? Are other people (practitioners or researchers) likely to use these ideas or build on them? Does the paper address a difficult problem in a better way than previous research? Does it advance the state of the art in a demonstrable way? Does it provide unique data, unique conclusions on existing data, or a unique theoretical or pragmatic approach?

## NeurIPS Experiment Results

The results of the experiment were as follows. From 170 papers 4 had to be withdrawn or were rejected without completing the review process, for the remainder, the ‘confusion matrix’ for the two committee’s decisions is in Table .

Committee 1 | |||

Accept | Reject | ||

Committee 2 | Accept | 22 | 22 |

Reject | 21 | 101 |

## Summarizing the Table

There are a few ways of summarizing the numbers in this table as percent or probabilities. First, the inconsistency, the proportion of decisions that were not the same across the two committees. The decisions were inconsistent for 43 out of 166 papers or 0.259 as a proportion. This number is perhaps a natural way of summarizing the figures if you are submitting your paper and wish to know an estimate of what the probability is that your paper would have different decisions according to the different committees. Secondly, the accept precision: if you are attending the conference and looking at any given paper, then you might want to know the probability that the paper would have been rejected in an independent rerunning of the conference. We can estimate this for Committee 1’s conference as 22/(22 + 22) = 0.5 (50%) and for Committee 2’s conference as 21/(22+21) = 0.49 (49%). Averaging the two estimates gives us 49.5%. Finally, the reject precision: if your paper was rejected from the conference, you might like an estimate of the probability that the same paper would be rejected again if the review process had been independently rerun. That estimate is 101/(22+101) = 0.82 (82%) for Committee 1 and 101/(21+101)=0.83 (83%) for Committee 2, or on average 82.5%. A final quality estimate might be the ratio of consistent accepts to consistent rejects, or the agreed accept rate, 22/123 = 0.18 (18%).

*inconsistency*: 43/166 =**0.259**- proportion of decisions that were not the same

*accept precision*\(0.5 \times 22/44\) + \(0.5 \times 21/43\) =**0.495**- probability any accepted paper would be rejected in a rerunning

*reject precision*= \(0.5\times 101/(22+101)\) + \(0.5\times 101/(21 + 101)\) =**0.175**- probability any rejected paper would be rejected in a rerunning

*agreed accept rate*= 22/101 =**0.218**- ratio between agreed accepted papers and agreed rejected papers.

## Reviewer Calibration

Calibration of reviewers is the process where different interpretations of the reviewing scale are addressed. The tradition of calibration goes at least as far back as John Platt’s Program Chairing, and included a Bayesian model by Ge, Welling and Ghahramani at NeurIPS 2013.

## Reviewer Calibration Model

In this note book we deal with reviewer calibration. Our assumption is that the score from the \(j\)th reviwer for the \(i\)th paper is given by \[ y_{i,j} = f_i + b_j + \epsilon_{i, j} \] where \(f_i\) is the ‘objective quality’ of paper \(i\) and \(b_j\) is an offset associated with reviewer \(j\). \(\epsilon_{i,j}\) is a subjective quality estimate which reflects how a specific reviewer’s opinion differs from other reviewers (such differences in opinion may be due to differing expertise or perspective). The underlying ‘objective quality’ of the paper is assumed to be the same for all reviewers and the reviewer offset is assumed to be the same for all papers.

If we have \(n\) papers and \(m\) reviewers, then this implies \(n\) + \(m\) + \(nm\) values need to be estimated. Naturally this is too many, and we can start by assuming that the subjective quality is drawn from a normal density with variance \(\sigma^2\) \[ \epsilon_{i, j} \sim N(0, \sigma^2 \mathbf{I}) \] which reduces us to \(n\) + \(m\) + 1 parameters. Further we can assume that the objective quality is also normally distributed with mean \(\mu\) and variance \(\alpha_f\), \[ f_i \sim N(\mu, \alpha_f) \] this now reduces us to \(m\)+3 parameters. However, we only have approximately \(4m\) observations (4 papers per reviewer) so parameters may still not be that well determined (particularly for those reviewers that have only one review). We, therefore, finally, assume that reviewer offset is normally distributed with zero mean, \[ b_j \sim N(0, \alpha_b), \] leaving us only four parameters: \(\mu\), \(\sigma^2\), \(\alpha_f\) and \(\alpha_b\). Combined together these three assumptions imply that \[ \mathbf{y} \sim N(\mu \mathbf{1}, \mathbf{K}), \] where \(\mathbf{y}\) is a vector of stacked scores \(\mathbf{1}\) is the vector of ones and the elements of the covariance function are given by \[ k(i,j; k,l) = \delta_{i,k} \alpha_f + \delta_{j,l} \alpha_b + \delta_{i, k}\delta_{j,l} \sigma^2, \] where \(i\) and \(j\) are the index of first paper and reviewer and \(k\) and \(l\) are the index of second paper and reviewer. The mean is easily estimated by maximum likelihood and is given as the mean of all scores.

It is convenient to reparametrize slightly into an overall scale \(\alpha_f\), and normalized variance parameters, \[ k(i,j; k,l) = \alpha_f\left(\delta_{i,k} + \delta_{j,l} \frac{\alpha_b}{\alpha_f} + \delta_{i, k}\delta_{j,l} \frac{\sigma^2}{\alpha_f}\right) \] which we rewrite to give two ratios: offset/signal ratio, \(\hat{\alpha}_b\) and noise/signal \(\hat{\sigma}^2\) ratio. \[ k(i,j; k,l) = \alpha_f\left(\delta_{i,k} + \delta_{j,l} \hat{\alpha}_b + \delta_{i, k}\delta_{j,l} \hat{\sigma}^2\right) \] The advantage of this parameterization is it allows us to optimize \(\alpha_f\) directly (with a fixed-point equation) and it will be very well determined. This leaves us with two free parameters, that we can explore on the grid. It is in these parameters that we expect the remaining underdetermindness of the model. We expect \(\alpha_f\) to be well determined because the negative log likelihood is now \[ \frac{|\mathbf{y}|}{2}\log\alpha_f + \frac{1}{2}\log \left|\hat{\mathbf{K}}\right| + \frac{1}{2\alpha_f}\mathbf{y}^\top \hat{\mathbf{K}}^{-1} \mathbf{y}, \] where \(|\mathbf{y}|\) is the length of \(\mathbf{y}\) (i.e. the number of reviews) and \(\hat{\mathbf{K}}=\alpha_f^{-1}\mathbf{K}\) is the scale normalized covariance. This negative log likelihood is easily minimized to recover \[ \alpha_f = \frac{1}{|\mathbf{y}|} \mathbf{y}^\top \hat{\mathbf{K}}^{-1} \mathbf{y}. \] A Bayesian analysis of this parameter is possible with gamma priors, but it would merely show that this parameter is extremely well determined (the degrees of freedom parameter of the associated Student-\(t\) marginal likelihood scales will the number of reviews, which will be around \(|\mathbf{y}| \approx 6,000\) in our case.

So, we propose to proceed as follows. Set the mean from the reviews (\(\mu\)) and then choose a two-dimensional grid of parameters for reviewer offset and diversity. For each parameter choice, optimize to find \(\alpha_f\) and then evaluate the liklihood. Worst case this will require us inverting \(\hat{\mathbf{K}}\), but if the reviewer paper groups are disconnected, it can be done a lot quicker. Next stage is to load in the reviews for analysis.

## Fitting the Model

```
import cmtutils as cu
import os
import pandas as pd
import numpy as np
import GPy
from scipy.sparse.csgraph import connected_components
from scipy.linalg import solve_triangular
```

`= '2014-09-06' date `

## Loading in the Data

```
= date + '_reviews.xls'
filename = cu.CMT_Reviews_read(filename=filename)
reviews = list(sorted(set(reviews.reviews.index), key=int))
papers = reviews.reviews.loc[papers] reviews.reviews
```

The maximum likelihood solution for \(\mu\) is simply the mean quality of the papers, this is easily computed.

```
= reviews.reviews.Quality.mean()
mu print("Mean value, mu = ", mu)
```

## Data Preparation

We take the reviews, which are indexed by the paper number, and create a new data frame, that indexes by paper id and email combined. From these reviews we tokenize the `PaperID`

and the `Email`

to extract two matrices that can be used in creation of covariance matrices. We also create a target vector which is the mean centred vector of scores.

```
= reviews.reviews.reset_index()
r ={'ID':'PaperID'}, inplace=True)
r.rename(columns= r.PaperID + '_' + r.Email
r.index = pd.get_dummies(r.PaperID)
X1 = X1[sorted(X1.columns, key=int)]
X1 = pd.get_dummies(r.Email)
X2 = X2[sorted(X2.columns, key=str.lower)]
X2 = reviews.reviews.Quality - mu y
```

### Constructing the Model in GPy

Having reduced the model to two parameters, I was hopeful I could set parameters broadly by hand. My initial expectation was that `alpha_b`

and `sigma2`

would both be less than 1, but some playing with parameters showed this wasn’t the case. Rather than waste further time, I decided to use our `GPy`

Software (see below) to find a maximum likelihood solution for the parameters.

Model construction firstly involves constructing covariance functions for the model and concatenating `X1`

and `X2`

to a new input matrix `X`

.

```
= X1.join(X2)
X = GPy.kern.Linear(input_dim=len(X1.columns), active_dims=np.arange(len(X1.columns)))
kern1 = 'K_f'
kern1.name = GPy.kern.Linear(input_dim=len(X2.columns), active_dims=np.arange(len(X1.columns), len(X.columns)))
kern2 = 'K_b' kern2.name
```

Next, the covariance function is used to create a Gaussian process regression model with `X`

as input and `y`

as target. The covariance function is given by \(\mathbf{K}_f + \mathbf{K}_b\).

```
= GPy.models.GPRegression(X, y.to_numpy()[:, np.newaxis], kern1+kern2)
model model.optimize()
```

Now we can check the parameters of the result.

```
print(model)
print(model.log_likelihood())
```

```
Name : GP regression
Objective : 10071.679092815619
Number of Parameters : 3
Number of Optimization Parameters : 3
Updates : True
Parameters:
GP_regression. | value | constraints | priors
sum.K_f.variances | 1.2782303448777643 | +ve |
sum.K_b.variances | 0.2400098787580176 | +ve |
Gaussian_noise.variance | 1.2683656892796749 | +ve |
-10071.679092815619
```

### Construct the Model Without GPy

The answer from the GPy solution is introduced here, alongside the code where the covariance matrices are explicitly created (above they are created using GPy’s high level code for kernel matrices, which may be less clear on the details).

```
# set parameter values to ML solutions given by GPy.
= model.sum.K_f.variances
alpha_f = model.sum.K_b.variances/alpha_f
alpha_b = model.Gaussian_noise.variance/alpha_f sigma2
```

Now we create the covariance functions based on the tokenized paper IDs and emails.

```
= np.dot(X1, X1.T)
K_f = alpha_b*np.dot(X2, X2.T)
K_b = K_f + K_b + sigma2*np.eye(X2.shape[0])
K = GPy.util.linalg.pdinv(K) # since we have GPy loaded in use their positive definite inverse.
Kinv, L, Li, logdet = reviews.reviews.Quality - mu
y = np.dot(Kinv, y)
alpha = np.dot(y, alpha)
yTKinvy = yTKinvy/len(y) alpha_f
```

Since we have removed the data mean, the log likelihood we are interested in is the likelihood of a multivariate Gaussian with covariance \(\mathbf{K}\) and mean zero. This is computed below.

```
= 0.5*len(y)*np.log(2*np.pi*alpha_f) + 0.5*logdet + 0.5*yTKinvy/alpha_f
ll print("negative log likelihood: ", ll)
```

### Review Quality Prediction

Now we wish to predict the bias corrected scores for the papers. That involves considering a variable \(s_{i,j} = f_i + e_{i,j}\) which is the score with the bias removed. That variable has a covariance matrix, \(\mathbf{K}_s=\mathbf{K}_f + \sigma^2 \mathbf{I}\) and a cross covariance between \(\mathbf{y}\) and \(\mathbf{s}\) is also given by \(\mathbf{K}_s\). This means we can compute the posterior distribution of the scores as follows:

```
# Compute mean and covariance of quality scores
= K_f + np.eye(K_f.shape[0])*sigma2
K_s = pd.Series(np.dot(K_s, alpha) + mu, index=X1.index)
s = alpha_f*(K_s - np.dot(K_s, np.dot(Kinv, K_s))) covs
```

### Monte Carlo Simulations for Probability of Acceptance

We can now sample from this posterior distribution of bias-adjusted scores jointly, to get a set of scores for all papers. For this set of scores, we can perform a ranking and accept the top 400 papers. This gives us a sampled conference. If we do that 1,000 times then we can see how many times each paper was accepted to get a probability of acceptance.

`= 420 # 440 because of the 10% replication number_accepts `

```
# place this in a separate box, because sampling can take a while.
= 1000
samples = np.random.multivariate_normal(mean=s, cov=covs, size=samples).T
score # Use X1 which maps papers to paper/reviewer pairings to get the average score for each paper.
= pd.DataFrame(np.dot(np.diag(1./X1.sum(0)), np.dot(X1.T, score)), index=X1.columns) paper_score
```

Now we can compute the probability of acceptance for each of the sampled rankings.

```
= ((paper_score>paper_score.quantile(1-(float(number_accepts)/paper_score.shape[0]))).sum(1)/1000)
prob_accept = 'AcceptProbability' prob_accept.name
```

Now we have the probability of accepts, we can decide on the boundaries of the grey area. These are set in `lower`

and `upper`

. The grey area is those papers that will be debated most heavily during the teleconferences between program chairs and area chairs.

```
=0.1
lower=0.9
upper= ((prob_accept>lower) & (prob_accept<upper))
grey_area print('Number of papers in grey area:', grey_area.sum())
```

```
import matplotlib.pyplot as plt
import cmtutils.plot as plot
```

```
= plt.subplots(figsize=plot.big_wide_figsize)
fig, ax print('Expected Papers Accepted:', prob_accept.sum())
= prob_accept.hist(bins=40, ax=ax)
_ ="./neurips", filename="probability-of-accept.svg") ma.write_figure(directory
```

## Some Sanity Checking Plots

Here is the histogram of the reviewer scores after calibration.

```
= plt.subplots(figsize=plot.big_wide_figsize)
fig, ax =100, ax=ax)
s.hist(bins= ax.set_title('Calibrated Reviewer Scores')
_ ="./neurips", filename="calibrated-reviewer-scores.svg") ma.write_figure(directory
```

### Adjustments to Reviewer Scores

We can also compute the posterior distribution for the adjustments to the reviewer scores.

```
# Compute mean and covariance of review biases
= pd.Series(np.dot(K_b, alpha), index=X2.index)
b = alpha_f*(K_b - np.dot(K_b, np.dot(Kinv, K_b))) covb
```

```
= pd.Series(np.dot(np.diag(1./X2.sum(0)), np.dot(X2.T, b)), index=X2.columns, name='ReviewerBiasMean')
reviewer_bias = pd.Series(np.dot(np.diag(1./X2.sum(0)), np.dot(X2.T, np.sqrt(np.diag(covb)))), index=X2.columns, name='ReviewerBiasStd') reviewer_bias_std
```

Here is a histogram of the mean adjustment for the reviewers.

```
= plt.subplots(figsize=plot.big_wide_figsize)
fig, ax =100, ax=ax)
reviewer_bias.hist(bins= ax.set_title('Reviewer Calibration Adjustments Histogram')
_ ="./neurips", filename="reviewer-calibration-adjustments.svg") ma.write_figure(directory
```

Export a version of the bias scores for use in CMT.

```
= pd.DataFrame(data={'Quality Score - Does the paper deserves to be published?':reviewer_bias,
bias_export 'Impact Score - Independently of the Quality Score above, this is your opportunity to identify papers that are very different, original, or otherwise potentially impactful for the NIPS community.':pd.Series(np.zeros(len(reviewer_bias)), index=reviewer_bias.index),
'Confidence':pd.Series(np.zeros(len(reviewer_bias)), index=reviewer_bias.index)})
= bias_export.columns.tolist()
cols = [cols[2], cols[1], cols[0]]
cols = bias_export[cols]
bias_export #bias_export.to_csv(os.path.join(cu.cmt_data_directory, 'reviewer_bias.csv'), sep='\t', header=True, index_label='Reviewer Email')
```

## Sanity Check

As a sanity check Corinna suggested it makes sense to plot the average raw score for the papers vs the probability of accept, just to ensure nothing weird is going on. To clarify the plot, I’ve actually plotted raw score vs log odds of accept.

```
= pd.Series(np.dot(np.diag(1./X1.sum(0)), np.dot(X1.T, r.Quality)), index=X1.columns)
raw_score ==0] = 1/(10*samples)
prob_accept[prob_accept==1] = 1-1/(10*samples) prob_accept[prob_accept
```

```
= plt.subplots(figsize=plot.big_wide_figsize)
fig, ax - np.log(1-prob_accept), 'rx')
ax.plot(raw_score, np.log(prob_accept)'Raw Score vs Log odds of accept')
ax.set_title('raw score')
ax.set_xlabel(= ax.set_ylabel('log odds of accept')
_ ="./neurips", filename="raw-score-vs-log-odds.svg") ma.write_figure(directory
```

## Calibraton Quality Sanity Checks

```
= 'CalibratedQuality'
s.name = r.join(s) r
```

We can also look at a scatter plot of the review quality vs the calibrated quality.

```
= plt.subplots(figsize=plot.big_wide_figsize)
fig, ax 'r.', markersize=10)
ax.plot(r.Quality, r.CalibratedQuality, 0, 11])
ax.set_xlim(['original review score')
ax.set_xlabel(= ax.set_ylabel('calibrated review score')
_ ="./neurips", filename="calibrated-review-score-vs-original-score.svg") ma.write_figure(directory
```

## Correlation of Duplicate Papers

For NeurIPS 2014 we experimented with duplicate papers: we pushed papers through the system twice, exposing them to different subsets of the reviewers. The first thing we’ll look at is the duplicate papers. Firstly, we identify them by matching on title.

```
= date + '_paper_list.xls'
filename = cu.CMT_Papers_read(filename=filename)
papers = []
duplicate_list for ID, title in papers.papers.Title.iteritems():
if int(ID)>1779 and int(ID) != 1949:
= list(papers.papers[papers.papers['Title'].str.contains(papers.papers.Title[ID].strip())].index)
pair =int)
pair.sort(key duplicate_list.append(pair)
```

Next, we compute the correlation coefficients for the duplicated papers for the average impact and quality scores.

```
= []
quality = []
calibrated_quality = []
accept = []
impact = []
confidence for duplicate_pair in duplicate_list:
==duplicate_pair[0]].Quality), np.mean(r[r.PaperID==duplicate_pair[1]].Quality)])
quality.append([np.mean(r[r.PaperID==duplicate_pair[0]].CalibratedQuality), np.mean(r[r.PaperID==duplicate_pair[1]].CalibratedQuality)])
calibrated_quality.append([np.mean(r[r.PaperID==duplicate_pair[0]].Impact), np.mean(r[r.PaperID==duplicate_pair[1]].Impact)])
impact.append([np.mean(r[r.PaperID==duplicate_pair[0]].Conf), np.mean(r[r.PaperID==duplicate_pair[1]].Conf)])
confidence.append([np.mean(r[r.PaperID= np.array(quality)
quality = np.array(calibrated_quality)
calibrated_quality = np.array(impact)
impact = np.array(confidence)
confidence = np.corrcoef(quality.T)[0, 1]
quality_cor = np.corrcoef(calibrated_quality.T)[0, 1]
calibrated_quality_cor = np.corrcoef(impact.T)[0, 1]
impact_cor = np.corrcoef(confidence.T)[0, 1]
confidence_cor print("Quality correlation: ", quality_cor)
print("Calibrated Quality correlation: ", calibrated_quality_cor)
print("Impact correlation: ", impact_cor)
print("Confidence correlation: ", confidence_cor)
```

```
Quality correlation: 0.54403674862622
Calibrated Quality correlation: 0.5455958618174274
Impact correlation: 0.26945269236041036
Confidence correlation: 0.3854251559444674
```

## Correlation Plots

To visualize the quality score correlation, we plot the group 1 papers against the group 2 papers. Here we add a small amount of jitter to ensure points to help visualize points that would otherwise fall on the same position.

Similarly for the calibrated quality of the papers.

```
# Apply Laplace smoothing to accept probabilities before incorporating them.
= r.join((prob_accept+0.0002)/1.001, on='PaperID').join(reviewer_bias, on='Email').join(papers.papers['Number Of Discussions'], on='PaperID').join(reviewer_bias_std, on='Email').sort_values(by=['AcceptProbability','PaperID', 'CalibratedQuality'], ascending=False)
revs 'PaperID'], inplace=True)
revs.set_index([def len_comments(x):
return len(x.Comments)
'comment_length']=revs.apply(len_comments, axis=1)
revs[# Save the computed information to disk
#revs.to_csv(os.path.join(cu.cmt_data_directory, date + '_processed_reviews.csv'), encoding='utf-8')
```

## Conference Simulation

Given the realization that roughly 50% of the score seems to be ‘subjective’ and 50% of the score seems to be ‘objective,’ then we can simulate the conference and see what it does for the accept precision for different probability of accept.

To explore the effect of the subjective scoring on the accept precision we construct a simple simulation that scores hypothetical papers with random values drawn from a Gaussian density. Each paper has an underlying objective score (shared across the hypothetical reviewers), and then alongside it there are Gaussian variables drawn independently at random to represent the subjectivity of the hypothetical reviewers.

Each paper is rated by two independent committees, and the papers are reordered to accept the top \(x\)% where \(x\) is our chosen accept rate. We can then use sample based estimates for the resulting accept precision.

In these experiments the scores are taken to be 50% subjective and 50% objective, in line with the results we see from the NeurIPS 2014 calibration model. We vary the number of reviewers in the simulation to see the effect of increasing reviewers on the accept precision.

`import numpy as np`

We repeat the experiment `samples`

number of times, here we’ve set this to be 100000. The subjectivity portion gives how much of the scores for each paper is subjective.

```
= 100000
num_papers = 0.5 subjectivity_portion
```

```
= [0.05, 0.1, 0.15, 0.2, 0.25,
accept_rates 0.3, 0.35, 0.4, 0.45, 0.5,
0.55, 0.6, 0.65, 0.7, 0.75,
0.8, 0.85, 0.9, 0.95, 1.0]
= []
all_accepts for num_reviewers in range(1,7):
= []
consistent_accepts for accept_rate in accept_rates:
= (1-subjectivity_portion)*np.random.randn(num_papers)
objective = subjectivity_portion*np.random.randn(num_papers, num_reviewers).mean(1)
subjective_0 = subjectivity_portion*np.random.randn(num_papers, num_reviewers).mean(1)
subjective_1 = objective + subjective_0
score_0 = objective + subjective_1
score_1
= score_0.argsort()[:int(num_papers*accept_rate)]
accept_0 = score_1.argsort()[:int(num_papers*accept_rate)]
accept_1
= len(set(accept_0).intersection(set(accept_1)))
consistent_accept /(num_papers*accept_rate))
consistent_accepts.append(consistent_acceptprint('Percentage consistently accepted: {prop}'.format(prop=consistent_accept/(num_papers*accept_rate)))
all_accepts.append(consistent_accepts)= np.array(all_accepts)
all_accepts = np.array(consistent_accepts)
consistent_accepts = np.array(accept_rate) accept_rate
```

In Figure we see the change in accept precision as we vary accept rate and number of reviewers for a conference where reviewers are 50% subjective.

Figure shows the accept rate against the gain in accept precision we have over the random committee.

## Where do Rejected Papers Go?

One facet that we can explore is what the final fate of papers that are rejected by the conference is.

Of the 1,678 papers submitted to NeurIPS 2014, only 414 were presented at the final conference. Here we trace the fate of the rejected papers, we searched Semantic Scholar for evidence of all 1,264 rejected papers. We looked for papers with similar titles and where the NeurIPS submission’s contact author was also in the author list. We were able to track down 680 papers.

This code analyzes those 680 papers extracting their final publication venue using the Semantic Scholar API.

`%pip install cmtutils`

`import cmtutils.nipsy as nipsy`

```
import os
import yaml
```

```
with open(os.path.join(nipsy.review_store, nipsy.outlet_name_mapping), 'r') as f:
= yaml.load(f, Loader=yaml.FullLoader)
mapping
= "2021-06-11"
date
= nipsy.load_citation_counts(date=date)
citations = nipsy.load_decisions()
decisions
nipsy.augment_decisions(decisions)= nipsy.join_decisions_citations(decisions, citations)
joindf
'short_venue'] = joindf.venue.replace(mapping) joindf[
```

Of the 680 papers 177 were only found on arXiv, 76 were found as PDFs online without a publication venue and 427 were published in other venues. The outlets that received ten or more papers from this group were AAAI (72 papers), AISTATS (57 papers), ICML (33 papers), CVPR (17 papers), Later NeurIPS (15 papers), JMLR (14 papers), IJCAI (14 papers), ICLR (13 papers), UAI (11 papers). Opinion about quality of these different outlets will vary from individual, but from our perspective all of these outlets are `top-tier’ for machine learning and related areas. Other papers appeared at less prestigious outlets, and citation scores were also recored for papers that remained available only on ArXiv. Note that there is likely a bias towards outlets that have a submission deadline shortly after NeurIPS decisions are public, e.g. submission deadline for AAAI 2015 was six days after NeurIPS decisions were sent to authors. AISTATS has a submission deadline one month after.

A Sankey diagram showing where papers submitted to the conference ended up is shown below.

## Impact of Papers Seven Years On

Now we look at the actual impact of the papers published using the Semantic Scholar data base for tracking citations.

```
import cmtutils as cu
import cmtutils.nipsy as nipsy
import cmtutils.plot as plot
```

```
import pandas as pd
import numpy as np
```

`= cu.Papers() papers `

https://proceedings.neurips.cc/paper/2014

`= False # Set to True to download impacts from Semantic Scholar UPDATE_IMPACTS `

The impact of the different papers is downloaded from Semantic scholar using their REST API. This can take some time, and they also throttle the calls. At the moment the code below deosn’t handle the throttling correctly. However, you it will load the cached version of of citations scores from the given date.

```
if UPDATE_IMPACTS:
from datetime import datetime
=datetime.today().strftime('%Y-%m-%d')
dateelse:
= "2021-06-11" date
```

```
# Rerun to download impacts from Semantic Scholar
if UPDATE_IMPACTS:
= nipsy.load_semantic_ids()
semantic_ids = citations.to_dict(orient='index')
citations_dict # Need to be a bit cleverer here. Semantic scholar will throttle this call.
= nipsy.download_citation_counts(citations_dict=citations_dict, semantic_ids=semantic_ids)
sscholar = pd.DataFrame.from_dict(citations_dict, orient="index")
citations + '-semantic-scholar-info.pickle')
citations.to_pickle(date else:
= nipsy.load_citation_counts(date=date) citations
```

The final decision sheet provides information about what happened to all of the papers.

```
= nipsy.load_decisions()
decisions nipsy.augment_decisions(decisions)
```

This is joined with the citation information to provide our main ability to understand the impact of these papers.

`= nipsy.join_decisions_citations(decisions, citations) joindf `

### Correlation of Quality Scores and Citation

Our first study will be to check the correlation between quality scores of papers and how many times that the papers have been cited in practice. In the plot below, rejected papers are given as crosses, accepted papers are given as dots. We include all papers, whether published in a venue or just available through ArXiv or other preprint servers. We show the published/non-published quality scores and \(\log_{10}(1+\text{citations})\) for all papers in the plot below. In the plot we are showing each point corrupted by some Laplacian noise and also removing axes. The idea is to give a sense of the distribution rather than reveal the score of a particular paper.

The correlation seems strong, but of course, we are looking at papers which were accepted and rejected by the conference. This is dangerous, as it is quite likely that presentation at the conference may provide some form of lift to the papers’ numbers of citations. So, the right thing to do is to look at the groups separately.

Looking at the accepted papers only shows a very different picture. There is very little correlation between accepted papers’ quality scores and the number of citations they receive.

Conversely, looking at rejected papers only, we do see a slight trend, with higher scoring papers achieving more citations on average. This, combined with the lower average number of citations in the rejected paper group, alongside their lower average scores, explains the correlation we originally observed.

Welling and Ghahramani introduced an “impact” score in NeurIPS 2013, we might expect the impact score to show correlation. And indeed, despite the lower range of the score (a reviewer can score either 1 or 2) we do see *some* correlation, although it is relatively weak.

Finally, we also looked at correlation between the *confidence* score and the impact. Here correlation is somewhat stronger. Why should confidence be an indicator of higher citations? A plausible explanation is that there is confounder driving both variables. For example, it might be that papers which are easier to understand (due to elegance of the idea, or quality of exposition) inspire greater reviewer confidence and increase the number of citations.

```
for column in ["average_quality", "average_impact", "average_confidence"]:
= []
cor for i in range(1000):
= bootstrap_index(joindf.loc[joindf.accept])
ind 1+joindf.loc[ind]['numCitedBy'])))
cor.append(joindf.loc[ind][column].corr(np.log(= np.array(cor)
cora = cora.mean()
rho = 2*np.sqrt(cora.var())
twosd print("{column}".format(column=column.replace("_", " ")))
print("Mean correlation is {rho} +/- {twosd}".format(rho=rho, twosd=twosd))
```

## Conclusion

Under the simple model we have outlined, we can be confident that there is inconsistency between two independent committees, but the level of inconsistency is much less than we would find for a random committee. If we accept that the bias introduced by the Area Chairs knowing when they were dealing with duplicates was minimal, then if we were to revisit the NIPS 2014 conference with an independent committee then we would expect between **38% and 64% of the presented papers to be the same**. If the conference was run at random, then we would only expect 25% of the papers to be the same.

It’s apparent from comments and speculation about what these results mean, that some people might be surprised by the size of this figure. However, it only requires a little thought to see that this figure is likely to be large for any highly selective conference if there is even a small amount of inconsistency in the decision-making process. This is because once the conference has chosen to be ‘highly selective’ then because, by definition, only a small percentage of papers are to be accepted. Now if we think of a type I error as accepting a paper which should be rejected, such errors are easier to make because, again by definition, many more papers should be rejected. Type II errors (rejecting a paper that should be accepted) are less likely because (by setting the accept rate low) there are fewer papers that should be accepted in the first place. When there is a difference of opinion between reviewers, it does seem that many of the aruguments can be distilled down to (a subjective opinion) about whether controlling for type I or type II errors is more important. Further, normally when discussing type I and type II errors we believe that the underlying system of study is genuinely binary: e.g., diseased or not diseased. However, for conferences the accept/reject boundary is not a clear separation point, there is a continuum (or spectrum) of paper quality (as there also is for some diseases). And the decision boundary often falls in a region of very high density.

I would prefer a world were a conference is no longer viewed as a proxy for research quality. The true test of quality is time. In the current world, papers from conferences such as NeurIPS are being used to judge whether a researcher is worthy of a position at a leading company, or whether a researcher gets tenure. This is problematic and damaging for the community. Reviewing is an inconsistent process, but that is not a bad thing. It is far worse to have a reviewing system that is consistently wrong than one which is inconsistently wrong.

My own view of a NeurIPS paper is inspired by the Millenium Galleries in Sheffield. There, among the exhibitions they sometimes have work done by apprentices in their ‘qualification.’ Sheffield is known for knives, and the work of the apprentices in making knives is sometimes very intricate indeed. But it does lead to some very impractical knives. NeurIPS seems to be good at judging technical skill, but not impact. And I suspect the same is true of many other meetings. So, a publication a NeurIPS does seem to indicate that the author has some of the skills required, but it does not necessarily imply that the paper will be impactful.

## Thanks!

For more information on these subjects and more you might want to check the following resources.

- twitter: @lawrennd
- podcast: The Talking Machines
- newspaper: Guardian Profile Page
- blog: http://inverseprobability.com