Paper Allocation for NIPS

With luck we will release papers to reviewers early next week. The paper allocations are being refined by area chairs at the moment.

Corinna and I thought it might be informative to give details of the allocation process we used, so I’m publishing it here. Note that this automatic process just gives the initial allocation. The current stage we are in is moving papers between Area Chairs (in response to their comments) whilst they also do some refinement of our initial allocation. If I find time I’ll also tidy up the python code that was used and publish it as well (in the form of an IPython notebook).

I wrote the process down in response to a query from Geoff Gordon. So the questions I answer are imagined questions from Geoff. If you like, you can picture Geoff asking them like I did, but in real life, they are words I put into Geoff’s mouth.

How did you allocate the papers?

We ranked all paper-reviewer matches by a similarity and allocated each paper-reviewer pair from the top of the list, rejecting an allocation if the reviewer had a full quota, or the paper had a full complement of reviewers.

How was the similarity computed?

The similarity consisted of the following weighted components.

s_p = 0.25 * primary subject match. s_s = 0.25 * bag of words match between primary and secondary subjects m = 0.5 * TPMS score (rescaled to be between 0 and 1).

So how were the bids used?

Each of the similarity scores was multiplied by 1.5^b where b is the bid. For: “eager” b=2, “willing” b=1, “in a pinch” b=-1, “not willing” b=-2 and no bid was b=0. So the final score used in the ranking was (s_p+s_s+m)*1.5^b

But how did you deal with the fact that different reviewers used the bidding in different ways?

The rows and columns were crudely normalized by the *square root* of their standard deviations

So what about conflicting papers?

Conflicting papers were given similarities of -inf.

How did you ensure that ‘expertise’ was evenly distributed?

We split the reviewing body into two groups. The first group of ‘experts’ were those people with two or more NIPS papers since 2007 (thanks to Chris Hiestand for providing this information). This was about 1/3 of the total reviewing body. We allocated these reviewers first to a maximum of one ‘expert’ per paper. We then allocated the remainder of the reviewing body to the papers up to a maximum of 3 reviewers per paper.

One or more of my papers has less than three reviewers, how did that happen?

When forming the ranking to allocate papers, we only retained papers scoring in the upper half. This was to ensure that we didn’t drop too far down the rank list. After passing through the rank list of scores once, some papers were still left unallocated.

But you didn’t leave unallocated papers to area chairs did you?

No, we needed all papers to have an area chair, so for area chairs we continued to allocate these ‘inappropriate papers’ to the best matching area chair with remaining quota, but for reviewers we left these allocations ‘open’ because we felt manual intervention was appropriate here.

Was anything else different about area chair allocation?

Yes, we found there was a tendency for high bidding area chairs to fill up their allocation quickly vs low bidding area chairs, meaning low bidding/similarity area chairs weren’t getting a very good allocation. To try and distribute things more evenly, we used a ‘staged quota’ system. We started by allocating area chairs five papers each. Then ten, then fifeteen etc. This meant that even if an area chair had the top 25 similarities in the overall list, many of those papers would still be matched to other reviewers. Our crude normalization was also designed to prevent this tendency. Perhaps a better idea still would be to rank similarities on a per reviewer basis and use this as the score instead of the similarity itself, although we didn’t try this.

Did you do the allocations for the bidding in the same way?

Yes, we did bidding allocations in a similar way, apart from two things. Firstly the similarity score was different, we didn’t have a separate match to primary key. This lead to problems for reviewers who had one dominant primary keyword and many less important secondary key words. Now, the allocated papers were also distributed in a different way. Each paper was allocated (for bidding) to those area chairs who were in the top 25 scores for that paper. This led to quite a wide variety in the number of papers you saw for bidding, but each paper was, (hopefully) seen at least 25 times.

That’s for area chairs, was it the same for the bidding allocation for reviewers?

No, for reviewers, we wanted to restrict the number of papers that each reviewer would see. We decided each reviewer should only see a maximum of 25 papers, we did something more similar to the ‘preliminary allocation’ that’s just been sent out. We went down the list allocating a maximum of 25 papers per reviewer, and ensuring each paper was viewed by 17 different reviewers.

Why did you do it like this? Did you research into this?

We never (orignally) intended to intervene so heavily in the allocation system, but with this year’s record numbers of submissions and reviewers the CMT database was failing to allocate. This, combined with time delays between Sheffield/New York/Seattle was causing delays in getting papers out for bidding, so at one stage we split the load into Corinna working with CMT to achieve an allocation and Neil working on coding the intervention described above. The intervention was finished first. Once we had a rough and ready system working for bids we realised we could have more fine control over the allocation than we’d get with CMT (for example trying to ensure that each paper got at least one ‘expert’), so we chose to continue with our approach. There may certainly be better ways of doing this.

How did you check the quality of the allocation?

The main approach we used for checking allocation quality was to check the allocation of an area chair whose domain we knew well, and ensure that the allocation made sense, i.e. we looked at the list of papers and judged whether it made ‘sense’.

That doesn’t sound very objective, isn’t there a better way?

We agree that it’s not very objective, but then again people seem to evaluate topic models like that all the time, and a topic model is a key part of this system (the TPMS matching service). The other approach was to wait until people complained about their allocation. There were only a few very polite complaints at the bidding stage, but these led us to realise we needed to upweight the similarities associated with the primary key word. We found that some people choose one very dominant primary keyword, and many, less important secondary keywords. These reviewers were not getting a very focussed allocation.

How about the code?

The code was written in python using pandas in the form of an IPython notebook.

And finally …

Thanks to all the reviewers and area chairs for their patience with the system and particular thanks to Laurent Charlin (TPMS) and the CMT support for their help getting everything uploaded.