Obviously (cross)IMPs work best (in terms of identifying the best pair) if we assume that big swings (such as making or not making a slam) reflect skill more than small swings (such as overtricks or part score sacs). This will obviously to some extent be the case, if only because a big swing can occur as an aggregate of several good decisions, for example first jamming opps auction effectively and then double them and then defend well.
On the other hand, MPs work best if some boards allow skill to translate into bigger swings than other boards do.
In addition, I thought that MPs work better in large fields but I am not sure if that is really true.
So I made some simulations of a 27 board mitchel (27*1 board, 9*3 or 3*9) in which I assumed that the raw score on a board was normal distributed with a mean value of
E(rawscore[board,nspair,ewpair]) = skillfactor[board] * (strength[ns]-strength[ew])
where the skillfactor was gamma distributed across boards with a shape parameter which I allowed to vary between simulations. Rate=Shape to keep the average skill factor constant between sims.
The variance of the raw score was gamma distributed across boards, independent of the skill factor.
Before calculating IMPs and MPs I rounded off to nearest multiple of 50 to allow for ties at matchpoints (rounding also applied for IMP scoring for a fairer comparison). I used butler scoring without outlier removal.
The average Spearman correlations between strength and IMP scoring was (as a function of shape parameter of the skill factor distribution and number of tables):
.1;3 .1;9 .1;27 1;3 1;9 1;27 10;3 10;9 10;27 0.756 0.833 0.864 0.877 0.929 0.952 0.920 0.960 0.979
For MPs:
.1;3 .1;9 .1;27 1;3 1;9 1;27 10;3 10;9 10;27 0.744 0.844 0.880 0.869 0.931 0.957 0.920 0.962 0.980
So it looks like that for large values of the shape parameter (i.e. the skill factor is roughly the same for all boards), it doesn't matter which scoring you use, and this hold regardless of field size. But for more heterogenous sets of boards (low value of the skill factor shape parameters), MPs is better for large fields and butler is better for small fields, with a break even somewhere halfway between 3 and 9 tables.
Based on 9000 sims, using both the ew and the ns ranking so 18000 data points per parameter combination.
Maybe I should have a go with correlated noise and skill factor, which is probably realistic. This would favour matchpoints, I would think.
Of course this is all based on a huge number of simplifications and assumptions. It would be cool if someone could do a similar analysis of real data.