Low Liquidity Compensation for Reputation Systems
A question of liquidity -
When is 4.0 > 5.0? When enough people say it is!
|--2007, F. Randall Farmer, Yahoo! Community Analyst|
Consider the following problem with simple averages: it is mathematically unreasonable to compare two similar targets with averages made from significantly different numbers of inputs. For the first target, suppose that there are only three ratings averaging 4.667 stars, which after rounding displays as , and you compare that average score to a target with a much greater number of inputs, say 500, averaging 4.4523 stars, which after rounding displays as only . The second target, the one with the lower average, better reflects the true consensus of the inputs, since there just isn't enough information on the first target to be sure of anything. Most simple-average displays with too few inputs shift the burden of evaluating the reputation to users by displaying the number of inputs alongside the simple average, usually in parentheses, like this:
But pawning off the interpretation of averages on users doesn't help when you're ranking targets on the basis of averages-a lone rating on a brand-new item will put the item at the top of any ranked results it appears in. This effect is inappropriate and should be compensated for.
We need a way to adjust the ranking of an entity based on the quantity of ratings. Ideally, an application performs these calculations on the fly so that no additional storage is required.
We provide the following solution: a high-performance liquidity compensation algorithm to offset variability in very small sample sizes. It's used on Yahoo! sites to which many new targets are added daily, with the result that, often, very few ratings are applied to each one.
r = SimpleMean m - AdjustmentFactor a + LiquidityWeight l * Adjustment Factor a
l = min(max((NumRatings n - LiquidityFloor f) / LiquidityCeiling c, 0), 1) * 2
r = m - a + min(max((n - f) / c, 0.00), 1.00) * 2.00 * a
This formula produces a curve seen in the figure below. Though a more mathematically continuous curve might seem appropriate, this linear approximation can be done with simple nonrecursive calculations and requires no knowledge of previous individual inputs.
Suggested initial values for
c , and
f (assuming normalized inputs):
- a = 0.10
This constant is the fractional amount to remove from the score before adding back in effects based on input volume. For many applications, such as 5-star ratings, it should be within the range of integer rounding error-in this example, if the
AdjustmentFactor is set much higher than 10%, a lot of 4-star entities will be ranked before 5-star ones. If it's set too much lower, it may not have the desired effect.
f = 10
This constant is the threshold for which we consider the number of inputs required to have a positive effect on the rank. In an ideal environment, this number is between 5 and 10, and our experience with large systems indicates that it should never be set lower than 3. Higher numbers help mitigate abuse and get better representation in consensus of opinion.
c = 60
This constant is the threshold beyond which additional inputs will not get a weighting bonus. In short, we trust the average to be representative of the optimum score. This number must not be lower than 30, which in statistics is the minimum required for a t-score. Note that the t-score cutoff is 30 for data that is assumed to be unmanipulated (read: random).
We encourage you to consider other values for
c , and
f , especially if you have any data on the characteristics of your sources and their inputs..