« Abusing FICO | Main | Low Liquidity Compensation for Reputation Systems »

Ratings Bias Effects

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's essay is excerpted from Chapter 4: Building Blocks and Reputation Tips. It uses our experience with Yahoo! data to share some thoughts surrounding user ratings bias, and how to overcome it. You may be surprised by our recommendations.

Figure: Some Yahoo! Sites Ratings Distribution: "One of these things is not like the other. One of these things just doesn't belong."

Some Yahoo! Sites Ratings Distribution: "One of these things is not like the other. One of these things just doesn't belong."

This figure shows the graphs of 5-star ratings from nine different Yahoo! sites with all the volume numbers redacted. We don't need them, since we only want to talk about the shapes of the curves.

Eight of these graphs have what is known to reputation system aficionados as J-curves- where the far right point (5 Stars) has the very highest count, 4-Stars the next, and 1-Star a little more than the rest. Generally, a J-curve is considered less-than ideal for several reasons: The average aggregate scores all clump together between 4.5 to 4.7 and therefore they all display as 4- or 5-stars and are not-so-useful for visually sorting between options. Also, this sort of curve begs the question: Why use a 5-point scale at all? Wouldn't you get the same effect with a simpler thumbs-up/down scale, or maybe even just a super-simple favorite pattern?

The outlier amongst the graphs is for Yahoo! Autos Custom (which is now shut down) where users were rating the car-profile pages created by other users - has a W-curve. Lots of 1, 3, and 5 star ratings and a healthy share of 4 and 2 star as well. This is a healthy distribution and suggests that "a 5-point scale is good for this community".

But why was Autos Custom's ratings so very different from Shopping, Local, Movies, and Travel?

The biggest difference is most likely that Autos Custom users were rating each other's content. The other sites had users evaluating static, unchanging or feed-based content in which they don't have a vested interest.

In fact, if you look at the curves for Shopping and Local, they are practically identical, and have the flattest J hook - giving the lowest share of 1-stars. This is a direct result of the overwhelming use-pattern for those sites: Users come to find a great place to eat or vacuum to buy. They search, and the results with the highest ratings appear first and if the user has experienced that object, they may well also rate it - if it is easy to do so - and most likely will give 5 stars (see the section called “First Mover Effects”). If they see an object that isn't rated, but they like, they may also rate and/or review, usually giving 5-stars - otherwise why bother - so that others may share in their discovery. People don't think that mediocre objects are worth the bother of seeking out and creating internet ratings. So the curves are the direct result of the product design intersecting with the users goals. This pattern - I'm looking for good things so I'll help others find good things - is a prevalent form of ratings bias. An even stronger example happens when users are asked to rate episodes of TV shows - Every episode is rated 4.5 stars plus or minus .5 stars because only the fans bother to rate the episodes, and no fan is ever going to rate an episode below a 3. Look at any popular running TV show on Yahoo! TV or [another site].

Looking more closely at how Autos Custom ratings worked and the content was being evaluated showed why 1-stars were given out so often: users were providing feedback to other users in order to get them to change their behavior. Specifically, you would get one star if you 1) Didn't upload a picture of your ride, or 2) uploaded a dealer stock photo of your ride. The site is Autos Custom, after all! The 5-star ratings were reserved for the best-of-the-best. Two through Four stars were actually used to evaluate quality and completeness of the car's profile. Unlike all the sites graphed here, the 5-star scale truly represented a broad sentiment and people worked to improve their scores.

There is one ratings curve not shown here, the U-curve, where 1 and 5 stars are disproportionately selected. Some highly-controversial objects on Amazon see this rating curve. Yahoo's now defunct personal music service also saw this kind of curve when introducing new music to established users: 1 star came to mean "Never play this song again" and 5 meant "More like this one, please". If you are seeing U-curves, consider that the 1) users are telling you something other than what you wanted to measure is important and/or 2) you might need a different rating scale.


TrackBack URL for this entry: