« June 2009 | Main | September 2009 »

August 26, 2009

Tag, You're It!

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's entry shares some news about the production and design of our book, and asks for your help in pushing it forward to completion.

This is very exciting for us. We're close enough to draft-complete that our wonderful editor at O'Reilly, Mary, went ahead and pulled the trigger on our cover-art! Earlier this week, she shared this with us, passed along from O'Reilly Creative Director Edie Freedman.

build_web_reputation_sys_comp.jpg

We love it. It's a beautiful parrot, and I really like the timeless, classic appeal of it. (To be honest, I can't believe that no animal cover has featured a parrot before now. But it's true.)

For those of you paying attention, yes we did share your animal suggestions with the creative team at ORA. They enjoyed them immensely and then—true to the admonition that leads off that page—set them aside and picked this big beautiful bird. (Or 'boo-wuh' as my toddler son says after repeat viewings on Daddy's laptop.) We're pleased with the end result, and excited to see the book coming thismuchcloser to reality.

However… we still need your help! All O'Reilly books feature a tagline, and we need some good suggestions. To accompany the written proposal for the book, Randy found a fun little "O'Reilly cover generator" somewhere online, and the tagline he provided to that was "Ratings, Reviews and Karma, Oh My!" That effort was only semi-facetious—it does highlight some of the principal patterns and methods discussed in the book. Not easy to do in 2 lines of text.

So, please—if you've been following the progress of the book and have some ideas about a tagline, we'd love to hear your thoughts. Please leave a comment on this page.

August 19, 2009

Low Liquidity Compensation for Reputation Systems

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's essay is excerpted from Chapter 4: Building Blocks and Reputation Tips. This tip provides a solution to an age old problem with ratings.
 

A question of liquidity -

When is 4.0 > 5.0? When enough people say it is!

 
  --2007, F. Randall Farmer, Yahoo! Community Analyst

Consider the following problem with simple averages: it is mathematically unreasonable to compare two similar targets with averages made from significantly different numbers of inputs. For the first target, suppose that there are only three ratings averaging 4.667 stars, which after rounding displays as , and you compare that average score to a target with a much greater number of inputs, say 500, averaging 4.4523 stars, which after rounding displays as only . The second target, the one with the lower average, better reflects the true consensus of the inputs, since there just isn't enough information on the first target to be sure of anything. Most simple-average displays with too few inputs shift the burden of evaluating the reputation to users by displaying the number of inputs alongside the simple average, usually in parentheses, like this: (142) .

But pawning off the interpretation of averages on users doesn't help when you're ranking targets on the basis of averages-a lone rating on a brand-new item will put the item at the top of any ranked results it appears in. This effect is inappropriate and should be compensated for.

We need a way to adjust the ranking of an entity based on the quantity of ratings. Ideally, an application performs these calculations on the fly so that no additional storage is required.

We provide the following solution: a high-performance liquidity compensation algorithm to offset variability in very small sample sizes. It's used on Yahoo! sites to which many new targets are added daily, with the result that, often, very few ratings are applied to each one.

  • RankMean
    • r = SimpleMean m - AdjustmentFactor a + LiquidityWeight l * Adjustment Factor a
  • LiquidityWeight
    • l = min(max((NumRatings n - LiquidityFloor f) / LiquidityCeiling c, 0), 1) * 2
  • Or
    • r = m - a + min(max((n - f) / c, 0.00), 1.00) * 2.00 * a

This formula produces a curve seen in the figure below. Though a more mathematically continuous curve might seem appropriate, this linear approximation can be done with simple nonrecursive calculations and requires no knowledge of previous individual inputs.

Figure: The effects of the liquidity compensation algorithm

Suggested initial values for a , c , and f (assuming normalized inputs):

  • AdjustmentFactor
    • a = 0.10

This constant is the fractional amount to remove from the score before adding back in effects based on input volume. For many applications, such as 5-star ratings, it should be within the range of integer rounding error-in this example, if the AdjustmentFactor is set much higher than 10%, a lot of 4-star entities will be ranked before 5-star ones. If it's set too much lower, it may not have the desired effect.

  • LiquidityFloor
    • f = 10

This constant is the threshold for which we consider the number of inputs required to have a positive effect on the rank. In an ideal environment, this number is between 5 and 10, and our experience with large systems indicates that it should never be set lower than 3. Higher numbers help mitigate abuse and get better representation in consensus of opinion.

  • LiquidityCeiling
    • c = 60

This constant is the threshold beyond which additional inputs will not get a weighting bonus. In short, we trust the average to be representative of the optimum score. This number must not be lower than 30, which in statistics is the minimum required for a t-score. Note that the t-score cutoff is 30 for data that is assumed to be unmanipulated (read: random). We encourage you to consider other values for a , c , and f , especially if you have any data on the characteristics of your sources and their inputs..

August 12, 2009

Ratings Bias Effects

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's essay is excerpted from Chapter 4: Building Blocks and Reputation Tips. It uses our experience with Yahoo! data to share some thoughts surrounding user ratings bias, and how to overcome it. You may be surprised by our recommendations.

Figure: Some Yahoo! Sites Ratings Distribution: "One of these things is not like the other. One of these things just doesn't belong."

Some Yahoo! Sites Ratings Distribution: "One of these things is not like the other. One of these things just doesn't belong."

This figure shows the graphs of 5-star ratings from nine different Yahoo! sites with all the volume numbers redacted. We don't need them, since we only want to talk about the shapes of the curves.

Eight of these graphs have what is known to reputation system aficionados as J-curves- where the far right point (5 Stars) has the very highest count, 4-Stars the next, and 1-Star a little more than the rest. Generally, a J-curve is considered less-than ideal for several reasons: The average aggregate scores all clump together between 4.5 to 4.7 and therefore they all display as 4- or 5-stars and are not-so-useful for visually sorting between options. Also, this sort of curve begs the question: Why use a 5-point scale at all? Wouldn't you get the same effect with a simpler thumbs-up/down scale, or maybe even just a super-simple favorite pattern?

The outlier amongst the graphs is for Yahoo! Autos Custom (which is now shut down) where users were rating the car-profile pages created by other users - has a W-curve. Lots of 1, 3, and 5 star ratings and a healthy share of 4 and 2 star as well. This is a healthy distribution and suggests that "a 5-point scale is good for this community".

But why was Autos Custom's ratings so very different from Shopping, Local, Movies, and Travel?

The biggest difference is most likely that Autos Custom users were rating each other's content. The other sites had users evaluating static, unchanging or feed-based content in which they don't have a vested interest.

In fact, if you look at the curves for Shopping and Local, they are practically identical, and have the flattest J hook - giving the lowest share of 1-stars. This is a direct result of the overwhelming use-pattern for those sites: Users come to find a great place to eat or vacuum to buy. They search, and the results with the highest ratings appear first and if the user has experienced that object, they may well also rate it - if it is easy to do so - and most likely will give 5 stars (see the section called “First Mover Effects”). If they see an object that isn't rated, but they like, they may also rate and/or review, usually giving 5-stars - otherwise why bother - so that others may share in their discovery. People don't think that mediocre objects are worth the bother of seeking out and creating internet ratings. So the curves are the direct result of the product design intersecting with the users goals. This pattern - I'm looking for good things so I'll help others find good things - is a prevalent form of ratings bias. An even stronger example happens when users are asked to rate episodes of TV shows - Every episode is rated 4.5 stars plus or minus .5 stars because only the fans bother to rate the episodes, and no fan is ever going to rate an episode below a 3. Look at any popular running TV show on Yahoo! TV or [another site].

Looking more closely at how Autos Custom ratings worked and the content was being evaluated showed why 1-stars were given out so often: users were providing feedback to other users in order to get them to change their behavior. Specifically, you would get one star if you 1) Didn't upload a picture of your ride, or 2) uploaded a dealer stock photo of your ride. The site is Autos Custom, after all! The 5-star ratings were reserved for the best-of-the-best. Two through Four stars were actually used to evaluate quality and completeness of the car's profile. Unlike all the sites graphed here, the 5-star scale truly represented a broad sentiment and people worked to improve their scores.

There is one ratings curve not shown here, the U-curve, where 1 and 5 stars are disproportionately selected. Some highly-controversial objects on Amazon see this rating curve. Yahoo's now defunct personal music service also saw this kind of curve when introducing new music to established users: 1 star came to mean "Never play this song again" and 5 meant "More like this one, please". If you are seeing U-curves, consider that the 1) users are telling you something other than what you wanted to measure is important and/or 2) you might need a different rating scale.

August 07, 2009

Abusing FICO

Today's NY Times has a reputation-related article that I find distressing—Another Hurdle for the Jobless: Credit Inquiries:

Once reserved for government jobs or payroll positions that could involve significant sums of money, credit checks are now fast, cheap and used for all manner of work. Employers, often winnowing a big pool of job applicants in days of nearly 10 percent unemployment, view the credit check as a valuable tool for assessing someone’s judgment.
This is, basically, the tale of a formalized reputation score (your FICO, or credit score) that has run amok, and may have outgrown its original intent.

Factors in a FICO credit score

In Chapter 1 of Building Web 2.0 Reputations Systems, we discuss the differences between local reputations and global reputations. Local reputations are highly specific to the context in which they're earned and typically have greater value as a decision-making device when they're evaluated within that context. This makes a certain sense when you think about it: John may have earned the 'Top Selling Realtor' award at his agency for three years running, but—when pulled over by the police for erratic driving—he'd be foolish to proffer up his lucite trophies as evidence of his upstanding, sober character. The one reputation has no currency in the other context.

There is a certain gray area, however, where local reputations do have some fungibility between contexts. Where the perceived differences between contexts are minimal enough that the reputation earned in one should still count for something in the other. So if John takes his Top Realtor awards with him to a regional real-estate conference? Well, they still kinda count for something. The context differs in scope—geography—but the applied domain—real estate—is the same. People in the new context can understand the reputation well enough to evaluate it and appreciate the effort (the inputs) that went into earning it.

Or perhaps John puts his awards on his LinkedIn profile to impress local small business owners with the chamber of commerce. The domain's not real estate anymore, but enough other facets of the contexts (local geography, a shared 'business professional' context) give his reputation some meaning to that audience. So locally-earned reputations can be immensely valuable when applied in related contexts.

An extremely limited number of reputations find utility in a great number of contexts. We call these reputations global reputations. [Note: this terminology is largely of our own devise, and we're not convinced that local vs. global carries exactly the right connotations. We welcome your suggestions.] Global reputations may, in fact, be an illusion. Our tendency seems to be to want to believe that some objective measure of a person's self-worth can be tabulated, stored and transferred with ease.

This, of course, is not the case. The further the interpretation of a reputation strays afield from the context in which it was earned (and the purposes for which it was originally created), then the less reliable that reputation is. This is the crux of the problem with FICO. Think for a moment about the original intent of your credit report. It is a tool for credit lenders to predict how valuable a customer you may or may not be to them. It is a highly-contextual and specific score, and it considers inputs specific to that evaluation: do you pay your bills on time? do you use the credit you've been given? how much? how frequently? do you have a lot of credit? too much?

The answers to these questions have great utility to credit-lenders, and rolling them up into an easily accessible score has definitely contributed to FICO's wild success as a credit-lending device. Unfortunately, this level of abstraction also gives the illusion that the FICO score has great portability between contexts. It's a numerical score, so it looks like an objective measure. The temptation is simply too great to believe that a high credit score equates to good citizenship, trustworthiness, high moral fiber and love of God & country.

I think you can see how misled and ridiculous this belief is, but as evidence that it's a fundamentally flawed assumption, consider the following: even within its own context, FICO is basically a subjective measure. It was created to inform the purposes of a specific set of evaluators—credit lenders. Remember, when a credit underwriter evaluates your score, the question that they're really hoping to answer is: will I make money off of this customer? Only as a facet or component of that question do they care about the related one: is this person responsible with their money?

A high credit score doesn't necessarily mean what many folks seem to think it means about a person's financial stability or decision-making prowess. There are any number of extremely wise patterns of behavior that a person might engage in that could result in a poor (or no) credit score: saving most of your money; or paying for purchases only in cash; never taking out a line of credit. These behaviors might make you a less-than-optimal credit customer, but would anyone truly argue that they make you a bad person? Or financially irresponsible? Of course not.

So, judging the quality of a person's character on their credit report? That's a fool's errand.

The Times article points out the truly insidious problem with basing hiring determinations on a person's credit score:

“How do you get out from under it?” asked Matthew W. Finkin, a law professor at the University of Illinois, who fears that the unemployed and debt-ridden could form a luckless class. “You can’t re-establish your credit if you can’t get a job, and you can’t get a job if you’ve got bad credit.”
This is exactly the danger that our draft chapter points out. This mis-application of your credit rating creates a feedback loop. This is a situation in which the inputs into the system (in this case, your employment) are dependent in some part upon the output from the system. Why are feedback loops bad? Well, as the Times points out—feedback loops are self-perpetuating and, once started, nigh-impossible to break. And, much like in music production (Neil Young notwithstanding) feedback loops are generally to be avoided because they muddy the fidelity of the signal. (We'll be talking more about feedback loops in the yet-to-be-drafted Chapter 10 of our book.)

Remember, a credit score—like any reputation—was created to serve a purpose. Generally, to provide information to make determinations within a specific context. Straying too far from that context, or applying it in ways that feed back into the system, erodes confidence in the system itself. (It's out of scope for—though perfectly aligned with—this discussion, but some blame the misapplication of FICO into the field of mortgage approvals for playing a significant role in the subprime mortgage collapse.)

August 05, 2009

Polish & Predictability

Just a quick note. One development that Randy and I are excited about: we've solicited the help of a fantastic copy-editor, Cate de Heer to provide a third set of eyes on our draft chapters. Cate's help is, of course, is in addition to that of our superb O'Reilly Editor Mary Treseler.

We're hoping that early and ongoing copy improvements will help immensely in the latter stages of the book's development (which are rapidly approaching)—by the time our technical reviewers start their reviews, hopefully they won't be distracted by our errant commas (and my overabundant exclamation points!) Cate has already delivered revisions to Chapter 1, and we published them wholesale on the wiki, so please do check them out.

(And, if you're curious to see what a difference a thoughtful copy-edit can make to your writing, you can compare the current version with any version preceding the 'Copy Edits' checkin.)

Also, we're making an earnest attempt to be more regular with our blog-publishing schedule. So, starting next Wednesday, we'll be posting at least one meaty-sized essay on reputation matters every week. We're calling it Reputation Wednesday. A small, regular event mostly designed to keep Randy and I honest, and get us off our butts to push some of the thinking that we're putting into the book out there for conversation.

We hope that it will make some of the concepts more accessible to folks that may not have time to dive into the wiki. I'll also point out that—if you haven't done so already—now would be a wonderful time to subscribe to the feed for this blog. We promise to fill it up. I swear.

August 03, 2009

Chapters 5 and 7 Ready for Review

We've been busy boys over here on BuildingReputation.com, though—unless you were paying careful attention—you might not have noticed. We've fallen victim to something that strikes all authors (we suspect) and have been so busy drafting, outlining, writing & revising that we've had a hard time keeping up with this site, and promoting the ongoing progress on the book.

The good news is, we're more than half-way to Draft Complete status. (Check out that sidebar on the wiki. 7 chapters down, only 5 to go!) So we have a plan for returning our attentions to this site, and growing the audience here that will continue to feed insight and course-corrections into the remaining chapters. (That's the theory anyway. Have we mentioned that this Unbook stuff is harder than we thought it would be?)

The really good news is this: Chapters Five and Seven are now draft-complete and ready for your review. A bit about each…

Chapter 5: Common Reputation Models

Chapter 5 is a pivotal point in the book, and represents something of a transition from the theoretical & abstract visual grammar of the early chapters to a real-world, applied demonstration of that grammar.

We look at some 'common' reputation models (tho', it's a recurring argument of ours throughout the book that there may not truly be any effective common 'off the rack' reputation models. All of these require some modification and combination to suit your specific context.) Here, for example, 'Robust Karma' shows how to get the right mix of participant quality and activity in your karma model…


Robust Karma

When needed, the Quality Karma and Participation Karma can be mixed into one score representing the combined value of this user's contributions. Each application decides how much weight each component gets in the final calculation. Often these Quality Karma are not displayed to users but only used for internal ranking for highlighting or attention and as search ranking influence factors, see Chap_5-Keep_Your_Barn_Door_Closed later this chapter for common reasons for this secrecy. But even when displayed, robust karma has the advantage of encouraging users to both the best stuff (as evaluated by their peers) and to do it often.

When negative factors are mixed into robust karma, it is particularly useful for customer care staff - both to highlight users that have become abusive or are decreasing content value, and to potentially provide an increased level of service in the case of a service event. This karma helps find the best of the best and the worst of the worst.

Figure_5-V: Robust Karma combines multiple other Karma scores, usually Qualitative and Quantitative for simplicities sake at the cost of obscuring detail.

Then we move on to something we know you're gonna like: we explore some well-known and high-profile reputation models that lie behind a couple of the Web's powerhouses. Ebay Seller Reputation is perhaps the most-observed (and emulated) reputation system going today and if you anticipate designing any type of marketplace or trust-based karma system, it is well worth a read to understand exactly how reputation works on Ebay.

Similarly, we do a deep-dive into how Flickr's Interestingness model ensures a stream of high-quality and reliably enjoyable photos to their Explore section of that site. This stuff is required reading for social software architects & designers. Trust us.

Chapter 7: Objects, Inputs, Scope, and Mechanism

This is the practitioners' chapter. In Chapter 6, we've asked a lot of the foundational questions about your intended reputation program: What do you hope to achieve? What behaviors are you trying to encourage? Discourage? How will you measure your progress?

In Chapter 7, with those goals firmly in mind, we show you how to start architecting your reputation system. We discuss how to identify the objects in your application that should accrue reputation, and give some guidance on determining which reputation inputs you should pay attention to. If you're curious about "Thumbs or Stars?" or "How is a 'Favorite' different from a 'Like'?" then we try to provide some guidance in this chapter.

And, as we're careful to remind, reputation takes place within a context, and an important facet of that context is its scope: how wide-ranging (or specific) should earned reputations be to the 'location' and context that they're earned in? Then, finally, we suggest a number of ordering mechanisms that show you how Objects, Inputs & Scope all combine to serve those goals that we established in Chapter 6.

Chapter 7 is a doozy of a chapter, folks. It almost begs to be it's own book. But, please, don't be intimidated—there's a lot of good raw material in there, but we still really do need your help and criticism to tease a great chapter out of it. Your comments are always welcome.