- The (un)Book (draft!)
- Appendix A - The Reputation Framework
- Appendix B - Related Resources
- Meta
-
Now we're going to start putting our simple reputation building blocks from Chapter_3 to work. Let's look at some actual reputation models to understand how the claims, inputs, and processes described in the last chapter can be combined to model a target entity's reputation.
In this chapter, we'll name and describe a number of simple and broadly-deployed reputation models, such as vote-to-promote, simple ratings, and points. You probably have some degree of familiarity with these patterns by simple virtue of being an active participant online. You see them all over the place-they're the bread and butter of today's social Web. Later in the chapter, we'll show you how to combine these simple models and expand upon them to make real-world models.
Understanding how these simple models combine to form more complete ones will help you identify them when you see them in the wild. All of this will become important later in the book, as we start to design and architect your own tailored reputation models.
At their very simplest, some of the models we present below are really no more than fancified reputation primitives: counters, accumulators, and the like. Notice, however, that just because these models are simple doesn't mean that they're not useful. Variations on the favorites-and-flags, voting, ratings-and-reviews, and karma models are abundant on the Web, and the operators of many sites find that, at least in the beginning, these simple models suit their needs perfectly.
The favorites-and-flags model excels at identifying outliers in a collection of entities. The outliers may be exceptional either for their perceived quality or for their lack of same. The general idea is this: give your community controls for identifying or calling attention to items of exceptional quality (or exceptionally low quality).
These controls may take the form of explicit votes for a reputable entity, or they may be more subtle implicit indicators of quality (such as the ability to bookmark content or send a link to it to a friend). A count of the number of times these controls are accessed forms the initial input into the system; the model uses that count to tabulate the entities' reputations.
In its simplest form, a favorites-and-flags model can be implemented as a simple counter. (Figure_4-1 ) When you start to combine them into more complex models, you'll probably need the additional flexibility of a reversible counter.
The favorites-and-flags model has three variants.
The vote-to-promote model, a variant of the favorites-and-flags model, has been popularized by crowd-sourced news sites such as Digg, Reddit, and Yahoo! Buzz. In a vote-to-promote system, a user promotes a particular content item in a community pool of submissions. This promotion takes the form of a vote for that item, and items with more votes rise in the rankings to be displayed with more prominence.
Vote-to-promote differs from this-or-that voting (see Chap_4-This-or-that-vote primarily in the degree of boundedness around the user's options. Vote-to-promote enacts an opinion on a reputable object within a large, potentially unbounded set (sites like StumbleUpon, for instance, have the entire web as their candidate pool of potential objects.)
Counting the number of times that members of your community bookmark a content item can be a powerful method for tabulating content reputation. This method provides a primary value (see Chap_6-Sidebar_Provide_a_Primary_Value ) to the user: bookmarking an item gives the user persistent access to it, and the ability to save, store, or retrieve it later. And, of course, it also provides a secondary value to the reputation system.
Unfortunately, there are many motivations in user-generated content applications for users to abuse the system. So it follows that reputation systems play a significant role monitoring and flagging bad content. This is not that far removed from bookmarking the good stuff. The most basic type of reputation model for abuse moderation involves keeping track of the number of times the community has flagged something as abusive. Craigslist uses this mechanism, and sets a custom threshold on each ad place based on a per-user, per-category, and even per-city basis - though the value and the formulation, is always kept secret from the users.
Typically, once a certain threshold is reached, either the application or human agents (staff) will act upon the content accordingly, or some piece of application logic will determine the proper automated outcome: either remove the “offending” item; properly categorize it (for instance, add an “adult content” disclaimer to it); or add it to a prioritized queue for human agent intervention.
If you give your users options for expressing their opinion about something, you are giving them a vote. A very common use of the voting model (which we've referenced throughout this book) is to allow community members to vote on the usefulness, accuracy, or appeal of something.
To differentiate from more open-ended voting schemes like vote-to-promote, it may help to think of these types of actions as “this-or-that” voting: choosing from the most attractive option within a bounded set of possibilities. (See Figure_4-2 .)
It's often more convenient to store that reputation statement back as a part of the reputable entity that it applies to, making it easier, for example, to fetch and display a “Was this review helpful?” score. (See Figure_2-7 .)
When an application offers users the ability to express an explicit opinion about the quality of something, it typically employs a ratings model. (Figure_4-3 ) There are a number of different scalar-value ratings: stars, bars, “HotOrNot,” or a 10-point scale. (We'll discuss how to choose from amongst the various types of ratings inputs in Chap_6-Choosing_Your_Inputs .) In the ratings model, ratings are gathered from multiple individual users and rolled up as a community average score for that target.
Some ratings are most effective when they travel together. More complex reputable entities frequently require more nuanced reputation models, and the ratings-and-review model, Figure_4-4 , allows users to express a variety of reactions to a target. While each rated facet could be stored and evaluated as its own specific reputation, semantically that wouldn't make much sense-it's the review in its entirety that is the primary unit of interest.
In the reviews model, a user gives a target a series of ratings and provides one or more freeform text opinions. Each individual facet of a review feeds into a community average.
For some applications, you may want a very specific and granular accounting of user activity on your site. The points model, Figure_4-5 , provides just such a capability. With points, your system counts up the hits, actions, and other activities that your users engage in and keeps a running sum of the awards.
This is a tricky model to get right. In particular, you face two dangers:
If your application's users must actually surrender part of their own intrinsic value in order to obtain goods or services, you will be punishing your best users, and you'll quickly lose track of people's real relative worths. Your system won't be able to tell the difference between truly valuable contributors and those who are just good hoarders and never spend the points allotted to them.
Far better to link the two systems but allow them to remain independent of each other: a currency system for your game or site should be orthogonal to your reputation system. Regardless of how much currency exchanges hands in your community, each user's underlying intrinsic karma should be allowed to grow or decay uninhibited by the demands of commerce.
A karma model is reputation for users. In Chapter 2, Chap_2-Mixing_Models_adding_Karma , we explained that a karma model usually is used in support of other reputation models to track or create incentives for user behavior. All the complex examples later in this chapter (Chap_4-Combining_the_Simple_Models ) generate and/or use a karma model to help calculate a quality score for other purposes, such as search ranking, content highlighting, or selecting the most reputable provider.
There are two primitive forms of karma models: models that measure the amount of user participation and models that measure the quality of contributions. When these types of karma models are combined, we refer to the combined model as robust. Including both types of measures in the model gives the highest scores to the users who are both active and produce the best content.
Counting socially and/or commercially significant events by content creators is probably the most common type of participation karma model. This model is often implemented as a point system (Chap_4-Points , in which each action is worth a fixed number of points and the points accumulate. A participation karma model looks exactly like Figure_4-5 , where the input event represents the number of points for the action and the source of the activity becomes the target of the karma.
There is also a negative participation karma model, which counts how many bad things a user does. Some people call this model strikes, after the three-strikes rule of American baseball. Again, the model is the same, except that the application interprets a high score inversely.
A quality-karma model, such as eBay's seller feedback Chap_4-eBay_Merchant_Feedback_Karma model, deals solely with the quality of contributions by users. In a quality-karma model, the number of contributions is meaningless unless it is accompanied by an indication of whether each contribution is good or bad for business. The best quality-karma scores are always calculated as a side effect of other users evaluating the contributions of the target.
In the eBay example, a successful auction bid is the subject of the evaluation, and the results roll up to the seller: if there is no transaction, there should be no evaluation. For a detailed discussion of this requirement, see Karma Is Complex, Chap_7-Displaying_Karma . Look ahead to Figure_4-6 for a diagram of a combined ratings-and-reviews and quality-karma model.
By itself, a participation-based karma score is inadequate to describe the value of a user's contributions to the community: we will caution time and again throughout the book that rewarding simple activity is an impoverished way to think about user karma. However, you probably don't want a karma score based solely on quality of contributions either. Under this circumstance, you may find your system rewarding cautious contributors-ones who, out of a desire to keep their quality-ratings high-only contribute to “safe” topics, or-once having attained a certain quality ranking-decide to stop contributing to protect that ranking.
What you really want to do is to combine quality-karma and participation-karma scores into one score-call it robust karma. The robust-karma score represents the overall value of a user's contributions: the quality component ensures some thought and care in the preparation of contributions, and the participation side ensures that the contributor is very active, that she's contributed recently, and (probably) that she's surpassed some minimal thresholds for user participation-enough that you can reasonably separate the passionate, dedicated contributors from the fly-by post-then-flee crowd.
The weight you'll give to each component depends on the application. Robust-karma scores often are not displayed to users, but may be used instead for internal ranking or flagging, or as factors influencing search ranking; see Chap_4-Keep_Your_Barn_Door_Closed , below, for common reasons for this secrecy. But even when karma scores are displayed, a robust-karma model has the advantage of encouraging users both to contribute the best stuff (as evaluated by their peers) and to do it often.
When negative factors are included in factoring robust-karma scores, it is particularly useful for customer care staff-both to highlight users who have become abusive or users whose contributions decrease the overall value of content on the site, and potentially to provide an increased level of service to proven-excellent users who become involved in a customer service procedure. A robust-karma model helps find the best of the best and the worst of the worst.
By themselves, the simple models described above are not enough to demonstrate a typical deployed large-scale reputation system in action. Just as the ratings-and-reviews model is a recombination of the simpler atomic models that we described in Chapter_3 , most reputation models combine multiple smaller, simpler models into one complex system.
A model is inextricably linked to the application in which it's embedded. For example, in Flickr, eBay, or Yahoo! Movies, every input is tied to a specific application feature (which itself depends on a specific object model and set of interactions). So to truly copy a model, you would also need to copy, wholesale, large parts of the interface and business logic from the application in which the model is embedded. This is probably not a recommended action.
The entire latter half of Building Web Reputation Systemsshows how to design a system specific to your own product and context. You'll see better results for your application if you learn from models presented in this chapter, then set them aside.
Eventually, a site based on a simple reputation model, such as the ratings-and-reviews model, is bound to become more complex. Probably the most common reason for increasing complexity is this progression: as an application becomes more successful, it becomes clear that some of the site's users produce higher-quality reviews. These quality contributions begin to significantly increase the value of the site to end users and to the site operator's bottom line. As a result, the site operator looks for ways to recognize these contributors, increase the search ranking value of their reviews, and generally provide incentives for this value-generating behavior. Adding a karma reputation model to the system is a common approach to reaching those goals.
The simplest way to introduce a quality-karma score to a simple ratings-and-reviews reputation system is to introduce a “Was this helpful?” feedback mechanism that visiting readers may use to evaluate each review.
The example in Figure_4-7 is a hypothetical product reputation model, and the reviews focus on 5-star ratings in the categories “overall” , “service” , and “price.” These specifics are for illustration only and are not critical to the design. This model could just as well be used with thumb ratings and any arbitrary categories like “sound quality” or “texture.”
The combined ratings-and-reviews with karma model has one compound input: the review and the was-this-helpful vote. From these inputs, the community rating averages, the WasThisHelpful
ratio, and the reviewer quality-karma rating are generated on the fly. Pay careful attention to the sources and targets of the inputs of this model-they are not the same users, nor are their ratings targeted at the same entities.
The model can be described as follows.
WasThisHelpful
ratio, which is initialized to 0 out of 0
, and is never actually modified by the reviewer but derived from the was-this-helpful votes of readers.This model has only three processes or outputs and is pretty straightforward. Note, however, the split shown for the was-this-helpful vote, where the message is duplicated and sent both to the was-this-helpful process and the process that calculates reviewer quality karma. The more complex the reputation model, the more common this kind of split becomes.
Besides indicating that the same input is used in multiple places, a split also offers the opportunity to do parallel and/or distributed processing-the two duplicate messages take separate paths and need not finish at the same time, or at all.
Because users may need to revise their ratings and the site operator may wish to cancel the effects of ratings by spammers and other abusive behavior, the effects of each review are reversible. This is a simple reversible average process, so it's a good idea to consider the effects of bias and liquidity when calculating and displaying these averages (See Chap_3-Craftsman_Tips ).
T
) number of votes and the count of positive (P
) votes. It stores the output claim in the target review as the WasThisHelpful
ratio claim with the value P out of T
.Policies differ for cases when a reviewer is allowed to make significant changes to a review (for example, changing a formerly glowing comment into a terse “This sucks now!” ). Many site operators simply revert all the was-this-helpful votes and reset the ratio. Even if your model doesn't permit edits to a review, for abuse mitigation purposes, this process still needs to be reversible.
Karma models, especially public karma models, are subject to massive abuse by users interested in personal status or commercial gain. For that reason, this process must be reversible. Now that we have a community-generated quality karma claim for each user (at least those who have written a review noteworthy enough to invite helpful votes), you may notice that this model doesn't use that score as an input or weight in calculating other scores. This configuration is a reminder that reputation models all exist within an application context-therefore the most appropriate use for this score will be determined by your application's needs.
Perhaps the you will keep the quality-karma score as a corporate reputation, helping to determine which users should get escalating customer support. Perhaps the score will be public, displayed next to every one of a user's reviews as a status symbol for all to see. It might even be personal, shared only with each reviewer, so that reviewers can see what the overall community thinks of their contributions. Each of these choices has different ramifications, which we discuss in Chapter_6 in detail.
EBay contains the Internet's most well-known and studied user reputation or karma system: seller feedback. Its reputation model, like most others that are several years old, is complex and continuously adapting to new business goals, changing regulations, improved understanding of customer needs, and the never-ending need to combat reputation manipulation through abuse. See Appendix_B for a brief survey of relevant research papers about this system and Chapter_9 for further discussion of the continuous evolution of reputation systems in general.
Rather than detail the entire feedback karma model here, we'll focus on claims that are from the buyer and about the seller. An important note about eBay feedback is that buyer claims exist in a specific context: a market transaction-a successful bid at auction for an item listed by a seller. This specificity leads to a generally higher quality-karma score for sellers than they would get if anyone could just walk up and rate a seller without even demonstrating that they'd ever done business with them; see Chap_1-implicit_reputation .
Figure_4-8 illustrates the seller feedback karma reputation model, which is made out of typical model components: two compound buyer input claims-seller feedback and detailed seller ratings-and several roll-ups of the seller's karma: community feedback ratings (a counter), feedback level (a named level), positive feedback percentage (a ratio), and the power seller rating (a label).
The context for the buyer's claims is a transaction identifier-the buyer may not leave any feedback before successfully placing a winning bid on an item listed by the seller in the auction market. Presumably, the feedback primarily describes the quality and delivery of the goods purchased. A buyer may provide two different sets of complex claims, and the limits on each vary:
EBay displays an extensive set of karma scores for sellers: the amount of time the seller has been a member of eBay; color-coded stars; percentages that indicate positive feedback; more than a dozen statistics track past transactions; and lists of testimonial comments from past buyers or sellers. This is just a partial list of the seller reputations that eBay puts on display.
The full list of displayed reputations almost serves as a menu of reputation types present in the model. Every process box represents a claim displayed as a public reputation to everyone, so to provide a complete picture of eBay seller reputation, we'll simply detail each output claim separately:
It is fairly common for a buyer to change this score, within some time limitations, so this effect must be reversible. Sellers spend a lot of time and effort working to change negative and neutral ratings to positive ratings to gain or to avoid losing a power seller rating. When this score changes, it is then used to calculate the feedback level.
The positive and negative ratings are used to calculate the positive feedback percentage.
EBay only recently added these categories as a new reputation model because including them as factors in the overall seller feedback ratings diluted the overall quality of seller and buyer feedback. Sellers could end up in disproportionate trouble just because of a bad shipping company or a delivery that took a long time to reach a remote location. Likewise, buyers were bidding low prices only to end up feeling gouged by shipping and handling charges. Fine-grained feedback allows one-off small problems to be averaged out across the DSR community averages instead of being translated into red-star negative scores that poison trust overall. Fine-grained feedback for sellers is also actionable by them and motivates them to improve, since these DSR scores make up half of the power seller rating.
Though the context for the buyer's claims is a single transaction or history of transactions, the context for the aggregate reputations that are generated is trust in the eBay marketplace itself. If the buyers can't trust the sellers to deliver against their promises, eBay cannot do business. When considering the roll-ups, we transform the single-transaction claims into trust in the seller, and-by extension-that same trust rolls up into eBay. This chain of trust is so integral and critical to eBay's continued success that they must continuously update the marketplace's interface and reputation systems.
The popular online photo service Flickr uses reputation to qualify new user submissions and track user behavior that violates Flickr's terms of service. Most notably, Flickr uses a completely custom reputation model called “interestingness” for identifying the highest-quality photographs submitted from the millions uploaded every week. Flickr uses that reputation score to rank photos by user and, in searches, by tag.
Interestingness is also the key to Flickr's “Explore” page, which displays a daily calendar of the photos with the highest interestingness ratings, and users may use a graphical calendar to look back at the worthy photographs from any previous day. It's like a daily leaderboard for newly-uploaded content.
As with all the models we describe in this book, we've taken some liberties to simplify the model for presentation-specifically, the patent mentions various weights and ceilings for the calculations without actually prescribing any particular values for them. We make no attempt to guess at what these values might be. Likewise, we have left out the specific calculations.
We do, however, offer two pieces of advice for anyone building similar systems: there is no substitute for gathering historical data when you are deciding how to clip and weight your calculations, and-even if you get your initial settings correct-you will need to adjust them over time to adapt to the use patterns that will emerge as the direct result of implementing reputation. (See Chap_9-Emergent_Effects_and_Defects )
Figure_4-9 has two primary outputs: photo interestingness
and interesting photographer karma
, and everything else feeds into those two key claims.
Of special note in this model is the existence of a karma loop (represented in the figure by a dashed-pipe.) A user's reputation score influences how much “weight” his or her opinion carries when evaluating others' work (commenting on it, favoriting it, or adding to groups): photographers with higher interestingness karma on Flickr have a greater voice in determining what constitutes “interesting” on the site.
Each day, Flickr generates and stores a list of the top 500 most interesting photos for the “Explore” page. It also updates the current interestingness score of each and every photo each time one of the input events occurs. Here we illustrate a real-time model for that update, though it isn't at all clear that Flickr actually does these calculations in real time, and there are several good reasons to consider delaying that action. See Chap_4-Keep_Your_Barn_Door_Closed , later in this chapter.
Since there are four main paths through the model, we've grouped all the inputs by the kind of reputation feedback they represent: viewer activities, tagging, flagging, and republishing. Each path provides a different kind of input into the final reputations.
2009
, me
, Randy
, Bryce
, Fluffy
, and cameraphone
, along with the expected descriptive categories of wedding
, dog
, tree
, landscape
, purple
, tall
, and irony
-which sometimes means “made of iron!” Tagging gets special treatment in a reputation model because users must apply extra effort to tag an object, and determining whether one tag is more likely to be accurate than another requires complicated computation. Likewise, certain tags, though popular, should not be considered for reputation purposes at all. Tags have their own quantitative contribution to interestingness, but they also are considered viewer activities, so the input is split into both paths.
The smart reputation designer can, in fact, leverage this unfortunate truth. Build a corporate-user “porn probability” reputation into your system-one that identifies content with a high (or too-high) velocity of attention and puts it in a prioritized queue for human agents to review.
Generally, four things determine a Flickr photo's interestingness (represented by the four parallel paths in Figure_4-9 ): the viewer activity score, which represents the effect of viewers taking a specific action on a photo; tag relatedness, which represents a tag's similarity to others associated with other tagged photos; the negative feedback adjustment, which reflects reasons to downgrade or disqualify the tag; and group weighting, which has an early positive effect on reputation with the first few events.
0.5
, because the process is likely to increase it. The process reads the interesting-photographer karma of the user taking the action (not the person who owns the photo) and increases the viewer activity value by some weighting amount before passing it on to the next process. As a simple example, we'll suggest that the increase in value will be a maximum of 0.25
-with no effect for a viewer with no karma and 0.25
for a hypothetical awesome user whose every photo is beloved by one and all. The resulting score will be in the range 0.5
to 0.75
. We assume that this interim value is not stored in a reputation statement for performance reasons.0.5
to 0.75
) and determines the relationship strength of the viewer to the photographer. The patent indicates that a stronger relationship should grant a higher weight to any viewer activity. Again, for our simple example, we'll add up to 0.25
for a mutual first-degree relationship between the users. Lower values can be added for one-way (follower) relationships or even relationships as members of the same Flickr groups. The result is now in the range of 0.5
to 1.0
and is ready to be added into the historical contributions for this photo.0.5
to 1.0
. It seems likely that this score is the primary basis for interestingness. The patent indicates that each sum is marked with a time stamp to track changes in viewer activity score over time.
The sum is then denormalized against the available range, from 0.5
to the maximum known viewer activity score, to produce an output from 0.0
to 1.0
, which represents the normalized accumulated score stored in the reputation system so that it can be used to recalculate photo interestingness as needed.
Flickr Tag Relatedness
[0032] As part of the relatedness computation, the statistics engine may employ a statistical clustering analysis known in the art to determine the statistical proximity between metadata (e.g., tags), and to group the metadata and associated media objects according to corresponding cluster. For example, out of 10,000 images tagged with the word “Vancouver,” one statistical cluster within a threshold proximity level may include images also tagged with “Canada” and “British Columbia.” Another statistical cluster within the threshold proximity may instead be tagged with “Washington” and “space needle” along with “Vancouver.” Clustering analysis allows the statistics engine to associate “Vancouver” with both the “Vancouver-Canada” cluster and the “Vancouver-Washington” cluster. The media server may provide for display to the user the two sets of related tags to indicate they belong to different clusters corresponding to different subject matter areas, for example.
This is a good example of a black-box process that may be calculated outside of the formal reputation system. Such processes are often housed on optimized machines or are run continuously on data samples in order to give best-effort results in real time.
For our model, we assume that the output will be a normalized score from 0.0
(no confidence) to 1.0
(high confidence) representing how likely the tag is related to the content. The simple average of all the scores for the tags on this photo are stored in the reputation system so that it can be used to recalculate photo interestingness as needed.
For illustration, let's say that it would only take five abuse reports to do the most damage possible to a photo's reputation. Using this math, each abuse report event would be worth 0.2
. Negative feedback can be thought of as a reversible accumulator with a maximum value of 1.0
.
Flickr official forum posts indicate that for the first five or so actions, this value quickly increases to its maximum value-1.0
in our system. After that, it stabilizes, so this process is also a simple accumulator, adding 0.2
for every event and capping at 1.0
.
0.0
to 1.0
and represent either positive (viewer activity score, tag relatedness, group weighting) or negative (negative feedback) effects on the claim.The exact formulation for this calculation is not detailed in any documentation, nor is it clear that anyone who doesn't work for Flickr understands all its subtleties. But… for illustration purposes, we propose this drastically simplified formulation: photo interestingness is made up of 20% each of group weighting and tag relatedness plus 60% viewer activity score minus negative feedback. A common early modification to a formulation like this is to increase the positive percentages enough so that no minor component is required for a high score. For example, you could increase the 60% viewer activity score to 80% and then cap the result at 1.0 before applying any negative effects. A copy of this claim value is stored in the same high-performance database as the rest of the search-related metadata for the target photo.
The Flickr model is undoubtedly complex and has spurred a lot of discussion and mythology in the photographer community on Flickr.
It's important to reinforce the point that all of this computational work is in support of three very exact contexts: interestingness works specifically to influence photos' search rank on the site, their display order on user profiles, and ultimately whether or not they're featured on the site-wide “Explore” page. It's the third context, Explore, that introduces one more important reputation mechanic: randomization.
Each day's photo interestingness calculations produce a ranked list of photos. If the content of the “Explore” page were 100% determined by those calculations, it could get boring. First-Mover effects can predict that you would probably always see the same photos by the same photographers at the top of the list. See Chap_3-First_Mover_Effects . Flickr lessens this effect by including a random factor in the selection of the photos:
Each day, the top 500 photos appear in randomized order. In theory, the photo with the 500th-ranked photo interestingness score could be displayed first and the one with the highest photo interestingness score could be displayed last. The next day, if they're still on the top-500 list, they could both appear somewhere in the middle.
This system has two wonderful effects:
What's truly wonderful is that this randomness doesn't harm Explore's efficacy in the least: given the scale and activity of the Flickr community, each and every day there are more than enough high-quality photos to fill a 500-photo list. Jumbling up the order for display doesn't detract from the experience of browsing them by one whit.
As a business owner on today's Web, probably the greatest thing about social media is that the users themselves create the media from which you, the site operator, capture value. This means, however, that the quality of your site is directly related to the quality of the content created by your users.
This can present problems. Sure, the content is cheap-but you usually get what you pay for, and you will probably need to pay more to improve the quality. Additionally, some users have a different set of motivations than you might prefer.
We offer design advice to mitigate potential problems with social collaboration, and suggestions for specific nontechnical solutions.
As illustrated in the real-life models above, reputation can be a successful motivation for users to contribute large volumes of content and/or high-quality content to your application. At the very least, reputation can provide critical money-saving value to your customer care department by allowing users to prioritize the bad content for attention and likewise flag power users and content to be featured.
But mechanical reputation systems, of necessity, are always subject to unwanted or unanticipated manipulation: they are only algorithms, after all. They cannot account for the many, sometimes conflicting, motivations for users' behavior on a site. One of the strongest motivations of users who invade reputation systems is commercial. Spam invaded email. Marketing firms invade movie review and social media sites. And drop-shippers are omnipresent on eBay.
EBay drop-shippers put the middleman back into the online market: they are people who resell items that they don't even own. It works roughly like this:
This model of doing business was not anticipated by the eBay seller feedback karma model, which only includes buyers and sellers as reputation entities. Drop-shippers are a third party in what was assumed to be a two-party transaction, and they cause the reputation model to break in various ways:
In effect, the seller can't make the order right with the customer without refunding the purchase price in a timely manner. This puts them out-of-pocket for the price of the goods along with the hassle of trying to recover the money from the drop-shipper.
But a simple refund alone sometimes isn't enough for the buyer! No, depending on the amount of perceived hassle and effort this transaction has cost them, they are still likely to rate the transaction negatively overall. (And rightfully so-once it's become evident that a seller is working through a drop-shipper, many of their excuses and delays start to ring very hollow.) So a seller may have, at this point, outlayed a lot of their own time and money to rectify a bad transaction only to still suffer the penalties of a red star.
What option does the seller have left to maintain their positive reputation? You guessed it-a payoff. Not only will a concerned seller eat the price of the goods-and any shipping involved-but they will also pay an additional cash bounty (typically up to $20.00) to get buyers to flip a red star to green.
What is the cost of clearing negative feedback on drop-shipped goods? The cost of the item + $20.00 + lost time in negotiating with the buyer. That's the cost that reputation imposes on drop-shipping on eBay.
The lesson here is that a reputation model will be reinterpreted by users as they find new ways to use your site. Site operators need to keep a wary eye on the specific behavior patterns they see emerging and adapt accordingly. Chapter_9 provides more detail and specific recommendations for prospective reputation modelers.
You will-at some point-be faced with a decision about how open to be (or not be) about the details of your reputation system. Exactly how much of your models' inner workings should you reveal to the community? Users inevitably will want to know:
This decision is not at all trivial: if you err on the side of extreme secrecy, you risk damaging your community's trust in the system that you've provided. Your users may come to question its fairness or-if the inner workings remain too opaque-they may flat-out doubt the system's accuracy.
Most reputation-intensive sites today attempt at least to alleviate some of the community's curiosity about how content reputations and user reputations are earned. It's not like you can keep your system a complete secret.
Equally bad, however, is divulging too much detail about your reputation system to the community. And more site designers probably make this mistake, especially in the early stages of deploying the system and growing the community. As an example, consider the highly specific breakdown of actions on the Yahoo! Answers site, and the points rewarded for each (See Figure_4-10 ).
Why might this breakdown be a mistake? For a number of reasons. Assigning overt point values to specific actions goes beyond enhancing the user experience and starts to directly influence it. Arguably, it may tip right over into the realm of dictating user behavior, which generally is frowned upon.
A detailed breakdown also arms the malcontents in your community with exactly the information they need to deconstruct your model. And they won't even need to guess at things like relative weightings of inputs into the system: the relative value of different inputs is right there on the site, writ large. Try, instead, to use language that is clear and truthful without necessarily being comprehensive and exhaustively complete.
Some language that is both understandably clear and appropriately vague
The exact formula that determines medal-achievement will not be made public (and is subject to change) but, in general, it may be influenced by the following factors: community response to your messages (how highly others rate your messages); the amount of (quality) contributions that you make to the boards; and how often and accurately you rate others' messages.
Staying vague does not mean, of course, that some in your community won't continue to wonder, speculate, and talk among themselves about the specifics of your reputation system. Algorithm gossip has become something of a minor sport on collaborative sites like Digg and YouTube.
For some participants, guessing at the workings of reputations like “highest rated” or “most popular” is probably just that-an entertaining game and nothing more. Others, however, see only the benefit of any insight they might be able to gain into the system's inner workings: greater visibility for themselves and their content; more influence within the community; and the greater currency that follows both. (See Chap_5-Egocentric_Incentives .)
Here are some helpful strategies for masking the inner workings of your reputation models and algorithms.
Time is on your side. Or it can be, in one of a couple of ways. First, consider the use of time-based decay in your models: recent actions “count for” more than actions in the distant past, and the effects of older actions decay (lessen) over time. Incorporating time-based delays has several benefits.
It's also beneficial to delay the results of newly triggered inputs. If a reasonable window of time exists between the triggering of an input (marking a photo as a favorite, for instance) and the resulting effect on that object's reputation (moving the photo higher in a visible ranking), it confounds a gaming user's ability to do easy what-if comparisons (particularly if the period of delay is itself unpredictable).
When the reputation effects of various actions are instantaneous, you've given the gamers of your system a powerful analytic tool for reverse-engineering your models.
We've already cautioned that it's important to keep your system flexible (see Chap_9-Plan_For_Change ). That's not just good advice from a technical standpoint, but from a social and strategic one as well. Put simply: leave yourself enough wiggle room to adjust the impact of different inputs in the system (add new inputs, change their relative weightings, or eliminate ones that were previously considered). That flexibility gives you an effective tool for confounding gaming of the system. If you suspect that a particular input is being exploited, you at least have the option of tweaking the model to compensate for the abuse. You will also want the flexibility of introducing new types of reputations to your site (or retiring ones that are no longer serving a purpose.)
It is tricky, however, to enact changes like these without affecting the social contract you've established with the community. Once you've codified a certain set of desired behaviors on your site, some users will (understandably) be upset if the rug gets pulled out from under them. This risk is yet another argument for avoiding disclosure of too many details about the mechanics of the system, or for downplaying the system's importance.
The first section of this book (Chapters 1-4) was focused on reputation theory:
Along the way, we sprinkled in practitioner's tips to share what we've learned from existing reputation systems to help you understand what could, and already has, gone wrong.
Now you're prepared for the second section of the book: applying this theory to a specific application-yours. Chapter_5 starts the project off with three basic questions about your application design. In haste, many projects skip over one or more of these critical considerations-the results are often very costly.