Differences

This shows you the differences between two versions of the page.

chapter_9 [2009/11/18 18:22]
randy
chapter_9 [2009/12/01 14:49] (current)
randy submitted for publisher review
Line 1: Line 1:
===== Application Integration, Testing & Tuning ===== ===== Application Integration, Testing & Tuning =====
-A reputation system does not exist in a vacuum; it is small machine in your larger application. There are a bunch of fine-grained connections between it and your various data sources, such as logs, event streams, identity db, entity db, and your high-performance data store. Connecting it up will most likely require custom programming to connect the wires between your reputation engine and subsystems that were never connected before. +If you've been following the steps provided in <html><a href=";/doku.php?id=Chapter_5">Chapter_5</a>&nbsp;</html>through <html><a href="/doku.php?id=Chapter_8">Chapter_8</a>&nbsp;</html>, you know your goals, have a diagram of your reputation model with initial calculations formulated and a handful of screen mock-ups showing how you will gather, display, and otherwise use reputation to increase the value of your application. You have ideas and plans, now it is time to reduce it all to code and to start seeing how it all works together.
- +
-This step is often overlooked in scheduling, but may take up a significant amount of your total project development time. There are usually small tuning adjustments that are required once the inputs are actually hooked up in a release environment. This chapter will help you understand how to plan for connecting the reputation engine to your application, and what final decisions you will need to make about your reputation model.+
<html><a name='Chap_9-Application_Integration'></a></html> <html><a name='Chap_9-Application_Integration'></a></html>
==== Integrating With Your Application ==== ==== Integrating With Your Application ====
-If you've been following the steps to this point, you know your goals, have a diagram of your reputation model with initial calculations formulated and a sense of how you will gather, display, and otherwise use reputation to increase the value of your application. You have ideas and plans, now it is time to reduce it all to code and to start seeing how it all works together.+A reputation system does not exist in a vacuum; it is small machine in your larger application. There are a bunch of fine-grained connections between it and your various data sources, such as logs, event streams, identity db, entity db, and your high-performance data store. Connecting it up will most likely require custom programming to connect the wires between your reputation engine and subsystems that were never connected before. 
 + 
 +This step is often overlooked in scheduling, but may take up a significant amount of your total project development time. There are usually small tuning adjustments that are required once the inputs are actually hooked up in a release environment. This chapter will help you understand how to plan for connecting the reputation engine to your application, and what final decisions you will need to make about your reputation model.
Line 21: Line 21:
Besides, who knows your reputable entities better than the application team? They build the software that gives your entities meaning. Engaging these key stakeholders early allows them to contribute to the model design and prepares them for the nature of the coming changes. Besides, who knows your reputable entities better than the application team? They build the software that gives your entities meaning. Engaging these key stakeholders early allows them to contribute to the model design and prepares them for the nature of the coming changes.
-Don't wait to share details about the reputation model design process until after screen mocks are distributed to engineering for scheduling estimates - there's too much happening on the reputation back-end that isn't represented in those images.+Don't wait to share details about the reputation model design process until after screen mocks are distributed to engineering for scheduling estimates-there's too much happening on the reputation back-end that isn't represented in those images.
</note> </note>
-<html><a href="/doku.php?id=Appendix_A">Appendix_A</a>&nbsp;</html>contains a deeper technical architecture oriented look at how to define the reputation framework: the software environment for executing your reputation model. Any plan to implement your model will require significant software engineering, so sharing that resource with the team is essential. Reviewing the framework requirements will lead to many questions from the implementation team about specific trade-offs related to issues such as scalability, reliability, and shared data. The answers will put constraints on your development schedule and the application's capabilities. One lesson is worth repeating here: the process boxes in the reputation model diagram are a notational convenience and //advisory//- they are not implementation requirements.+<html><a href="/doku.php?id=Appendix_A">Appendix_A</a>&nbsp;</html>contains a deeper technical architecture oriented look at how to define the reputation framework: the software environment for executing your reputation model. Any plan to implement your model will require significant software engineering, so sharing that resource with the team is essential. Reviewing the framework requirements will lead to many questions from the implementation team about specific trade-offs related to issues such as scalability, reliability, and shared data. The answers will put constraints on your development schedule and the application's capabilities. One lesson is worth repeating here: the process boxes in the reputation model diagram are a notational convenience and //advisory//-they are not implementation requirements.
<note tip> <note tip>
Line 38: Line 38:
This challenge requires that the reputation model implementation be resilient in the face of missing inputs. One simple strategy is to have the reputation processes that handle inputs have reasonable default values for every input. Inferred Karma, (<html><a href="/doku.php?id=Chapter_6#Chap_6-Inferred_Karma">Chap_6-Inferred_Karma</a>&nbsp;</html>) is an example. This approach also copes well if a previously reliable source of inputs becomes inactive, either through a network outage or simply a localized application change. This challenge requires that the reputation model implementation be resilient in the face of missing inputs. One simple strategy is to have the reputation processes that handle inputs have reasonable default values for every input. Inferred Karma, (<html><a href="/doku.php?id=Chapter_6#Chap_6-Inferred_Karma">Chap_6-Inferred_Karma</a>&nbsp;</html>) is an example. This approach also copes well if a previously reliable source of inputs becomes inactive, either through a network outage or simply a localized application change.
-Explicit inputs, such as ratings and reviews, take much longer to implement as they have significant user-interface components. Consider the overhead with something as simple as a thumbs-up/thumbs-down voting model: What does it look like if the user hasn't voted? What if, after they voted they want to change their vote? What if they want to remove their vote altogether?+Explicit inputs, such as ratings and reviews, take much longer to implement as they have significant user-interface components. Consider the overhead with something as simple as a thumbs-up/thumbs-down voting model: what does it look like if the user hasn't voted?; what if they want to change their vote?; what if they want to remove their vote altogether?
-For models with many explicit reputation inputs, all of this work can cause a waterfall effect on testing the model: Waiting until the user interface is done to test the model causes the testing period to be very short because of management pressure to deliver new features: "The application //looks// ready, why haven't we shipped?"+For models with many explicit reputation inputs, all of this work can cause a waterfall effect on testing the model: waiting until the user interface is done to test the model causes the testing period to be very short because of management pressure to deliver new features-“The application //looks// ready, why haven't we shipped?
-We found that getting a primitive user interface in place quickly for testing is essential. Our voting example above can be quickly represented in a web application as two text-links: "Vote Yes" "Vote No" and text next to it that represented the tester's previous vote: "(You [haven't] voted [Yes|No].)" Trivial to implement, no art requirements, no mouse-overs, no compatibility testing, no accessibility review, no pressure to ship early, but completely functional. This approach allows the reputation team to test the input flow and the functionality of model. This sort of development interface is also amenable to robotic regression testing.+We found that getting a primitive user interface in place quickly for testing is essential. Our voting example above can be quickly represented in a web application as two text-links: “Vote Yes” “Vote No” and text next to it that represented the tester's previous vote: (You [haven't] voted [Yes|No].)Trivial to implement, no art requirements, no mouse-overs, no compatibility testing, no accessibility review, no pressure to ship early, but completely functional. This approach allows the reputation team to test the input flow and the functionality of model. This sort of development interface is also amenable to robotic regression testing.
=== Applied Outputs === === Applied Outputs ===
-The simplest output is reflecting explicit reputation back to users - showing their star-rating for a camera back to them when they visit the camera again in the future, or on their profile for others to see. The next level of output is the display of roll-ups, such as the average rating from all users about that camera. The specific patterns for these are discussed in detail in <html><a href="/doku.php?id=Chapter_7">Chapter_7</a>&nbsp;</html>. Unlike the case with integrating inputs, these outputs can be simulated easily by the reputation implementation team on their own, so there isn't a dependency on other application teams to determine if a roll-up result is accurate. One practice during debugging a model is to simply log every input with the changes to the roll-ups that were generated, giving a historical view of the model's state over time.+The simplest output is reflecting explicit reputation back to users-showing their star-rating for a camera back to them when they visit the camera again in the future, or on their profile for others to see. The next level of output is the display of roll-ups, such as the average rating from all users about that camera. The specific patterns for these are discussed in detail in <html><a href="/doku.php?id=Chapter_7">Chapter_7</a>&nbsp;</html>. Unlike the case with integrating inputs, these outputs can be simulated easily by the reputation implementation team on their own, so there isn't a dependency on other application teams to determine if a roll-up result is accurate. One practice during debugging a model is to simply log every input with the changes to the roll-ups that were generated, giving a historical view of the model's state over time.
But, as we detailed in <html><a href="/doku.php?id=Chapter_8">Chapter_8</a>&nbsp;</html>these explicit displays of reputation aren't usually the most interesting or valuable: using reputation to identify and filter the best (and worst) reputable entities in your application is. Using reputation output to perform these tasks is more deeply integrated with the application; For example, search results may be ranked by a combination of a keyword search and reputation score. A user's report of TOS violating content might want to compare the karma of the author of the content to the reporter. These context-specific uses require tight integration with the application. But, as we detailed in <html><a href="/doku.php?id=Chapter_8">Chapter_8</a>&nbsp;</html>these explicit displays of reputation aren't usually the most interesting or valuable: using reputation to identify and filter the best (and worst) reputable entities in your application is. Using reputation output to perform these tasks is more deeply integrated with the application; For example, search results may be ranked by a combination of a keyword search and reputation score. A user's report of TOS violating content might want to compare the karma of the author of the content to the reporter. These context-specific uses require tight integration with the application.
-This leads to an unusual suggested implementation strategy -- code the complex reputation uses //first//. Get the skeleton reputation-influenced search results page working even before the real inputs are built. Inputs are easy to simulate, the reputation model needs to be debugged as well as the application-side weights used for the search will need tuning. This approach will also quickly expose and scaling sensitivities in the system - in web applications search tends to consume the most resources by far. Save the fiddling over the screen presentation of roll-ups for last.+This leads to an unusual suggested implementation strategy-code the complex reputation uses //first//. Get the skeleton reputation-influenced search results page working even before the real inputs are built. Inputs are easy to simulate, the reputation model needs to be debugged as well as the application-side weights used for the search will need tuning. This approach will also quickly expose and scaling sensitivities in the system-in web applications search tends to consume the most resources by far. Save the fiddling over the screen presentation of roll-ups for last.
<html><a name='Chap_9-Beware-Feedback-Loops'></a></html> <html><a name='Chap_9-Beware-Feedback-Loops'></a></html>
Line 64: Line 64:
This mis-application of your credit rating creates a //feedback loop//. This is a situation in which the inputs into the system (in this case, your employment) are dependent in some part upon the output from the system. This mis-application of your credit rating creates a //feedback loop//. This is a situation in which the inputs into the system (in this case, your employment) are dependent in some part upon the output from the system.
-Why are feedback loops bad? Well, as the Times points out -- feedback loops are self-perpetuating and, once started, nigh-impossible to break. Much like in music production (Jimi Hendrix notwithstanding), feedback loops are generally to be avoided because they muddy the fidelity of the signal.+Why are feedback loops bad? Well, as the Times points out-feedback loops are self-perpetuating and, once started, nigh-impossible to break. Much like in music production (Jimi Hendrix notwithstanding), feedback loops are generally to be avoided because they muddy the fidelity of the signal.
<html><a name='Chap_9-Plan_For_Change'></a></html> <html><a name='Chap_9-Plan_For_Change'></a></html>
Line 74: Line 74:
Also pay some heed to the manner in which you introduce new reputation-related features to your community. Also pay some heed to the manner in which you introduce new reputation-related features to your community.
-  * Have your community manager announce the features on your product blog, along with a solicitation for public feedback and input. That last part is important -- though these may be feature additions or changes like any other, often times they are fundamentally transformative to the experience of engaging with your application. Make sure that people know they have a voice in the process, and their opinion counts. +  * Have your community manager announce the features on your product blog, along with a solicitation for public feedback and input. That last part is important-though these may be feature additions or changes like any other, often times they are fundamentally transformative to the experience of engaging with your application. Make sure that people know they have a voice in the process, and their opinion counts. 
-  * Be careful to be simultaneously clear -- in describing what the new features are -- and vague in describing exactly how they work. You want the community to become familiar with these fundamental changes to their experience, so that they're not surprised or, worse, offended when they first encounter them in the wild. But you //don't// want everyone immediately running out to “kick the tires” of the new system, poking prodding and trying to earn reputation to satisfy their “thirst for first.” See <html><a href="/doku.php?id=Chapter_5#Chap_5-The_Quest_For_Mastery">Chap_5-The_Quest_For_Mastery</a>&nbsp;</html>. +  * Be careful to be simultaneously clear-in describing what the new features are - and vague in describing exactly how they work. You want the community to become familiar with these fundamental changes to their experience, so that they're not surprised or, worse, offended when they first encounter them in the wild. But you //don't// want everyone immediately running out to “kick the tires” of the new system, poking prodding and trying to earn reputation to satisfy their “thirst for first.” See <html><a href="/doku.php?id=Chapter_5#Chap_5-The_Quest_For_Mastery">Chap_5-The_Quest_For_Mastery</a>&nbsp;</html>. 
-  * There are a certain class of changes that you probably shouldn't announce at all. Low-level tweaking of your system -- the addition of a new input, re-adjusting the weightings of factors in a reputation model -- can usually be done on an ongoing basis and, for the most part, silently. (This is not to say that your community won't notice, however -Â do a web search on “YouTube most popular algorithm” to see just how passionately and closely that community scrutinizes every reputation-related tweak.)+  * There are a certain class of changes that you probably shouldn't announce at all. Low-level tweaking of your system-the addition of a new input, re-adjusting the weightings of factors in a reputation model-can usually be done on an ongoing basis and, for the most part, silently. (This is not to say that your community won't notice, however -Â do a web search on “YouTube most popular algorithm” to see just how passionately and closely that community scrutinizes every reputation-related tweak.)
<html><a name='Chap_9-Testing'></a></html> <html><a name='Chap_9-Testing'></a></html>
==== Testing Your System ==== ==== Testing Your System ====
-As with all new software deployment, there are several phases of testing recommended: bench testing, environmental testing (aka Alpha), and pre-deployment testing (aka Beta.) Note that we don't mean web-Beta, which has come to mean deployed applications that can be assumed, by the users, to be unreliable - we mean pre- or limited-deployment.+As with all new software deployment, there are several phases of testing recommended: bench testing, environmental testing (aka Alpha), and pre-deployment testing (aka Beta.) Note that we don't mean web-Beta, which has come to mean deployed applications that can be assumed, by the users, to be unreliable-we mean pre- or limited-deployment.
Line 85: Line 85:
A well coded reputation model should function with simulated inputs. This allows the reputation implementation team to confirm that the messages flow through the model correctly and provides a means to test the accuracy of the calculations and the performance of the system. A well coded reputation model should function with simulated inputs. This allows the reputation implementation team to confirm that the messages flow through the model correctly and provides a means to test the accuracy of the calculations and the performance of the system.
-Rushed development budgets often cause project staff to skip this step to save time and to instead focus the extra engineering resources on rigging the inputs or implementing a new output - after all, nothing like real data to let you know if every thing's working properly, right? In the case of reputation model implementations, this assumption has been proven both false and costly every single time we've seen it deployed. Bench testing would have saved hundreds of thousands of dollars in effort on the Yahoo! Shopping Top Reviewer karma project.+Rushed development budgets often cause project staff to skip this step to save time and to instead focus the extra engineering resources on rigging the inputs or implementing a new output-after all, nothing like real data to let you know if every thing's working properly, right? In the case of reputation model implementations, this assumption has been proven both false and costly every single time we've seen it deployed. Bench testing would have saved hundreds of thousands of dollars in effort on the Yahoo! Shopping Top Reviewer karma project.
-<note tip>+<box blue 75% round>
** Bench Test Your Model With the Data You Already Have. Always. ** ** Bench Test Your Model With the Data You Already Have. Always. **
Line 96: Line 96:
Over several weeks and dozens of meetings the team defined the model using a prototype of the graphical grammar presented in this book. The final version was very similar to the <html><a href="/doku.php?id=Chapter_4#Chap_4-User_Reviews_with_Karma">Chap_4-User_Reviews_with_Karma</a>&nbsp;</html>presented in chapter 5. The weighting constants were carefully debated and set to favor quality with a score 4 times higher than the value of writing a review. The team also planned to give users back-dated credit to reviewers by writing an input simulator by reading the current ratings and reviews database and running them through the reputation model. Over several weeks and dozens of meetings the team defined the model using a prototype of the graphical grammar presented in this book. The final version was very similar to the <html><a href="/doku.php?id=Chapter_4#Chap_4-User_Reviews_with_Karma">Chap_4-User_Reviews_with_Karma</a>&nbsp;</html>presented in chapter 5. The weighting constants were carefully debated and set to favor quality with a score 4 times higher than the value of writing a review. The team also planned to give users back-dated credit to reviewers by writing an input simulator by reading the current ratings and reviews database and running them through the reputation model.
-The planning took so long that the implementation schedule was crushed - the only way to get it to deployment in time was to code it quickly and enable it immediately. No bench testing, no analysis of the model or the back-dated input simulator. The application team made sure the pages loaded and the inputs all got sent, and then pushed it live in early October.+The planning took so long that the implementation schedule was crushed-the only way to get it to deployment in time was to code it quickly and enable it immediately. No bench testing, no analysis of the model or the back-dated input simulator. The application team made sure the pages loaded and the inputs all got sent, and then pushed it live in early October.
-The good news was that everything was working. The bad news? It was //really// bad -- every single user on the Top Reviewer 100 list had something in common. They all wrote dozens or hundreds of CD reviews. All music users, all the time. Most of the reviews were “I liked it” or “SUX0RZ” and the helpful scores almost didn't figure into the calculation at all. It was too late to change anything significant in the model and so the project failed to accomplish its goal.+The good news was that everything was working. The bad news? It was //really// bad-every single user on the Top Reviewer 100 list had something in common. They all wrote dozens or hundreds of CD reviews. All music users, all the time. Most of the reviews were “I liked it” or “SUX0RZ” and the helpful scores almost didn't figure into the calculation at all. It was too late to change anything significant in the model and so the project failed to accomplish its goal.
-A simple bench test with the currently available data would have revealed the fatal flaw in the model: The presumed reputation context was just plain //wrong//- there is no such thing as a global “Yahoo! Shopping” context for karma. The team should have implemented per-product category reviewer karma: who writes the best digital camera reviews? Who contributes the classical CD reviews that others think as the most helpful?+A simple bench test with the currently available data would have revealed the fatal flaw in the model: The presumed reputation context was just plain //wrong//-there is no such thing as a global “Yahoo! Shopping” context for karma. The team should have implemented per-product category reviewer karma: who writes the best digital camera reviews? Who contributes the classical CD reviews that others think as the most helpful?
-</note+</box
-Besides accuracy and determining suitability of the model for it's intended purposes, one of the most important benefits of bench testing is stress testing of performance. Almost by definition, initial deployment of a model will be incremental - smaller amounts of data are easier to track and debug and there are less people to disappoint if the new feature doesn't always work or is a bit messy. In fact, bench testing is the only time the reputation team will be able to accurately predict the performance of the model under stress until long after deployment, when some peak usage brings it to the breaking point, potentially disabling your application.+Besides accuracy and determining suitability of the model for it's intended purposes, one of the most important benefits of bench testing is stress testing of performance. Almost by definition, initial deployment of a model will be incremental-smaller amounts of data are easier to track and debug and there are less people to disappoint if the new feature doesn't always work or is a bit messy. In fact, bench testing is the only time the reputation team will be able to accurately predict the performance of the model under stress until long after deployment, when some peak usage brings it to the breaking point, potentially disabling your application.
-Do not count on the next two testing phases to stress test your model. They won't - that isn't what they are for.+Do not count on the next two testing phases to stress test your model. They won't-that isn't what they are for.
-Professional-grade testing methodologies, usually utilizing scripting languages such as Javascript or PHP, are available as open source and as commercial packages. Use one to automate simulated inputs to your reputation model code as well as to simulate the reputation output events of a typical application, such as searches, profile displays, and leaderboards. Establish target performance metrics and test various normal- and peak-operational load scenarios. Run it until it breaks and either tune the system and/or establish operational contingency plans with the application engineers. For example: Say that hitting the reputation database for a large number of search results is limited to one hundred requests per second and the application team expects that to be sufficient for the next few months - after which either another database request processor will be deployed, or the application with get more performance by caching common searches in memory.+Professional-grade testing methodologies, usually utilizing scripting languages such as Javascript or PHP, are available as open source and as commercial packages. Use one to automate simulated inputs to your reputation model code as well as to simulate the reputation output events of a typical application, such as searches, profile displays, and leaderboards. Establish target performance metrics and test various normal- and peak-operational load scenarios. Run it until it breaks and either tune the system and/or establish operational contingency plans with the application engineers. For example: Say that hitting the reputation database for a large number of search results is limited to one hundred requests per second and the application team expects that to be sufficient for the next few months-after which either another database request processor will be deployed, or the application with get more performance by caching common searches in memory.
Line 113: Line 113:
After bench-testing has begun and there is some confidence that the reputation model code is stable enough for the application team to develop against, crude integration can begin in earnest. As suggested in <html><a href="/doku.php?id=Chapter_9#Chap_9-Rigging_Inputs">Chap_9-Rigging_Inputs</a>&nbsp;</html>application developers should go for breadth (getting all the inputs/outputs quickly inserted) instead of depth (getting a single reputation score input/output working well.) Once this reputation scaffolding is in place, both the application team and the reputation team can test the characteristics of the model in it's actual operating environment. After bench-testing has begun and there is some confidence that the reputation model code is stable enough for the application team to develop against, crude integration can begin in earnest. As suggested in <html><a href="/doku.php?id=Chapter_9#Chap_9-Rigging_Inputs">Chap_9-Rigging_Inputs</a>&nbsp;</html>application developers should go for breadth (getting all the inputs/outputs quickly inserted) instead of depth (getting a single reputation score input/output working well.) Once this reputation scaffolding is in place, both the application team and the reputation team can test the characteristics of the model in it's actual operating environment.
-Also, any formal or informal testing staff that are available can start using the new reputation features while they are still in development allowing for feedback about both calculation and presentation. This is when the fruits of the reputation designer's labor begin to manifest - an input leads to calculation leads to some valuable change in the application's output. It is most likely that this phase will find minor problems in calculation and presentation, while it is still inexpensive to fix them.+Also, any formal or informal testing staff that are available can start using the new reputation features while they are still in development allowing for feedback about both calculation and presentation. This is when the fruits of the reputation designer's labor begin to manifest-an input leads to calculation leads to some valuable change in the application's output. It is most likely that this phase will find minor problems in calculation and presentation, while it is still inexpensive to fix them.
Depending on the size and duration of this testing phase, initial reputation model tuning may be possible. One word of warning though: testers at this phase, even if they are from outside your formal organization are not usually representative of your post-deployment users, so be careful what conclusions you draw about their reputation behaviour. Someone who is drawing a paycheck, or was given special-status access is NOT a typical user, unless your application is for a corporate intra-net. Depending on the size and duration of this testing phase, initial reputation model tuning may be possible. One word of warning though: testers at this phase, even if they are from outside your formal organization are not usually representative of your post-deployment users, so be careful what conclusions you draw about their reputation behaviour. Someone who is drawing a paycheck, or was given special-status access is NOT a typical user, unless your application is for a corporate intra-net.
Line 129: Line 129:
== Performance: Testing Scale == == Performance: Testing Scale ==
-Although the maximum throughput of the reputation system should have been determined during the bench testing phase, engaging a large number of users during the Beta test will reveal a much more realistic picture of the expected use patterns in deployment. The shapes of peak usage, the distribution of inputs, and especially the reputation query rates should be measured and the bench tests should be re-run using these observations. This should be done at least twice - half way through the Beta, and a week or two before deployment, especially as more testers are added over time.+Although the maximum throughput of the reputation system should have been determined during the bench testing phase, engaging a large number of users during the Beta test will reveal a much more realistic picture of the expected use patterns in deployment. The shapes of peak usage, the distribution of inputs, and especially the reputation query rates should be measured and the bench tests should be re-run using these observations. This should be done at least twice-half way through the Beta, and a week or two before deployment, especially as more testers are added over time.
Line 139: Line 139:
Reputation systems change the way applications display content. Those changes add elements to the user interface that require additional space, new user behavior learning, and change the flow of the application significantly. A good example of this effect is when search URL reputation (page ranking) replaced hand-built directories as the primary method for finding content on the web. Reputation systems change the way applications display content. Those changes add elements to the user interface that require additional space, new user behavior learning, and change the flow of the application significantly. A good example of this effect is when search URL reputation (page ranking) replaced hand-built directories as the primary method for finding content on the web.
-When a reputation-enabled application enters pre-deployment testing, tracking the actions of users - their clicks, their evaluations, their content contributions, even their eye-movements - provides important information to optimize the effectiveness of the model and the application as a whole.+When a reputation-enabled application enters pre-deployment testing, tracking the actions of users-their clicks, their evaluations, their content contributions, even their eye-movements-provides important information to optimize the effectiveness of the model and the application as a whole.
== Feedback: Evaluating Customer's Satisfaction == == Feedback: Evaluating Customer's Satisfaction ==
-Despite our focus on measuring the performance and flow of user interaction, we'd like to caution that pure quantitative testing can lead to faulty conclusions about the effectiveness of your application, especially if the metrics are not as positive as you expected. Everyone knows that when metrics are bad, all the tell you is that you've done something wrong, not what it is. But that is also often true for good metrics - a lot of page-views doesn't always mean you have a healthy or profitable product. Sometimes it's quite the contrary, controversial objects generate a lot of heat (in the form of online discussion) but can create negative value to the provider.+Despite our focus on measuring the performance and flow of user interaction, we'd like to caution that pure quantitative testing can lead to faulty conclusions about the effectiveness of your application, especially if the metrics are not as positive as you expected. Everyone knows that when metrics are bad, all the tell you is that you've done something wrong, not what it is. But that is also often true for good metrics-a lot of page-views doesn't always mean you have a healthy or profitable product. Sometimes it's quite the contrary, controversial objects generate a lot of heat (in the form of online discussion) but can create negative value to the provider.
-In the beta phase, explicit feedback is required to help understand how the application, especially the reputation system, is perceived by the users. Besides multiple opt-in feedback channels, such as email or a message boards, guided surveys are strongly recommended. In our experience, opt-in message formats don't accurately represent the opinions of largest group of users - the //lurkers//- those that only consume reputation and never explicitly evaluate anything. At least in applications that are primarily advertising supported, the lurkers actually produce the largest chunk of revenue.+In the beta phase, explicit feedback is required to help understand how the application, especially the reputation system, is perceived by the users. Besides multiple opt-in feedback channels, such as email or a message boards, guided surveys are strongly recommended. In our experience, opt-in message formats don't accurately represent the opinions of largest group of users-the //lurkers//-those that only consume reputation and never explicitly evaluate anything. At least in applications that are primarily advertising supported, the lurkers actually produce the largest chunk of revenue.
Line 163: Line 163:
Every change to the reputation model and application should be measured against all corporate and success-related metrics. Resist the desire to tune things unless you have a specific goal to change one or more of your most important metrics. Every change to the reputation model and application should be measured against all corporate and success-related metrics. Resist the desire to tune things unless you have a specific goal to change one or more of your most important metrics.
-<note tip>+<box blue 75% round>
** Beware Excessive Tuning: The Hawthorne Effect ** ** Beware Excessive Tuning: The Hawthorne Effect **
Line 174: Line 174:
But, the story of this effect gets even more interesting and relevant. But, the story of this effect gets even more interesting and relevant.
-The wikipedia entry for The Hawthorn Effect has a rather large section entitled //Interpretations and criticisms of the Hawthorne studies// which quotes many modern scholars challenging many of the details in Landsbeger's two decades delayed analysis. From that entry: "A psychology professor at the University of Michigan, Dr. Richard Nisbett, calls the Hawthorne effect 'a glorified anecdote.' 'Once you've got the anecdote,' he said, 'you can throw away the data."+The wikipedia entry for The Hawthorn Effect has a rather large section entitled //Interpretations and criticisms of the Hawthorne studies// which quotes many modern scholars challenging many of the details in Landsbeger's two decades delayed analysis. From that entry: “A psychology professor at the University of Michigan, Dr. Richard Nisbett, calls the Hawthorne effect 'a glorified anecdote.' 'Once you've got the anecdote,' he said, 'you can throw away the data.
The existence of significant questions surrounding this effect reinforces the fact that, when it comes to human behavior, there is a tendency to over-extrapolate from the available data while ignoring all of the factors that aren't even quantitative measured or even measurable. The existence of significant questions surrounding this effect reinforces the fact that, when it comes to human behavior, there is a tendency to over-extrapolate from the available data while ignoring all of the factors that aren't even quantitative measured or even measurable.
-This problematic simplified extrapolation can also happen while tuning reputation models: It's easy to say "Oh! I know what they're doing." That's fine as far as it goes, but the school of hard knocks has taught us that for every behavior you posit, there are at least two more you are missing. If you really want to know how your model is working, you'll have to do both qualitative and quantitative research. Use your metrics to create groups based on similar activity patterns and then reach out and //ask// them why they do what they do.+This problematic simplified extrapolation can also happen while tuning reputation models: It's easy to say “Oh! I know what they're doing.That's fine as far as it goes, but the school of hard knocks has taught us that for every behavior you posit, there are at least two more you are missing. If you really want to know how your model is working, you'll have to do both qualitative and quantitative research. Use your metrics to create groups based on similar activity patterns and then reach out and //ask// them why they do what they do.
-</note>+</box>
== Model Tuning == == Model Tuning ==
Line 195: Line 195:
  * Establishing the pattern that reputation can, and will, be changing over time helps set expectations with the early-adoption users. Getting them used to changes will make future tuning cause less of a community disruption.   * Establishing the pattern that reputation can, and will, be changing over time helps set expectations with the early-adoption users. Getting them used to changes will make future tuning cause less of a community disruption.
</note> </note>
-Much of the tuning to reputation models will be opaque to end users. For example, corporate reputations (internal-only) such as Spammer-IP can be tuned and re-tuned with impunity - actually it should be tuned regularly to compensate for improved knowledge and as abusers learn to work their way around the system.+Much of the tuning to reputation models will be opaque to end users. For example, corporate reputations (internal-only) such as Spammer-IP can be tuned and re-tuned with impunity-actually it should be tuned regularly to compensate for improved knowledge and as abusers learn to work their way around the system.
When tuning, an A-B test where the proposed changes and the old model can be tested side by side would be ideal, but most application and metrics environments make this cost-prohibitive. Alternatively, when tuning a reputation model, keep a backup snapshot of both the model code //and// the values of the critical metrics of the original for comparison. If, after a few days or weeks the tuned model under-performs against the previous version, it will be less painful to return to the backup. When tuning, an A-B test where the proposed changes and the old model can be tested side by side would be ideal, but most application and metrics environments make this cost-prohibitive. Alternatively, when tuning a reputation model, keep a backup snapshot of both the model code //and// the values of the critical metrics of the original for comparison. If, after a few days or weeks the tuned model under-performs against the previous version, it will be less painful to return to the backup.
Line 203: Line 203:
There are a number of application and reputation-system related problems that will probably only come to light under real use, from a community of real users, for some extended duration of time. You might see hints of these misunderstandings or mis-comprehensions during early-stage user testing, but you'll have little means of gauging their severity until you analyze the data in bulk. Forgive us an extended example, again from the world of Yahoo! Answers. But it illustrates the kind of back-and-forth tune-observe-tune rhythm that you may need to fall into to truly be responsive in improving the performance of the reputation-related elements of your application. There are a number of application and reputation-system related problems that will probably only come to light under real use, from a community of real users, for some extended duration of time. You might see hints of these misunderstandings or mis-comprehensions during early-stage user testing, but you'll have little means of gauging their severity until you analyze the data in bulk. Forgive us an extended example, again from the world of Yahoo! Answers. But it illustrates the kind of back-and-forth tune-observe-tune rhythm that you may need to fall into to truly be responsive in improving the performance of the reputation-related elements of your application.
-Once upon a time, the Yahoo! Answers interface featured a simple, plain “Star” mechanism associated with a question. The original design intent for the star was to act as a sort of lightweight endorsement of an item, somewhat akin to Facebook's “Like” control. Star-vote totals were to be displayed next to questions in listings, and also feed into a “Most Starred” widget (actually, a tab on a widget -- displayed alongside Recent and Popular questions) at the top of the site. When viewing a particular question, you could see a listing of all other users that had Starred that question.+Once upon a time, the Yahoo! Answers interface featured a simple, plain “Star” mechanism associated with a question. The original design intent for the star was to act as a sort of lightweight endorsement of an item, somewhat akin to Facebook's “Like” control. Star-vote totals were to be displayed next to questions in listings, and also feed into a “Most Starred” widget (actually, a tab on a widget-displayed alongside Recent and Popular questions) at the top of the site. When viewing a particular question, you could see a listing of all other users that had Starred that question.
-As a convenience for users, there was one more feature -- Answers would keep a list of all the questions that //you// had Starred, and display those on your Profile for others to see (or, if you opted to keep them private, for your eyes only.) It was this final feature that may have tipped the utility for some users away from seeing stars primarily as a voting mechanism and instead seeing them as a kind of quasi-bookmark for questions.+As a convenience for users, there was one more feature-Answers would keep a list of all the questions that //you// had Starred, and display those on your Profile for others to see (or, if you opted to keep them private, for your eyes only.) It was this final feature that may have tipped the utility for some users away from seeing stars primarily as a voting mechanism and instead seeing them as a kind of quasi-bookmark for questions.
Up to this point, a user's profile had only ever displayed questions that she'd asked or answered, There was no facility for saving an arbitrary question posed by anyone on the site. Stars finally gave this functionality to Answers users. One might think, this shouldn't be a problem, right? It's just a convenient and emergent use of the Star feature. William Gibson said “The street finds it own use for things.” (See more about emergence in <html><a href="/doku.php?id=Chapter_9#Chap_9-Emergent_Effects_and_Defects">Chap_9-Emergent_Effects_and_Defects</a>&nbsp;</html>.) Up to this point, a user's profile had only ever displayed questions that she'd asked or answered, There was no facility for saving an arbitrary question posed by anyone on the site. Stars finally gave this functionality to Answers users. One might think, this shouldn't be a problem, right? It's just a convenient and emergent use of the Star feature. William Gibson said “The street finds it own use for things.” (See more about emergence in <html><a href="/doku.php?id=Chapter_9#Chap_9-Emergent_Effects_and_Defects">Chap_9-Emergent_Effects_and_Defects</a>&nbsp;</html>.)
Line 211: Line 211:
But the ancillary, downstream reputation effects of those star-votes were still being compiled, and still being applied to some very prominent listings on the site. Remember, those stars votes completely determined the placement of questions in the Most Starred listing. Over time, a disconcerting effect started to take place: users who were, in good faith, reporting bad content as //abusive// (see <html><a href="/doku.php?id=Chapter_8#Chap_8-Report_Abuse">Chap_8-Report_Abuse</a>&nbsp;</html>) would subsequently Star those very same questions, to save them for later review. (Probably to come back later and determine whether or not their complaints had been acted upon by Yahoo moderators.) But the ancillary, downstream reputation effects of those star-votes were still being compiled, and still being applied to some very prominent listings on the site. Remember, those stars votes completely determined the placement of questions in the Most Starred listing. Over time, a disconcerting effect started to take place: users who were, in good faith, reporting bad content as //abusive// (see <html><a href="/doku.php?id=Chapter_8#Chap_8-Report_Abuse">Chap_8-Report_Abuse</a>&nbsp;</html>) would subsequently Star those very same questions, to save them for later review. (Probably to come back later and determine whether or not their complaints had been acted upon by Yahoo moderators.)
-As a result, the “Most Starred” tab, featured at a high and prominent level of the site, was -- with alarming regularity -- filling up with the absolute //worst// content on the site! In fact, the worst of the worst -- this was the stuff that users felt strongly enough about to report it to Yahoo. And, given the unbearable time-lags between reporting and moderation on Answers in those days, these horrible questions were actually being //rewarded// with higher visibility on the site for a prolonged period of time.+As a result, the “Most Starred” tab, featured at a high and prominent level of the site, was-with alarming regularity-filling up with the absolute //worst// content on the site! In fact, the worst of the worst-this was the stuff that users felt strongly enough about to report it to Yahoo. And, given the unbearable time-lags between reporting and moderation on Answers in those days, these horrible questions were actually being //rewarded// with higher visibility on the site for a prolonged period of time.
The Star feature had backfired entirely. When measured against the original metrics laid out for the project (to encourage easier identification of high-quality content on the site) it was evident that a redesign was called for. The Star feature had backfired entirely. When measured against the original metrics laid out for the project (to encourage easier identification of high-quality content on the site) it was evident that a redesign was called for.
-In response, the Answers team put some features in place to actually facilitate this report-then-save behavior that they were noticing, but to do so in a way that did not have downstream reputation ramifications. The approach was two-pronged: first, they clarified the purpose and intent of the star-vote (adding the simple label “Interesting!” to the Star button was a huge improvement); secondly, they provided a different facility for saving a question -- one intended to be personal-only, and not displayed back to the community. (And with no observable downstream reputation ramifications.) “Watchlists” on Answers now let a user mark something for future reference, but doesn't assume any specific judgment about the quality of the question being watched. (<html><a href="#Figure_9-1">Figure_9-1</a>&nbsp;</html>.)+In response, the Answers team put some features in place to actually facilitate this report-then-save behavior that they were noticing, but to do so in a way that did not have downstream reputation ramifications. The approach was two-pronged: first, they clarified the purpose and intent of the star-vote (adding the simple label “Interesting!” to the Star button was a huge improvement); secondly, they provided a different facility for saving a question-one intended to be personal-only, and not displayed back to the community. (And with no observable downstream reputation ramifications.) “Watchlists” on Answers now let a user mark something for future reference, but doesn't assume any specific judgment about the quality of the question being watched. (<html><a href="#Figure_9-1">Figure_9-1</a>&nbsp;</html>.)
<html><a name="Figure_9-1"><center></html>// Figure_9-1: By giving users a simple, private Watchlist,the Answers designers responded to the needs of Abuse Reporters who wanted to check back in on bad content. //<html></center></a></html> <html><a name="Figure_9-1"><center></html>// Figure_9-1: By giving users a simple, private Watchlist,the Answers designers responded to the needs of Abuse Reporters who wanted to check back in on bad content. //<html></center></a></html>
-<html><center><img width="65%" src="http://buildingreputation.com/lib/exe/fetch.php?media=Ch10-AnswersPrivateWatchlist.png"/></center></html>+<html><center><img width="65%" src="http://buildingreputation.com/lib/exe/fetch.php?media=Figure_9-1.png"/></center></html>
Conditions like these are unlikely to be uncovered during early-stage design and planning, nor will their gravity be easily assessed from small-scale user testing. These are truly application tweaks that will only start to come to light under the load of a public beta. (Though they may crop up again at any time once the application is in production! Stay nimble, keep an eye on metrics, and pay attention to how folks are actually using the features you've provided.) Conditions like these are unlikely to be uncovered during early-stage design and planning, nor will their gravity be easily assessed from small-scale user testing. These are truly application tweaks that will only start to come to light under the load of a public beta. (Though they may crop up again at any time once the application is in production! Stay nimble, keep an eye on metrics, and pay attention to how folks are actually using the features you've provided.)
Line 232: Line 232:
<html><a name='Chap_9-Emergent_Effects_and_Defects'></a></html> <html><a name='Chap_9-Emergent_Effects_and_Defects'></a></html>
== Emergent Effects and Emergent Defects == == Emergent Effects and Emergent Defects ==
-It's quite possible that -- even during the beta period of your deployment -- you're noticing some strange effects starting to take hold. Perhaps content items are rising in the ranks that don't entirely seem… deserving somehow. Or maybe you're noticing a predominance of a certain kind of content at the expense of other types. What you're seeing is the character of your community shaking itself out, finding it's edges and defining itself. Tread carefully before deciding how (and if) to intervene.+It's quite possible that-even during the beta period of your deployment-you're noticing some strange effects starting to take hold. Perhaps content items are rising in the ranks that don't entirely seem… deserving somehow. Or maybe you're noticing a predominance of a certain kind of content at the expense of other types. What you're seeing is the character of your community shaking itself out, finding it's edges and defining itself. Tread carefully before deciding how (and if) to intervene.
-Check out Delicious's //Popular Bookmarks// ranking for any given week: we bet you'll see a whole lot of “Top N” blog articles. (See <html><a href="#Figure_9-2">Figure_9-2</a>&nbsp;</html>) Why might this be? Technology essayist Paul Graham posits that it may be the users of the service, and their motivational mindset, that explain it: “Delicious users are collectors, and a list of n things seems particularly collectible because it's a collection itself.” (Graham explores the “List of N Things” phenomenon to some depth at [[http://www.paulgraham.com/nthings.html|]]) The preponderance of lists on Delicious is a natural offshoot of its context of use -- an emergent effect -- and is probably //not// one that you would worry about, nor try to control in any way.+Check out Delicious's //Popular Bookmarks// ranking for any given week: we bet you'll see a whole lot of “Top N” blog articles. (See <html><a href="#Figure_9-2">Figure_9-2</a>&nbsp;</html>) Why might this be? Technology essayist Paul Graham posits that it may be the users of the service, and their motivational mindset, that explain it: “Delicious users are collectors, and a list of n things seems particularly collectible because it's a collection itself.” (Graham explores the “List of N Things” phenomenon to some depth at [[http://www.paulgraham.com/nthings.html|]]) The preponderance of lists on Delicious is a natural offshoot of its context of use-an emergent effect-and is probably //not// one that you would worry about, nor try to control in any way.
-<html><a name="Figure_9-2"><center></html>// Figure_9-2: What are people saving on Delicious? Lists, lists and more lists… //<html></center></a></html> +<html><a name="Figure_9-2"><center></html>// Figure_9-2: What are people saving on Delicious? Lists, lists and more lists… (and there's nothing wrong with that.) //<html></center></a></html> 
-<html><center><img width="65%" src="http://buildingreputation.com/lib/exe/fetch.php?media=Ch10-DeliciousPopular.png"/></center></html>+<html><center><img width="65%" src="http://buildingreputation.com/lib/exe/fetch.php?media=Figure_9-2.png"/></center></html>
But you may also be seeing the effects of some design decisions that you've made, and you may want to tweak those designs now before wider deployment. Blogger and social media maven Muhammad Saleem noticed one such problem with voting on socially-driven news sites like Digg: But you may also be seeing the effects of some design decisions that you've made, and you may want to tweak those designs now before wider deployment. Blogger and social media maven Muhammad Saleem noticed one such problem with voting on socially-driven news sites like Digg:
Line 249: Line 249:
We are beginning to see a trend where people make assumptions about the contents of an article based on the meta-data associated with the submission rather than reading the article itself. Based on these (oft-flawed) assumptions, people then vote for or against the stories, and even comment on the stories without having read the stories themselves. We are beginning to see a trend where people make assumptions about the contents of an article based on the meta-data associated with the submission rather than reading the article itself. Based on these (oft-flawed) assumptions, people then vote for or against the stories, and even comment on the stories without having read the stories themselves.
</blockquote> </blockquote>
-We've noticed a similar tendency in some community-voting sites we've worked on a Yahoo! and have come to consider behavior like this to be a type of emergent //defect//: behavior that is home-grown within the community and may even become a //de facto// standard for interacting, but is not necessarily valued. In fact, it's basically a //bug// and a failing of your system or -- more likely -- user interface design.+We've noticed a similar tendency in some community-voting sites we've worked on a Yahoo! and have come to consider behavior like this to be a type of emergent //defect//: behavior that is home-grown within the community and may even become a //de facto// standard for interacting, but is not necessarily valued. In fact, it's basically a //bug// and a failing of your system or-more likely-user interface design.
In instances like these, you should consider tweaking your design, to encourage the proper and appropriate use of the controls you're providing. In some ways, it's not surprising that Digg users are voting on articles based on only surface appraisals: the application's very design in fact encourages this. (See <html><a href="#Figure_9-3">Figure_9-3</a>&nbsp;</html>.) In instances like these, you should consider tweaking your design, to encourage the proper and appropriate use of the controls you're providing. In some ways, it's not surprising that Digg users are voting on articles based on only surface appraisals: the application's very design in fact encourages this. (See <html><a href="#Figure_9-3">Figure_9-3</a>&nbsp;</html>.)
-<html><a name="Figure_9-3"><center></html>// Figure_9-3: The design of Digg enables (one might argue, encourages) voting for articles at a high level of the site. This excerpted screen is the front page of Digg -- users can vote for (Digg) an article, or against (bury) it, with no need to read further. //<html></center></a></html> +<html><a name="Figure_9-3"><center></html>// Figure_9-3: The design of Digg enables (one might argue, encourages) voting for articles at a high level of the site. This excerpted screen is the front page of Digg-users can vote for (Digg) an article, or against (bury) it, with no need to read further. //<html></center></a></html> 
-<html><center><img width="65%" src="http://buildingreputation.com/lib/exe/fetch.php?media=Ch10-DiggTopLevelVoting.png"/></center></html>+<html><center><img width="65%" src="http://buildingreputation.com/lib/exe/fetch.php?media=Figure_9-3.png"/></center></html>
Of course, one should not pre-suppose that the Digg folks think of this behavior (if it's even as widespread as Saleem indicates) as a defect. Again, it's a careful balance between the actual observed behavior of users using your system and your own predetermined goals and aspirations for the application. Of course, one should not pre-suppose that the Digg folks think of this behavior (if it's even as widespread as Saleem indicates) as a defect. Again, it's a careful balance between the actual observed behavior of users using your system and your own predetermined goals and aspirations for the application.
-It's quite possible that Digg feels that high voting levels -- even if some percentage of those votes are from uninformed users -- are important enough to promote voting at higher and higher levels of the site. From a brand perspective alone, it certainly would be odd to visit Digg.com, and not see a single place to Digg something up, right?+It's quite possible that Digg feels that high voting levels-even if some percentage of those votes are from uninformed users-are important enough to promote voting at higher and higher levels of the site. From a brand perspective alone, it certainly would be odd to visit Digg.com, and not see a single place to Digg something up, right?
<html><a name='Chap_9-Defending_Against_Emergent_Defects'></a></html> <html><a name='Chap_9-Defending_Against_Emergent_Defects'></a></html>
Line 265: Line 265:
It's hard to anticipate all emergent defects until they… well… emerge. But there are certainly some good principles of design that you can follow that may defend your system against some of the most common ones. It's hard to anticipate all emergent defects until they… well… emerge. But there are certainly some good principles of design that you can follow that may defend your system against some of the most common ones.
-  * //Encourage consumption//-- if your system's reputations are intended to capture the quality of a piece of content, then you should make a good-faith attempt to ensure that users are qualified to make that assessment. Some examples:+  * //Encourage consumption//-if your system's reputations are intended to capture the quality of a piece of content, then you should make a good-faith attempt to ensure that users are qualified to make that assessment. Some examples:
    * Early on in its lifetime, Apple's iPhone App Store allowed //any// visitor to rate an application, whether they'd purchased it or not! You can probably see the potential for bad data to arise from this situation. A subsequent release addressed this problem, ensuring that only users who'd installed the program would have a voice. It doesn't guarantee perfection, but a gating mechanism for rating does help dampen noise.     * Early on in its lifetime, Apple's iPhone App Store allowed //any// visitor to rate an application, whether they'd purchased it or not! You can probably see the potential for bad data to arise from this situation. A subsequent release addressed this problem, ensuring that only users who'd installed the program would have a voice. It doesn't guarantee perfection, but a gating mechanism for rating does help dampen noise.
    * Digg and other social voting sites provide a toolbar that follows logged-in users out to external sites, encouraging them to actually read linked articles before clicking the toolbar-provided voting mechanism. Your application could even //require// an interaction like this for a vote to be counted. (More likely, you'll simply want to weight votes more heavily when they're cast in a guaranteed-better fashion like this.)     * Digg and other social voting sites provide a toolbar that follows logged-in users out to external sites, encouraging them to actually read linked articles before clicking the toolbar-provided voting mechanism. Your application could even //require// an interaction like this for a vote to be counted. (More likely, you'll simply want to weight votes more heavily when they're cast in a guaranteed-better fashion like this.)
-    * Think of ways to check for consumption in a media-specific way. Videos, for example -- perhaps you should give more weight to opinions cast about a video only once the user has passed a certain time-threshold of viewing (or, perhaps, disable voting mechanisms altogether until that time.) +    * Think of ways to check for consumption in a media-specific way. Videos, for example-perhaps you should give more weight to opinions cast about a video only once the user has passed a certain time-threshold of viewing (or, perhaps, disable voting mechanisms altogether until that time.) 
-  * //Avoid Ambiguous Controls//-- Try not to lard too much input overhead onto reputable entities, and try to keep the purpose and primary value of each clear, concise and non-conflicting. If your design already calls for a Bookmarking or Favorites features, then carefully consider whether or not you also need a Thumbs Up or 'I Like It.'+  * //Avoid Ambiguous Controls//-Try not to lard too much input overhead onto reputable entities, and try to keep the purpose and primary value of each clear, concise and non-conflicting. If your design already calls for a Bookmarking or Favorites features, then carefully consider whether or not you also need a Thumbs Up or 'I Like It.'
In any event, provide some cues to users about the utility of those controls: are they strictly for expressing an opinion? Sharing with a friend? Saving for later? The downstream effects may, in fact, be that one control does //all three// of these things, but sometimes it's better to suggest clear and consistent uses for controls than let the community muddle along, inventing its own utilities and rationales for things. If a secondary or tertiary use for a control emerges, then consider formalizing that function as a new feature. In any event, provide some cues to users about the utility of those controls: are they strictly for expressing an opinion? Sharing with a friend? Saving for later? The downstream effects may, in fact, be that one control does //all three// of these things, but sometimes it's better to suggest clear and consistent uses for controls than let the community muddle along, inventing its own utilities and rationales for things. If a secondary or tertiary use for a control emerges, then consider formalizing that function as a new feature.
<html><a name='Chap_9-Exclusivity-Thresholds'></a></html> <html><a name='Chap_9-Exclusivity-Thresholds'></a></html>
Line 275: Line 275:
Many of the benefits that we've discussed for tracking reputation (the ability to highlight good contributions and contributors, the ability to 'tag' user profiles with awards or recognition, even the simple ability to motivate contributors to accel) can be undermined if you make one simple mistake with your reputation system: being //too generous// with positive reputations. Particularly, if you hand out reputations at the higher end of the spectrum too widely, then they will no longer be seen as valuable and rare achievements. You'll also lose the ability to call out great content in long listings: if everything is marked as special, then nothing will stand out. Many of the benefits that we've discussed for tracking reputation (the ability to highlight good contributions and contributors, the ability to 'tag' user profiles with awards or recognition, even the simple ability to motivate contributors to accel) can be undermined if you make one simple mistake with your reputation system: being //too generous// with positive reputations. Particularly, if you hand out reputations at the higher end of the spectrum too widely, then they will no longer be seen as valuable and rare achievements. You'll also lose the ability to call out great content in long listings: if everything is marked as special, then nothing will stand out.
-It's probably okay to wait until the tuning phase to address the question of distribution thresholds. You'll need to make some calculations -- based on available data for current use of the application -- to determine how heavily or lightly to weight certain inputs into the system. A good example is the Gold/Silver/Bronze medal system that we developed at Yahoo! to reward active, quality contributors to UK Sports Message Boards.+It's probably okay to wait until the tuning phase to address the question of distribution thresholds. You'll need to make some calculations-based on available data for current use of the application-to determine how heavily or lightly to weight certain inputs into the system. A good example is the Gold/Silver/Bronze medal system that we developed at Yahoo! to reward active, quality contributors to UK Sports Message Boards.
-We knew that we wanted certain inputs to factor into users' badge-holder reputations: the number of posts posted; how well-received (highly-rated) those posts were by the community. And so on. But, at first, our guesses at the appropriate thresholds for these activities were just that -- guesses.+We knew that we wanted certain inputs to factor into users' badge-holder reputations: the number of posts posted; how well-received (highly-rated) those posts were by the community. And so on. But, at first, our guesses at the appropriate thresholds for these activities were just that-guesses.
-Take, for instance, one input that was included to indicate dedication to the community: the number of posts that a user had rated. (In general, we caution against simple activity-level indicators for karma, but remember -- this is but one input into the model -- weighted appropriately against other quality-indicators like community response to your own postings.) We arbitrarily settled on the following minimum thresholds for badge-earners:+Take, for instance, one input that was included to indicate dedication to the community: the number of posts that a user had rated. (In general, we caution against simple activity-level indicators for karma, but remember-this is but one input into the model-weighted appropriately against other quality-indicators like community response to your own postings.) We arbitrarily settled on the following minimum thresholds for badge-earners:
-  * //Bronze Badge//- 5 posts rated +  * //Bronze Badge//-5 posts rated 
-  * //Silver Badge//- 20 posts rated +  * //Silver Badge//-20 posts rated 
-  * //Gold Badge//- 100 posts rated +  * //Gold Badge//-100 posts rated 
-These were simply stabs in the dark -- placeholders, really -- that we fully expected to tune as we got closer to deployment.+These were simply stabs in the dark-placeholders, really-that we fully expected to tune as we got closer to deployment.
-And, in fact, once we'd done an in-depth calculation of project badge numbers in the community (based on Message Board activity levels that were already evident //before// the addition of badges) we realized that these estimates were way too low. We would be giving out millions of Bronze badges, and -- heck, still thousands of Golds. This felt way too liberal, given the goals of the project: to identify and reward //only// the most active and valued contributors to boards.+And, in fact, once we'd done an in-depth calculation of project badge numbers in the community (based on Message Board activity levels that were already evident //before// the addition of badges) we realized that these estimates were way too low. We would be giving out millions of Bronze badges, and-heck, still thousands of Golds. This felt way too liberal, given the goals of the project: to identify and reward //only// the most active and valued contributors to boards.
By the time the feature went into production, these minimum thresholds for rating others postings were made //much// higher (orders of magnitude higher) and, in fact, it was several months before the first message board Gold badge actually surfaced in the wild! We considered that a good thing, and perfectly in-line with the business and community metrics we'd laid out at the project's outset. By the time the feature went into production, these minimum thresholds for rating others postings were made //much// higher (orders of magnitude higher) and, in fact, it was several months before the first message board Gold badge actually surfaced in the wild! We considered that a good thing, and perfectly in-line with the business and community metrics we'd laid out at the project's outset.
-<note tip>+<box blue 75% round>
** So… How Much is Enough? ** ** So… How Much is Enough? **
Line 296: Line 296:
  * //Is this karma (people reputation), or content reputation?// Be more mindful of the distribution of karma. It's probably okay to have an overabundance of “Trophy-winning videos” floating around your site, but too many top-flight experts risks devaluing the reward altogether.   * //Is this karma (people reputation), or content reputation?// Be more mindful of the distribution of karma. It's probably okay to have an overabundance of “Trophy-winning videos” floating around your site, but too many top-flight experts risks devaluing the reward altogether.
-  * //Honor the presentation pattern.// Some distribution thresholds will be super-easy to calibrate -- if you're honoring the Top 100 Reviewers on your site, then the number of users awarded //should// be fairly self-evident. It's only with more-ambiguous patterns that thresholds will need to be actively tuned and massaged to get the desired distributions. +  * //Honor the presentation pattern.// Some distribution thresholds will be super-easy to calibrate-if you're honoring the Top 100 Reviewers on your site, then the number of users awarded //should// be fairly self-evident. It's only with more-ambiguous patterns that thresholds will need to be actively tuned and massaged to get the desired distributions. 
-  * //Power-law is your friend.// When in doubt, try to award reputations along a power-law distribution. (See [[http://en.wikipedia.org/wiki/Power_law|]].) Great reputations should be rare, good ones scarce, and mediocre ones should be the norm. This will naturally mimic the natural properties of most networks, so -- really -- your reputations should reflect those values also. +  * //Power-law is your friend.// When in doubt, try to award reputations along a power-law distribution. (See [[http://en.wikipedia.org/wiki/Power_law|]].) Great reputations should be rare, good ones scarce, and mediocre ones should be the norm. This will naturally mimic the natural properties of most networks, so-really-your reputations should reflect those values also. 
-</note>+</box>
=== Tuning for the Future === === Tuning for the Future ===
-There are sometimes pleasant surprises when implementing reputation systems for the first time. When users begin to interact with reputation-powered applications, the very nature of the application can changes significantly - it often becomes communal - control of the reputable entities shifts from the company to the people.+There are sometimes pleasant surprises when implementing reputation systems for the first time. When users begin to interact with reputation-powered applications, the very nature of the application can changes significantly: it often becomes communal-control of the reputable entities shifts from the company to the people.
This shift from a content-centric application to community-centric often leads to inspirational application designs to be built on the lessons drawn from the existing reputation system. Simply put, if reputation works well for one application, all of the other related applications will want to integrate it, yesterday! This shift from a content-centric application to community-centric often leads to inspirational application designs to be built on the lessons drawn from the existing reputation system. Simply put, if reputation works well for one application, all of the other related applications will want to integrate it, yesterday!
-Though new reputation models can be added only as fast as they can be developed, tested, integrated, and deployed, new uses for //existing// reputations can be released by the application team without coordination and almost instantaneously - they already have access to the reputation API calls. This suggests that the reputation team should continuously optimize for performance against its internal metrics. Expect significant growth, especially in the number of reputation queries. Even if the primary application, as originally implemented, doesn't grow daily users by an unexpected rate, expect the application team to add new types of uses, such as more reputation-weighted searches or to add more pages that display a reputation score.+Though new reputation models can be added only as fast as they can be developed, tested, integrated, and deployed, new uses for //existing// reputations can be released by the application team without coordination and almost instantaneously-they already have access to the reputation API calls. This suggests that the reputation team should continuously optimize for performance against its internal metrics. Expect significant growth, especially in the number of reputation queries. Even if the primary application, as originally implemented, doesn't grow daily users by an unexpected rate, expect the application team to add new types of uses, such as more reputation-weighted searches or to add more pages that display a reputation score.
Tuning reputation systems for ROI, behavior, and future improvements is a never-ending process. If you stop this required maintenance, the entire system //will// lose value as it becomes abused, slow, non-competitive, broken, and eventually irrelevant. Tuning reputation systems for ROI, behavior, and future improvements is a never-ending process. If you stop this required maintenance, the entire system //will// lose value as it becomes abused, slow, non-competitive, broken, and eventually irrelevant.
 +
 +
 +==== Learning By Example ====
 +It's one thing to describe and critique currently deployed reputation systems - after they've already been deployed. It's another to prescribe a detailed set of steps that are recommended for new practitioners, as we have done in this book.
 +
 +<blockquote Warrior Proverb>
 +Talk is easy; action is difficult. But, action is easy; true understanding is difficult!
 +</blockquote>
 +The lessons we present here are the direct result of many attempts, some succeeded, some failed, at reputation system development and deployment. The book is the result of successive refinement of those lessons, especially as we refined it at Yahoo!. <html><a href="/doku.php?id=Chapter_10">Chapter_10</a>&nbsp;</html>is our proof-in-the-pudding that this methodology works in practice; it covers each step as we applied them during the development of a community moderation reputation model for Yahoo! Answers.
chapter_9.txt · Last modified: 2009/12/01 14:49 by randy
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0