By Joe Schulz
Courtesy of StickyMinds

crazy_appsAs any software developer or tester knows, it’s impossible to identify every user issue in a controlled test environment. For mobile apps especially, there are a near-limitless number of permutations― combinations of devices, firmware, operating systems, and networks. As a result, it simply isn’t feasible to test every possible scenario. It’s also impossible for lab testers who are already familiar with an app’s anticipated behavior to experience it from the perspective of a neophyte.

To help resolve this problem, I can attest that one of the most effective ways is through crowdsourced testing. Crowdsourced testing, also called crowd testing, has been around for many years. Now, though, the power of databases and analytics makes it possible for vendors to create highly targeted pools of up to thousands testers who meet specific criteria. It also allows building and maintaining accurate profiles and ratings on these testers over time.

In such a scenario, crowd testers are neither random nor unpredictable. Because crowd testers are usually paid or remunerated in some fashion, and because their future work hinges on producing accurate results, they have a vested interest in performing the tests or demonstrations expediently and correctly. (Testers usually report the majority of defects within the first few hours of the test.) This approach gives enterprises that engage in crowdsourced testing greater confidence in the reliability of the outcomes.

Working with the Crowd

In my experience, a crowdsourced testing project essentially follows this process: A crowdsourced testing vendor works with a client (the app developer) to define target criteria (demographic, psychographic, geographic, etc.) for the crowd based on the type of input the client wants. The client and vendor also determine whether seasoned testers or neophyte users (or a combination) will be the most effective. Reputable, experienced testers often provide the most reliable results the fastest, but “newbies” are the best candidates for user-experience tests—they notice complicated or confusing features that an experienced user would intuitively figure out and never notice as a defect.

The vendor then preselects individuals from its tester database who fit the criteria. This group is asked to use the app for a defined period of time―perhaps by performing specific routines, or test cases―and report on their experiences and impressions, including any problems or missing functions. These firms’ testers validate some or all of the reported problems or challenges by replicating them in the testing lab in order to ensure the issue wasn’t due to a crowd tester’s device or user error.

Often, the client and the vendor work together, with the client providing validation feedback to the vendor, which allows them to further fine-tune the performance ratings for participating testers. The last step is for the client to prioritize the defects―by such variables as commonality (the percentage of testers that found them), device penetration, and severity of the defect―and define the list of issues it wishes to resolve.

I’ve worked on dozens of such projects with pretty impressive results. Following are two examples.

Financial App Defects Lower Brand Allegiance

About a year ago, I worked with a large financial institution that was having problems with the latest version of its mobile app. They had received numerous bad reviews through social media sites. The problem stemmed mainly from incompatibilities across different mobile devices. Their reputation and that of the app deteriorated because of the problems, reversing the positive opinion the first version had scored, which was making it more difficult for them to attract new consumers.

Because of time and financial constraints, the bank had traditionally tested on only a few key devices. The prevailing wisdom was that if an app worked on an iPhone 5S, it should work on all iOS devices. They took the same approach to the Android OS, testing only on one device from Samsung and one from Motorola. Their test was fairly thorough, but it was largely manual, with little automation. This also limited their device sample size.

I helped them craft a crowdsourcing test for five key use cases. It was a short test, only five days long, with a pool of five hundred testers. We designed the crowd so that the pool emphasized diversity across devices. Consequently, even though this test was neither extensive nor intensive (compared to the potential scope and depth of crowdsourced testing), we were able to cover more than two hundred additional combinations of hardware and operating systems than their internal testing had reached.

The crowdsourcing test returned more than nine hundred unique defects across the pool of devices. Most were due to minor variations in the devices and operating system versions. After validating and prioritizing the defects, the bank identified twenty-six specific improvements that resolved 80 percent of the high-priority defects. These updates were rolled out to the app over three releases in a twelve-week span. As a result, their app rating improved 250 percent over that timeframe―bringing it very close to the high rating of the original app.

Retail App Problems Cause Big Headaches

I also worked with an online retailer whose mobile app was poorly received at its initial introduction. The app received mostly one-star and two-star ratings, with most reviews citing the lack of functionality compared with expectations. The organization was surprised because many of the deficiencies cited by crowd testers had, in fact, been delivered in the initial app release.

In response to their problem, I helped them craft a crowdsourcing test focused on general app usage. The crowd consisted of a pool of two hundred users across a wide variety of age and economic backgrounds. We focused only on iOS users and specifically did not give them detailed test instructions. Instead, we specified high-level goals and asked the users to determine how best to accomplish the task. They were asked to document each attempt, including failed attempts.

The result was a plethora of reports that highlighted problems with the basic user interface. The desired functionality was available, but the users often had difficulty locating the appropriate options to run the function. Documenting these failed attempts enabled the retailer to pinpoint the most common user expectations and then redesign its interface to accommodate overall user assumptions.

We repeated this test a second time before the client delivered the user interface redesign to the public. The second test showed marked improvement in some areas, but in others it was evident that some of the updates hadn’t fixed the problem. After making a second round of updates, the new user interface was delivered to the app store and its average rating jumped to around three and a half stars.

The Where and When of Crowdsourced Testing

These examples highlight how crowdsourced testing can enable enterprises to fix issues that completely perplex or escape in-house testers. Although crowdsourced testing can be beneficial at some level on nearly every development project, there are several scenarios where I specifically recommend it:

  • App server load tests: To see how an app server will perform when hundreds of disparate users on myriad devices try to connect
  • Challenging geographic conditions: To determine if (or how badly) limited cellular coverage or the presence of impediments such as brick buildings or mountains impacts app performance
  • Device-specific defect identification: To identify bugs and defects on less commonly used devices
  • Network variations: To explore how the app operates and navigates the handshake over multiple networks types (Wi-Fi and multiple cellular modes)
  • Old, new, or unusual equipment: To provide a mechanism for testing on devices that are hard to find or very expensive without the effort and expense of equipment acquisition
  • User experience issues: To pinpoint functions that work but are not especially intuitive or obvious

Finally, crowdsourcing is valuable, not only before the final release of code to production, but also in the early stages of testing. Crowd testers uncover defects faster than most in-house testers, and companies can run multiple crowdsourced tests to decide if allowing a minor or subjective flaw to go uncorrected will negatively influence user acceptance.

Of course, companies must have the resources on their pre-production servers to handle the load of the testing, and they must have sufficient security to protect both the servers and the crowd testers’ devices.