Just Say No to More End-to-End Tests

Wednesday, April 22, 2015 23:10 PM

by Mike Wacker

At some point in your life, you can probably recall a movie that you and your friends all wanted to see, and that you and your friends all regretted watching afterwards. Or maybe you remember that time your team thought they’d found the next "killer feature" for their product, only to see that feature bomb after it was released.

Good ideas often fail in practice, and in the world of testing, one pervasive good idea that often fails in practice is a testing strategy built around end-to-end tests.

Testers can invest their time in writing many types of automated tests, including unit tests, integration tests, and end-to-end tests, but this strategy invests mostly in end-to-end tests that verify the product or service as a whole. Typically, these tests simulate real user scenarios.

End-to-End Tests in Theory 

While relying primarily on end-to-end tests is a bad idea, one could certainly convince a reasonable person that the idea makes sense in theory.

To start, number one on Google's list of ten things we know to be true is: "Focus on the user and all else will follow." Thus, end-to-end tests that focus on real user scenarios sound like a great idea. Additionally, this strategy broadly appeals to many constituencies:
  • Developers like it because it offloads most, if not all, of the testing to others. 
  • Managers and decision-makers like it because tests that simulate real user scenarios can help them easily determine how a failing test would impact the user. 
  • Testers like it because they often worry about missing a bug or writing a test that does not verify real-world behavior; writing tests from the user's perspective often avoids both problems and gives the tester a greater sense of accomplishment. 

End-to-End Tests in Practice 

So if this testing strategy sounds so good in theory, then where does it go wrong in practice? To demonstrate, I present the following composite sketch based on a collection of real experiences familiar to both myself and other testers. In this sketch, a team is building a service for editing documents online (e.g., Google Docs).

Let's assume the team already has some fantastic test infrastructure in place. Every night:
  1. The latest version of the service is built. 
  2. This version is then deployed to the team's testing environment. 
  3. All end-to-end tests then run against this testing environment. 
  4. An email report summarizing the test results is sent to the team.

The deadline is approaching fast as our team codes new features for their next release. To maintain a high bar for product quality, they also require that at least 90% of their end-to-end tests pass before features are considered complete. Currently, that deadline is one day away:

Days LeftPass %Notes
15%Everything is broken! Signing in to the service is broken. Almost all tests sign in a user, so almost all tests failed.
04%A partner team we rely on deployed a bad build to their testing environment yesterday.
-154%A dev broke the save scenario yesterday (or the day before?). Half the tests save a document at some point in time. Devs spent most of the day determining if it's a frontend bug or a backend bug.
-254%It's a frontend bug, devs spent half of today figuring out where.
-354%A bad fix was checked in yesterday. The mistake was pretty easy to spot, though, and a correct fix was checked in today.
-41%Hardware failures occurred in the lab for our testing environment.
-584%Many small bugs hiding behind the big bugs (e.g., sign-in broken, save broken). Still working on the small bugs.
-687%We should be above 90%, but are not for some reason.
-789.54%(Rounds up to 90%, close enough.) No fixes were checked in yesterday, so the tests must have been flaky yesterday.

Analysis 

Despite numerous problems, the tests ultimately did catch real bugs.

What Went Well 

  • Customer-impacting bugs were identified and fixed before they reached the customer.

What Went Wrong 

  • The team completed their coding milestone a week late (and worked a lot of overtime). 
  • Finding the root cause for a failing end-to-end test is painful and can take a long time. 
  • Partner failures and lab failures ruined the test results on multiple days. 
  • Many smaller bugs were hidden behind bigger bugs. 
  • End-to-end tests were flaky at times. 
  • Developers had to wait until the following day to know if a fix worked or not. 

So now that we know what went wrong with the end-to-end strategy, we need to change our approach to testing to avoid many of these problems. But what is the right approach?

The True Value of Tests 

Typically, a tester's job ends once they have a failing test. A bug is filed, and then it's the developer's job to fix the bug. To identify where the end-to-end strategy breaks down, however, we need to think outside this box and approach the problem from first principles. If we "focus on the user (and all else will follow)," we have to ask ourselves how a failing test benefits the user. Here is the answer:

A failing test does not directly benefit the user. 

While this statement seems shocking at first, it is true. If a product works, it works, whether a test says it works or not. If a product is broken, it is broken, whether a test says it is broken or not. So, if failing tests do not benefit the user, then what does benefit the user?

A bug fix directly benefits the user.

The user will only be happy when that unintended behavior - the bug - goes away. Obviously, to fix a bug, you must know the bug exists. To know the bug exists, ideally you have a test that catches the bug (because the user will find the bug if the test does not). But in that entire process, from failing test to bug fix, value is only added at the very last step.

StageFailing TestBug OpenedBug Fixed
Value AddedNoNoYes

Thus, to evaluate any testing strategy, you cannot just evaluate how it finds bugs. You also must evaluate how it enables developers to fix (and even prevent) bugs.

Building the Right Feedback Loop

Tests create a feedback loop that informs the developer whether the product is working or not. The ideal feedback loop has several properties:
  • It's fast. No developer wants to wait hours or days to find out if their change works. Sometimes the change does not work - nobody is perfect - and the feedback loop needs to run multiple times. A faster feedback loop leads to faster fixes. If the loop is fast enough, developers may even run tests before checking in a change. 
  • It's reliable. No developer wants to spend hours debugging a test, only to find out it was a flaky test. Flaky tests reduce the developer's trust in the test, and as a result flaky tests are often ignored, even when they find real product issues. 
  • It isolates failures. To fix a bug, developers need to find the specific lines of code causing the bug. When a product contains millions of lines of codes, and the bug could be anywhere, it's like trying to find a needle in a haystack. 

Think Smaller, Not Larger

So how do we create that ideal feedback loop? By thinking smaller, not larger.

Unit Tests

Unit tests take a small piece of the product and test that piece in isolation. They tend to create that ideal feedback loop:

  • Unit tests are fast. We only need to build a small unit to test it, and the tests also tend to be rather small. In fact, one tenth of a second is considered slow for unit tests. 
  • Unit tests are reliable. Simple systems and small units in general tend to suffer much less from flakiness. Furthermore, best practices for unit testing - in particular practices related to hermetic tests - will remove flakiness entirely. 
  • Unit tests isolate failures. Even if a product contains millions of lines of code, if a unit test fails, you only need to search that small unit under test to find the bug. 

Writing effective unit tests requires skills in areas such as dependency management, mocking, and hermetic testing. I won't cover these skills here, but as a start, the typical example offered to new Googlers (or Nooglers) is how Google builds and tests a stopwatch.

Unit Tests vs. End-to-End Tests

With end-to-end tests, you have to wait: first for the entire product to be built, then for it to be deployed, and finally for all end-to-end tests to run. When the tests do run, flaky tests tend to be a fact of life. And even if a test finds a bug, that bug could be anywhere in the product.

Although end-to-end tests do a better job of simulating real user scenarios, this advantage quickly becomes outweighed by all the disadvantages of the end-to-end feedback loop:

UnitEnd-toEnd
Fast


Reliable



Isolates Failures


Simulates a Real User



Integration Tests

Unit tests do have one major disadvantage: even if the units work well in isolation, you do not know if they work well together. But even then, you do not necessarily need end-to-end tests. For that, you can use an integration test. An integration test takes a small group of units, often two units, and tests their behavior as a whole, verifying that they coherently work together.

If two units do not integrate properly, why write an end-to-end test when you can write a much smaller, more focused integration test that will detect the same bug? While you do need to think larger, you only need to think a little larger to verify that units work together.

Testing Pyramid

Even with both unit tests and integration tests, you probably still will want a small number of end-to-end tests to verify the system as a whole. To find the right balance between all three test types, the best visual aid to use is the testing pyramid. Here is a simplified version of the testing pyramid from the opening keynote of the 2014 Google Test Automation Conference:



The bulk of your tests are unit tests at the bottom of the pyramid. As you move up the pyramid, your tests gets larger, but at the same time the number of tests (the width of your pyramid) gets smaller.

As a good first guess, Google often suggests a 70/20/10 split: 70% unit tests, 20% integration tests, and 10% end-to-end tests. The exact mix will be different for each team, but in general, it should retain that pyramid shape. Try to avoid these anti-patterns:
  • Inverted pyramid/ice cream cone. The team relies primarily on end-to-end tests, using few integration tests and even fewer unit tests. 
  • Hourglass. The team starts with a lot of unit tests, then uses end-to-end tests where integration tests should be used. The hourglass has many unit tests at the bottom and many end-to-end tests at the top, but few integration tests in the middle. 
Just like a regular pyramid tends to be the most stable structure in real life, the testing pyramid also tends to be the most stable testing strategy.


Quantum Quality

Thursday, April 02, 2015 15:35 PM

UPDATE: Hey, this was an April fool's joke but in fact we wished we could
have realized this idea and we are looking forward to the day this has
been worked out and becomes a reality.

by Kevin Graney

Here at Google we have a long history of capitalizing on the latest research and technology to improve the quality of our software. Over our past 16+ years as a company, what started with some humble unit tests has grown into a massive operation. As our software complexity increased, ever larger and more complex tests were dreamed up by our Software Engineers in Test (SETs).

What we have come to realize is that our love of testing is a double-edged sword. On the one hand, large-scale testing keeps us honest and gives us confidence. It ensures our products remain reliable, our users' data is kept safe, and our engineers are able to work productively without fear of breaking things. On the other hand, it's expensive in both engineer and machine time. Our SETs have been working tirelessly to reduce the expense and latency of software tests at Google, while continuing to increase their quality.

Today, we're excited to reveal how Google is tackling this challenge. In collaboration with the Quantum AI Lab, SETs at Google have been busy revolutionizing how software is tested. The theory is relatively simple: bits in a traditional computer are either zero or one, but bits in a quantum computer can be both one and zero at the same time. This is known as superposition, and the classic example is Schrodinger's cat. Through some clever math and cutting edge electrical engineering, researchers at Google have figured out how to utilize superposition to vastly improve the quality of our software testing and the speed at which our tests run.


Figure 1 Some qubits inside a Google quantum device.

With superposition, tests at Google are now able to simultaneously model every possible state of the application under test. The state of the application can be thought of as an n bit sequential memory buffer, consistent with the traditional Von Neuman architecture of computing. Because each bit under superposition is simultaneously a 0 and a 1, these tests can simulate 2n different application states at any given instant in time in O(n) space. Each of these application states can be mutated by application logic to another state in constant time using quantum algorithms developed by Google researchers. These two properties together allow us to build a state transition graph of the application under test that shows every possible application state and all possible transitions to other application states. Using traditional computing methods this problem has intractable time complexity, but after leveraging superposition and our quantum algorithms it becomes relatively fast and cheap.

Figure 2 The application state graph for a demonstrative 3-bit application. If the start state is 001 then 000, 110, 111, and 011 are all unreachable states. States 010 and 100 both result in deadlock.

Once we have the state transition graph for the application under test, testing it becomes almost trivial. Given the initial startup state of the application, i.e. the executable bits of the application stored on disk, we can find from the application's state transition graph all reachable states. Assertions that ensure proper behavior are then written against the reachable subset of the transition graph. This paradigm of test writing allows both Google's security engineers and software engineers to work more productively. A security engineer can write a test, for example, that asserts "no executable memory regions become mutated in any reachable state". This one test effectively eliminates the potential for security flaws that result from memory safety violations. A test engineer can write higher level assertions using graph traversal methods that ensure data integrity is maintained across a subset of application state transitions. Tests of this nature can detect data corruption bugs.

We're excited about the work our team has done so far to push the envelope in the field of quantum software quality. We're just getting started, but based on early dogfood results among Googlers we believe the potential of this work is huge. Stay tuned!


Android UI Automated Testing

Thursday, April 02, 2015 15:51 PM

by Mona El Mahdy

Overview

This post reviews four strategies for Android UI testing with the goal of creating UI tests that are fast, reliable, and easy to debug.

Before we begin, let’s not forget an important rule: whatever can be unit tested should be unit tested. Robolectric and gradle unit tests support are great examples of unit test frameworks for Android. UI tests, on the other hand, are used to verify that your application returns the correct UI output in response to a sequence of user actions on a device. Espresso is a great framework for running UI actions and verifications in the same process. For more details on the Espresso and UI Automator tools, please see: test support libraries.

The Google+ team has performed many iterations of UI testing. Below we discuss the lessons learned during each strategy of UI testing. Stay tuned for more posts with more details and code samples.

Strategy 1: Using an End-To-End Test as a UI Test

Let’s start with some definitions. A UI test ensures that your application returns the correct UI output in response to a sequence of user actions on a device. An end-to-end (E2E) test brings up the full system of your app including all backend servers and client app. E2E tests will guarantee that data is sent to the client app and that the entire system functions correctly.

Usually, in order to make the application UI functional, you need data from backend servers, so UI tests need to simulate the data but not necessarily the backend servers. In many cases UI tests are confused with E2E tests because E2E is very similar to manual test scenarios. However, debugging and stabilizing E2E tests is very difficult due to many variables like network flakiness, authentication against real servers, size of your system, etc.


When you use UI tests as E2E tests, you face the following problems:
  • Very large and slow tests. 
  • High flakiness rate due to timeouts and memory issues. 
  • Hard to debug/investigate failures. 
  • Authentication issues (ex: authentication from automated tests is very tricky).

Let’s see how these problems can be fixed using the following strategies.

Strategy 2: Hermetic UI Testing using Fake Servers

In this strategy, you avoid network calls and external dependencies, but you need to provide your application with data that drives the UI. Update your application to communicate to a local server rather than external one, and create a fake local server that provides data to your application. You then need a mechanism to generate the data needed by your application. This can be done using various approaches depending on your system design. One approach is to record server responses and replay them in your fake server.

Once you have hermetic UI tests talking to a local fake server, you should also have server hermetic tests. This way you split your E2E test into a server side test, a client side test, and an integration test to verify that the server and client are in sync (for more details on integration tests, see the backend testing section of blog).

Now, the client test flow looks like:


While this approach drastically reduces the test size and flakiness rate, you still need to maintain a separate fake server as well as your test. Debugging is still not easy as you have two moving parts: the test and the local server. While test stability will be largely improved by this approach, the local server will cause some flakes.

Let’s see how this could this be improved...

Strategy 3: Dependency Injection Design for Apps.

To remove the additional dependency of a fake server running on Android, you should use dependency injection in your application for swapping real module implementations with fake ones. One example is Dagger, or you can create your own dependency injection mechanism if needed.

This will improve the testability of your app for both unit testing and UI testing, providing your tests with the ability to mock dependencies. In instrumentation testing, the test apk and the app under test are loaded in the same process, so the test code has runtime access to the app code. Not only that, but you can also use classpath override (the fact that test classpath takes priority over app under test) to override a certain class and inject test fakes there. For example, To make your test hermetic, your app should support injection of the networking implementation. During testing, the test injects a fake networking implementation to your app, and this fake implementation will provide seeded data instead of communicating with backend servers.


Strategy 4: Building Apps into Smaller Libraries

If you want to scale your app into many modules and views, and plan to add more features while maintaining stable and fast builds/tests, then you should build your app into small components/libraries. Each library should have its own UI resources and user dependency management. This strategy not only enables mocking dependencies of your libraries for hermetic testing, but also serves as an experimentation platform for various components of your application.

Once you have small components with dependency injection support, you can build a test app for each component.

The test apps bring up the actual UI of your libraries, fake data needed, and mock dependencies. Espresso tests will run against these test apps. This enables testing of smaller libraries in isolation.

For example, let’s consider building smaller libraries for login and settings of your app.


The settings component test now looks like:


Conclusion

UI testing can be very challenging for rich apps on Android. Here are some UI testing lessons learned on the Google+ team:
  1. Don’t write E2E tests instead of UI tests. Instead write unit tests and integration tests beside the UI tests. 
  2. Hermetic tests are the way to go. 
  3. Use dependency injection while designing your app. 
  4. Build your application into small libraries/modules, and test each one in isolation. You can then have a few integration tests to verify integration between components is correct . 
  5. Componentized UI tests have proven to be much faster than E2E and 99%+ stable. Fast and stable tests have proven to drastically improve developer productivity.

The First Annual Testing on the Toilet Awards

Tuesday, February 03, 2015 16:51 PM

By Andrew Trenk

The Testing on the Toilet (TotT) series was created in 2006 as a way to spread unit-testing knowledge across Google by posting flyers in bathroom stalls. It quickly became a part of Google culture and is still going strong today, with new episodes published every week and read in hundreds of bathrooms by thousands of engineers in Google offices across the world. Initially focused on content related to testing, TotT now covers a variety of technical topics, such as tips on writing cleaner code and ways to prevent security bugs.

While TotT episodes often have a big impact on many engineers across Google, until now we never did anything to formally thank authors for their contributions. To fix that, we decided to honor the most popular TotT episodes of 2014 by establishing the Testing on the Toilet Awards. The winners were chosen through a vote that was open to all Google engineers. The Google Testing Blog is proud to present the winners that were posted on this blog (there were two additional winners that weren’t posted on this blog since we only post testing-related TotT episodes).

And the winners are ...

Erik Kuefler: Test Behaviors, Not Methods and Don't Put Logic in Tests 
Alex Eagle: Change-Detector Tests Considered Harmful

The authors of these episodes received their very own Flushy trophy, which they can proudly display on their desks.



(The logo on the trophy is the same one we put on the printed version of each TotT episode, which you can see by looking for the “printer-friendly version” link in the TotT blog posts).

Congratulations to the winners!

Testing on the Toilet: Change-Detector Tests Considered Harmful

Wednesday, January 28, 2015 00:43 AM

by Alex Eagle

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


You have just finished refactoring some code without modifying its behavior. Then you run the tests before committing and… a bunch of unit tests are failing. While fixing the tests, you get a sense that you are wasting time by mechanically applying the same transformation to many tests. Maybe you introduced a parameter in a method, and now must update 100 callers of that method in tests to pass an empty string.

What does it look like to write tests mechanically? Here is an absurd but obvious way:

// Production code:
def abs(i: Int)
return (i < 0) ? i * -1 : i

// Test code:
for (line: String in File(prod_source).read_lines())
switch (line.number)
1: assert line.content equals def abs(i: Int)
2: assert line.content equals return (i < 0) ? i * -1 : i

That test is clearly not useful: it contains an exact copy of the code under test and acts like a checksum. A correct or incorrect program is equally likely to pass a test that is a derivative of the code under test. No one is really writing tests like that, but how different is it from this next example?
// Production code:
def process(w: Work)
firstPart.process(w)
secondPart.process(w)

// Test code:
part1 = mock(FirstPart)
part2 = mock(SecondPart)
w = Work()
Processor(part1, part2).process(w)
verify_in_order
was_called part1.process(w)
was_called part2.process(w)

It is tempting to write a test like this because it requires little thought and will run quickly. This is a change-detector test—it is a transformation of the same information in the code under test—and it breaks in response to any change to the production code, without verifying correct behavior of either the original or modified production code.

Change detectors provide negative value, since the tests do not catch any defects, and the added maintenance cost slows down development. These tests should be re-written or deleted.

Testing on the Toilet: Prefer Testing Public APIs Over Implementation-Detail Classes

Wednesday, January 14, 2015 17:35 PM

by Andrew Trenk

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


Does this class need to have tests?

class UserInfoValidator {
public void validate(UserInfo info) {
if (info.getDateOfBirth().isInFuture()) { throw new ValidationException()); }
}
}
Its method has some logic, so it may be good idea to test it. But what if its only user looks like this?
public class UserInfoService {
private UserInfoValidator validator;
public void save(UserInfo info) {
validator.validate(info); // Throw an exception if the value is invalid.
writeToDatabase(info);
}
}
The answer is: it probably doesn’t need tests, since all paths can be tested through UserInfoService. The key distinction is that the class is an implementation detail, not a public API.

A public API can be called by any number of users, who can pass in any possible combination of inputs to its methods. You want to make sure these are well-tested, which ensures users won’t see issues when they use the API. Examples of public APIs include classes that are used in a different part of a codebase (e.g., a server-side class that’s used by the client-side) and common utility classes that are used throughout a codebase.

An implementation-detail class exists only to support public APIs and is called by a very limited number of users (often only one). These classes can sometimes be tested indirectly by testing the public APIs that use them.

Testing implementation-detail classes is still useful in many cases, such as if the class is complex or if the tests would be difficult to write for the public API class. When you do test them, they often don’t need to be tested in as much depth as a public API, since some inputs may never be passed into their methods (in the above code sample, if UserInfoService ensured that UserInfo were never null, then it wouldn’t be useful to test what happens when null is passed as an argument to UserInfoValidator.validate, since it would never happen).

Implementation-detail classes can sometimes be thought of as private methods that happen to be in a separate class, since you typically don’t want to test private methods directly either. You should also try to restrict the visibility of implementation-detail classes, such as by making them package-private in Java.

Testing implementation-detail classes too often leads to a couple problems:

- Code is harder to maintain since you need to update tests more often, such as when changing a method signature of an implementation-detail class or even when doing a refactoring. If testing is done only through public APIs, these changes wouldn’t affect the tests at all.

- If you test a behavior only through an implementation-detail class, you may get false confidence in your code, since the same code path may not work properly when exercised through the public API. You also have to be more careful when refactoring, since it can be harder to ensure that all the behavior of the public API will be preserved if not all paths are tested through the public API.

Testing on the Toilet: Truth: a fluent assertion framework

Friday, December 19, 2014 18:28 PM

by Dori Reuveni and Kurt Alfred Kluever

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


As engineers, we spend most of our time reading existing code, rather than writing new code. Therefore, we must make sure we always write clean, readable code. The same goes for our tests; we need a way to clearly express our test assertions.

Truth is an open source, fluent testing framework for Java designed to make your test assertions and failure messages more readable. The fluent API makes reading (and writing) test assertions much more natural, prose-like, and discoverable in your IDE via autocomplete. For example, compare how the following assertion reads with JUnit vs. Truth:

assertEquals("March", monthMap.get(3));          // JUnit
assertThat(monthMap).containsEntry(3, "March"); // Truth
Both statements are asserting the same thing, but the assertion written with Truth can be easily read from left to right, while the JUnit example requires "mental backtracking".

Another benefit of Truth over JUnit is the addition of useful default failure messages. For example:
ImmutableSet<String> colors = ImmutableSet.of("red", "green", "blue", "yellow");
assertTrue(colors.contains("orange")); // JUnit
assertThat(colors).contains("orange"); // Truth
In this example, both assertions will fail, but JUnit will not provide a useful failure message. However, Truth will provide a clear and concise failure message:

AssertionError: <[red, green, blue, yellow]> should have contained <orange>

Truth already supports specialized assertions for most of the common JDK types (Objects, primitives, arrays, Strings, Classes, Comparables, Iterables, Collections, Lists, Sets, Maps, etc.), as well as some Guava types (Optionals). Additional support for other popular types is planned as well (Throwables, Iterators, Multimaps, UnsignedIntegers, UnsignedLongs, etc.).

Truth is also user-extensible: you can easily write a Truth subject to make fluent assertions about your own custom types. By creating your own custom subject, both your assertion API and your failure messages can be domain-specific.

Truth's goal is not to replace JUnit assertions, but to improve the readability of complex assertions and their failure messages. JUnit assertions and Truth assertions can (and often do) live side by side in tests.

To get started with Truth, check out http://google.github.io/truth/

GTAC 2014 Wrap-up

Thursday, February 19, 2015 13:19 PM

by Anthony Vallone on behalf of the GTAC Committee

On October 28th and 29th, GTAC 2014, the eighth GTAC (Google Test Automation Conference), was held at the beautiful Google Kirkland office. The conference was completely packed with presenters and attendees from all over the world (Argentina, Australia, Canada, China, many European countries, India, Israel, Korea, New Zealand, Puerto Rico, Russia, Taiwan, and many US states), bringing with them a huge diversity of experiences.


Speakers from numerous companies and universities (Adobe, American Express, Comcast, Dropbox, Facebook, FINRA, Google, HP, Medidata Solutions, Mozilla, Netflix, Orange, and University of Waterloo) spoke on a variety of interesting and cutting edge test automation topics.

All of the slides and video recordings are now available on the GTAC site. Photos will be available soon as well.


This was our most popular GTAC to date, with over 1,500 applicants and almost 200 of those for speaking. About 250 people filled our venue to capacity, and the live stream had a peak of about 400 concurrent viewers with 4,700 playbacks during the event. And, there was plenty of interesting Twitter and Google+ activity during the event.


Our goal in hosting GTAC is to make the conference highly relevant and useful for, not only attendees, but the larger test engineering community as a whole. Our post-conference survey shows that we are close to achieving that goal:



If you have any suggestions on how we can improve, please comment on this post.

Thank you to all the speakers, attendees, and online viewers who made this a special event once again. To receive announcements about the next GTAC, subscribe to the Google Testing Blog.

Protractor: Angular testing made easy

Sunday, November 30, 2014 17:50 PM

By Hank Duan, Julie Ralph, and Arif Sukoco in Seattle

Have you worked with WebDriver but been frustrated with all the waits needed for WebDriver to sync with the website, causing flakes and prolonged test times? If you are working with AngularJS apps, then Protractor is the right tool for you.

Protractor (protractortest.org) is an end-to-end test framework specifically for AngularJS apps. It was built by a team in Google and released to open source. Protractor is built on top of WebDriverJS and includes important improvements tailored for AngularJS apps. Here are some of Protractor’s key benefits:

  • You don’t need to add waits or sleeps to your test. Protractor can communicate with your AngularJS app automatically and execute the next step in your test the moment the webpage finishes pending tasks, so you don’t have to worry about waiting for your test and webpage to sync. 
  • It supports Angular-specific locator strategies (e.g., binding, model, repeater) as well as native WebDriver locator strategies (e.g., ID, CSS selector, XPath). This allows you to test Angular-specific elements without any setup effort on your part. 
  • It is easy to set up page objects. Protractor does not execute WebDriver commands until an action is needed (e.g., get, sendKeys, click). This way you can set up page objects so tests can manipulate page elements without touching the HTML. 
  • It uses Jasmine, the framework you use to write AngularJS unit tests, and Javascript, the same language you use to write AngularJS apps.

Follow these simple steps, and in minutes, you will have you first Protractor test running:

1) Set up environment

Install the command line tools ‘protractor’ and ‘webdriver-manager’ using npm:
npm install -g protractor

Start up an instance of a selenium server:
webdriver-manager update & webdriver-manager start

This downloads the necessary binary, and starts a new webdriver session listening on http://localhost:4444.

2) Write your test
// It is a good idea to use page objects to modularize your testing logic
var angularHomepage = {
nameInput : element(by.model('yourName')),
greeting : element(by.binding('yourName')),
get : function() {
browser.get('index.html');
},
setName : function(name) {
this.nameInput.sendKeys(name);
}
};

// Here we are using the Jasmine test framework
// See http://jasmine.github.io/2.0/introduction.html for more details
describe('angularjs homepage', function() {
it('should greet the named user', function(){
angularHomepage.get();
angularHomepage.setName('Julie');
expect(angularHomepage.greeting.getText()).
toEqual('Hello Julie!');
});
});

3) Write a Protractor configuration file to specify the environment under which you want your test to run:
exports.config = {
seleniumAddress: 'http://localhost:4444/wd/hub',

specs: ['testFolder/*'],

multiCapabilities: [{
'browserName': 'chrome',
// browser-specific tests
specs: 'chromeTests/*'
}, {
'browserName': 'firefox',
// run tests in parallel
shardTestFiles: true
}],

baseUrl: 'http://www.angularjs.org',
};

4) Run the test:

Start the test with the command:
protractor conf.js

The test output should be:
1 test, 1 assertions, 0 failures


If you want to learn more, here’s a full tutorial that highlights all of Protractor’s features: http://angular.github.io/protractor/#/tutorial

GTAC 2014 is this Week!

Thursday, February 19, 2015 13:20 PM

by Anthony Vallone on behalf of the GTAC Committee

The eighth GTAC commences on Tuesday at the Google Kirkland office. You can find the latest details on the conference at our site, including speaker profiles.

If you are watching remotely, we'll soon be updating the live stream page with the stream link and a Google Moderator link for remote Q&A.

If you have been selected to attend or speak, be sure to note the updated parking information. Google visitors will use off-site parking and shuttles.

We look forward to connecting with the greater testing community and sharing new advances and ideas.

Testing on the Toilet: Writing Descriptive Test Names

Monday, October 20, 2014 18:22 PM

by Andrew Trenk

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

How long does it take you to figure out what behavior is being tested in the following code?

@Test public void isUserLockedOut_invalidLogin() {
authenticator.authenticate(username, invalidPassword);
assertFalse(authenticator.isUserLockedOut(username));

authenticator.authenticate(username, invalidPassword);
assertFalse(authenticator.isUserLockedOut(username));

authenticator.authenticate(username, invalidPassword);
assertTrue(authenticator.isUserLockedOut(username));
}

You probably had to read through every line of code (maybe more than once) and understand what each line is doing. But how long would it take you to figure out what behavior is being tested if the test had this name?

isUserLockedOut_lockOutUserAfterThreeInvalidLoginAttempts

You should now be able to understand what behavior is being tested by reading just the test name, and you don’t even need to read through the test body. The test name in the above code sample hints at the scenario being tested (“invalidLogin”), but it doesn’t actually say what the expected outcome is supposed to be, so you had to read through the code to figure it out.

Putting both the scenario and the expected outcome in the test name has several other benefits:

- If you want to know all the possible behaviors a class has, all you need to do is read through the test names in its test class, compared to spending minutes or hours digging through the test code or even the class itself trying to figure out its behavior. This can also be useful during code reviews since you can quickly tell if the tests cover all expected cases.

- By giving tests more explicit names, it forces you to split up testing different behaviors into separate tests. Otherwise you may be tempted to dump assertions for different behaviors into one test, which over time can lead to tests that keep growing and become difficult to understand and maintain.

- The exact behavior being tested might not always be clear from the test code. If the test name isn’t explicit about this, sometimes you might have to guess what the test is actually testing.

- You can easily tell if some functionality isn’t being tested. If you don’t see a test name that describes the behavior you’re looking for, then you know the test doesn’t exist.

- When a test fails, you can immediately see what functionality is broken without looking at the test’s source code.

There are several common patterns for structuring the name of a test (one example is to name tests like an English sentence with “should” in the name, e.g., shouldLockOutUserAfterThreeInvalidLoginAttempts). Whichever pattern you use, the same advice still applies: Make sure test names contain both the scenario being tested and the expected outcome.

Sometimes just specifying the name of the method under test may be enough, especially if the method is simple and has only a single behavior that is obvious from its name.

Announcing the GTAC 2014 Agenda

Thursday, February 19, 2015 13:20 PM

by Anthony Vallone on behalf of the GTAC Committee

We have completed selection and confirmation of all speakers and attendees for GTAC 2014. You can find the detailed agenda at:
  developers.google.com/gtac/2014/schedule

Thank you to all who submitted proposals! It was very hard to make selections from so many fantastic submissions.

There was a tremendous amount of interest in GTAC this year with over 1,500 applicants (up from 533 last year) and 194 of those for speaking (up from 88 last year). Unfortunately, our venue only seats 250. However, don’t despair if you did not receive an invitation. Just like last year, anyone can join us via YouTube live streaming. We’ll also be setting up Google Moderator, so remote attendees can get involved in Q&A after each talk. Information about live streaming, Moderator, and other details will be posted on the GTAC site soon and announced here.

Chrome - Firefox WebRTC Interop Test - Pt 2

Tuesday, September 09, 2014 20:09 PM

by Patrik Höglund

This is the second in a series of articles about Chrome’s WebRTC Interop Test. See the first.

In the previous blog post we managed to write an automated test which got a WebRTC call between Firefox and Chrome to run. But how do we verify that the call actually worked?

Verifying the Call

Now we can launch the two browsers, but how do we figure out the whether the call actually worked? If you try opening two apprtc.appspot.com tabs in the same room, you will notice the video feeds flip over using a CSS transform, your local video is relegated to a small frame and a new big video feed with the remote video shows up. For the first version of the test, I just looked at the page in the Chrome debugger and looked for some reliable signal. As it turns out, the remoteVideo.style.opacity property will go from 0 to 1 when the call goes up and from 1 to 0 when it goes down. Since we can execute arbitrary JavaScript in the Chrome tab from the test, we can simply implement the check like this:

bool WaitForCallToComeUp(content::WebContents* tab_contents) {
// Apprtc will set remoteVideo.style.opacity to 1 when the call comes up.
std::string javascript =
"window.domAutomationController.send(remoteVideo.style.opacity)";
return test::PollingWaitUntil(javascript, "1", tab_contents);
}

Verifying Video is Playing

So getting a call up is good, but what if there is a bug where Firefox and Chrome cannot send correct video streams to each other? To check that, we needed to step up our game a bit. We decided to use our existing video detector, which looks at a video element and determines if the pixels are changing. This is a very basic check, but it’s better than nothing. To do this, we simply evaluate the .js file’s JavaScript in the context of the Chrome tab, making the functions in the file available to us. The implementation then becomes

bool DetectRemoteVideoPlaying(content::WebContents* tab_contents) {
if (!EvalInJavascriptFile(tab_contents, GetSourceDir().Append(
FILE_PATH_LITERAL(
"chrome/test/data/webrtc/test_functions.js"))))
return false;
if (!EvalInJavascriptFile(tab_contents, GetSourceDir().Append(
FILE_PATH_LITERAL(
"chrome/test/data/webrtc/video_detector.js"))))
return false;

// The remote video tag is called remoteVideo in the AppRTC code.
StartDetectingVideo(tab_contents, "remoteVideo");
WaitForVideoToPlay(tab_contents);
return true;
}

where StartDetectingVideo and WaitForVideoToPlay call the corresponding JavaScript methods in video_detector.js. If the video feed is frozen and unchanging, the test will time out and fail.

What to Send in the Call

Now we can get a call up between the browsers and detect if video is playing. But what video should we send? For chrome, we have a convenient --use-fake-device-for-media-stream flag that will make Chrome pretend there’s a webcam and present a generated video feed (which is a spinning green ball with a timestamp). This turned out to be useful since Firefox and Chrome cannot acquire the same camera at the same time, so if we didn’t use the fake device we would have two webcams plugged into the bots executing the tests!

Bots running in Chrome’s regular test infrastructure do not have either software or hardware webcams plugged into them, so this test must run on bots with webcams for Firefox to be able to acquire a camera. Fortunately, we have that in the WebRTC waterfalls in order to test that we can actually acquire hardware webcams on all platforms. We also added a check to just succeed the test when there’s no real webcam on the system since we don’t want it to fail when a dev runs it on a machine without a webcam:

if (!HasWebcamOnSystem())
return;

It would of course be better if Firefox had a similar fake device, but to my knowledge it doesn’t.

Downloading all Code and Components 

Now we have all we need to run the test and have it verify something useful. We just have the hard part left: how do we actually download all the resources we need to run this test? Recall that this is actually a three-way integration test between Chrome, Firefox and AppRTC, which require the following:

  • The AppEngine SDK in order to bring up the local AppRTC instance, 
  • The AppRTC code itself, 
  • Chrome (already present in the checkout), and 
  • Firefox nightly.

While developing the test, I initially just hand-downloaded these and installed and hard-coded the paths. This is a very bad idea in the long run. Recall that the Chromium infrastructure is comprised of thousands and thousands of machines, and while this test will only run on perhaps 5 at a time due to its webcam requirements, we don’t want manual maintenance work whenever we replace a machine. And for that matter, we definitely don’t want to download a new Firefox by hand every night and put it on the right location on the bots! So how do we automate this?

Downloading the AppEngine SDK
First, let’s start with the easy part. We don’t really care if the AppEngine SDK is up-to-date, so a relatively stale version is fine. We could have the test download it from the authoritative source, but that’s a bad idea for a couple reasons. First, it updates outside our control. Second, there could be anti-robot measures on the page. Third, the download will likely be unreliable and fail the test occasionally.

The way we solved this was to upload a copy of the SDK to a Google storage bucket under our control and download it using the depot_tools script download_from_google_storage.py. This is a lot more reliable than an external website and will not download the SDK if we already have the right version on the bot.

Downloading the AppRTC Code
This code is on GitHub. Experience has shown that git clone commands run against GitHub will fail every now and then, and fail the test. We could either write some retry mechanism, but we have found it’s better to simply mirror the git repository in Chromium’s internal mirrors, which are closer to our bots and thereby more reliable from our perspective. The pull is done by a Chromium DEPS file (which is Chromium’s dependency provisioning framework).

Downloading Firefox
It turns out that Firefox supplies handy libraries for this task. We’re using mozdownload in this script in order to download the Firefox nightly build. Unfortunately this fails every now and then so we would like to have some retry mechanism, or we could write some mechanism to “mirror” the Firefox nightly build in some location we control.

Putting it Together

With that, we have everything we need to deploy the test. You can see the final code here.

The provisioning code above was put into a separate “.gclient solution” so that regular Chrome devs and bots are not burdened with downloading hundreds of megs of SDKs and code that they will not use. When this test runs, you will first see a Chrome browser pop up, which will ensure the local apprtc instance is up. Then a Firefox browser will pop up. They will each acquire the fake device and real camera, respectively, and after a short delay the AppRTC call will come up, proving that video interop is working.

This is a complicated and expensive test, but we believe it is worth it to keep the main interop case under automation this way, especially as the spec evolves and the browsers are in varying states of implementation.

Future Work

  • Also run on Windows/Mac. 
  • Also test Opera. 
  • Interop between Chrome/Firefox mobile and desktop browsers. 
  • Also ensure audio is playing. 
  • Measure bandwidth stats, video quality, etc.


Chrome - Firefox WebRTC Interop Test - Pt 1

Tuesday, August 26, 2014 21:09 PM

by Patrik Höglund

WebRTC enables real time peer-to-peer video and voice transfer in the browser, making it possible to build, among other things, a working video chat with a small amount of Python and JavaScript. As a web standard, it has several unusual properties which makes it hard to test. A regular web standard generally accepts HTML text and yields a bitmap as output (what you see in the browser). For WebRTC, we have real-time RTP media streams on one side being sent to another WebRTC-enabled endpoint. These RTP packets have been jumping across NAT, through firewalls and perhaps through TURN servers to deliver hopefully stutter-free and low latency media.

WebRTC is probably the only web standard in which we need to test direct communication between Chrome and other browsers. Remember, WebRTC builds on peer-to-peer technology, which means we talk directly between browsers rather than through a server. Chrome, Firefox and Opera have announced support for WebRTC so far. To test interoperability, we set out to build an automated test to ensure that Chrome and Firefox can get a call up. This article describes how we implemented such a test and the tradeoffs we made along the way.

Calling in WebRTC

Setting up a WebRTC call requires passing SDP blobs over a signaling connection. These blobs contain information on the capabilities of the endpoint, such as what media formats it supports and what preferences it has (for instance, perhaps the endpoint has VP8 decoding hardware, which means the endpoint will handle VP8 more efficiently than, say, H.264). By sending these blobs the endpoints can agree on what media format they will be sending between themselves and how to traverse the network between them. Once that is done, the browsers will talk directly to each other, and nothing gets sent over the signaling connection.

Figure 1. Signaling and media connections.

How these blobs are sent is up to the application. Usually the browsers connect to some server which mediates the connection between the browsers, for instance by using a contact list or a room number. The AppRTC reference application uses room numbers to pair up browsers and sends the SDP blobs from the browsers through the AppRTC server.

Test Design

Instead of designing a new signaling solution from scratch, we chose to use the AppRTC application we already had. This has the additional benefit of testing the AppRTC code, which we are also maintaining. We could also have used the small peerconnection_server binary and some JavaScript, which would give us additional flexibility in what to test. We chose to go with AppRTC since it effectively implements the signaling for us, leading to much less test code.

We assumed we would be able to get hold of the latest nightly Firefox and be able to launch that with a given URL. For the Chrome side, we assumed we would be running in a browser test, i.e. on a complete Chrome with some test scaffolding around it. For the first sketch of the test, we imagined just connecting the browsers to the live apprtc.appspot.com with some random room number. If the call got established, we would be able to look at the remote video feed on the Chrome side and verify that video was playing (for instance using the video+canvas grab trick). Furthermore, we could verify that audio was playing, for instance by using WebRTC getStats to measure the audio track energy level.

Figure 2. Basic test design.

However, since we like tests to be hermetic, this isn’t a good design. I can see several problems. For example, if the network between us and AppRTC is unreliable. Also, what if someone has occupied myroomid? If that were the case, the test would fail and we would be none the wiser. So to make this thing work, we would have to find some way to bring up the AppRTC instance on localhost to make our test hermetic.

Bringing up AppRTC on localhost

AppRTC is a Google App Engine application. As this hello world example demonstrates, one can test applications locally with
google_appengine/dev_appserver.py apprtc_code/

So why not just call this from our test? It turns out we need to solve some complicated problems first, like how to ensure the AppEngine SDK and the AppRTC code is actually available on the executing machine, but we’ll get to that later. Let’s assume for now that stuff is just available. We can now write the browser test code to launch the local instance:
bool LaunchApprtcInstanceOnLocalhost() 
// ... Figure out locations of SDK and apprtc code ...
CommandLine command_line(CommandLine::NO_PROGRAM);
EXPECT_TRUE(GetPythonCommand(&command_line));

command_line.AppendArgPath(appengine_dev_appserver);
command_line.AppendArgPath(apprtc_dir);
command_line.AppendArg("--port=9999");
command_line.AppendArg("--admin_port=9998");
command_line.AppendArg("--skip_sdk_update_check");

VLOG(1) << "Running " << command_line.GetCommandLineString();
return base::LaunchProcess(command_line, base::LaunchOptions(),
&dev_appserver_);
}

That’s pretty straightforward [1].

Figuring out Whether the Local Server is Up 

Then we ran into a very typical test problem. So we have the code to get the server up, and launching the two browsers to connect to http://localhost:9999?r=some_room is easy. But how do we know when to connect? When I first ran the test, it would work sometimes and sometimes not depending on if the server had time to get up.

It’s tempting in these situations to just add a sleep to give the server time to get up. Don’t do that. That will result in a test that is flaky and/or slow. In these situations we need to identify what we’re really waiting for. We could probably monitor the stdout of the dev_appserver.py and look for some message that says “Server is up!” or equivalent. However, we’re really waiting for the server to be able to serve web pages, and since we have two browsers that are really good at connecting to servers, why not use them? Consider this code.
bool LocalApprtcInstanceIsUp() {
// Load the admin page and see if we manage to load it right.
ui_test_utils::NavigateToURL(browser(), GURL("localhost:9998"));
content::WebContents* tab_contents =
browser()->tab_strip_model()->GetActiveWebContents();
std::string javascript =
"window.domAutomationController.send(document.title)";
std::string result;
if (!content::ExecuteScriptAndExtractString(tab_contents,
javascript,
&result))
return false;

return result == kTitlePageOfAppEngineAdminPage;
}

Here we ask Chrome to load the AppEngine admin page for the local server (we set the admin port to 9998 earlier, remember?) and ask it what its title is. If that title is “Instances”, the admin page has been displayed, and the server must be up. If the server isn’t up, Chrome will fail to load the page and the title will be something like “localhost:9999 is not available”.

Then, we can just do this from the test:
while (!LocalApprtcInstanceIsUp())
VLOG(1) << "Waiting for AppRTC to come up...";

If the server never comes up, for whatever reason, the test will just time out in that loop. If it comes up we can safely proceed with the rest of test.

Launching the Browsers 

A browser window launches itself as a part of every Chromium browser test. It’s also easy for the test to control the command line switches the browser will run under.

We have less control over the Firefox browser since it is the “foreign” browser in this test, but we can still pass command-line options to it when we invoke the Firefox process. To make this easier, Mozilla provides a Python library called mozrunner. Using that we can set up a launcher python script we can invoke from the test:
from mozprofile import profile
from mozrunner import runner

WEBRTC_PREFERENCES = {
'media.navigator.permission.disabled': True,
}

def main():
# Set up flags, handle SIGTERM, etc
# ...
firefox_profile =
profile.FirefoxProfile(preferences=WEBRTC_PREFERENCES)
firefox_runner = runner.FirefoxRunner(
profile=firefox_profile, binary=options.binary,
cmdargs=[options.webpage])

firefox_runner.start()

Notice that we need to pass special preferences to make Firefox accept the getUserMedia prompt. Otherwise, the test would get stuck on the prompt and we would be unable to set up a call. Alternatively, we could employ some kind of clickbot to click “Allow” on the prompt when it pops up, but that is way harder to set up.

Without going into too much detail, the code for launching the browsers becomes
GURL room_url = 
GURL(base::StringPrintf("http://localhost:9999?r=room_%d",
base::RandInt(0, 65536)));
content::WebContents* chrome_tab =
OpenPageAndAcceptUserMedia(room_url);
ASSERT_TRUE(LaunchFirefoxWithUrl(room_url));

Where LaunchFirefoxWithUrl essentially runs this:
run_firefox_webrtc.py --binary /path/to/firefox --webpage http://localhost::9999?r=my_room

Now we can launch the two browsers. Next time we will look at how we actually verify that the call worked, and how we actually download all resources needed by the test in a maintainable and automated manner. Stay tuned!


1The explicit ports are because the default ports collided on the bots we were running on, and the --skip_sdk_update_check was because the SDK stopped and asked us something if there was an update.

Testing on the Toilet: Web Testing Made Easier: Debug IDs

Wednesday, August 13, 2014 17:58 PM

by Ruslan Khamitov 

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Adding ID attributes to elements can make it much easier to write tests that interact with the DOM (e.g., WebDriver tests). Consider the following DOM with two buttons that differ only by inner text:

Save buttonEdit button
<div class="button">Save</div>
<div class="button">Edit</div>

How would you tell WebDriver to interact with the “Save” button in this case? You have several options. One option is to interact with the button using a CSS selector:
div.button

However, this approach is not sufficient to identify a particular button, and there is no mechanism to filter by text in CSS. Another option would be to write an XPath, which is generally fragile and discouraged:
//div[@class='button' and text()='Save']

Your best option is to add unique hierarchical IDs where each widget is passed a base ID that it prepends to the ID of each of its children. The IDs for each button will be:
contact-form.save-button
contact-form.edit-button

In GWT you can accomplish this by overriding onEnsureDebugId()on your widgets. Doing so allows you to create custom logic for applying debug IDs to the sub-elements that make up a custom widget:
@Override protected void onEnsureDebugId(String baseId) {
super.onEnsureDebugId(baseId);
saveButton.ensureDebugId(baseId + ".save-button");
editButton.ensureDebugId(baseId + ".edit-button");
}

Consider another example. Let’s set IDs for repeated UI elements in Angular using ng-repeat. Setting an index can help differentiate between repeated instances of each element:
<tr id="feedback-{{$index}}" class="feedback" ng-repeat="feedback in ctrl.feedbacks" >

In GWT you can do this with ensureDebugId(). Let’s set an ID for each of the table cells:
@UiField FlexTable table;
UIObject.ensureDebugId(table.getCellFormatter().getElement(rowIndex, columnIndex),
baseID + colIndex + "-" + rowIndex);

Take-away: Debug IDs are easy to set and make a huge difference for testing. Please add them early.

Testing on the Toilet: Don't Put Logic in Tests

Thursday, July 31, 2014 16:59 PM

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

Programming languages give us a lot of expressive power. Concepts like operators and conditionals are important tools that allow us to write programs that handle a wide range of inputs. But this flexibility comes at the cost of increased complexity, which makes our programs harder to understand.

Unlike production code, simplicity is more important than flexibility in tests. Most unit tests verify that a single, known input produces a single, known output. Tests can avoid complexity by stating their inputs and outputs directly rather than computing them. Otherwise it's easy for tests to develop their own bugs.

Let's take a look at a simple example. Does this test look correct to you?

@Test public void shouldNavigateToPhotosPage() {
String baseUrl = "http://plus.google.com/";
Navigator nav = new Navigator(baseUrl);
nav.goToPhotosPage();
assertEquals(baseUrl + "/u/0/photos", nav.getCurrentUrl());
}

The author is trying to avoid duplication by storing a shared prefix in a variable. Performing a single string concatenation doesn't seem too bad, but what happens if we simplify the test by inlining the variable?

@Test public void shouldNavigateToPhotosPage() {
Navigator nav = new Navigator("http://plus.google.com/");
nav.goToPhotosPage();
assertEquals("http://plus.google.com//u/0/photos", nav.getCurrentUrl()); // Oops!
}

After eliminating the unnecessary computation from the test, the bug is obvious—we're expecting two slashes in the URL! This test will either fail or (even worse) incorrectly pass if the production code has the same bug. We never would have written this if we stated our inputs and outputs directly instead of trying to compute them. And this is a very simple example—when a test adds more operators or includes loops and conditionals, it becomes increasingly difficult to be confident that it is correct.

Another way of saying this is that, whereas production code describes a general strategy for computing outputs given inputs, tests are concrete examples of input/output pairs (where output might include side effects like verifying interactions with other classes). It's usually easy to tell whether an input/output pair is correct or not, even if the logic required to compute it is very complex. For instance, it's hard to picture the exact DOM that would be created by a Javascript function for a given server response. So the ideal test for such a function would just compare against a string containing the expected output HTML.

When tests do need their own logic, such logic should often be moved out of the test bodies and into utilities and helper functions. Since such helpers can get quite complex, it's usually a good idea for any nontrivial test utility to have its own tests.

The Deadline to Sign up for GTAC 2014 is Jul 28

Thursday, February 19, 2015 13:21 PM

Posted by Anthony Vallone on behalf of the GTAC Committee

The deadline to sign up for GTAC 2014 is next Monday, July 28th, 2014. There is a great deal of interest to both attend and speak, and we’ve received many outstanding proposals. However, it’s not too late to add yours for consideration. If you would like to speak or attend, be sure to complete the form by Monday.

We will be making regular updates to our site over the next several weeks, and you can find conference details there:
  developers.google.com/gtac

For those that have already signed up to attend or speak, we will contact you directly in mid August.

Measuring Coverage at Google

Monday, July 14, 2014 19:45 PM

By Marko Ivanković, Google Zürich

Introduction


Code coverage is a very interesting metric, covered by a large body of research that reaches somewhat contradictory results. Some people think it is an extremely useful metric and that a certain percentage of coverage should be enforced on all code. Some think it is a useful tool to identify areas that need more testing but don’t necessarily trust that covered code is truly well tested. Others yet think that measuring coverage is actively harmful because it provides a false sense of security.

Our team’s mission was to collect coverage related data then develop and champion code coverage practices across Google. We designed an opt-in system where engineers could enable two different types of coverage measurements for their projects: daily and per-commit. With daily coverage, we run all tests for their project, where as with per-commit coverage we run only the tests affected by the commit. The two measurements are independent and many projects opted into both.

While we did experiment with branch, function and statement coverage, we ended up focusing mostly on statement coverage because of its relative simplicity and ease of visualization.

How we measured


Our job was made significantly easier by the wonderful Google build system whose parallelism and flexibility allowed us to simply scale our measurements to Google scale. The build system had integrated various language-specific open source coverage measurement tools like Gcov (C++), Emma / JaCoCo (Java) and Coverage.py (Python), and we provided a central system where teams could sign up for coverage measurement.

For daily whole project coverage measurements, each team was provided with a simple cronjob that would run all tests across the project’s codebase. The results of these runs were available to the teams in a centralized dashboard that displays charts showing coverage over time and allows daily / weekly / quarterly / yearly aggregations and per-language slicing. On this dashboard teams can also compare their project (or projects) with any other project, or Google as a whole.

For per-commit measurement, we hook into the Google code review process (briefly explained in this article) and display the data visually to both the commit author and the reviewers. We display the data on two levels: color coded lines right next to the color coded diff and a total aggregate number for the entire commit.


Displayed above is a screenshot of the code review tool. The green line coloring is the standard diff coloring for added lines. The orange and lighter green coloring on the line numbers is the coverage information. We use light green for covered lines, orange for non-covered lines and white for non-instrumented lines.

It’s important to note that we surface the coverage information before the commit is submitted to the codebase, because this is the time when engineers are most likely to be interested in improving it.

Results


One of the main benefits of working at Google is the scale at which we operate. We have been running the coverage measurement system for some time now and we have collected data for more than 650 different projects, spanning 100,000+ commits. We have a significant amount of data for C++, Java, Python, Go and JavaScript code.

I am happy to say that we can share some preliminary results with you today:


The chart above is the histogram of average values of measured absolute coverage across Google. The median (50th percentile) code coverage is 78%, the 75th percentile 85% and 90th percentile 90%. We believe that these numbers represent a very healthy codebase.

We have also found it very interesting that there are significant differences between languages:

C++JavaGoJavaScriptPython
56.6%61.2%63.0%76.9%84.2%


The table above shows the total coverage of all analyzed code for each language, averaged over the past quarter. We believe that the large difference is due to structural, paradigm and best practice differences between languages and the more precise ability to measure coverage in certain languages.

Note that these numbers should not be interpreted as guidelines for a particular language, the aggregation method used is too simple for that. Instead this finding is simply a data point for any future research that analyzes samples from a single programming language.

The feedback from our fellow engineers was overwhelmingly positive. The most loved feature was surfacing the coverage information during code review time. This early surfacing of coverage had a statistically significant impact: our initial analysis suggests that it increased coverage by 10% (averaged across all commits).

Future work


We are aware that there are a few problems with the dataset we collected. In particular, the individual tools we use to measure coverage are not perfect. Large integration tests, end to end tests and UI tests are difficult to instrument, so large parts of code exercised by such tests can be misreported as non-covered.

We are working on improving the tools, but also analyzing the impact of unit tests, integration tests and other types of tests individually.

In addition to languages, we will also investigate other factors that might influence coverage, such as platforms and frameworks, to allow all future research to account for their effect.

We will be publishing more of our findings in the future, so stay tuned.

And if this sounds like something you would like to work on, why not apply on our job site?

ThreadSanitizer: Slaughtering Data Races

Monday, June 30, 2014 21:30 PM

by Dmitry Vyukov, Synchronization Lookout, Google, Moscow

Hello,

I work in the Dynamic Testing Tools team at Google. Our team develops tools like AddressSanitizer, MemorySanitizer and ThreadSanitizer which find various kinds of bugs. In this blog post I want to tell you about ThreadSanitizer, a fast data race detector for C++ and Go programs.

First of all, what is a data race? A data race occurs when two threads access the same variable concurrently, and at least one of the accesses attempts is a write. Most programming languages provide very weak guarantees, or no guarantees at all, for programs with data races. For example, in C++ absolutely any data race renders the behavior of the whole program as completely undefined (yes, it can suddenly format the hard drive). Data races are common in concurrent programs, and they are notoriously hard to debug and localize. A typical manifestation of a data race is when a program occasionally crashes with obscure symptoms, the symptoms are different each time and do not point to any particular place in the source code. Such bugs can take several months of debugging without particular success, since typical debugging techniques do not work. Fortunately, ThreadSanitizer can catch most data races in the blink of an eye. See Chromium issue 15577 for an example of such a data race and issue 18488 for the resolution.

Due to the complex nature of bugs caught by ThreadSanitizer, we don't suggest waiting until product release validation to use the tool. For example, in Google, we've made our tools easily accessible to programmers during development, so that anyone can use the tool for testing if they suspect that new code might introduce a race. For both Chromium and Google internal server codebase, we run unit tests that use the tool continuously. This catches many regressions instantly. The Chromium project has recently started using ThreadSanitizer on ClusterFuzz, a large scale fuzzing system. Finally, some teams also set up periodic end-to-end testing with ThreadSanitizer under a realistic workload, which proves to be extremely valuable. When races are found by the tool, our team has zero tolerance for races and does not consider any race to be benign, as even the most benign races can lead to memory corruption.

Our tools are dynamic (as opposed to static tools). This means that they do not merely "look" at the code and try to surmise where bugs can be; instead they they instrument the binary at build time and then analyze dynamic behavior of the program to catch it red-handed. This approach has its pros and cons. On one hand, the tool does not have any false positives, thus it does not bother a developer with something that is not a bug. On the other hand, in order to catch a bug, the test must expose a bug -- the racing data access attempts must be executed in different threads. This requires writing good multi-threaded tests and makes end-to-end testing especially effective.

As a bonus, ThreadSanitizer finds some other types of bugs: thread leaks, deadlocks, incorrect uses of mutexes, malloc calls in signal handlers, and more. It also natively understands atomic operations and thus can find bugs in lock-free algorithms (see e.g. this bug in the V8 concurrent garbage collector).

The tool is supported by both Clang and GCC compilers (only on Linux/Intel64). Using it is very simple: you just need to add a -fsanitize=thread flag during compilation and linking. For Go programs, you simply need to add a -race flag to the go tool (supported on Linux, Mac and Windows).

Interestingly, after integrating the tool into compilers, we've found some bugs in the compilers themselves. For example, LLVM was illegally widening stores, which can introduce very harmful data races into otherwise correct programs. And GCC was injecting unsafe code for initialization of function static variables. Among our other trophies are more than a thousand bugs in Chromium, Firefox, the Go standard library, WebRTC, OpenSSL, and of course in our internal projects.

So what are you waiting for? You know what to do!

GTAC 2014: Call for Proposals & Attendance

Thursday, February 19, 2015 13:21 PM

Posted by Anthony Vallone on behalf of the GTAC Committee

The application process is now open for presentation proposals and attendance for GTAC (Google Test Automation Conference) (see initial announcement) to be held at the Google Kirkland office (near Seattle, WA) on October 28 - 29th, 2014.

GTAC will be streamed live on YouTube again this year, so even if you can’t attend, you’ll be able to watch the conference from your computer.

Speakers
Presentations are targeted at student, academic, and experienced engineers working on test automation. Full presentations and lightning talks are 45 minutes and 15 minutes respectively. Speakers should be prepared for a question and answer session following their presentation.

Application
For presentation proposals and/or attendance, complete this form. We will be selecting about 300 applicants for the event.

Deadline
The due date for both presentation and attendance applications is July 28, 2014.

Fees
There are no registration fees, and we will send out detailed registration instructions to each invited applicant. Meals will be provided, but speakers and attendees must arrange and pay for their own travel and accommodations.

Update : Our contact email was bouncing - this is now fixed.



GTAC 2014 Coming to Seattle/Kirkland in October

Thursday, February 19, 2015 13:22 PM

Posted by Anthony Vallone on behalf of the GTAC Committee

If you're looking for a place to discuss the latest innovations in test automation, then charge your tablets and pack your gumboots - the eighth GTAC (Google Test Automation Conference) will be held on October 28-29, 2014 at Google Kirkland! The Kirkland office is part of the Seattle/Kirkland campus in beautiful Washington state. This campus forms our third largest engineering office in the USA.



GTAC is a periodic conference hosted by Google, bringing together engineers from industry and academia to discuss advances in test automation and the test engineering computer science field. It’s a great opportunity to present, learn, and challenge modern testing technologies and strategies.

You can browse the presentation abstracts, slides, and videos from last year on the GTAC 2013 page.

Stay tuned to this blog and the GTAC website for application information and opportunities to present at GTAC. Subscribing to this blog is the best way to get notified. We're looking forward to seeing you there!

Testing on the Toilet: Risk-Driven Testing

Friday, May 30, 2014 23:10 PM

by Peter Arrenbrecht

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

We are all conditioned to write tests as we code: unit, functional, UI—the whole shebang. We are professionals, after all. Many of us like how small tests let us work quickly, and how larger tests inspire safety and closure. Or we may just anticipate flak during review. We are so used to these tests that often we no longer question why we write them. This can be wasteful and dangerous.

Tests are a means to an end: To reduce the key risks of a project, and to get the biggest bang for the buck. This bang may not always come from the tests that standard practice has you write, or not even from tests at all.

Two examples:

“We built a new debugging aid. We wrote unit, integration, and UI tests. We were ready to launch.”

Outstanding practice. Missing the mark.

Our key risks were that we'd corrupt our data or bring down our servers for the sake of a debugging aid. None of the tests addressed this, but they gave a false sense of safety and “being done”.
We stopped the launch.


“We wanted to turn down a feature, so we needed to alert affected users. Again we had unit and integration tests, and even one expensive end-to-end test.”

Standard practice. Wasted effort.

The alert was so critical it actually needed end-to-end coverage for all scenarios. But it would be live for only three releases. The cheapest effective test? Manual testing before each release.


A Better Approach: Risks First

For every project or feature, think about testing. Brainstorm your key risks and your best options to reduce them. Do this at the start so you don't waste effort and can adapt your design. Write them down as a QA design so you can point to it in reviews and discussions.

To be sure, standard practice remains a good idea in most cases (hence it’s standard). Small tests are cheap and speed up coding and maintenance, and larger tests safeguard core use-cases and integration.

Just remember: Your tests are a means. The bang is what counts. It’s your job to maximize it.

Testing on the Toilet: Effective Testing

Wednesday, May 07, 2014 17:10 PM

by Rich Martin, Zurich

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.


Whether we are writing an individual unit test or designing a product’s entire testing process, it is important to take a step back and think about how effective are our tests at detecting and reporting bugs in our code. To be effective, there are three important qualities that every test should try to maximize:

Fidelity

When the code under test is broken, the test fails. A high­-fidelity test is one which is very sensitive to defects in the code under test, helping to prevent bugs from creeping into the code.

Maximize fidelity by ensuring that your tests cover all the paths through your code and include all relevant assertions on the expected state.

Resilience

A test shouldn’t fail if the code under test isn’t defective. A resilient test is one that only fails when a breaking change is made to the code under test. Refactorings and other non-­breaking changes to the code under test can be made without needing to modify the test, reducing the cost of maintaining the tests.

Maximize resilience by only testing the exposed API of the code under test; avoid reaching into internals. Favor stubs and fakes over mocks; don't verify interactions with dependencies unless it is that interaction that you are explicitly validating. A flaky test obviously has very low resilience.

Precision

When a test fails, a high­-precision test tells you exactly where the defect lies. A well­-written unit test can tell you exactly which line of code is at fault. Poorly written tests (especially large end-to-end tests) often exhibit very low precision, telling you that something is broken but not where.

Maximize precision by keeping your tests small and tightly ­focused. Choose descriptive method names that convey exactly what the test is validating. For system integration tests, validate state at every boundary.

These three qualities are often in tension with each other. It's easy to write a highly resilient test (the empty test, for example), but writing a test that is both highly resilient and high­-fidelity is hard. As you design and write tests, use these qualities as a framework to guide your implementation.

Testing on the Toilet: Test Behaviors, Not Methods

Monday, April 14, 2014 21:25 PM

by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

After writing a method, it's easy to write just one test that verifies everything the method does. But it can be harmful to think that tests and public methods should have a 1:1 relationship. What we really want to test are behaviors, where a single method can exhibit many behaviors, and a single behavior sometimes spans across multiple methods.

Let's take a look at a bad test that verifies an entire method:

@Test public void testProcessTransaction() {
User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
transactionProcessor.processTransaction(
user,
new Transaction("Pile of Beanie Babies", dollars(3)));
assertContains("You bought a Pile of Beanie Babies", ui.getText());
assertEquals(1, user.getEmails().size());
assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Displaying the name of the purchased item and sending an email about the balance being low are two separate behaviors, but this test looks at both of those behaviors together just because they happen to be triggered by the same method. Tests like this very often become massive and difficult to maintain over time as additional behaviors keep getting added in—eventually it will be very hard to tell which parts of the input are responsible for which assertions. The fact that the test's name is a direct mirror of the method's name is a bad sign.

It's a much better idea to use separate tests to verify separate behaviors:

@Test public void testProcessTransaction_displaysNotification() {
transactionProcessor.processTransaction(
new User(), new Transaction("Pile of Beanie Babies"));
assertContains("You bought a Pile of Beanie Babies", ui.getText());
}
@Test public void testProcessTransaction_sendsEmailWhenBalanceIsLow() {
User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
transactionProcessor.processTransaction(
user,
new Transaction(dollars(3)));
assertEquals(1, user.getEmails().size());
assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Now, when someone adds a new behavior, they will write a new test for that behavior. Each test will remain focused and easy to understand, no matter how many behaviors are added. This will make your tests more resilient since adding new behaviors is unlikely to break the existing tests, and clearer since each test contains code to exercise only one behavior.

The *Real* Test Driven Development

Thursday, April 02, 2015 15:35 PM


Update: APRIL FOOLS!


by Kaue Silveira

Here at Google, we invest heavily in development productivity research. In fact, our TDD research group now occupies nearly an entire building of the Googleplex. The group has been working hard to minimize the development cycle time, and we’d like to share some of the amazing progress they’ve made.

The Concept

In the ways of old, it used to be that people wrote tests for their existing code. This was changed by TDD (Test-driven Development), where one would write the test first and then write the code to satisfy it. The TDD research group didn’t think this was enough and wanted to elevate the humble test to the next level. We are pleased to announce the Real TDD, our latest innovation in the Program Synthesis field, where you write only the tests and have the computer write the code for you!

The following graph shows how the number of tests created by a small feature team grew since they started using this tool towards the end of 2013. Over the last 2 quarters, more than 89% of this team’s production code was written by the tool!

See it in action:

Test written by a Software Engineer:

class LinkGeneratorTest(googletest.TestCase):

def setUp(self):
self.generator = link_generator.LinkGenerator()

def testGetLinkFromIDs(self):
expected = ('https://frontend.google.com/advancedSearchResults?'
's.op=ALL&s.r0.field=ID&s.r0.val=1288585+1310696+1346270+')
actual = self.generator.GetLinkFromIDs(set((1346270, 1310696, 1288585)))
self.assertEqual(expected, actual)

Code created by our tool:

import urllib

class LinkGenerator(object):

_URL = (
'https://frontend.google.com/advancedSearchResults?'
's.op=ALL&s.r0.field=ID&s.r0.val=')

def GetLinkFromIDs(self, ids):
result = []
for id in sorted(ids):
result.append('%s ' % id)
return self._URL + urllib.quote_plus(''.join(result))

Note that the tool is smart enough to not generate the obvious implementation of returning a constant string, but instead it correctly abstracts and generalizes the relation between inputs and outputs. It becomes smarter at every use and it’s behaving more and more like a human programmer every day. We once saw a comment in the generated code that said "I need some coffee".

How does it work?

We’ve trained the Google Brain with billions of lines of open-source software to learn about coding patterns and how product code correlates with test code. Its accuracy is further improved by using Type Inference to infer types from code and the Girard-Reynolds Isomorphism to infer code from types.

The tool runs every time your unit test is saved, and it uses the learned model to guide a backtracking search for a code snippet that satisfies all assertions in the test. It provides sub-second responses for 99.5% of the cases (as shown in the following graph), thanks to millions of pre-computed assertion-snippet pairs stored in Spanner for global low-latency access.



How can I use it?

We will offer a free (rate-limited) service that everyone can use, once we have sorted out the legal issues regarding the possibility of mixing code snippets originating from open-source projects with different licenses (e.g., GPL-licensed tests will simply refuse to pass BSD-licensed code snippets). If you would like to try our alpha release before the public launch, leave us a comment!