Software Testing - Bayesian Theorem in Testing Maps

By on May 3, 2011 Comments

This article is written by Mr. Ashish Harsh for us. Thanks Ashish for sharing your knowledge / insight with us.

Software testing is essentially an exercise of continuous exploration, learning and questioning. This exercise becomes very interesting and challenging at times, when application under test is as complex as Maps. You must have used applications like Google Maps, Yahoo Maps etc. Primary use of these applications is to help users in finding route. As an input to these applications, user gives source and destination and based on this information, maps give them directions to reach from source to destination. You might think from the description that application is simple, but it has got numerous challenges.

As a tester you need to find out relevant queries and also quality of results produced by the system.

During Beta testing of the application, we got thousands of queries and input data which were used by the end users. To give an idea about the amount of data we had, for every city there are more than 8000 queries. For example, Hotels in Mumbai, Escort Mumbai, Taj Mumbai etc. Finding relevant data from these queries is a very difficult and time consuming task.

This data can be analyzed for relevant queries in two different ways, either apply human resources to analyze this or use Artificial Intelligence and write some smart tool. Since getting human resource is very expensive :) , we decided to develop some tool to classify input data.

After looking at the various possible solutions, we decided to use Bayesian Classifier. For people who are interested to know more about Bayesian Classifier , this is what Wikipedia say about it --

Bayes' theorem (also known as Bayes' rule or Bayes' law) is a result in probability theory, which relates the conditional and marginal probability distributions of random variables. In some interpretations of probability, Bayes' theorem tells how to update or revise beliefs in light of new evidence a posteriori.

The probability of an event A conditional on another event B is generally different from the probability of B conditional on A. However, there is a definite relationship between the two, and Bayes' theorem is the statement of that relationship.

Use of classifier based on the Bayesian's theorem is well known in the email spam filtering. Generally in spam filters, they have a large set of data in terms of good mail and spam mail. It works on the probability that certain words will be present in spam mails rather than normal email. System of spam mail filtering also learns from it's users every time user hit report spam or not a spam button.

So we decided to write our own tool based on the Bayseian theorem with the capabilities of learning what is good data and what is bad data. This tool will learn how to classify data based on how we train it. In simple terms, input for the tool would be definition of what is good, what is bad and sample data. Based on this, it will classify data in good or bad, as simple as that.

Normally to classify a set of text, we have to teach the tool what is good and what is bad. During the training, classifier will keep track of how often words categorized as good or bad are showing up in each category.

Implementation

This tool was developed in Ruby, as Lucas Carlson's Classifier library is already available as classifier gem. This library provides a naive Bayesian classifier. More information about this can be found here.

In our implementation, following code reads three files

  • good.yml
  • not_good.yml
  • input file

For the execution, we need to give two command line arguments. City Name and Input File Name. Now based on the definition of good and bad, it will create a directory by city name and put good.txt and bad.txt in that directory containing information classified as good or bad.

``require 'stemmer' require 'classifier'

System Message: WARNING/2 (<string>, line 35); backlink

Inline literal start-string without end-string.

if ARGV.empty?

puts "*You Should supply CityName and Input File name to script**n"

else if ARGV[1]

puts "I am searching for the city #{ARGV[0]}n"

puts "The input file is #{ARGV[1]}n"

inputfile=ARGV[1].to_s.downcase

pwd=Dir.getwd

city=ARGV[0].to_s.downcase].to_s.downcase

Dir.mkdir("#{city}")

# Load previous classifications

good = YAML::load_file('good.yml')

not_good = YAML::load_file('not_good.yml')

data=File.open("#{inputfile}","r")

goody=File.open("#{pwd}"+"\"+"#{city}"+"\good.txt","a")

nogood=File.open("#{pwd}"+"\"+"#{city}"+"\nogood.txt","a")

classifier = Classifier::Bayes.new('good', 'No good')

# Train the classifier

not_good.each { |not_good| classifier.train_no_good not_good }

good.each { |good_one| classifier.train_good good_one }

while line3=data.gets

if classifier.classify(line3)=="Good"

goody.write line3

else

nogood.write line3

end

end

else

puts "*second argument that is name of file is required*n"

end

end``

Quality Of Results

Quality of result depends on how much training we have given to the classifier. Its kind of a learning system where quality of result depend upon training. The major benefit of this approach is reduction in human efforts required to classify the data. Similar to this, there are many applications where human intervention is required to classify what is good and what is bad. A properly trained classifier similar to this, can be helpful in similar situations.

Hope you find this article interesting and you will be able to use it if you need to classify data for your application.

Docutils System Messages

System Message: ERROR/3 (<string>, line 73); backlink

Undefined substitution referenced: "not_good".

System Message: ERROR/3 (<string>, line 75); backlink

Undefined substitution referenced: "good_one".
blog comments powered by Disqus
Finished reading? Browse all posts »