Over-engineering and horsing around

horse

There’s been a scandal. A scandal! Horse meat has made its way into a number of typically not-horse meat based foods. This inspired me to spend a couple of spare hours on a website I named Percentage Horse.

The idea is simple:

Given the name of a thing, what percentage of it is horse?

Or, in terms of the technical task:

Given a word or phrase, generate a number between 0 and 100 which relates to its connection to horse meat.

Where’s a guy to start? I first tried searching for a way to automatically scan a combination of online supermarkets and news sources to try to automatically research the ingredients list for food items. I realised pretty early on that whilst this would work for all the obvious things, the supermarket’s don’t have APIs and the scope of what you can give the site to search for is limited.

The solution I went with consists of three parts (and lots of caching, which I won’t go into in detail beyond; cache everything):

  1. Ask Wikipedia about the search word or phrase
  2. Search for key words in the resulting text
  3. Add a small random number

As I only had a few hours to make the site, I cut to the chase and made a crude curl-based function which returned the wikipedia entry for a given piece of text. Wikipedia is nice because they have a public API.

function return_page($url)
{
	// create curl resource 
	$ch = curl_init(); 

	curl_setopt($ch, CURLOPT_USERAGENT, 'User-Agent: YOUR-DETAILS-HERE');
	
	// set url 
	curl_setopt($ch, CURLOPT_URL, $url); 

	//return the transfer as a string 
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

	// $output contains the output string 
	$output = curl_exec($ch); 

	 // close curl resource to free up system resources 
	curl_close($ch);  
	
	return $output;
}

$string = "http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=" . urlencode($search_value) . "&prop=revisions&rvprop=content";
	
$search_page_string = return_page($string);

The next step is to process the lump of text we get back.

Quick note here on redirects: If you search for some words, they redirect to others (“sausages” redirects to “sausage”, for example). I made an ugly-but-it-works detector which searches for redirects and then follows the URL accordingly if it finds one:

function get_search_results_word($search_value, $allow_redirect = true)
{

	$string = "http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=" . urlencode($search_value) . "&prop=revisions&rvprop=content";
	
	$search_page_string = return_page($string);
	
	if ($allow_redirect && stristr($search_page_string, "redirect") && (strlen($search_page_string) < 1000))
	{
		
		$search_page_string = explode("[[", $search_page_string);
		$search_page_string = $search_page_string[1];
		$search_page_string = explode("]]", $search_page_string);
		$search_page_string = $search_page_string[0];
		
		if ($search_page_string)
		{
			$search_page_string = get_search_results_word($search_page_string, false);
		}
	}
	
	return $search_page_string;
	
}

I could have used a sexy regular expression, but who’s got time for that?

Right now I can search for a word, and get a whole tonne of related words come back (in the form of the text on the Wikipedia page). If we break the user-submitted phrase into its constituent words then merge all our results into one big text string, we’re almost there.

We don’t really care about sentences or structure, so I pushed the large text string through a word-frequency analyser, which gives me an array of words and their popularity in the results.

Some words are more important than others so I made a small array of key words, like “meat” and “smart” (any food that’s “smart price” is probably pretty low quality), and another array of words to ignore like “at” and “the”. Assuming we don’t care about words that only appear once, we can step through the words and calculate $is_horse and $not_horse values. With a little bit of weighting we can subsequently divide one value by the other and get our percentage.

foreach ($wiki_text_array as $wiki_text_word => $frequency)
{
	if (($frequency > 2) && !in_array($wiki_text_word, $ignore_words))
	{
		if (in_array($wiki_text_word, $key_words))
		{
			$is_horse += ($frequency * 1);
		}
		else
		{
			$not_horse += ($frequency * 0.05);
		}
	}
}

// calculate the score, and prevent it being more than 100
$score = round(($is_horse / ($is_horse + $not_horse)) * 160, 2);

if ($score > 100)
{
	$score = 100;
}

The random number added on the end is purely theatrical. The whole website is a joke, and as long as the percentage is within 25% of what we want, the joke works. Adding a random number prevents similar phrases having identical percentages, which is a surprisingly frequent occurrence and makes the website look broken.

And that’s it! Season lightly with some basic CSS and jQuery, and you have yourself a novelty website in a couple of hours.

You can see the website for yourself at percentagehorse.com.

Leave a Reply