My 8/24/08 Missoulian column
Internet search engines such as Google, Yahoo! and MSN help us find news, blogs, people, hiking boots, definitions of words and collectible widgets. Even hackers use search engines to find vulnerable Web sites. The term “search” – as both a noun and a verb – has become ubiquitous, and “Google” has entered the lexicon as a generic term like Kleenex and Xerox.
Search is invaluable to users like you and me, but it’s also big business. Search engines make money by selling clickable advertisements and by drawing eyeballs to their “portals,” Web pages that also provide links to other services. Some portals are so crammed with news and personals that search seems to be an afterthought.
But how do search engines work? It isn’t magic, but requires a lot of money, hardware and software, as well as very high-capacity “pipes” to the Internet.
The basic concept behind search is to create an index of what’s available on the Web. Search engines build and maintain extremely large and complex computer networks to catalog the Internet and respond to your queries any time of day or night. Indexes are based on words in Web sites and documents, file names, etc. – wherever a particular word exists online, regardless of context, it will end up in an index.
Google supposably has around three-quarters of a million to a million personal computers spread around the world in data centers, arranged for speedy access and redundancy in case undersea cables are cut or satellites go dark. Google uses fast, desktop-style PCs because they’re inexpensive to install and easy to change out when they die.
When you search for something, Google doesn’t send a query out right then to find what you’re looking for. It refers to its vast pre-built index of the Internet. The job of indexing the Internet is done by “bots,” automatic software critters that traverse the Web “reading” sites. The bots are called “spiders” – a play on the term World Wide Web – and they work 24/7, collecting information as fast as they can to compete for market share with other search engines, as the best results will bring a user back to the same site.
Spiders can work reasonably slow, too. The growth of the number of Web sites is such that even Google, with all its vast computing power, sometimes can only visit a Web site once a month or less. According to Internet services company Netcraft, there were more than 175 million Web servers online in June, with a growth rate of about 3 million a month, so the number of Web sites is actually more.
Some early search engines – like the old Yahoo! – operated on indexes compiled by people rather than spiders. While there are still some human-generated indexes around, modern Internet users want access to everything – Web sites, images, documents, databases – so a search engine must constantly update by spider to keep up.
These days, Google is the search leader, accounting for nearly 70 percent of search traffic. Google has a leg up on Yahoo! and other search engines – even a one called Cuil (pronounced “cool”), which was supposed to be a “Google killer” but failed miserably – because of its search algorithm, the complex method by which its index is developed.
The algorithm is a corporate secret, except for the way search results are ranked, called PageRank: Sites at the top of the list are there because other sites are linked to them, the idea being that a Web site with other sites pointing to it should be more important and relevant than a loner.
Let’s say there are two Web sites that have similar content, one called Gadget Montana and the other Made in Montana Gadgets. If you manage Gadget Montana and have other quality sites link it – such as the National Gadget Association and International Gadgets – or have blogs or news stories mention it and your competitor doesn’t, there’s a good chance Google will rank your site higher after its spiders see the difference in links. Now, “a good chance” is the key phrase, as other factors come in to play. The best way to rank high in Web search is to have high-quality, original content that is rich with links. There once were ways to game how high your site ranked, but most don’t work anymore.
Unless you’re savvy enough with technology to know how to block them, nothing much escapes a spider. Most are “nice” in that they will respect the two primary ways you can control their crawl of your site, through “robots.txt” files or “.htaccess” files (that leading dot is important). Some spiders aren’t nice and will grab your content for what I call “bottom feeder” sites, which use excerpted text and images to promote clickable ads and make money.
The next big step in search engine technology is called “semantic search.” For the user that means getting more relevant results because the search engine understands the context of the word, not just its existence on a Web site.
With semantic search, a search for “Montana Gadgets” would bring back results for gadget manufacturers and dealers in Montana, and not results for tween singer Hannah Montana and the gadgets she sells, regardless of how popular she might be. For more reading on this subject and when it might become a reality, try Googling “semantic search.”