Matt's Guide to Search Engines

index

Introduction Registration robots.txt Meta Tags Content

Introduction

This page is a reference to help you make sure you've done what you can to make sure search robots and engines properly index (or don't index) your website and web pages.

Registration

The most important (and most obvious) thing you can do to make sure search engines find your site is to register it with them. That doesn't mean it's the first thing you should do, though (even though, I listed it first). Make sure you've configured your robots.txt file and any other precautions or guides that you want to setup, because some engines begin to index with remarkable speed (and it may be weeks before they come back around for a second pass).

To register your site with any search engine, in most cases it's as simple as looking for a link at the bottom of the search engine's main page that will read "add a URL," "suggest a site," or something of the sort. Most of the time, typing the website's URL in the form (and sometimes an email address or description) will do it.

For some directory engines, Yahoo being a good example, be prepared to tell it how to categorize your site; for example, is your site a Reference -> Quotations -> Randomized sort of site, or a Science -> Biology -> Zoology -> Animals Insects and Pets -> Mammals -> Cats page? Once you've figured that out, the rest of the registration is pretty straightforward.

Be warned that a few of the search engines out there are starting to charge money to have a site listed directly, so don't click too blindly.

robots.txt

One way you can influence the way that search engines index your whole site is via the robots.txt file. robots.txt is always placed in the document root directory of your website (the directory which http://www.domain.com/ would take you to), and is a raw text file. It is primarily used to limit indexing of certain directories, and/or to prevent indexing by certain search engines. Examples follow.

This example would prevent all search engines from indexing any of your site (the directories are recursive, so / indicates anything from the document root on):

User-agent: *
Disallow: / 

The following example allows indexing of anything on your site except for the /cgi-bin and /images directories, usually a good idea if you don't want people searching for and browsing to your image files directly. Note that this doesn't prevent people from pulling up your images if they know the URL -- it only prevents search engines from finding them.

User-agent: *
Disallow: /cgi-bin 
Disallow: /images

Other examples:

User-agent: *
Disallow: 
User-agent: *
Disallow: 
User-agent: EvilSearchEngine
Disallow: /

Meta Tags

Obviously, you want to make sure your site is found by the people who are looking for the type of content you're offering, and the best way to take care of this(along with providing robot indexing data at a more granular level) is via HTML Meta tags. Meta tags belong between the <head> and </head> tags at the top of your HTML document, and though you should have only one of each, in general, you're welcome to use whatever combination of tags you feel is appropriate.

Robots

<meta name="robots" content="noindex">

The "noindex" robot tag tells search robots to skip this page.

<meta name="robots" content="nofollow">

The "nofollow" robot tag tells search robots not to follow any links on this page. This is occasionally useful for indexes of images or other content you don't want a search engine to index directly (where it would bypass too much of your natural navigation).

Keywords

<meta name="Keywords" content="keywords description robots.txt meta tags search engine">

The Keywords meta tag helps search engines identify which search phrases should be associated with your site. While not all search engines pay attention to your keywords, it never hurts, particularly if your topic is not clear from other aspects of the page. It's best to omit commas and to list keywords in the natural pairings and order in which people will likely combine them in searches.

Description

<meta name="Description" content="how to list my website in a search engine">

The Description meta tag, also not used by all search engines, but important to some, includes a sentence or phrase which describes what content is in the page. Again, think in terms of what a person would enter into a search engine.

Content

The last and most deceptively difficult aspect of getting your site indexed effectively is via your page content. Search engines particularly look for content within certain tags, so be conscientious that you are really using heading tags for heading material and not just for looks, otherwise your site may be indexed using those words instead of the real indicators of what it is about. Though it's a veritable mystery how some robots derive keywords out of the majority of website text, content of the following tags seems to take special precedence, roughly in order:
<title>
<h1> - <h6>
<th>
<strong>
<b>
<em>
<i>