Before you start to optimize your online presence, as a webmaster it’s important to first test the water and develop an idea of what exactly search engines are looking for in a domain and how exactly they go about it. Great content is one thing, but to get great rankings as well, content needs to be presented in as search engine friendly a way as possible so that Google’s crawlers can interpret and process it correctly. Technical factors and requirements are particularly important to ensure that quality content ranks highly.
Before any traffic can be generated via organic searched in search engines, a domain and its sub-pages must first be stored in the search engine’s index. This process might seem obvious but it is extremely important, and too often neglected by webmasters who fail to understand that Google’s crawlers also have their technical restrictions. For this reason, you should provide the crawlers with as much information as possible to ease their work.
Crawling and indexing budget limitations
There are many reasons for webmasters to keep track of search engines’ progress as they crawl and index content. One of the most important reasons is that Google for instance only allocates a limited crawling and indexing budget to each website or online shop. Resourceful webmasters therefore find a way to technically help crawlers to locate and index their most important documents.
The following errors and problems have a negative effect on crawling and indexing:
Crawl problems – double the work and inaccessibility
Each internet presence is allocated a budget by search engines which defines how much effort and index volume may be expended on the domain. It is therefore important to be efficient in order to squeeze the greatest potential from your domain’s content and performance capacity. Search engine crawlers aim to waste as little effort and energy as possible finding and processing new/edited content. When crawlers can only find content with great effort, this means that fewer documents are processed since so much energy has already been used up locating the domain in the first place. Webmasters should therefore try to avoid the following crawl problems as much as possible:
Long load times
Search engines don’t want to overload web-servers with extra requests, so long load times cause crawls to be interrupted. The process will be resumed at a later date, but interruptions are nevertheless still be avoided.
The more closely packed together content is placed on a homepage, the easier it is for the crawler to find it. Content buried at deeper levels of the website is harder to detect and may not be found by the crawler at all. Since crawlers use links to find their way to new content, your internal link structure should be good enough to facilitate crawler accessibility.
Difficult to read content
Index budget waste – duplicate content
Duplicate content – identical content found under different URLs – uses up crawling and indexing budgets, since search engines process and index it as well. When webmasters don’t take precautions to prevent the indexing of duplicate content, it not only wastes the crawler’s indexing budget but can also often lead to significant ranking fluctuations. It can also mean that relevant, useful content isn’t indexed since the indexing budget has already been exhausted by double content. Avoid the following:
Https and http
Most online shops use SSL-certified encoded data transfer. But when online shop content is accessible under both http and https, this is duplicate content. The http version of the website should be redirected to the https version using the status code 301, and a new sitemap should be created in webmaster tools so that changes can be noted as quickly and efficiently as possible.
Capital letters in URLs
Many webmasters don’t know that it makes a difference whether URLs are written in caps or not. Differently written URLs can be deemed duplicate content so make a decision and stick with it.
Trailing slash at the end of a URL
If content is indexed both with a trailing slash at the end of the URL and without, this is classed as duplicate content. Content without a slash should be redirected to the version with the slash at the end, again using status code 301.
Versions with and without www.
When your website or online shop exists both with and without the prefix www., then search engine crawlers have to index the same content twice, wasting time, space and energy. Again, this can be fixed by redirecting the version without www. to the version with www.
Near duplicate content in product descriptions
A problem commonly found in online shops. Product descriptions don’t really leave much room for creativity and so are often very similarly composed, differing only in tiny details. Sometimes they’re identical. Search engines place great value on unique content, so try and make your product descriptions as unique as possible.
Tag pages and category menus
Whilst so-called tag clouds are useful from a usability point of view, they can be problematic when it comes to SEO. Tag pages tend to generate lists of tagged content identical to content listed in category menus. As a webmaster, you should check whether your tag pages are delivering identical results as found in category lists and, if so, ensure that tag clouds aren’t indexed.
Paramter URLs and Session Ids
Parameter URLs and Session Ids generally reproduce content which has already been indexed under an existing URL. If these URLs are indexed as well, you have duplicate content. Try to avoid this.
Changes in Link Structure
SEO on-page optimization can often involve changes in link structure. URLs are usually optimized to contain the most relevant keywords, but when the old URLs are not redirected to the new ones using status code 301, then identical content can be found under different web addresses. This is annoying for users but more damagingly it is considered duplicate content by search engines crawlers.
Additional extras can create duplicate content
Some websites offer readers the possibility of printing content or downloading as a PDF. Both print and PDF versions can be classed as duplicate content.
How to avoid duplicate content
Technically speaking, there are several ways of telling search engine crawlers that a particular document is not be indexed.
- The noindex meta tag
- Redirect using .htaccess
- Canonical tags
- robots.txt (for instance for PDF files which should not be indexed)
Facilitate crawling using your sitemap
Sitemaps are a great way of informing the search engine which URLs are the most important for the webmaster and therefore which should be prioritized for indexing. To make sure that these sites are found straight away at the start of the crawl, sitemaps should be included in the robotx.txt.
There are different types of sitemaps:
Sitemaps can also be added to Google Webmaster Tools, where they can also be tested under Crawling > Sitemaps. This is a good idea for webmasters, who can access valuable information about the current crawl state and the indexing of content. Errors and other crawl issues are also identifiable and can therefore be avoided.
Sitemaps are particularly useful when:
- Your website is still relatively new and the search engine doesn’t know its content yet
- Significant changes have been made to your website (for example the URL structure)
- Your website has lots of subpages
Semantic labelling of content
You can also provide search engines with additional information regarding specific content by using schema.org to produce structured data. The mark-ups make it easier for search engines to categorize content thematically, and thus recognize context. There is already a diverse range of labels available covering all topics and contents. Common labels include:
- Places (relevant for local SEO)
Online shops in particular stand to benefit from the use of the “Reviews” label, since these are often visible as Rich Snippets in the search results. Star ratings in search results stand out and can significantly increase click rate. In future, it will become ever more important to provide search engines with structured content since queries will be answered based on the user’s intention.