Pull Your Site Out Of The Supplemental Index
By Richard Hearne
Expert Author
Article Date: 2007-04-10
I met Krishna De in Cork last month. She gave a fantastic presentation on marketing and leveraging the Internet to achieve your business goals.
In fact, without prejudice to any of the other speakers, I found that Krishna's topic area was of the most interest to me. Krishna also availed of my offer for a free site review. So without further ado…
KrishnaDe.com
I have to say I have always admired Krishna's website. It is just well polished from the get-go. The homepage just speaks ‘professionalism' to me:
If I were to find any fault it would be with the footer - I can't easily discern between text and links. But that would just be nit-picking.
More than meets the eye
It was only when I sent in a spider that the true size of Krishna's site became apparent. I knew that her blog has been on-line for a number of years and so expected the blog to be quite extensive. But I hadn't expected this:
Crawler 1: 2,306 internal pages
Crawler 2: 2,604 pages (some external)
A look at Google's index shows that Krishna's site has a high number of pages in the supplemental index:
Pages Indexed: 1,330
Pages Supplemental: 964
That's a particularly high proportion of supplemental:indexed pages, and to me this is the most pressing issue for Krishna.
A robots eye view
Here's Krishna's robots.txt file:
User-agent: *
Disallow: /_mm/
Disallow: /_notes/
Disallow: /_baks/
Disallow: /MMWIP/
Disallow: /audio-for/
Disallow: /private/
Disallow: /onlinebrand/
User-agent: googlebot
Disallow: *.csi
When I look at some of the files that have made their way into the supplemental index I can see immediately that many should not be indexed in the first place.
HOLD PRESS - I've just noticed that Krishna's site has been hacked:
Those links at the top of the page shouldn't be there. That's taken from Google's cached version of the page. Here's the original page. This type of hacking is normally carried out by altering the .htaccess file to cloak your pages for GoogleBot. Normal users are shown the second page, while Google sees the page with the links.
I've seen this hack a lot recently. The best medicine is to make sure that your software is up-to-date. There have been issues with Wordpress, and that's why the Wordpress guys are very much on the ball with updates. You have to carefully check your server to see what else has been left around. The first file I would check is .htaccess, although in this case I have a feeling there may be a bit more going on.
I cant tell for sure if Krishna has fixed this. This hack might be a bit more elaborate than normal user agent sniffing. When I access the page as GoogleBot I get the clean version so the hack has either been treated, or is using a IP delivery or reverse-lookup to only cloak for the real GoogleBot. I sent Krishna a mail as soon as I found this so hopefully she already knows about it and had it patched.
Back to work…
There's not a lot I can do while I wait to hear back from Krishna. So I'm going to go ahead with what I think Krishna should do to fix the supplemental issues.
The crawler found 2,306 resources in Krishna's site. it also found about 100 cases of duplicate content covering about 250 pages (the homepage was accessible via 4 URLs). Most of the duplicate content came from the trailing slash problem. Krishna can solve most of this by installing a small Wordpress Plugin called Permalink Redirect.
Next step, Krishna needs to update that robots.txt file. I would add in the following to stop Google crawling certain areas of the site:
User-agent: *
Disallow: /_mm/
Disallow: /_notes/
Disallow: /_baks/
Disallow: /MMWIP/
Disallow: /audio-for/
Disallow: /private/
Disallow: /onlinebrand/
Disallow: /learningzone/
Disallow: /blog/wp-content/plugins/
User-agent: googlebot
Disallow: *.csi
Soemwhere in Krishna's blog she has linked to her plug-in directory. The result is that Google has indexed a tonne of files from her Wordpress Plug-in directory. This has two effects:
1. increases the site size, and therefore the Pagerank needed to carry each page;
2. decreases the Pagerank passed to each page as there are more internal links than needed.
So not only should Krishna remove the links to those pages, she should also make sure that the bots no longer crawl resources that shouldn't be in the index. The two most obvious offenders I could see for low-value filler content were Learning Zone (/learningzone/) and the plug-in directory (/blog/wp-content/plugins/). So I've disallowed the bots from those areas.
Calendars can drive bots batty
I've found that dynamic calendars are very often the worst culprits for driving search engine bots around the twist. And Krishna's site hasn't let me down. Within the LearningZone there is a dynamic calendar. This is just one more reason to keep the bots out of there.
Permalinks
I notice that the crawler came back with a large chunk of default Wordpress page URLs. These are the URLs that look like www.mysite.com/?p=1234. Krishna must have changed over to the more SE-friendly permalink structure, but not changed all her internal links.
Although there could be quite some work involved, I think it would be useful to fix this issue. I saw some duplicate content issues due to the use of both default and permalink structures. If you are interested in the duplicate content URLs here's the full report:
krishna-de-dupes.txt
Other thoughts
My eyes are getting a bit weary now, but there are just a couple of other thoughts on Krishna's blog.
Internal linking can be a great way to help your pages rank well. For a start you can control the anchor text used, and anchors are what give relevancy to the linked material. Google loves anchors, so don't use ‘click here' or ‘look at this' where you could use great descriptive anchors for your links.
I looked through some of Krishna's posts and the thing that struck me was the lack of links. A great way to keep posts out of the supplemental index AND boost your internal traffic is to cross link in your posts. If you discussed something previously which is related to your current post then link to it. And use good descriptive anchor text in your links. It's amazing how just one or two good internal links can see pages jump out of the supplemental index.
I hope Krishna has fixed this up
It's such a pain in the rear when hackers get into your site. And it goes to show that you can never be too careful with the security of your website. Hopefully Krishna either has this sorted or soon will.
And if you want to see a great example of a blog that shows you what on-line marketing is I would strongly advise that you head over to Krishna De's website.
Comments About the Author:
Richard Hearne is the founder of Red Cardinal, a dedicated search marketing consultancy. A frequent contributor to Google's Webmaster Group, Richard regularly advises clients on Internet marketing strategy and Search Engine optimisation campaigns. Richard's thoughts and research can be found on his search marketing blog.
|
|
| Tips for creating
sites |
Creating a professional site is the most
important thing for businesses. Having the look and feel of a corporate site can
make your potential just as high as a corporate website
Click here for more
design tips and help for creating professional sites |
|
Develop Your Skills |
Graphics play a major role in website development
and deployment. By sharpening your skills you can help your website earn more
income.
Get more info on graphics |
|
|
|