Thursday, June 2, 2011

Posts regarding the mechanics of indexing.

Googlebot noticed your site, uses an SSL certificate which may be considered invalid by web browsers
Without knowing your site, it's hard to judge the exact situation. In general, we send this message when we think that you might have content hosted on https and have found the SSL certificate to be invalid. In practice, this doesn't affect your site's crawling, indexing, or ranking. It may, however, confuse users when they click on one of these results and see a certificate warning in their browser, which is why we flag it for webmasters. 

If you really don't use https, then I'd double-check to make sure that none of your content is being indexed like that, and then you're welcome to ignore this message. If you do find content indexed as https, I'd recommend using the usual canonicalization methods to resolve that:

Crawling www and non www >
 For example, many sites let both the www and non-www versions of their site get indexed, without using something to control canonicalization (such as a redirect or the rel=canonical). Our algorithms will try to concentrate on one of those versions for search, meaning that we'd tend to crawl & index that version much more frequently than the other one. In a case like that, it could happen that the less-favored version was last seen with an older version of the CMS, just because we haven't been crawling it as frequently.

.. having a site indexed with www and non-www URLs at the same time is not a problem and generally wouldn't result in fluctuations in ranking in web-search. It helps us a little bit when we can focus on a single host name (www or non-www) since we don't have to worry about crawling both versions, but for the largest part, our algorithms also get along fine when there's a mixture of both.

Canonical tags : following, pagination, print and product pages.
When we see a canonical link element like that and follow it (which we mostly do), we'll treat it similarly to a redirect. So if you play around with the rel=canonical, you have to be very careful because you won't see the "redirect" that Googlebot will use for indexing.

- Pagination: this is complicated, I personally would be careful when using with rel=canonical with paginated lists. The important part is that we should be able to find all products listed, so at the very least those lists should provide a default sort order where we can access (and index) all pages. Since this is somewhat difficult unless you really, really know what you are doing, I would personally avoid adding rel=canonical for these pages. One possible solution could be to use JavaScript for paginated lists with different sort orders, for example, that way you would have a single URL which lists all products.

- Printer-friendly pages: Personally, I'd suggest just using a normal printer style sheet, which would let you keep the same URLs. Short of that, using a rel=canonical is also fine.

- Product pages: If you have separate URLs for the same products (eg books>non-fiction>guide-to-Italy // books>travel>guide-to-Italy) then picking one and pointing that canonical from the other pages is fine. Setting a category page as canonical seems like a bad idea since we won't be able to index the product pages.

Crawling canonical links
The data shown there is based on our crawling activity, which is why you'd see those URLs there if you're using rel=canonical. We have to crawl and index these URLs first, before the rel=canonical is extracted, so it may even happen that they are temporarily visible in the search results. That's fine - and not something you'd need to prevent. As we process the content there, we'll focus on your preferred canonical for further indexing.

The sky is not falling : www and non-www
In general, just having a site accessible via www and non-www is not so much of a problem.

We're generally pretty good at figuring that out
While cleaning up issues like canonicalization with 301 redirects are good, they aren't the most important thing on a website. If It gets way too complicated to fix that with your current setup, I'd just leave them as is, perhaps using Webmaster Tools to select your preference if you can. We're generally pretty good at figuring that out, no need to worry too much about it :-).

Google auto-canonicalise ?
Yes, we can and do sort this kind of issue out algorithmically all the time :-). Most sites don't specify a canonical in Webmaster Tools, yet we index them just the same. That said, if we notice that and show the same page, we'll just pick one of those and show it to our users in the search results. By doing that, there's a chance that we might not pick the one which YOU prefer -- so with this setting and with the 301 redirect you have a way of telling us your preference.
There are a few other advantages of specifying one or the other. For example, in order for us to notice that the content on both URLs is the same, we have to actually crawl both versions. Depending on your website and on your server, this might not be a problem -- or it might be a big problem (if accessing those URLs uses a lot of your resources). By using a redirect or specifying a canonical version you can help reduce that overhead.
At any rate, no you certainly don't have to do this; it's just something that you could do if you wanted to :).
Regarding the original question, if we have chosen to index your site as "" then you won't find it by searching for "" (because we don't have the "www" part in the URLs). However, if you turn it around and tell us to index "" we'll have both versions available. Regardless of that, when a user searches for your URL they generally already know how to reach you, so this is usually not something worth getting grey hair over.

Cleaning up the index ?
From the search you mentioned, I searched for some of the product titles there. For the ones that I checked, your HTTPS pages did not show up in the search results, so I wouldn't really worry about it. Give it time and as we recrawl these pages, we'll update them in the index accordingly. At any rate, since the pages redirect to the preferred ones, you wouldn't have to specify the "noindex" x-robots-tag anyway and in addition, any users who happen to come through the HTTPS pages will make it to your site regardless. There's generally no need to clean up the indexed URLs this granularly :-).

Session urls in sitemap files
If you are not submitting clean URLs in your Sitemap file, you'd be better off not using a Sitemap file. With session-IDs in there, it'll cause more problems (with us crawling and indexing those URLs) than if you just let us crawl your website normally (especially if you really have a clean URL structure). So my advice would be to either delete the Sitemap file, or make sure that the submitted URLs are really exactly the same, clean ones that we find while crawling.

304 Not Modified
As many servers are incorrectly configured, we do not always crawl using conditional requests, so what you are seeing -- as far as I understand it -- would be normal. Additionally, as Cristina mentioned, the "Fetch as Googlebot" feature will always use unconditional requests, so you should also see the "200 OK" there as well. Additionally, the type of request made will generally not have an influence on your site's ranking (assuming your server is returning the proper content for those requests).

302 redirect away from root
For what it's worth, a 302 redirect is the correct redirect from a root URL to a detail page (such as from "/" to "/sites/bursa/"). This is one of the few situations where a 302 redirect is preferred over a 301 redirect. However, as Colin mentioned, if you were hosting this yourself, you might want to look into saving an additional jump by just serving the content directly (it's not necessary, but if you can do it, it's always nice to save the user a redirect).

Generally speaking, with a 302 redirect we'd try to take the content of the redirect target (in your case PAGE-B) and index it under the redirecting URL (in your case PAGE-A). If the target has a noindex meta tag, then it's likely that we'd apply that to the redirecting URL as well.

Change of Hosting
Secondly, it seems you changed hosting infrastructure around May 11th. When our algorithms see a hosting change, they try to lower Googlebot's crawling rate as a safety mechanism to not overload the servers. In time, as we crawl more and learn more about the crawling load the hosting seems capable of handling, the algorithms will automatically try to increase the crawl rate. You're seeing this process when you report 30% growth in crawl rate recently, and there is a good chance that will continue to grow.

Making a great site
Looking through here, I think Cristina mentioned a really good point -- having great content, especially on your homepage, can do wonders for your site's visibility in search results. Not only will it provide something for our crawlers to pick up & to help us better understand your website, but it will also be something that can and will attract links from other websites.
In my opinion, next to having a technically "ok" website, the content itself is one of the biggest "SEO-elements" that you can work on for your site. That's not something you need a SEO company for, that's something which you -- as the expert in that business -- need to work on yourself. Make something that you would recommend to others in the same business!

No comments: