Google logoI was discussing the issues around “hidden” or “protected” content with a client yesterday, specifically the problem that as a website owner you want as much content in the search engine’s index as possible, so that your site will be found, but you don’t actually want humans to see it without registering/paying.

This is an issue that has plagued paid-for content sites for years (see Danny Sullivan’s history lesson here). The problem being that whilst there are pretty simple technical solutions to allowing search engine spiders into your site, whilst preventing access to the casual human browser, pretty much any way of doing this you can come up with constitutes “cloaking” in the eyes of the search engine. If you have a look at Google’s Webmaster Guidelines on the subject, you can understand why this practice is frowned upon – they don’t want users to be taken somewhere they weren’t expecting, as that could severely affect the quality of the user experience and ultimately lead to people using another search engine.

I noticed that Google had made a blog post attempting to deal with this problem while I was on holiday – they want users to be able to find “protected” content because it may be just what they’re looking for, but not at the expense of inviting spam into the index. The solution is simple – allow Googlebot to index your site and when a user finds that page via a Google search, let them see the full page. If they want to access another “protected” page, Google is quite happy for you to require registration/payment; but not for that first page/article they clicked to from the search result. They call it “First Click Free” (FCF), something that has been accepted in Google News search for some time.

Initially, that sounds like a sterling solution. But it doesn’t take long to realise the problems here – firstly, a simple site: command search on Google for the site in question will reveal every page on the site. According to Google’s rules, if you click on any of those pages in the search result, you should see the whole article for free. So, a simple run down the full list of pages provided by that site: search gives you access to every page of paid content on the site in question.

Secondly, there are some simple technologies freely available out there to make you appear to be Googlebot or to make it look like every page you view has been referred from a Google search (here’s just one). So, using these, it would be simple to browse a site conforming to Google’s FCF rules and get access to every page – you wouldn’t even need to keep going back to that site: search listing.

So, what should the webmasters of such sites do? Well, you could take the view that the vast majority of web users have no idea about the site: command, changing user agents or accessing Google’s cache (the “Cache” link that appears under each search result that shows Google’s copy of the page in its database, rather than the “live” page). In which case, the vast majority of your site’s visitors will experience the site just as Google suggests.

However, if this becomes a popular method of allowing Google access to hidden content, how long before tools are developed and widely publicised to make things like changing your user agent incredibly easy? Eventually, there will be enough users doing it to really affect your site. In that case, there are a couple of options:

  1. Create summary pages that contain info “teaser” information to get the user’s attention and to work well enough in terms of SEO. In this case, your full protected pages won’t be accessible to Google or anyone else, but if the pages contain sufficient information and are optimised, they should still appear in searches and therefore do the job.
  2. Change your business model slightly. Allow everyone access to at least one page of protected content when they arrive, then request registration when they move to another page. This is like Google’s FCF model, except it is universal rather than applying only to Google users. If so desired, you could use the <meta name=”robots” content=”noarchive”> tag in the head of your pages to prevent search engines making copies in their cache. However, this may have a negative impact on pages’ performance in search results, as search engines like to compare copies of a page over time to assess its “trustworthiness” and topical relevancy. Remember also that this may restrict crawling of your pages, as Google will experience the site in the same way – it will be able to access one page, but then get the “registration required” message. I would be interested to know if anyone has tried this and whether an XML sitemap gets all the pages indexed anyway?

If I come across any other ideas, I’ll add to this post.

Bookmark and Share