Read articles behind paywalls by masquerading as Googlebot

Martin Brinkmann
Feb 26, 2016
Updated • Feb 11, 2020
Internet
|
68

The Internet is at a tipping point. The continued rise of adblocking has put an end to the revenue model that relies solely on ad dollars to operate websites and businesses.

Especially news sites have started to experiment with ways to diversify income sources, and one prominent option that sites like The Wall Street Journal, Financial Times, The New York Times, the Times, or The Washington Post have implemented or tested is the paywall system.

There are different types of paywalls but they all have in common that they block access to content; this may happen directly when the first article is opened, after a certain number of articles have been read on site, or as an excerpt system that displays the first paragraph to the reader and below that sign-up information to read the rest.

Paywalls may not always require users to pay money for access. Some sites may require users to sign-up to use the site but won't charger users once they have signed up.

news site paywall

It may make sense from a business point of view, and may be more lucrative than battling it out with users who run adblockers, but there is a downside to it both for the paywalled site and the blocked user.

Sites lose a high percentage of visitors if they implement a paywall system. It is unclear how high the percentage really is, and it probably varies from site to site, but it is likely a lot higher than the percentage of visitors who subscribe to the site after being presented with the choice to subscribe to read the desired article.

For users, it can be really frustrating to follow a link to an interesting sounding article just to be blocked from reading it once the resource has loaded; it is a waste of time for many, especially if no content is provided prior to signing up or subscribing.

Masquerade your browser

It is no secret that news sites allow access to news aggregators and search engines. If you check Google News or Search for instance, you will find articles from sites with paywalls listed there.

In the past, news sites allowed access to visitors coming from major news aggregators such as Reddit, Digg or Slashdot, but that practice seems to be as good as dead nowadays. Some may still allow it but it is trial and error, and the workaround may be shut down at any time.

Another trick, to paste the article title into a search engine to read the cached story on it directly, does not seem to work properly anymore as well as articles on sites with paywalls are not usually cached anymore.

Tip: check out the following add-on that you may use to bypass paywalls:

User-Agent and Referrer

You are probably wondering how sites block or allow access to the site's content. The methods have have improved over the years, and it is no longer enough to simply change the referrer of the browser to https://www.google.com/ to gain full access to a site's content.

Instead, sites use various checks that include user-agent, referrer and cookies, and sometimes even more than that, to determine the legitimacy of access.

General information

Probably the best way to masquerade the browser is to make it appear to be Googlebot.

  • Referrer:  https://www.google.com/
  • User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html

Note that the option does not work anymore on may sites. It may be better to try and masquerade as coming from Twitter or other social media sites.

Firefox

referrer

Firefox users need two browser add-ons for that: the first, RefControl, to change the referrer value when visiting news sites, the second, User Agent Switcher, to change the user agent of the browser.

Update: RefControl is no longer available. You may try this alternative instead. End

  1. Download and install both extensions in the Firefox web browser.
  2. Tap on the Alt-key, and select Tools > RefControl Options.
  3. Click on "add site", enter a domain name under site, select custom action, and enter https://www.google.com/ as the referrer.
  4. Repeat this for all news sites you want to access (some may not work even if you make the changes, so keep that in mind).
  5. When you are done, close the configuration window.
  6. Tap on the Alt-key again, and select Tools > Default User Agent > Edit User Agents from the menu.
  7. Select New > User Agent, and replace the string in the User Agent field with Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). Name it Googlebot.
  8. Exit the menu.
  9. Before you access these sites, tap on Alt, and select Default User Agent > Googlebot.

This is all there is to it. It is a bit unfortunate that there is no extension for Firefox that changes the user agent automatically based on the sites you visit.

Google Chrome

Google Chrome users can install extensions like User Agent Switcher and Referer Control that are available for the browser to do the same.

There is however another possibility, and that is to create a custom extension which automates the process in the browser.

Instructions are provided on Elaineou. All it takes, basically, is to create a new directory on the local computer, create the two files background.js and manifest.json inside it, and copy and paste the code found on the site into the files.

You need to enable "developer mode" on chrome://extensions/, and can then select "load unpacked extension" to pick the folder you have created the two files in to load the extension in Chrome.

You may modify the list of sites it supports to add new ones.

Summary
Read articles behind paywalls by masquerading as Googlebot
Article Name
Read articles behind paywalls by masquerading as Googlebot
Description
Paywalls prevent Internet users from reading one or more than a handful of articles on news sites. Find out how to access these articles nevertheless.
Author
Publisher
Ghacks Technology News
Logo
Advertisement

Tutorials & Tips


Previous Post: «
Next Post: «

Comments

  1. ilev said on August 4, 2012 at 7:53 pm
    Reply

    Doesn’t Windows 8 know that www. or http:// are passe ?

    1. Martin Brinkmann said on August 4, 2012 at 7:57 pm
      Reply

      Well it is a bit difficulty to distinguish between name.com domains and files for instance.

    2. Leonidas Burton said on September 4, 2023 at 4:51 am
      Reply

      I know a service made by google that is similar to Google bookmarks.
      http://www.google.com/saved

  2. VioletMoon said on August 16, 2023 at 5:26 pm
    Reply

    @Ashwin–Thankful you delighted my comment; who knows how many “gamers” would have disagreed!

  3. Karl said on August 17, 2023 at 10:36 pm
    Reply

    @Martin

    The comments section under this very article (3 comments) is identical to the comments section found under the following article:
    https://www.ghacks.net/2023/08/15/netflix-is-testing-game-streaming-on-tvs-and-computers/

    Not sure what the issue is, but have seen this issue under some other articles recently but did not report it back then.

  4. Anonymous said on August 25, 2023 at 11:44 am
    Reply

    Omg a badge!!!
    Some tangible reward lmao.

    It sucks that redditors are going to love the fuck out of it too.

  5. Scroogled said on August 25, 2023 at 10:57 pm
    Reply

    With the cloud, there is no such thing as unlimited storage or privacy. Stop relying on these tech scums. Purchase your own hardware and develop your own solutions.

    1. lollmaoeven said on August 27, 2023 at 6:24 am
      Reply

      This is a certified reddit cringe moment. Hilarious how the article’s author tries to dress it up like it’s anything more than a png for doing the reddit corporation’s moderation work for free (or for bribes from companies and political groups)

  6. El Duderino said on August 25, 2023 at 11:14 pm
    Reply

    Almost al unlmited services have a real limit.

    And this comment is written on the dropbox article from August 25, 2023.

  7. John G. said on August 26, 2023 at 1:29 am
    Reply

    First comment > @ilev said on August 4, 2012 at 7:53 pm

    For the God’s sake, fix the comments soon please! :[

  8. Kalmly said on August 26, 2023 at 4:42 pm
    Reply

    Yes. Please. Fix the comments.

  9. Kim Schmidt said on September 3, 2023 at 3:42 pm
    Reply

    With Google Chrome, it’s only been 1,500 for some time now.

    Anyone who wants to force me in such a way into buying something that I can get elsewhere for free will certainly never see a single dime from my side. I don’t even know how stupid their marketing department is to impose these limits on users instead of offering a valuable product to the paying faction. But they don’t. Even if you pay, you get something that is also available for free elsewhere.

    The algorithm has also become less and less savvy in terms of e.g. English/German translations. It used to be that the bot could sort of sense what you were trying to say and put it into different colloquialisms, which was even fun because it was like, “I know what you’re trying to say here, how about…” Now it’s in parts too stupid to translate the simplest sentences correctly, and the suggestions it makes are at times as moronic as those made by Google Translations.

    If this is a deep-learning AI that learns from users’ translations and the phrases they choose most often – which, by the way, is a valuable, moneys worthwhile contribution of every free user to this project: They invest their time and texts, thereby providing the necessary data for the AI to do the thing as nicely as they brag about it in the first place – alas, the more unprofessional users discovered the translator, the worse the language of this deep-learning bot has become, the greater the aggregate of linguistically illiterate users has become, and the worse the language of this deep-learning bot has become, as it now learns the drivel of every Tom, Dick and Harry out there, which is why I now get their Mickey Mouse language as suggestions: the inane language of people who can barely spell the alphabet, it seems.

    And as a thank you for our time and effort in helping them and their AI learn, they’ve lowered the limit from what was once 5,000 to now 1,500…? A big “fuck off” from here for that! Not a brass farthing from me for this attitude and behaviour, not in a hundred years.

  10. Anonymous said on September 28, 2023 at 8:19 am
    Reply

    When will you put an end to the mess in the comments?

  11. RIP said on September 28, 2023 at 9:36 am
    Reply

    Ghacks comments have been broken for too long. What article did you see this comment on? Reply below. If we get to 20 different articles we should all stop using the site in protest.

    I posted this on [https://www.ghacks.net/2023/09/28/reddit-enforces-user-activity-tracking-on-site-to-push-advertising-revenue/] so please reply if you see it on a different article.

    1. RIP said on September 28, 2023 at 11:01 am
      Reply

      Comment redirected me to [https://www.ghacks.net/2012/08/04/add-search-the-internet-to-the-windows-start-menu/] which seems to be the ‘real’ article it is attached to

  12. RIP said on September 28, 2023 at 10:48 am
    Reply

    Comment redirected me to [https://www.ghacks.net/2012/08/04/add-search-the-internet-to-the-windows-start-menu/] which seems to be the ‘real’ article it is attached to

  13. Mystique said on September 28, 2023 at 12:13 pm
    Reply

    Article Title: Reddit enforces user activity tracking on site to push advertising revenue
    Article URL: https://www.ghacks.net/2023/09/28/reddit-enforces-user-activity-tracking-on-site-to-push-advertising-revenue/

    No surprises here. This is just the beginning really. I cannot see a valid reason as to why anyone would continue to use the platform anymore when there are enough alternatives fill that void.

  14. justputthispostanywhere said on September 29, 2023 at 3:59 am
    Reply

    I’m not sure if there is a point in commenting given that comments seem to appear under random posts now, but I’ll try… this comment is for https://www.ghacks.net/2023/09/28/reddit-enforces-user-activity-tracking-on-site-to-push-advertising-revenue/

    My temporary “solution”, if you can call it that, is to use a VPN (Mullvad in my case) to sign up for and access Reddit via a European connection. I’m doing that with pretty much everything now, at least until the rest of the world catches up with GDPR. I don’t think GDPR is a magical privacy solution but it’s at least a first step.

Leave a Reply

Check the box to consent to your data being stored in line with the guidelines set out in our privacy policy

We love comments and welcome thoughtful and civilized discussion. Rudeness and personal attacks will not be tolerated. Please stay on-topic.
Please note that your comment may not appear immediately after you post it.