OpenAI reveals new web crawler named "GPTBot"

Onur Demirkol
Aug 8, 2023

OpenAI has recently announced its new web crawler, GPTBot. This bot will collect publicly available data for the purpose of training AI models, which the business claims will be done transparently and responsibly.

According to OpenAI's release documentation, the web crawler will filter to eliminate sources that need paywall access as well as personally identifiable information (PII) or material that violates company regulations. According to the inventor of the GPT, letting the bot will assist in increasing the accuracy and capabilities of AI systems in the future.

This revolutionary step not only promises to improve the precision, capabilities, and safety of AI models, but it also ignites deep debates about data ethics, ownership, and use in the digital age. Though OpenAI admits that it scrapes the internet for training huge language models like GPT-4, this appears to be a half-baked solution to the ethical issues around taking data from other people's websites.


GPTBot access can be limited

In acknowledging the variety inherent in digital environments, OpenAI gives webmasters the ability to choose the amount to which GPTBot interacts with their websites. Webmasters can limit GPTBot's access totally or specify the directories it can browse by making cautious changes to their robots.txt files.

The launch of GPTBot provides webmasters and content providers with a new viewpoint, providing a window into the exploration of their digital domains. Webmasters may analyze GPTBot's interactions with their websites thanks to extensive documentation, and they can control access using the standard robots.txt protocol.

Watch out for these ChatGPT scams

Access control is a simple technique that entails including the following directives:

  • User-agent: GPTBot Disallow: /

The following structure can be used for a more refined approach that allows for more selective access:

  • User-agent: GPTBot Allow: /directory-1/ Disallow: /directory-2/
  • Balancing Act: Legal, Ethical, and Ownership Considerations

Recently, OpenAI applied for a trademark for 'GPT-5,' implying that the firm is training its next version of GPT-4, which, according to various sources, will be close to AGI, which has been the company's objective all along. GPTBot will undoubtedly assist the organization in gathering additional data from around the internet in order to train this model. On the other side, the corporation also stopped using its AI Classifier to recognize GPT-produced text.


Tutorials & Tips

Previous Post: «
Next Post: «


There are no comments on this post yet, be the first one to share your thoughts!

Leave a Reply

Check the box to consent to your data being stored in line with the guidelines set out in our privacy policy

We love comments and welcome thoughtful and civilized discussion. Rudeness and personal attacks will not be tolerated. Please stay on-topic.
Please note that your comment may not appear immediately after you post it.