OpenAI has introduced a new technique for website administrators to prohibit the company’s Web crawler from extracting information.
The company that created ChatGPT included instructions for disabling its Web crawler in its online documentation. Monday members of the AI community discovered the addition, but there was no official announcement.
In order to prevent GPTBot from scanning a site, it must be added to the robots.txt file alongside the site’s restricted areas. Crawlers such as Googlebot are prevented from accessing the entirety or a portion of a domain using the same method.
Additionally, the corporation validated the IP address block used by the crawler. Instead of using robots.txt, an administrator could simply block these addresses.
GPTBot’s method requires users to opt out of monitoring, necessitating proactive measures on the part of website administrators. Unless an administrator specifically adds GPTBot to a site’s robots.txt file to halt the crawler, data may be used in future models.
Some people have quoted that OpenAI’s action may enable the company to lobby for anti-scraping legislation or defend itself against future actions.
However, it is doubtful that already collected data would be exempt from legislative scrutiny. GPT-4, for instance, was introduced in March 2023 utilizing data already included in training sets.
OpenAI has trained its models with additional datasets, including Common Crawl. The crawler program used to generate the data, CCBot, can also be blocked via robots.txt code. However, GPTBot is a dedicated crawler for the organization.
In addition to being able to block the crawler, the detection of the GPTBot has other conceivable applications. Following the identification of the crawler, a single suggestion has provided OpenAI with various responses. OpenAI’s stated goal for the crawler is for its AI models to become more precise and possess enhanced capabilities and safety.
What is a crawler and why is it necessary for OpenAI?
A Web crawler is a bot that traverses the World Wide Web methodically, accumulating data along the way.
This information is used by search engines like Google to construct an index for query purposes. Other applications include the archiving of Web pages. The robots.txt file is used to instruct crawler algorithms to index only specific web pages or nothing at all. If a crawler is omitted from this file, information accessible to the public will be collected.
Large language models, such as OpenAI’s, require training datasets in order to respond accurately to user queries. Web crawlers are the optimal way to generate these datasets. For instance, the Common Crawl program aims to provide a copy of the Internet for research and analysis.