Web scraping is a hotly debated topic that has produced a variety of court rulings regarding copyright and fair usage of information. Some companies have go as far as to running their software that blocks site scraping efforts.
The internet offers an incredible amount of information. The depth and breadth of the information available is so vast that it can be hardly measured. Much of the information available is protected by copyright laws, yet there are many people who are aggregating and modifying the format of the information for their own purposes.
Some site, like search engines, will organize and index information from websites as a service to people who are attempting to locate information. Other sites, however, are interested in doing market research to gain an advantage over their competitors. Still others are out to steal whatever information they can get and use it to their own advantage.
Site Scraping – What is it?
When most people visit a website, they are interested in learning certain pieces of information. They start with a search engine which guides them to sites that have the information they need. Then they click through pages of information, gathering what they seek. Site scraping is when someone uses an automated system or computer program to do the same actions, only much faster and in order to gain substantial quantities of information.
Unlike the individual person who visits a site to learn about a product or its pricing, a site scraping “bot” can go through an entire site and copy every piece of product information and pricing that’s available on that site. The difficulty is trying to ascertain the purpose of the data collection. Quite often, the purpose is to gather pricing information or some sort of data which can be used for marketing or other purposes. There is a substantial amount of debate over the legality of site scraping.
Legal Concerns of Scraping
The U.S. courts have moved back and forth on the legal issues associated with web scraping. For many years, the idea of bots gathering information from websites was looked upon as more of a nuisance than anything else. In 2001, a travel agency sued a competitor for scraping. The competitor had used bots to gather all of the agency’s prices and travel data, using it to set up their own website with competitive pricing. The judge ruled that the competitor’s use of the agency’s information did not constitute hacking. Instead, it ruled that it was fair use of information that the travel agency had made public.
In a recent case, however, it’s obvious that the attitude has begun to change. In a suit filed by AT&T, a judge ruled that the farming of information from the company’s website was considered a forced attack on AT&T. The individual who committed the invasion of the website’s databases was actually charged with a felony that carried a 15 year sentence.
Yet, other cases are showing some inconsistencies in how the courts are dealing with the problem of site scraping. Another case involving site scraping ruled that it was the format of the information that was protected, not the actual data itself. As long as the information
which was gathered was presented in a different organization, then the scraping was permitted.
Search engines continue to provide concerns for some companies. A recent case involved a search engine that provided thumbnail images of a website’s content. The company filed suit against the search engine, claiming that it had violated fair use laws. The court ruled in favor of the search engine, citing that the information gathered and posted by the search engine was for general public benefit.
Protection against Scraping
While the legal battles regarding site scraping go back and forth, efforts to block site scraping have become quite successful. A tool like Scrapesentry is helpful in preventing information theft. Although there are still many instances of scraping, it is evident that security measures are now available to identify website accesses that are attempts at scraping information.
Using such information as page consumption speed, the new software is able to label a visit to a website as a probable case of site scraping and terminate the bot’s access to the website. This new software is also able to block return visitors to a site that have an unusually higher rate of return than most other visitors to the website.
This same software, however, is able to detect other types of informational bots such as web crawlers. These friendly bots, which are used to gather search engine details, are white listed and are allowed repeated access to site information.
Site scraping continues to be an issue for many websites. Yet, bot operators claim that the information is readily available and the gathering of such information constitutes fair use. This topic will continue to be debated in the courts until some firm decisions are established that set a clear picture of what data scraping should be authorized and exactly what is considered a violation of such things as current copyright laws.