Millions of people reap the benefits of web data intelligence every day.
Many of us happily use travel fare aggregator websites, consult price comparison sites, expect Amazon to serve the best deals, and scour Google for any information possible without even realizing that all these services depend on web scraping technology.
And yet, the scraping industry is still surrounded by misconceptions about its ethicality and legal status.
This challenge isn’t unique — any evolving industry must go through different stages, forming common standards, agreeing on legal and ethical norms, and raising public awareness about the way it functions.
Gaining trust doesn’t happen overnight. For the scraping industry players, the only way to bring the market to a more mature stage is by recognizing the good, the bad, and the ugly of the data aggregation practices and by acting both legally and ethically.
The main elements of ethical data aggregation
The easiest way to define ethical data aggregation is to start from the broader term of data ethics.
The discipline of data ethics evaluates such attributes as fairness, privacy, accountability, and transparency in the entire data cycle, from collection and analysis to building individual data-driven systems, such as ML models.
The aim is to identify whether certain practices can have negative effects (both apparent and latent) on individuals, society, or the environment. As such, data ethics almost always have wider implications than legal norms surrounding data practices.
Ethical data aggregation is a narrower term that covers only the first stage — data collection.
The ethicality of data aggregation practices (or web scraping, as it is often called) can be determined by whether they adhere to the following four principles:
Respect for privacy
Credible web scraping providers collect only publicly available data, meaning that any information behind log-ins or paywalls, in nearly all cases, shouldn’t be scraped.
Even if private data is publicly available, it is still protected by privacy laws and organizations should thoroughly evaluate whether there really is a necessity to collect such information.
As complicated as it might get sometimes, anyone scraping at scale must thoroughly determine the nature of the data they are fetching.
A recent class-action suit around children’s data protection shows that there’s still a lot of space for better ethical judgment when it comes to a massive aggregation of publicly available information.
Respect for the target website
Ethical scraping practices treat target websites in a way that doesn’t hinder their speed and functionality or, in other words, cause server overload. Before scraping, it is essential to study the target site’s Terms of Service and robot.txt files, ensuring your activities won’t breach them.
Utilizing ethical proxies
Web scraping relies on a robust proxy infrastructure to overcome server blocks and geo-restrictions. Unethically sourced proxies were one of the reasons why web scraping gained a questionable reputation at the very start.
Overcoming it required industry leaders to set the bar and main principles of ethical proxy acquisition. Currently, they include a commitment to fairness, transparency, and compensation. Central to the Tier A+ procurement model is obtaining explicit consent and offering rewards for the users who are voluntarily participating in the proxy network.
Robust KYC policies
Companies that commit themselves to ethical data practices must also ensure that these standards are well-known and respected among their clients, serving proxy and scraping infrastructure only for approved business use cases.
“Legal” might not equal ethical
Unfortunately, technology is just a means to an end, and there is no way of ensuring that it will never get into the wrong hands.
Unethical scraping activities don't necessarily coincide with illegal practices or bad Internet actors. Unethical actions might be done by normal companies simply trying to cut corners for faster business growth.
Recent controversies around generative AI developments are a perfect illustration of an ethically ambiguous situation.
Hypothetically, even if we assume that generative AI companies aggregated data for AI training needs within legal boundaries, meaning that data was publicly available, didn’t jeopardize privacy regulation and breach copyright law, there’s still a question of whether, ethically and morally, they had a right to use content generated by millions of people for commercial purposes and without getting consents.
This example also shows why the ethical outlook is broader than the legal one.
An important part of discussions around ethical data aggregation is how we could bring digital peace of mind to the broader Internet community, shifting the benefits of public web data towards a win-win situation, not a zero-sum game.
Making the Internet a better place for everyone
Most arguments for ethical data aggregation laid down in this article might sound self-explanatory.
However, the same as people, not all companies are equally ethics-oriented or have the same understanding of ethical conduct. This is where awareness campaigns, such as the Internet Infrastructure Coalition and the Ethical Web Data Collection Initiative (EWDCI), come into play.
In order to agree on what is or isn’t proper scraping activities, it is essential to have as broad ecosystem representation as possible, shifting the focus from the ongoing battle among tech giants and actually giving a voice to SMEs that often create the majority of value and innovation in the market.
EWDCI acts as such a framework, giving different web scraping industry players the possibility to raise concerns and discuss common standards and norms in a way that reflects their business situation and challenges.
Just recently, EWDCI launched a certification program, inviting web scraping companies to get accredited and signal their commitment to the highest ethical standards.
Showing that, as a segment of the Internet’s infrastructure, these companies can provide value for consumers and businesses ethically is the way to make the Internet a safer place and, herewith, overcome the one-sided reputation that web scraping has earned.