Web抓取:介绍、应用和最佳实践

Encora | July 29, 2019

Web scraping typically extracts large amounts of data from websites for a variety of uses such as price monitoring, enriching machine learning models, financial data aggregation, monitoring consumer sentiment, news tracking, etc. Browsers show data from a website. However, manually copy data from multiple sources for retrieval in a central place can be very tedious and time-consuming. 网络抓取工具基本上自动化了这个手动过程. 
这篇文章旨在帮助你快速了解网页抓取的基础知识. We’ll cover basic processes, best practices, dos and don’ts, and identify use cases where web scraping may be illegal and have adverse effects. 

Basics of Web Scraping

“Web scraping,” also called crawling or spidering, is the automated gathering of data from an online source usually from a website. While scraping is a great way to get massive amounts of data in relatively short timeframes, 它确实给承载源的服务器增加了压力. 
这就是为什么许多网站不允许或禁止抓取所有数据的主要原因. However, 只要它不破坏在线资源的主要功能, it is fairly acceptable. 
尽管存在法律挑战,但即使在2019年,网络抓取仍然很受欢迎. 分析的重要性和需求已经成倍增长. This, in turn, means various learning models and analytics engine need more raw data. 网络抓取仍然是一种流行的信息收集方式. With the rise of programming languages such a Python, web scraping has made significant leaps.

Typical applications of web scraping

Social media sentiment analysis

社交媒体帖子的保质期非常短, however, 整体来看,它们显示出有价值的趋势. While most social media platforms have APIs that let 3rd party tools access their data, this may not always be sufficient. In such cases scraping these websites gives access to real-time information such as trending sentiments, phrases, topics, etc. 

eCommerce pricing

Many eCommerce sellers often have their products listed on multiple marketplaces. With scraping, they can monitor the pricing on multiple platforms and make a sale on the marketplace where the profit is higher. 

Investment opportunities

Real estate investors often want to know about promising neighborhoods they can invest in. 虽然有多种方法可以获得这些数据, web scraping travel marketplaces and hospitality brokerage websites offer valuable information. 这包括诸如评级最高的地区等信息, amenities that typical buyers look for, 可能即将成为有吸引力的租赁选择的地点, etc.  

Machine learning

机器学习模型需要原始数据来进化和改进. 网页抓取工具可以抓取大量的数据点, text and images in a relatively short time. Machine learning is fueling today’s technological marvels such as driverless cars, space flight, image and speech recognition. 然而,这些模型需要数据来提高其准确性和可靠性. 
一个好的网页抓取项目遵循这些做法. These ensure that you get the data you are looking for while being non-disruptive to the data sources. 

Identify the goal

任何网络抓取项目都是从需求开始的. A goal detailing the expected outcomes is necessary and is the most basic need for a scraping task. The following set of questions need to be asked while identifying the need for a web scraping project:

  • 我们期望寻求什么样的信息?
  • 这种抓取活动的结果会是什么?
  • 这些信息通常在哪里发布?
  • 谁是使用这些数据的最终用户?
  • Where will the extracted data be stored? For e.g. 在云或本地存储、外部数据库等.
  • 这些数据应该如何呈现给最终用户? For e.g. CSV/Excel/JSON文件或SQL数据库等.
  • 源网站多久更新一次新数据? In other words, what is the typical shelf-life of the data that is being collected and how often does it have to be refreshed?
  • Post the scraping activity, what are the types of reports you would want to generate?

Tool analysis

由于网页抓取大部分是自动化的,工具选择是非常重要的. 在确定工具选择时,需要记住以下几点:

  • Fitment with the needs of the project
  • Supported operating systems and platforms
  • Free/open-source or paid tool
  • Support for scripting languages
  • Support for built-in data storage
  • Available selectors
  • Availability of documentation

Designing the scraping schema

Let’s assume that our scraping job collects data from job sites about open positions listed by various organizations. 数据源还将指定模式属性. 这个作业的模式看起来像这样:

  • Job ID
  • Title
  • Job description
  • URL used to apply for the position
  • Job location
  • Remuneration data if it is available
  • Job type
  • Experience level
  • Any special skills listed

Test runs and larger jobs

This is a no-brainer and a test run will help you identify any roadblocks or potential issues before running a larger job. 虽然不能保证以后不会有惊喜, results from the test run are a good indicator of what to expect going forward.

  1. Parse the HTML
  2. 根据抓取模式检索所需的项目
  3. Identify URLs pointing to subsequent pages

Once we are happy with the test run, 我们现在可以概括这个范围,并继续进行更大的刮擦. 这里我们需要理解人类如何从每个页面检索数据. Using regular expressions we can accurately match and retrieve the correct data. Subsequently, we also need to catch the correct Xpaths and replace them with hardcoded values if necessary. 您可能还需要外部库的支持. 
通常,您可能需要外部库作为源代码的输入. For e.g. you may need to enter the Country, State and Zipcode to identify the correct values that you need.
Here are a few additional points to check:

  1. Command-line interface
  2. Scheduling for the created scrape 
  3. Third-party integration support (e.g. for Git, TFS, Bitbucket)
  4. Scrape templates for similar websites

Output formats

Depending on the tool end-users can access the data from web scraping in several formats:

  • CSV
  • JSON
  • XML
  • Excel
  • SQL Server database
  • MySQL Database
  • OleDB Database
  • 脚本(脚本提供来自几乎任何数据源的数据)

提高刮刀的性能和可靠性

Tools and scripts often follow a few best practices while web scraping large amounts of data. 
In many cases, the scraping job may have to collect extremely large amounts of data. 这可能会花费太多时间,并且会遇到超时和无限循环. Hence tool identification and understanding its capabilities is very important. Here are a few best practices to help you better tune your scraping models for performance and reliability.

  1. 如果可能的话,在网页抓取时避免使用图像. If you absolutely need images, you must store these in a local drive and update the database with the appropriate path.
  2. 某些Javascript特性可能会导致不稳定. Certain dynamic features may cause memory leaks, website hangs or even crashes. It is absolutely important to remember that the normal activity of the information source must not be disrupted in any way. In such scenarios, a few tools use web crawler agents to facilitate the scrape. Very often, using a web crawler agent can be up to 100 times faster than using a web browser agent. 
  3. 在抓取工具或脚本中启用以下选项-“忽略缓存”, ‘Ignore certificate errors’, and ‘Ignore to run ActiveX and flash’.
  4. 在每个抓取会话完成后调用终止进程
  5. 避免在每次抓取时使用多个浏览器 
  6. Handle memory leaks

Things to stay away from

在设置和执行网页抓取项目时,有一些禁忌. 

  1. Avoid sites with too many broken links
  2. 远离那些数据字段中有太多缺失值的站点
  3. 需要CAPTCHA验证才能显示数据的站点
  4. 一些网站有一个无限循环的分页. Here the scraping tool would start from the beginning once the number of pages exhausts.
  5. Web scraping iframe-based websites
  6. Once a certain connection threshold reaches, 一些网站可能会阻止用户进一步抓取数据. 虽然您可以使用代理和不同的用户头来完成抓取, 理解这些措施到位的原因是很重要的. If a website has taken steps to prevent web scraping, these should be respected and left alone. Forcibly web scraping such sites is illegal. 

从互联网早期开始,网络抓取就已经存在了. While it can provide you the data you need, certain care, caution and restraint should be exercised. A properly planned and executed web scraping project can yield valuable data – one that will be useful for the end-user. 

References

  1. “Web Scraper Test Drive! – Web Scraping”, n.d. Accessed July 26, 2019. .
  2. 5大网络抓取工具比较| Octoparse”,n.d. Accessed July 26, 2019. http://www.octoparse.com/blog/top-5-web-scraping-tools-comparison.
  3. “10个最好的网络抓取工具提取在线数据- Hongkiat”,n.d. Accessed July 26, 2019. http://www.hongkiat.com/blog/web-scraping-tools/.
  4. “Web Scraping Explained”, n.d. Accessed July 26, 2019. http://www.webharvy.com/articles/what-is-web-scraping.html.
  5. “Web Scraping – Wikipedia”, n.d. Accessed July 26, 2019. http://en.wikipedia.org/wiki/Web_scraping.
  6. 网页抓取的大列表:网页抓取的应用于.d. Accessed July 26, 2019. http://www.entropywebscraping.com/2017/01/01/big-list-web-scraping-uses/.

Liked what you read? 这里还有一些你可能会感兴趣的:

  1. 使用Electron构建桌面应用程序-第1部分
  2. Progressive Web Applications
  3. The Rise and Evolution of VueJS

Insight Content

Categories

Share this Post

Featured Insights