International Journal For Multidisciplinary Research

E-ISSN: 2582-2160     Impact Factor: 9.24

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 6 Issue 6 November-December 2024 Submit your research before last 3 days of December to publish your research paper in the issue of November-December.

Scrapy-Based Incremental Housing Rental Information Crawling System Design

Author(s) Qichen Shao, Dongxiao Ren
Country China
Abstract A web-controlled incremental crawling system is designed for incremental crawling of property rental information on websites because of the need for massive data sets to train the housing rental system model, and to solve the problem of always using site-wide crawling and multiple database accesses for crawling websites based on the Scrapy framework. In order to achieve incremental crawling, a download middleware is added to the Scrapy framework, the system loads the seed page, the visited URLs and their hash lists and the control page list when the crawler starts, obtains the URLs of the sub-level pages and enters them into the database, then crawls the sub-level pages in bulk and parses the property information in the sub-level pages. The data is cleaned by verifying the data format, completing missing items, removing duplicate data and detecting abnormal data to get the eligible property data.
Keywords Scrapy crawler, incremental crawling, download middleware
Field Computer Applications
Published In Volume 5, Issue 3, May-June 2023
Published On 2023-06-08
Cite This Scrapy-Based Incremental Housing Rental Information Crawling System Design - Qichen Shao, Dongxiao Ren - IJFMR Volume 5, Issue 3, May-June 2023. DOI 10.36948/ijfmr.2023.v05i03.3488
DOI https://doi.org/10.36948/ijfmr.2023.v05i03.3488
Short DOI https://doi.org/gscdqz

Share this