International Journal For Multidisciplinary Research

E-ISSN: 2582-2160     Impact Factor: 9.24

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 7, Issue 2 (March-April 2025) Submit your research before last 3 days of April to publish your research paper in the issue of March-April.

The Making of an Data Pipeline

Author(s) Harsh Kaushik, Avnish Rai, Gaurav Kapasiya, Jai Prakash Bhati
Country India
Abstract This paper details the development and implementation of a data engineering pipeline designed for the extraction, transformation, and loading (ETL) of data from a web-based directory. The project involves using asynchronous web scraping techniques to gather user details from a local business directory, transforming the data into a structured format, and loading it into a storage solution. The pipeline utilises Python, the HTTPX library for asynchronous HTTP requests, BeautifulSoup for HTML parsing, and Amazon S3 for data storage. By leveraging these technologies, the pipeline demonstrates an efficient approach to handling large-scale web data extraction and processing, significantly reducing the time required to gather and organise data from multiple web pages. This paper provides insights into the architecture, implementation, and performance of the ETL pipeline, highlighting the benefits and challenges of using asynchronous programming in data engineering.
Keywords ETL, Data engineering , Python, Async, Web Scraping, local.ch
Field Engineering
Published In Volume 6, Issue 3, May-June 2024
Published On 2024-05-21
DOI https://doi.org/10.36948/ijfmr.2024.v06i03.20849
Short DOI https://doi.org/gtwmsm

Share this