Skip to main content

GSoC 2018 - Week - 4

Ankit Jain Web Developer, Open Source Enthusiast and Drupal Contributor
Week - 4

Week 3 of the GSoC coding period is completed successfully. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.

Project Abstract

I am working on "Developing a “ Product Advertising API ” module for Drupal 8" - #7. The “Product Advertising API” which is renamed to "Affiliates Connect" module provides an interface to easily integrate with affiliate APIs or product advertising APIs provided by different e-commerce platforms like Flipkart, Amazon, eBay etc to fetch data easily from their database and import them into Drupal to monetize your website by advertising their products. If the e-commerce platforms don't provide affiliate APIs, we will scrape the data from such platforms.

Progress

Some of the tasks accomplished in this week are -

- Configuration Form for saving the configuration of the affiliates_connect settings is completed. Link to the issue - #2976037

- Custom Affiliates Product Entity for storing product's data from various vendors is almost completed and is reviewed by borisson_ , dbjpanda, and other mentors. Link to the issue - #2975642

- As every vendor has a different configuration so configuration form for the plugins is still under development and discussion. Link to the issue - #2977044

- Functional Tests for verification of routes defined in the project as suggested by borisson_ is also completed and under review. It also included the functional tests for checking whether product's data is submitted correctly by affiliates_product add form or not. Tests for deleting & editing the products are also completed under this issue. Link to the issue - #2977377

Tests

Tests can increase the velocity without doing too much extra work.

Week 4 - Goals

The basic module development is completed so I will start working on Scraper API and I have done some work this week by doing/implementing some basic level of scraping using various libraries/modules in Node. Link to the repo - scraping-using-node. I have done some studies on Flipkart Affiliate APIs in this repo so I am also working on Flipkart Plugin. I will update my further work in the repo.

Flow for scraping any e-commerce website content is -

1. The first task is to find the sitemap URL which actually contains the categories URL. Sitemap URLs for Flipkart and Amazon - Flipkart, Amazon.in

2. After the categories of the products, We can scrape the product category wise and paginate each category till the last page to scrape every product so we need some way to paginate the whole category.

3. We need to scrape the detailed product data so we need to go to each product link and scrape its content.

4. Saving all product's link to a file for further scraping/updating existing product.

I have decided to scrape these sites using x-ray in Node and I found x-ray perfect for this project because of the features provided by this lib like Pagination feature, Crawler support, and Pluggable drivers- phantomjs which are required to scrape the websites which use React/Angular to load their content. While working on the x-ray, I found that the pagination feature is not working as expected and created lots of issues. I along with shibasisp and other mentors went through many libraries like Casper, Osmosis, Webdriver.io, and nightmare. After trying many libs, Nightmare is the one that is found suitable for this project. Nightmare uses Electron under the cover, which is similar to PhantomJS but roughly twice as fast and more modern. Nightmare uses Javascript for handing/manipulating DOM through evaluate function which is complex to implement. So, I am using Cheerio for handling/fetching DOM content by fetching innerHTML through evaluate function and pass the content (innerHTML) to Cheerio which is easy, fast, and flexible to implement.

Difficulties

- We need to scrape thousand of products data so it can take a lot of time depending on the implementation so I need to devise the algorithm that takes minimal time in scraping the whole lot of data.

Category