Week 4 of the GSoC coding period is completed successfully and with its completion, first evaluation period is also started. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.
I am working on "Developing a “ Product Advertising API ” module for Drupal 8" - #7. The “Product Advertising API” which is renamed to "Affiliates Connect" module provides an interface to easily integrate with affiliate APIs or product advertising APIs provided by different e-commerce platforms like Flipkart, Amazon, eBay etc to fetch data easily from their database and import them into Drupal to monetize your website by advertising their products. If the e-commerce platforms don't provide affiliate APIs, we will scrape the data from such platforms.
Some of the tasks accomplished before the first evaluation period, This post comprises of tasks completed from first week of GSoC to first evaluation period.
- With the skeleton issue, The overview page which will show the different plugins enabled by the user and allow the users to configure plugins, is also completed.
- Content Entity for storing product's data from various vendors is also completed and reviewed by borisson_ , dbjpanda , and other mentors. It is still under discussion whether to select nodes or content entities to store the data. I am thinking to use content entity for storing the product's data and why I am using it, is also explained under this blog post - Where to store product's data? Link to the issue - #2975642
- Functional Tests for verification of routes defined in the project as suggested by borisson_ is also completed and reviewed. It also included the functional tests for checking whether product's data is submitted correctly by affiliates_product add form or not. Tests for deleting & editing the products are also completed under this issue. Link to the issue - #2977377
- Native APIs provided by e-commerce sites only allow some percent of product's data to be accessed. I am also working on Flipkart module in this repo from my previous studies in this repo. I will update my further work in the repo. We can't fetch all the product's data from any e-commerce sites so we need to write the scraper which does this task for us. For scraping, I am using Node.js to create scraper APIs.
Link to the repo - Scraping using node/scrapers-benchmarks
- As discussed, We are using heterogeneous scraper so that we can use request/x-ray library for static websites and nightmare for dynamic websites.
- Scraping Amazon websites using Nightmare is almost completed. Link to the repo - Scraping using node/Amazon
- I have broken down the scraper API part into smaller issues, so it would be easy to implement and review. In this way, we can progress faster. I have added one issue to the issue queue. Link to the issue - #2979094
- As discussed with mentors, We want something that can ease out the efforts at the end of the client side, So I am thinking to use Feeds module and utilize its feed import functionality. Feeds has config entity "feeds type" that allows user to use the type of fetcher, parser, processor and maps the fields of the selected proceessor. Once that feeds type is created, We can use it for multiple urls/categories (another entity defined as content entity for saving urls linked with feed types).
- Feeds provides parser for parsing and mapping the data collected from the fetcher part. So I am designing the separate parser that will use cheerio lib of Nodejs for parsing the data.
- In this week, I am going to create the fetcher part. In this way, we can provide a user a generic scraper that can be configured as per the user requirement.