Skip to main content

GSoC 2018 - Week - 8

Ankit Jain Web Developer, Open Source Enthusiast and Drupal Contributor

You can know more about my last week work in this blog post - Week 7. This blog post is for Week - 8. GSoC (Google Summer of Code) is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3-month programming project during their break from school.


Some of the tasks accomplished in this week are-

- Scraper API is almost completed with both Static and dynamic fetchers. Static fetcher was completed last week and dynamic fetcher is also completed this week. Dynamic fetcher is using Nightmare for loading the Javascript and content that JS loads. Both the fetchers have the features of crawling, inner fetching and break fetching. Link to the repo - advance_crawler

- The project is dockerized and along with that, the project can be individually used by other developers for scraping. As Drupal is interacting with the Node through REST API architecture, it is not dependent on Drupal and its module in any manner Hence can be used in other projects. Link to the repo - advance_crawler

- I have added the multiple User-Agents features to avoid blocking, I was working on Proxy but I didn't find any open source/free proxy service and third party proxies can result in the performance issue until it is not a premium proxy service.

- I have also fixed the changes suggested by dbjpanda regarding the restructuring of routes. It is under review. As a part of this issue, it was suggested to add the predefined fields along with user-defined fields under the "Manage Fields" section. But it is not possible to show the predefined fields under the Manage fields Section. You can see that it is neither defined in Taxonomy nor in Content-Type but Content Type in Basic Article has 'body' field which is defined in the "Manage Fields" tab because it is not defined as a predefined fields i.e under BaseFieldDefinition function - Here you can see - Drupal 8.6.x Node.php It is actually defined as a file in the install dir under config folder. Link to the issue - #2982833

Week 8 - Goals

- Provide default configuration for Amazon Scraper. Scraper API is using Feeds module along with Feeds_ex and feeds_advance_crawler module and it will be difficult for anyone to set up the configuration for scraping so amazon plugin will come up with default configuration plugin which provides the one feed-type configured for scraping data from Amazon. Link to the issue - #2983176

- I will work on the Native API part for this module. Amazon Product Advertising API won't allow the data to be imported from the Amazon. It only allows users to search for any product through ASIN (product_unique_code) and sends the product information. Flipkart Product Advertising API on the other hands allows a user to fetch the 500 products from each category (Flipkart has around 58 categories).


Setting the default configuration is a bit tough as I have to go through the Amazon DOM structure in a proper way. As Amazon has multiple DOM structure for some pages like Special Page for the new Launch of any mobile. So it will be difficult to scrape the 100% products from the Amazon but 80% of the products can be scraped through it. These are some challenging task but it's not impossible.