Web Scrapping Amazon Products with Python

Scroll this

This post is about basic web scrapping with python to scrap amazon’s products name, price and availability every day automatically and save them in a csv file. Now, you might ask why to scrap Amazon product details ? Well, you can do a lot of things by scrapping Amazon products details –

  • Monitor an item for a change in price, stock, Rating etc..
  • Analyse how a particular brand is being sold by amazon.
  • Email you whenever price of an item drops.
  • Or anything else which you can think of.

Let’s begin with our first python scrapper.

First, you must know that all the products of Amazon are identified by “ASIN(Amazon Standard Identification Numbers)” which is a a unique 10 letters and/or numbers amazon uses to keep track of products in its database. ASIN looks like B01MG4G1N4.

The amazon products page are identified by link like this –> “http://www.amazon.in/dp/B01MG4G1N4/”.(www.amazon.in/dp/<ASIN>). 

After collecting ASIN’s of products which you want to scrap we will download the HTML of each product’s page and start identify the XPaths for the data elements that you need – e.g. Product Title, Price, Description etc. Read more about XPaths here.

PYTHON CODE PREREQUISITES:

  • Python 3.0v
  • Python Requests module
  • Python LXML module

Here is the code which scraps products Name, Sale Price, Original Price and Availability. You can directly use this code and make changes as per your needs. This is the best way to learn.

 Enter code from lxml import html  
import csv,os,json
import requests
from time import sleep
import pandas as pd

def Amazon_Parser(url):
    page = requests.get(url)
    doc = html.fromstring(page.content)
    XPATH_NAME = '//h1[@id="title"]//text()'
    XPATH_SALE_PRICE = '//span[contains(@id,"ourprice") or contains(@id,"saleprice")]/text()'
    XPATH_ORIGINAL_PRICE = '//td[contains(text(),"List Price") or contains(text(),"M.R.P") or contains(text(),"Price")]/following-sibling::td/text()'
    XPATH_AVAILABILITY = '//div[@id="availability"]//text()'

    RAW_NAME = doc.xpath(XPATH_NAME)
    RAW_SALE_PRICE = doc.xpath(XPATH_SALE_PRICE)
    RAW_ORIGINAL_PRICE = doc.xpath(XPATH_ORIGINAL_PRICE)
    RAw_AVAILABILITY = doc.xpath(XPATH_AVAILABILITY)

    NAME = ' '.join(''.join(RAW_NAME).split()) if RAW_NAME else None
    SALE_PRICE = ' '.join(''.join(RAW_SALE_PRICE).split()).strip() if RAW_SALE_PRICE else None
    ORIGINAL_PRICE = ''.join(RAW_ORIGINAL_PRICE).strip() if RAW_ORIGINAL_PRICE else None
    AVAILABILITY = ''.join(RAw_AVAILABILITY).strip() if RAw_AVAILABILITY else None
     
    if not ORIGINAL_PRICE:
        ORIGINAL_PRICE = SALE_PRICE
 
    #if page.status_code!=200:
    #    raise ValueError('captha')
    data = {
            'NAME':[NAME],
            'SALE_PRICE':[SALE_PRICE],
            'ORIGINAL_PRICE':[ORIGINAL_PRICE],
            'AVAILABILITY':[AVAILABILITY],
            'URL':[url],
           }
    print(data)
    df = pd.DataFrame.from_dict(data)
    return df

def Read_Data():
    AsinList = ['B01MG4G1N4','B071XTFP66','B00RJU3RVS','B01F6WX6LQ']
    extracted_data = []
    df = pd.DataFrame()
    for i in AsinList:
        url = "http://www.amazon.in/dp/"+i
        print("processing: "+url)
        data = Amazon_Parser(url)
        df = df.append(data)
    print(df)
    with open('Gaming_Consoles_Data.csv','a') as f:
        df.to_csv(f,header=False)

if __name__ == "__main__":
    Read_Data()

A output csv file will be made in same directory as that of the script with name “Gaming_Consoles_Data.csv”. It will contain five columns named NAME, SALE_PRICE,ORIGINAL_PRICE,AVAILABILITY and URL as defined by this section of above code –

data = {
         'NAME':[NAME],
         'SALE_PRICE':[SALE_PRICE],
         'ORIGINAL_PRICE':[ORIGINAL_PRICE],
         'AVAILABILITY':[AVAILABILITY],
         'URL':[url],
        }

Note that whenever you will run above code, the products data will be appended in the csv file. This way, we can run it everyday and the data will keep on getting stored in the same file. Hence, later lets say after 1 month, We can analyse it with any price change. But everyday running the script manually is not a good idea. Hence, we will automate running of script.

SCRIPT RUNNING AUTOMATION

Pre-Requisite :

  • Installed Linux/Ubuntu
  • basic knowledge of using terminal commands.
  • Installed pip

I will be using built in cron feature of ubuntu to automate the script running task. Before that let me tell you what is cron. CRON is a system daemon used to execute desired tasks (in the background) at designated times. A crontab file is a simple text file containing a list of commands meant to be run at specified times. It is edited using the crontab command discussed below:

To use cron for tasks meant to run only for your user profile, add entries to your own user’s crontab file. To edit the crontab file open the terminal and enter:

crontab -e

We need to add a command which signifies running of our python script. Add the following command at the bottom of file:

22 00 * * * sh path/to/sh/file/filename.sh

The above command will run the .sh file everyday at 10:00 pm daily. To understand complete format of this command you can read here. Now the question is what is .sh file and what does it contain and how does it run python file ?

sh files are ubuntu (linux) shell executable files which are used to run terminal commands. In our case, the sh file will contain following commands:

#!/bin/sh
source path/to/pip/bin/activate
path/to/installedpython/python path/to/python/script/test.py

Now, what does above commands do ? Well, First command simply specifies the path where shell is located. The second command is used to activate pip where our modules which are to be used in our python script are installed. The third line is used to run python script named ‘test.py’.

So, to conclude, The .sh file has terminal commands which are used to run python script. This .sh file is put on cron which runs it daily at 10 pm.

I hope you enjoyed this post and you got basic idea of scrapping products information from Amazon. If you are stuck anywhere or need any kind of help feel free to comment below.

Thanks

Raghav Chopra

If you think this post added value to your python knowledge, click on below link to share it with you friends. It would mean a lot to me and it will help more people reach this post.

Get Free Posts Updates!

Signup now and receive an email once we publish new content.

I will never give away, trade or sell your email address. You can unsubscribe at any time.