Author: Udit Vashisht

Python
>
Tutorials

How I used Python and Web Scrapping to find cheap diaper deals?

9 minutes read
553 Views

So, here is how I used web scrapping to find cheap deals on diapers:-
Amazon has certain Warehouse deals, which at least in the case of diapers consists of the products, which are returned by the buyers and have defective original packaging. But, the product inside is mostly new and unused. So, finding such deals can help you save few dollars on certain stuff. So, let’s go down to the coding part :

We will be using requests and BeautifulSoup. So, let’s import them and since amazon.com doesn’t like python scrolling through its website, let’s add some headers.

import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
}

Now, we will have to find the target URL, you can easily find it by browsing the website, selecting the warehouse deals from the drop-down menu, entering the keywords and hitting the search button. Let me make it easy for you. Just enter the following codes:-

AMAZON = 'https:/www.amazon.com'
BASE_URL = 'https://www.amazon.com/s/search-alias%3Dwarehouse-deals&field-keywords='
SEARCH_WORDS = 'huggies+little+movers+Size 4'
url = BASE_URL + SEARCH_WORDS

If you will search on the website manually, you will get the following sort of screen:-

You need to focus on the line which says 8 results for Amazon Warehouse: “Huggies Diaper”. Now, we can encounter following four cases, when we search for an item in Warehouse deals:-

There is no deal present.
There are limited numbers of deals present and all of them are on one page. (e.g. 8 results for Amazon Warehouse: “Huggies Diaper”)
There are limited numbers of deals present but are spread across more than one page. (e.g. 1–24 of 70 results for Amazon Warehouse : “huggies”)
There are more than 1000 deals present ( e.g. 1–24 of over 4,000 results for Amazon Warehouse : “iphone”)
I will be dealing with above as under:-
In the case of no deals present, I will be exiting the function. (We can log such cases)
In the second case, we will be creating a dictionary of the data using the function scrap_data(). We will be checking it out in details soon.
In the third and four cases, we will have to scrape through multiple pages and to keep it simple we will be going through a maximum of 96 results i.e. 4 pages.

So let’s create a soup using BeautifulSoup and requests, since we will be creating soups for multiple urls in certain cases, it is better to create a different function for that:-

def create_soup(url):

    req = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(req.text, 'html.parser')

    return soup

If we inspect the element, we will find that the said line of text has span id = “s-result-count”. Now, we will grab the text using the following code:-

result = soup.find("span", id="s-result-count").text

We will use regex to match the third and fourth scenario and will just search the first 96 results ( or four pages) in case of the fourth scenario. The code for the same would be as under:-

import re
def parse_result(result):
    ''' This will take care of 4 cases of results
    Case 1 When no result is available.
    Case 2 When all the results available are shown on first page only i.e. 24 results.
    Case 3 When there are more than 24 results and there are more than one page but the number of result is certain.
    Case 4 When there are over 500, 1000 or more results.
    '''

    matchRegex = re.compile(r'''(
    (1-\d+)\s
    (of)\s
    (over\s)?
    (\d,\d+|\d+)\s
    )''', re.VERBOSE)

    matchGroup = matchRegex.findall(result)

    # Case 1 simply exits the function
    # TODO Log such cases
    if "Amazon Warehouse" not in result:
        exit()

    else:
        # Case2
        if not matchGroup:
            resultCount = int(result.split()[0])
        # Case4
        elif matchGroup[0][3] == "over ":
            resultCount = 96
        # Case3
        else:
            resultCount = min([(int(matchGroup[0][4])), 96])

        return resultCount

Let’s crunch few numbers and get the resultCount and the number of pages we need to navigate:-

def crunch_numbers():
    soup = create_soup(url)
    result = soup.find("span", id="s-result-count").text
    resultCount = parse_result(result)
    navPages = min([resultCount // 24 + 1, 4])

    return resultCount, navPages

So, finally we have a target number in the form of resultCount and we will be extracting the data for that number. On closely inspecting the element of the webpage, you will find that all the results are inside the li tag with an id= “result_0” onwards (Yes, they are zero-indexed).

The name of the item, link and price are in h2, a and span tag inside the li tag. However, though the results upto number 96 will be with id “result_96”, but they will be distributed across 4 pages. So, we need to get the url of the preceding pages also. So, the link to the second page of results is in a span with a class “pagenLink” and it has two references to the page number “sr_pg_2” and “page=2”. So, if we will grab this, we can easily get the next two urls by replacing 2 with 3 and 4 for next pages:-

Depending upon the number of navPages, we will be creating a dictionary to replace the digit “2” with the desired digit as under:-

dict_list = [{
"sr_pg_2": "sr_pg_" + str(i),
"page=2": "page=" + str(i)
} for i in range(2, navPages + 1)]

We, will be grabbing the second url using the following code:-

nextUrl = soup.find("span", class_="pagnLink").find("a")["href"]

And, replacing the digit using the following function:-

def get_url(text, dict):
    for key, value in dict.items():
    url = AMAZON + text.replace(key, value)

    return url

Finally, we will be extracting the Name, Url and the price of the desired product. In case of more than one result pages, we will be using if elif statements to create new soups for the next urls grabbed above. Lastly, we will append the data to a dictionary for further processing. The code will be as under:-

def scrap_data():

    resultCount, navPages = crunch_numbers()
    soup = create_soup(url)

    try:

        nextUrl = soup.find("span", class_="pagnLink").find("a")["href"]
        renameDicts = [{
        "sr_pg_2": "sr_pg_" + str(i),
        "page=2": "page=" + str(i)
        } for i in range(2, navPages + 1)]

        urlList = [get_url(nextUrl, dict) for dict in renameDicts]

    except AttributeError:

        pass

    productName = []
    productLink = []
    productPrice = []

    for i in range(resultCount):

        if i > 23 and i <= 47:
            soup = create_soup(urlList[0])

        elif i > 47 and i <= 71:
            soup = create_soup(urlList[1])

        elif i > 71:
            soup = create_soup(urlList[2])

        id = "result_{}".format(i)

        try:
            name = soup.find("li", id=id).find("h2").text
            link = soup.find("li", id="result_{}".format(i)).find("a")["href"]
            price = soup.find(
                "li", id="result_{}".format(i)).find("span", {
                    'class': 'a-size-base'
                }).text
        except AttributeError:
            name = "Not Available"
            link = "Not Available"
            price = "Not Available"

        productName.append(name)
        productLink.append(link)
        productPrice.append(price)

    finalDict = {
        name: [link, price]
        for name, link, price in zip(productName, productLink, productPrice)
    }

    return finalDict

In order to automate the process, we will want our programme to send us the list of the products available at the particular time. For this, we will be creating an empty “email_message.txt” file. We will further filter the finalDict generated by scrap_data.py and create a custom email message using the following code:

def create_email_message():

    finalDict = scrap_data()

    notifyDict = {}

    with open("email_message.txt", 'w') as f:

        f.write("Dear User,\n\nFollowing deals are available now:\n\n")

        for key, value in finalDict.items():
            # Here we will search for certain keywords to refine our results
            if "Size 4" in key:
                notifyDict[key] = value

        for key, value in notifyDict.items():
            f.write("Product Name: " + key + "\n")
            f.write("Link: " + value[0] + "\n")
            f.write("Price: " + value[1] + "\n\n")

        f.write("Yours Truly,\nPython Automation")
   return notifyDict
 ```

 So, now we will be notifying the user via email for that we will be using .env to save the user credentials and even the emails (though you can use the .txt file to save the emails also). You can read more about using dotenv from the link below:-

 https://github.com/uditvashisht/til/blob/master/python/save-login-credential-in-env-files.md


 Create an empty .env file and save the credentials:-

 ```
 #You can enter as many emails as you want, separated by a whitespace
emails = 'user1@domain.com user2@domain.com'
MY_EMAIL_ADDRESS = "youremail@domain.com"
MY_PASSWORD = "yourpassword"

Then you will have to do following imports in your programme and the load the env as under:

import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

Further we will be using smtplib to send email. I have copied most of the code for this portion from this post by Arjun Krishna Babu:-


def notify_user():

    load_dotenv(find_dotenv())
    notifyDict = create_email_message()
    emails = os.getenv("emails").split()

    if notifyDict != {}:

        s = smtplib.SMTP(host="smtp.gmail.com", port=587)
        s.starttls()
        s.login(os.getenv("MY_EMAIL_ADDRESS"), os.getenv("MY_PASSWORD"))

        for email in emails:

            msg = MIMEMultipart()

            message = open("email_message.txt", "r").read()

            msg['From'] = os.getenv("MY_EMAIL_ADDRESS")
            msg['To'] = email
            msg['Subject'] = "Hurry Up: Deals on Size-4 available."

            msg.attach(MIMEText(message, 'plain'))

            s.send_message(msg)

            del msg

        s.quit()

And finally:-

if __name__ == '__main__':
    notify_user()

Now, you can schedule this script to run on your own computer or some cloud server to notify you periodically.

The complete code is available here

Python Tutorial for Beginners

Have you heard a lot about Python Language? Are you looking for free and reliable resource to learn Python? If Yes, your search for the Best Python Tutorial is over.

We are excited to bring an exhaustive Python tutorial for a complete beginner. Even ...

Chapter 5- Indentation

What is indentation in Python?

Like many other languages, python is also a block-structured language.

Blocks of code in Python

Block is basically a group of statements in a code script. A block in itself can have another block or blocks, hence making it a nested block. Now, ...

Author: Udit Vashisht

Python
>
Tutorials

How I used Python and Web Scrapping to find cheap diaper deals?

Table of Contents

Related Posts

Matplotlib Stack Plots/Bars | Matplotlib Tutorial in Python | Chapter 4

Python Tutorial for Beginners

Python Tutorial for Beginners

Chapter 5- Indentation

What is indentation in Python?

Blocks of code in Python

Search

Categories

Tags

Author: Udit Vashisht

Python> Tutorials

How I used Python and Web Scrapping to find cheap diaper deals?

Table of Contents

Related Posts

Matplotlib Stack Plots/Bars | Matplotlib Tutorial in Python | Chapter 4

Python Tutorial for Beginners

Python Tutorial for Beginners

Chapter 5- Indentation

What is indentation in Python?

Blocks of code in Python

Search

Categories

Tags

Python
>
Tutorials