TDM 40100: Project 12 — 2022

Motivation: In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: sqlite3, containerization, and analysis work as well.

Context: This is the third in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we’ve touched on in previous data mine courses. For this second project, we continue to build our suite of tools designed to scrape public housing data.

Scope: playwright, Python, web scraping

Learning Objectives
  • Use playwright to interact with a web page prior to scraping.

  • Use playwright and xpath expressions to efficiently scrape targeted data.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1

This has been (maybe) a bit intense for a project series. This project is going to give you a little break and not give you anything new to do, except changing the package we are using.

playwright is a modern web scraping tool backed by Microsoft, that, like selenium, allows you to interact with a web page before scraping. playwright is not necessarily better (yet), however, it is different, and actively maintained.

Implement the get_links, and link_to_blob functions using playwright instead of selenium. You can find the documentation for playwright here.

Before you get started, you will need to run the following in a bash cell.

%%bash

python3 -m playwright install

Finally, we aren’t going to force you to fight with the playwright documentation to get started, so the following is an example of code that will run in a Jupyter notebook, and perform many of the basic/same operations you are acustomed to with selenium.

import time
import asyncio
from playwright.async_api import async_playwright

# so we can run this from within Jupyter, which is already async
import nest_asyncio
nest_asyncio.apply()

async def main():
    async with async_playwright() as p:
        browser = await p.firefox.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto("https://purdue.edu/directory")

        # print the page source
        print(await page.content())

        # get html element
        e = page.locator("xpath=//html")

        # print the inner html of the element
        print(await e.inner_html())

        # isolate the search bar "input" element
        inp = e.locator("xpath=.//input")

        # print the outer html, or the element and contents
        print(await inp.evaluate("el => el.outerHTML"))

        # fill the input with "mdw"
        await inp.fill("mdw")
        print(await inp.evaluate("el => el.outerHTML"))

        # find the search button and click it
        await page.locator("xpath=//a[@id='glass']").click()

        # We can delay the program to allow the page to load
        time.sleep(5)

        # find the table in the page with dr. wards content
        table = page.locator("xpath=//table[@class='more']")

        # print the table and contents
        print(await table.evaluate("el => el.outerHTML"))

        # find the alias, if a selector starts with // or .. it is assumed to be xpath
        print(await page.locator("//th[@class='icon-key']").evaluate("el => el.outerHTML"))

        # you can print an attribute
        print(await page.locator("//th[@class='icon-key']").get_attribute("scope"))

        # similarly, you can print an elements content
        print(await page.locator("//th[@class='icon-key']").inner_text())

        # you could use the regular xpath stuff, no problem
        print(await page.locator("//th[@class='icon-key']/following-sibling::td").inner_text())

        await browser.close()

asyncio.run(main())
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Implement the get_links function using playwright. Test it out so the exmaple below is the same (or close, listed houses may change).

Here is the selenium version.

def get_links(search_term: str) -> list[str]:
    """
    Given a search term, return a list of web links for all of the resulting properties.
    """
    def _load_cards(driver):
        """
        Given the driver, scroll through the cards
        so that they all load.
        """
        cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
        while True:
            try:
                num_cards = len(cards)
                driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])
                time.sleep(2)
                cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]")
                if num_cards == len(cards):
                    break
                num_cards = len(cards)
            except StaleElementReferenceException:
                # every once in a while we will get a StaleElementReferenceException
                # because we are trying to access or scroll to an element that has changed.
                # this probably means we can skip it because the data has already loaded.
                continue

    links = []
    url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/"

    firefox_options = Options()
    # Headless mode means no GUI
    firefox_options.add_argument("--headless")
    firefox_options.add_argument("--disable-extensions")
    firefox_options.add_argument("--no-sandbox")
    firefox_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Firefox(options=firefox_options)

    with driver as d:
        d.get(url)
        d.delete_all_cookies()
        while True:
            time.sleep(2)
            _load_cards(d)
            links.extend([e.get_attribute("href") for e in d.find_elements("xpath", "//a[@data-test='property-card-link' and @class='property-card-link']")])
            next_link = d.find_element("xpath", "//a[@rel='next']")
            if next_link.get_attribute("disabled") == "true":
                break
            url = next_link.get_attribute('href')
            d.delete_all_cookies()
            next_link.click()

    return links

Use the set_viewport_size function to change the browser’s width to 960 and height to 1080.

Don’t forget to await the async functions — this is going to be the most likely source of errors.

Unlike in selenium, in playwright, you won’t be able to do something like this:

# wrong
cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]")
len(cards) # get the number of cards found

Instead, you’ll have to use the useful count function to get the nth element in the list of cards.

cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]")
num_cards = await cards.count()

Unlike in selenium, in playwright, you won’t be able to do something like this:

# wrong
cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]")
await cards[num_cards-1].scroll_into_view_if_needed()

Instead, you’ll have to use the useful nth function to get the nth element in the list of cards.

cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]")
await cards.nth(num_cards-1).scroll_into_view_if_needed()

To clear cookies, search for "cookie" in the playwright documentation. Hint: you can clear cookies using the context object.

This following provides a working skeleton to run the asynchronous code in Jupyter.

import time
import asyncio
from playwright.async_api import async_playwright, expect

import nest_asyncio
nest_asyncio.apply()

async def get_links(search_term: str) -> list[str]:
    """
    Given a search term, return a list of web links for all of the resulting properties.
    """
    async def _load_cards(page):
        """
        Given the driver, scroll through the cards
        so that they all load.
        """
        pass

    links = []
    url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/"
    async with async_playwright() as p:
        browser = await p.firefox.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        # code

        time.sleep(10) # useful if using headful mode (not headless)
        await browser.close()

    return links

my_links = asyncio.run(get_links("47933"))
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Implement the link_to_blob function using playwright. Test it out so the example below functions.

The selenium version will be posted below on Monday, November 21.

import uuid
import os

def link_to_blob(link: str) -> bytes:
    def _load_tables(driver):
        """
        Given the driver, scroll through the cards
        so that they all load.
        """
        table = driver.find_element("xpath", "//div[@id='Price-and-tax-history']")
        driver.execute_script('arguments[0].scrollIntoView();', table)
        time.sleep(2)
        try:
            see_more = driver.find_element("xpath", "//span[contains(text(), 'See complete tax history')]")
            see_more.click()
        except NoSuchElementException:
            pass
        try:
            see_more = driver.find_element("xpath", "//span[contains(text(), 'See complete price history')]")
            see_more.click()
        except NoSuchElementException:
            pass

    filename = f"{uuid.uuid4()}.html"
    with open(filename, 'w') as f:
        firefox_options = Options()
        # Headless mode means no GUI
        firefox_options.add_argument("--headless")
        firefox_options.add_argument("--disable-extensions")
        firefox_options.add_argument("--no-sandbox")
        firefox_options.add_argument("--disable-dev-shm-usage")

        driver = webdriver.Firefox(options=firefox_options)
        driver.get(link)
        time.sleep(2)
        _load_tables(driver)
        f.write(driver.page_source)
        driver.quit()

    with open(filename, 'rb') as f:
        blob = f.read()

    os.remove(filename)

    return blob

The get_price_history and get_tax_history solutions will be posted below on Monday, Novermber 21.

import lxml.html

def get_price_history(html: str):
    tree = lxml.html.fromstring(html)
    trs = tree.xpath("//div[@id='Price-and-tax-history']/following-sibling::div//tr[@id]")
    values = []
    for tr in trs:
        price = tr.xpath(".//td/span")[2].text.replace("$", "").replace(",", "").strip()
        if price == '':
            price = None
        values.append([tr.xpath(".//td/span")[0].text, tr.xpath(".//td/span")[1].text, price])

    return values


def get_tax_history(html: str):
    tree = lxml.html.fromstring(html)
    try:
        tbody = tree.xpath("//div[@id='Price-and-tax-history']/following-sibling::div//table//tbody")[1]
    except IndexError:
        return None
    values = []
    for tr in tbody.xpath(".//tr"):
        prop_tax = tr.xpath(".//td/span")[1].text.replace("$", "").replace(",", "").replace("-", "").strip()
        if prop_tax == '':
            prop_tax = None
        values.append([int(tr.xpath(".//td/span")[0].text), prop_tax, int(tr.xpath(".//td/span")[2].text.replace("$", "").replace(",", ""))])

    return values

To test your code run the following.

my_blob = asyncio.run(link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/"))

def blob_to_html(blob: bytes) -> str:
    return blob.decode("utf-8")

get_price_history(blob_to_html(my_blob))
output
[['11/9/2022', 'Price change', '275000'],
 ['11/2/2022', 'Listed for sale', '289900'],
 ['1/13/2000', 'Sold', '19000']]
my_blob = asyncio.run(link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/"))

def blob_to_html(blob: bytes) -> str:
    return blob.decode("utf-8")

get_tax_history(blob_to_html(my_blob))
output
[[2021, '1344', 124511],
 [2020, '1310', 122792],
 [2019, '1290', 120031],
 [2018, '1260', 117793],
 [2017, '1260', 115370],
 [2016, '1252', 112997],
 [2015, '1262', 112212],
 [2014, '1277', 113120],
 [2013, '1295', 112920],
 [2012, '1389', 124535],
 [2011, '1557', 134234],
 [2010, '1495', 132251],
 [2009, '1499', 128776],
 [2008, '1483', 128647],
 [2007, '1594', 124900],
 [2006, '1608', 121900],
 [2005, '1704', 118400],
 [2004, '1716', 115000],
 [2003, '1624', 112900],
 [2002, '1577', 110300],
 [2000, '288', 15700]]

Please note that exact numbers may change slightly, that is okay! Prices and things change.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Test out a playwright feature from the documentation that is new to you. This could be anything. One suggestion that could be interesting would be screenshots. As long as you demonstrate something new, you will receive credit for this question. Have fun, and happy thanksgiving!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.