Is Raspberry Pi good for web scraping?

A web scraper is a tool used to automatically extract data from websites. It navigates through web pages, collects information, and stores it in a structured format. This process is often used for various purposes such as market research, data analysis, and content aggregation. Web scraping is an efficient way to quickly gather large amounts of data, avoiding the need for manual data entry. However, it is important to ensure that web scraping is done in accordance with the terms and conditions of the target websites to avoid any legal or ethical issues.

Content

Idea

I want to create a Python web scraper on Raspberry Pi. Its algorithm will involve Raspberry Pi searching the phrase “why do cats sleep so much?” on Google every hour, selecting 5 news articles, and sending them to me via Telegram through a bot. You can choose any phrase, not just one, and also change the frequency of the web scraping – it’s very easy to configure in the scraper’s code.

Telegram bot configuration

First, we need to create and set up a bot that will send us the Google search results. The mechanism and sequence of such setup are detailed in my article about Raspberry Pi OpenCV, so I won’t go into detail about it here. You can follow the link and view the details. It is essential for you to follow the instructions described there to get the bot token and chat ID, as we will need them for sending messages.

Web scraping code

I won’t describe the process of installing the operating system on Raspberry Pi, assuming that you already have a ready Raspberry Pi with the Raspbian OS installed. During the work on this project, I used fragments of the web scraper code from this website. I had to add proper parsing of the URL links, as well as sending a brief description of the search results with the links to web pages using a Telegram bot.

So, let’s take a look at the code and describe its key points:

import requests
from bs4 import BeautifulSoup
from time import sleep

TOKEN = "YOUR_TELEGRAM_BOT_TOKEN"
CHAT_ID = "YOUR_TELEGRAM_CHAT_ID"
query = "why do cats sleep so much?"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

def get_search_results(query):
    url = 'https://www.google.com.ua/search?gl=us&q=' + query
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all the search result this div wrapper
    result_divs = soup.find_all('div', class_='Gx5Zad fP1Qef xpd EtOod pkphOe')
    results = []

    for div in result_divs:
        # Only get the organic_results
        if div.find('div', class_='egMi0 kCrYT') is None:
            continue
        # Extracting the title (linked text) from h3
        title = div.find('h3').text
        # Extracting the URL
        link = div.find('a')['href']
        # Extracting the brief description
        description = div.find('div', class_='BNeawe s3v9rd AP7Wnd').text

        results.append({'title': title, 'link': link, 'description': description})

    return results

def send_message(message):
    url = f"https://api.telegram.org/bot{TOKEN}/sendMessage"
    params = {
        "chat_id": CHAT_ID,
        "text": message
    }
    response = requests.post(url, json=params)
  # print(response.json())

search_results = get_search_results(query)
counter = 1
for result in search_results:
    if counter <=5:
      print(result['title'] + "\n" + result['link'] + "\n" + result['description'])
      print("-------------")
      send_message(f"A match was found for the phrase '{query}':\n\n" + result['description'] + "\n\n" + str(result['link'].split("url=",1)[1]).split("&ved=")[0])
    else:
      break
    counter += 1
    sleep(10)

As you can see above, you will need the Telegram bot token and chat ID, which you can obtain from the instructions provided in the previous step. We also have two main functions: get_search_results for web scraping, and send_message for sending the search results to Telegram. In my code, Raspberry Pi performs the web scraping, parses the results, and sends a short description and link for each of the five search results via the Telegram bot with a 10-second interval between messages.

Now, let’s set up the execution of our Python script with a one-hour frequency using cron:

crontab -e

crontab en

And add our script and the rule to execute it (see more settings here):

00 * * * * /usr/bin/python /home/raspberrypi/web_scraper.py

Let’s test the web scraper on Raspberry Pi

When we run our Python script, we see the output of its work in the console:

web scraping result

Additionally, the Telegram bot simultaneously sends the results of the web scraper’s work to Telegram. It looks like this:

itmakerclub.com cats

Instead of a conclusion

In fact, this project can be interestingly expanded. For example, you can teach Raspberry Pi to wait for a message that you write to the Telegram bot, and then based on that message, perform search queries and return results to you, rather than doing it on a schedule with cron. There are many interesting possibilities that you can implement for your own needs. Feel free to share your ideas and thoughts in the comments 😉

itmakerclub.com Ellie cat

Spread the love