A web scraper is a tool used to automatically extract data from websites. It navigates through web pages, collects information, and stores it in a structured format. This process is often used for various purposes such as market research, data analysis, and content aggregation. Web scraping is an efficient way to quickly gather large amounts of data, avoiding the need for manual data entry. However, it is important to ensure that web scraping is done in accordance with the terms and conditions of the target websites to avoid any legal or ethical issues.
Idea
I want to create a Python web scraper on Raspberry Pi. Its algorithm will involve Raspberry Pi searching the phrase “why do cats sleep so much?” on Google every hour, selecting 5 news articles, and sending them to me via Telegram through a bot. You can choose any phrase, not just one, and also change the frequency of the web scraping – it’s very easy to configure in the scraper’s code.
Telegram bot configuration
First, we need to create and set up a bot that will send us the Google search results. The mechanism and sequence of such setup are detailed in my article about Raspberry Pi OpenCV, so I won’t go into detail about it here. You can follow the link and view the details. It is essential for you to follow the instructions described there to get the bot token and chat ID, as we will need them for sending messages.
Web scraping code
I won’t describe the process of installing the operating system on Raspberry Pi, assuming that you already have a ready Raspberry Pi with the Raspbian OS installed. During the work on this project, I used fragments of the web scraper code from this website. I had to add proper parsing of the URL links, as well as sending a brief description of the search results with the links to web pages using a Telegram bot.
So, let’s take a look at the code and describe its key points:
import requests
from bs4 import BeautifulSoup
from time import sleep
TOKEN = "YOUR_TELEGRAM_BOT_TOKEN"
CHAT_ID = "YOUR_TELEGRAM_CHAT_ID"
query = "why do cats sleep so much?"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
def get_search_results(query):
url = 'https://www.google.com.ua/search?gl=us&q=' + query
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all the search result this div wrapper
result_divs = soup.find_all('div', class_='Gx5Zad fP1Qef xpd EtOod pkphOe')
results = []
for div in result_divs:
# Only get the organic_results
if div.find('div', class_='egMi0 kCrYT') is None:
continue
# Extracting the title (linked text) from h3
title = div.find('h3').text
# Extracting the URL
link = div.find('a')['href']
# Extracting the brief description
description = div.find('div', class_='BNeawe s3v9rd AP7Wnd').text
results.append({'title': title, 'link': link, 'description': description})
return results
def send_message(message):
url = f"https://api.telegram.org/bot{TOKEN}/sendMessage"
params = {
"chat_id": CHAT_ID,
"text": message
}
response = requests.post(url, json=params)
# print(response.json())
search_results = get_search_results(query)
counter = 1
for result in search_results:
if counter <=5:
print(result['title'] + "\n" + result['link'] + "\n" + result['description'])
print("-------------")
send_message(f"A match was found for the phrase '{query}':\n\n" + result['description'] + "\n\n" + str(result['link'].split("url=",1)[1]).split("&ved=")[0])
else:
break
counter += 1
sleep(10)
As you can see above, you will need the Telegram bot token and chat ID, which you can obtain from the instructions provided in the previous step. We also have two main functions: get_search_results
for web scraping, and send_message
for sending the search results to Telegram. In my code, Raspberry Pi performs the web scraping, parses the results, and sends a short description and link for each of the five search results via the Telegram bot with a 10-second interval between messages.
Now, let’s set up the execution of our Python script with a one-hour frequency using cron:
crontab -e
And add our script and the rule to execute it (see more settings here):
00 * * * * /usr/bin/python /home/raspberrypi/web_scraper.py
Let’s test the web scraper on Raspberry Pi
When we run our Python script, we see the output of its work in the console:
Additionally, the Telegram bot simultaneously sends the results of the web scraper’s work to Telegram. It looks like this:
Instead of a conclusion
In fact, this project can be interestingly expanded. For example, you can teach Raspberry Pi to wait for a message that you write to the Telegram bot, and then based on that message, perform search queries and return results to you, rather than doing it on a schedule with cron. There are many interesting possibilities that you can implement for your own needs. Feel free to share your ideas and thoughts in the comments 😉
What a great project!
Thanks Ellie 🙂