🐍 Python Tutorial: Web Scraping
Web scraping is the process of extracting data from websites. With Python, libraries like requests and BeautifulSoup make it easy to download HTML content and parse it to extract useful information.
1. Basic HTML Scraping with requests and BeautifulSoup
Let's fetch a page and parse its contents:
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.text)2. Extracting Elements
You can extract elements by tag name, class, ID, or CSS selectors:
# Find all paragraphs
paragraphs = soup.find_all('p')
# Find element by class name
headline = soup.find('h1', class_='headline')
# Find element using CSS selector
link = soup.select_one('a[href^="http"]')
print(link['href'])3. Handling Headers and robots.txt
Always respect robots.txt and add user-agent headers to avoid being blocked:
headers = {'User-Agent': 'Mozilla/5.0'}
url = "https://example.com"
response = requests.get(url, headers=headers)4. Pagination and Saving Data
For scraping multiple pages and saving the data:
import csv
with open('data.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Title', 'Link'])
for page in range(1, 4):
url = f"https://example.com/page/{page}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for article in soup.select('.article'):
title = article.find('h2').text
link = article.find('a')['href']
writer.writerow([title, link])