🐍 Python Tutorial: Web Scraping

Web scraping is the process of extracting data from websites. With Python, libraries like requests and BeautifulSoup make it easy to download HTML content and parse it to extract useful information.

1. Basic HTML Scraping with requests and BeautifulSoup

Let's fetch a page and parse its contents:

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

print(soup.title.text)

2. Extracting Elements

You can extract elements by tag name, class, ID, or CSS selectors:

# Find all paragraphs
paragraphs = soup.find_all('p')

# Find element by class name
headline = soup.find('h1', class_='headline')

# Find element using CSS selector
link = soup.select_one('a[href^="http"]')
print(link['href'])

3. Handling Headers and robots.txt

Always respect robots.txt and add user-agent headers to avoid being blocked:

headers = {'User-Agent': 'Mozilla/5.0'}
url = "https://example.com"
response = requests.get(url, headers=headers)

4. Pagination and Saving Data

For scraping multiple pages and saving the data:

import csv

with open('data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Link'])

    for page in range(1, 4):
        url = f"https://example.com/page/{page}"
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        for article in soup.select('.article'):
            title = article.find('h2').text
            link = article.find('a')['href']
            writer.writerow([title, link])

Additional Resources & References

Learning Resources

Tools & Practice Sites

← Back : Working with APIs Next: Deployment and Distribution →