Web Scraping with BeautifulSoup
When a website has data you want but no API, web scraping lets you download its HTML and extract exactly what you need. Combined with the requests library you already know, BeautifulSoup turns a tangled page into clean, structured data.
Learn Web Scraping with BeautifulSoup in our free Python course — a beginner-friendly interactive lesson with runnable examples, a practice exercise and a…
Part of the free Python course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
You'll learn to parse HTML, find elements by tag and class, follow links, and scrape responsibly.
Every web page is HTML — nested tags forming a tree. Scraping means navigating that tree to find the branches you care about:
Tip: in your browser, right-click any element and choose "Inspect" to see its tag and class. That's how you find the selectors to scrape.
Scraping is always two steps: requests fetches the HTML, then BeautifulSoup parses it into a searchable tree:
"html.parser" is Python's built-in parser (no extra install). Some scrapers prefer "lxml" for speed — both work the same way once the soup is built.
These examples embed the HTML directly, so they run anywhere BeautifulSoup is installed — no network needed:
You'll often want a tag's text and its attributes (like a link's href or an image's src ):
Use tag["attr"] when you're sure the attribute exists, and tag.get("attr") when it might be missing — get returns None instead of raising a KeyError .
If you know CSS, select() lets you write the same selectors you'd use for styling. It's often more concise than chaining find calls:
inside a div.card"], ['a[href]', "every link that has an href"], ].map(([sel, desc]) => ( ))} Real-World Example: A Product Listing Scraper A realistic scraper: download a page, loop over product cards, extract name, price, and link, and build clean records. This is the shape of nearly every real scraping job: Notice the defensive if not (...) check — real pages have broken or missing cards. A scraper that assumes perfect HTML crashes on the first oddity. Scrape Responsibly Scraping carries real responsibilities. Be a good citizen of the web: Check robots.txt (e.g. example.com/robots.txt ) and the site's Terms of Service first. Prefer an official API if one exists — it's faster, stabler, and explicitly allowed. Rate-limit yourself — add time.sleep(1) between requests so you don't hammer the server. Identify your bot with a clear User-Agent string. Never collect personal data or republish copyrighted content. 🧩 Reorder Challenge These lines fetch a page and print its main heading, but they're scrambled. Find the order: Show answer Correct order: B → C → E → A → D Imports first, download with requests, build the soup from resp.text , then search the tree. You can't parse before you'…
You can download HTML with requests, parse it with BeautifulSoup, select elements with find/find_all and CSS selectors, read text and attributes, and scrape ethically. This is one of the most practical skills in a Python toolkit.
🚀 Up next: itertools — elegant, memory-efficient tools for looping and combining data.
Practice quiz
What are the two main steps of scraping a web page?
- Compile then run
- Parse first, then download
- Download the HTML with requests, then parse it with BeautifulSoup
- Open a browser, then screenshot it
Answer: Download the HTML with requests, then parse it with BeautifulSoup. Scraping is always: requests fetches the HTML, then BeautifulSoup parses that text into a searchable tree.
How do you construct a soup from downloaded HTML using Python's built-in parser?
- BeautifulSoup(resp.text, 'html.parser')
- BeautifulSoup.parse(resp)
- soup(resp.text)
- BeautifulSoup(resp.text, 'python')
Answer: BeautifulSoup(resp.text, 'html.parser'). 'html.parser' is Python's built-in parser (no extra install); you pass the HTML text and the parser name.
What is the difference between find and find_all?
- find returns a list; find_all returns one element
- They are identical
- find only works on links
- find returns the first match (or None); find_all returns a list of all matches
Answer: find returns the first match (or None); find_all returns a list of all matches. find returns the FIRST matching element or None; find_all returns a LIST of every match.
Why does BeautifulSoup use class_ (with an underscore) to filter by CSS class?
- Because class names are case-sensitive
- Because 'class' is a reserved Python keyword
- Because it searches faster
- Because the HTML attribute is spelled class_
Answer: Because 'class' is a reserved Python keyword. 'class' is a reserved Python keyword, so BeautifulSoup uses class_ as the keyword argument to match a CSS class.
What does soup.find('span') return when there is no span in the HTML?
- None
- An empty list
- An empty string
- It raises a KeyError
Answer: None. find returns None when nothing matches; calling .text on that None would then raise an AttributeError, which is why you guard with if checks.
For a tag named link, how do you safely read an href that might be missing?
- href
link.get('href') returns None if the attribute is missing, while link['href'] raises a KeyError when it's absent.
What does link.text give you for <a href='/x'>Go</a>, versus link['href']?
- Both return '/x'
- Go
- href
Answer: Go. .text is the visible content between the tags ('Go'); indexing with an attribute name returns that attribute's value ('/x').
Which method lets you use CSS selectors and returns a LIST of matches?
- select_one()
- find()
- get_text()
- select()
Answer: select(). select() takes a CSS selector and returns a list (like find_all); select_one() returns the first match (like find).
A scraper returns None or an empty list even though you see the data in your browser. The most common cause is:
- Your internet is too fast
- The page renders content with JavaScript after the initial HTML, which requests + BeautifulSoup never see
- BeautifulSoup only supports XML
- find_all is deprecated
Answer: The page renders content with JavaScript after the initial HTML, which requests + BeautifulSoup never see. JavaScript-rendered content isn't in the initial HTML, so requests + BeautifulSoup miss it; such sites need Selenium or Playwright.
Which is a responsible scraping practice?
- Send requests as fast as possible to finish quickly
- Always scrape personal data while it's available
- Check robots.txt and Terms of Service, rate-limit yourself, and prefer an official API
- Hide your User-Agent entirely
Answer: Check robots.txt and Terms of Service, rate-limit yourself, and prefer an official API. Be a good web citizen: honor robots.txt/ToS, add delays like time.sleep(1), identify your bot, and use an official API when one exists.