Using requests and beautiful soup#

The requests library is an API built on top of pure Python web utility libraries, which makes placing HTTP requests easy and intuitive.

BeautifulSoup parses HTML content and builds a detailed tree of all the tags and markup within the page for easy and intuitive traversal. This tree can be used to look for certain markup elements (a table, a hyperlink, or a blob of text within a particular div ID) to scrape useful data.

Importing libraries and packages#

1# Data gathering
2import requests
3from bs4 import BeautifulSoup

Using requests to get a response#

1url = "https://en.wikipedia.org/wiki/Main_Page"
2
3response = requests.get(url)
4response
<Response [200]>
1type(response)
requests.models.Response

Checking status of the request#

 1def status_check(r):
 2    if r.status_code == 200:
 3        print("Success!")
 4        return 1
 5    else:
 6        print("Failed!")
 7        return -1
 8
 9
10status_check(response)
Success!
1

Decoding the contents of a response#

1def encoding_check(r):
2    return r.encoding
3
4
5def decode_content(r, encoding):
6    return r.content.decode(encoding)
7
8
9contents = decode_content(response, encoding_check(response))
1type(contents)
str
1len(contents)
101482
1contents[:10000]
'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"00593be1-61af-48f5-8b19-fd84f0934090","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1114291180,"wgRevisionId":1114291180,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":false,"wgRelevantPageIsProbablyEditable":false,"wgRestrictionEdit":["sysop"],"wgRestrictionMove":["sysop"],"wgIsMainPage":true,"wgFlaggedRevsParams":{\n"tags":{"status":{"levels":1}}},"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDisplayWikibaseDescriptions":{"search":true,"watchlist":true,"tagline":false,"nearby":true},"wgWMESchemaEditAttemptStepOversample":false,"wgWMEPageLength":3000,"wgNoticeProject":"wikipedia","wgVector2022PreviewPages":[],"wgMediaViewerOnClick":true,"wgMediaViewerEnabledByDefault":true,"wgPopupsFlags":10,"wgULSCurrentAutonym":"English","wgEditSubmitButtonLabelPublish":true,"wgCentralAuthMobileDomain":false,"wgULSPosition":"interlanguage","wgULSisCompactLinksEnabled":true,"wgULSisLanguageSelectorEmpty":false,"wgWikibaseItemId":"Q5296","GEHomepageSuggestedEditsEnableTopics":true,"wgGETopicsMatchModeEnabled":false,"wgGEStructuredTaskRejectionReasonTextInputEnabled":false};RLSTATE={"skins.vector.user.styles":"ready","ext.globalCssJs.user.styles":"ready","site.styles":"ready","user.styles":"ready","skins.vector.user":"ready","ext.globalCssJs.user":"ready","user":\n"ready","user.options":"loading","ext.tmh.player.styles":"ready","mediawiki.ui.button":"ready","skins.vector.styles":"ready","skins.vector.icons":"ready","mediawiki.ui.icon":"ready","ext.visualEditor.desktopArticleTarget.noscript":"ready","ext.wikimediaBadges":"ready","ext.uls.interlanguage":"ready"};RLPAGEMODULES=["ext.tmh.player","site","mediawiki.page.ready","skins.vector.js","skins.vector.es6","mmv.head","mmv.bootstrap.autostart","ext.visualEditor.desktopArticleTarget.init","ext.visualEditor.targetLoader","ext.eventLogging","ext.wikimediaEvents","ext.navigationTiming","ext.cx.eventlogging.campaigns","ext.centralNotice.geoIP","ext.centralNotice.startUp","ext.gadget.ReferenceTooltips","ext.gadget.charinsert","ext.gadget.extra-toolbar-buttons","ext.gadget.switcher","ext.centralauth.centralautologin","ext.popups","ext.echo.centralauth","ext.uls.interface","ext.growthExperiments.SuggestedEditSession"];</script>\n<script>(RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.options@12s5i",function($,jQuery,require,module){mw.user.tokens.set({"patrolToken":"+\\\\","watchToken":"+\\\\","csrfToken":"+\\\\"});});});</script>\n<link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=ext.tmh.player.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.ui.button%2Cicon%7Cskins.vector.icons%2Cstyles&amp;only=styles&amp;skin=vector-2022"/>\n<script async="" src="/w/load.php?lang=en&amp;modules=startup&amp;only=scripts&amp;raw=1&amp;skin=vector-2022"></script>\n<meta name="ResourceLoaderDynamicStyles" content=""/>\n<link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=site.styles&amp;only=styles&amp;skin=vector-2022"/>\n<meta name="generator" content="MediaWiki 1.40.0-wmf.19"/>\n<meta name="referrer" content="origin"/>\n<meta name="referrer" content="origin-when-crossorigin"/>\n<meta name="referrer" content="origin-when-cross-origin"/>\n<meta name="robots" content="max-image-preview:standard"/>\n<meta name="format-detection" content="telephone=no"/>\n<meta property="og:image" content="https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Saloniki_City_Walls_2.jpg/1200px-Saloniki_City_Walls_2.jpg"/>\n<meta property="og:image:width" content="1200"/>\n<meta property="og:image:height" content="800"/>\n<meta property="og:image" content="https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Saloniki_City_Walls_2.jpg/800px-Saloniki_City_Walls_2.jpg"/>\n<meta property="og:image:width" content="800"/>\n<meta property="og:image:height" content="533"/>\n<meta property="og:image" content="https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Saloniki_City_Walls_2.jpg/640px-Saloniki_City_Walls_2.jpg"/>\n<meta property="og:image:width" content="640"/>\n<meta property="og:image:height" content="427"/>\n<meta name="viewport" content="width=1000"/>\n<meta property="og:title" content="Wikipedia, the free encyclopedia"/>\n<meta property="og:type" content="website"/>\n<link rel="preconnect" href="//upload.wikimedia.org"/>\n<link rel="alternate" media="only screen and (max-width: 720px)" href="//en.m.wikipedia.org/wiki/Main_Page"/>\n<link rel="alternate" type="application/atom+xml" title="Wikipedia picture of the day feed" href="/w/api.php?action=featuredfeed&amp;feed=potd&amp;feedformat=atom"/>\n<link rel="alternate" type="application/atom+xml" title="Wikipedia featured articles feed" href="/w/api.php?action=featuredfeed&amp;feed=featured&amp;feedformat=atom"/>\n<link rel="alternate" type="application/atom+xml" title="Wikipedia &quot;On this day...&quot; feed" href="/w/api.php?action=featuredfeed&amp;feed=onthisday&amp;feedformat=atom"/>\n<link rel="apple-touch-icon" href="/static/apple-touch/wikipedia.png"/>\n<link rel="icon" href="/static/favicon/wikipedia.ico"/>\n<link rel="search" type="application/opensearchdescription+xml" href="/w/opensearch_desc.php" title="Wikipedia (en)"/>\n<link rel="EditURI" type="application/rsd+xml" href="//en.wikipedia.org/w/api.php?action=rsd"/>\n<link rel="license" href="https://creativecommons.org/licenses/by-sa/3.0/"/>\n<link rel="canonical" href="https://en.wikipedia.org/wiki/Main_Page"/>\n<link rel="dns-prefetch" href="//meta.wikimedia.org" />\n<link rel="dns-prefetch" href="//login.wikimedia.org"/>\n</head>\n<body class="skin-vector skin-vector-search-vue vector-toc-pinned mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject page-Main_Page rootpage-Main_Page skin-vector-2022 action-view vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-sticky-header-edit-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled"><div class="mw-page-container">\n\t<a class="mw-jump-link" href="#bodyContent">Jump to content</a>\n\t<div class="mw-page-container-inner">\n\t\t<input\n\t\t\ttype="checkbox"\n\t\t\tid="mw-sidebar-checkbox"\n\t\t\tclass="mw-checkbox-hack-checkbox"\n\t\t\t>\n\t\t<header class="mw-header mw-ui-icon-flush-left mw-ui-icon-flush-right">\n\t\t\t<div class="vector-header-start">\n\t\t\t\t\t<label\n\t\t\t\tid="mw-sidebar-button"\n\t\t\t\tclass="mw-checkbox-hack-button mw-ui-icon mw-ui-button mw-ui-quiet mw-ui-icon-element mw-ui-icon-flush-right"\n\t\t\t\tfor="mw-sidebar-checkbox"\n\t\t\t\trole="button"\n\t\t\t\taria-controls="mw-panel"\n\t\t\t\tdata-event-name="ui.sidebar"\n\t\t\t\ttabindex="0"\n\t\t\t\ttitle="Main menu">\n\t\t\t\t<span>Toggle sidebar</span>\n\t\t\t</label>\n\t\t\n<a href="/wiki/Main_Page" class="mw-logo">\n\t<img class="mw-logo-icon" src="/static/images/icons/wikipedia.png" alt=""\n\t\taria-hidden="true" height="50" width="50">\n\t<span class="mw-logo-container">\n\t\t<img class="mw-logo-wordmark" alt="Wikipedia"\n\t\t\tsrc="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;">\n\t\t<img class="mw-logo-tagline"\n\t\t\talt="The Free Encyclopedia"\n\t\t\tsrc="/static/images/mobile/copyright/wikipedia-tagline-en.svg" width="117" height="13" style="width: 7.3125em; height: 0.8125em;">\n\t</span>\n</a>\n\n\t\t\t</div>\n\t\t\t<div class="vector-header-end">\n\t\t\t\t\n<div id="p-search" role="search" class="vector-search-box-vue  vector-search-box-collapses  vector-search-box-show-thumbnail vector-search-box-auto-expand-width vector-search-box">\n\t<a href="/wiki/Special:Search"\n\t\n\t\t\n\t\t\n\t\t\n\t\ttitle="Search Wikipedia [f]"\n\t\taccesskey="f"\n\t\tclass="mw-ui-button mw-ui-quiet mw-ui-icon mw-ui-icon-element mw-ui-icon-wikimedia-search search-toggle">\n\t\t<span>Search</span>\n\t</a>\n\t\n\t<div>\n\t\t<form action="/w/index.php" id="searchform"\n\t\t\tclass="vector-search-box-form">\n\t\t\t<div id="simpleSearch"\n\t\t\t\tclass="vector-search-box-inner"\n\t\t\t\t data-search-loc="header-moved">\n\t\t\t\t<input class="vector-search-box-input"\n\t\t\t\t\t type="search" name="search" placeholder="Search Wikipedia" aria-label="Search Wikipedia" autocapitalize="sentences" title="Search Wikipedia [f]" accesskey="f" id="searchInput"\n\t\t\t\t>\n\t\t\t\t<input type="hidden" name="title" value="Special:Search">\n\t\t\t\t<input id="mw-searchButton"\n\t\t\t\t\t class="searchButton mw-fallbackSearchButton" type="submit" name="fulltext" title="Search Wikipedia for this text" value="Search">\n\t\t\t\t<input id="searchButton"\n\t\t\t\t\t class="searchButton" type="submit" name="go" title="Go to a page with this exact name if it exists" value="Go">\n\t\t\t</div>\n\t\t</form>\n\t</div>\n</div>\n\n\t\t\t\t<nav class="vector-user-links" aria-label="Personal tools" role="navigation" >\n\t\n<div id="p-vector-user-menu-overflow" class="vector-menu mw-portlet mw-portlet-vector-user-menu-overflow"  >\n\t<div class="vector-menu-heading">\n\t\t\n\t</div>\n\t<div class="vector-menu-content">\n\t    \n\t    <ul class="vector-menu-content-list"><li id="pt-createaccount-2" class="user-links-collapsible-item mw-list-item"><a href="/w/index.php?title=Special:CreateAccoun'

Extracting readable text#

1soup = BeautifulSoup(contents, "html.parser")
2txt_dump = soup.text
3type(txt_dump)
str
1len(txt_dump)
10385
1print(txt_dump[7000:8725])
start of a missile-launch countdown. The scene is replaced by a nuclear explosion, with Johnson's voice-over stating: "We must either love each other, or we must die." Although the Johnson campaign was criticized for frightening voters by implying that Goldwater would wage a nuclear war, various other campaigns since have adopted and used the "Daisy" advertisement.

Advertisement credit: Lyndon B. Johnson 1964 presidential campaign

Recently featured: 
Along the River During the Qingming Festival
Rega
Persicaria maculosa


Archive
More featured pictures




Other areas of Wikipedia

Community portal – The central hub for editors, with resources, links, tasks, and announcements.
Village pump – Forum for discussions about Wikipedia itself, including policies and technical issues.
Site news – Sources of news about Wikipedia and the broader Wikimedia movement.
Teahouse – Ask basic questions about using or editing Wikipedia.
Help desk – Ask questions about using or editing Wikipedia.
Reference desk – Ask research questions about encyclopedic topics.
Content portals – A unique way to navigate the encyclopedia.

Wikipedia's sister projects

Wikipedia is written by volunteer editors and hosted by the Wikimedia Foundation, a non-profit organization that also hosts a range of other volunteer projects:





CommonsFree media repository



MediaWikiWiki software development



Meta-WikiWikimedia project coordination



WikibooksFree textbooks and manuals



WikidataFree knowledge base



WikinewsFree-content news



WikiquoteCollection of quotations



WikisourceFree-content library



WikispeciesDirectory of species



WikiversityFree learning tools



WikivoyageFree travel guide



WiktionaryDictionary and
1# Extracting "From today's featured article"
2idx1 = txt_dump.find("From today's featured article")
3idx2 = txt_dump.find("Recently featured")
4print(txt_dump[idx1 + len("From today's featured article") : idx2])
Eastern city wall of Thessalonica

The siege of Thessalonica (1422–1430) was a successful campaign to capture the city by the Ottoman Empire under Sultan Murad II. It remained in Ottoman hands until 1912, when it became part of the Kingdom of Greece. Thessalonica had already been under Ottoman control from 1387 to 1403 before returning to Byzantine rule in the aftermath of the Battle of Ankara. In 1422 Murad attacked the city. Its ruler, Andronikos Palaiologos, was unable to provide manpower or resources for the city's defense, and handed it over to the Republic of Venice in September 1423. The Ottomans blockaded the city and attacked it by land. The blockade reduced the inhabitants to near starvation, and many fled the city. In 1429 Venice declared war on the Ottomans, and on 29 March 1430 Murad's forces took the city. The siege and the subsequent sack reduced the city to a shadow of its former self, from perhaps as many as 40,000 inhabitants to around 2,000. (Full article...)
1# Extracting "On this day"
2idx3 = txt_dump.find("On this day")
3print(txt_dump[idx3 + len("On this day") : idx3 + len("On this day") + 1000])
January 23



Elizabeth Blackwell

1556 – One of the deadliest earthquakes in history struck Shaanxi, China, resulting in at least 100,000 direct deaths.
1849 – Elizabeth Blackwell (pictured) graduated from Geneva Medical College in New York, making her the first woman to receive a medical degree in the United States.
1909 – Two men committed an armed robbery in Tottenham, London, and led police on a two-hour chase, partially by tram, that ended in the perpetrators' suicides.
1942 – World War II: Japan began an invasion of the island of New Britain in the Australian Territory of New Guinea.
1993 – The first version of Mosaic, created by Marc Andreessen and Eric Bina, was released, becoming the first popular web browser.
Mary Ward  (b. 1585)Ernst Abbe  (b. 1840)Louisa Cadamuro  (b. 1987)

More anniversaries: 
January 22
January 23
January 24


Archive
By email
List of days of the year




From today's featured list

Muddy Waters

The recording career of Muddy Waters, an American blues
1text_list = []
2for d in soup.find_all("div"):
3    if d.get("id") == "mp-otd":
4        for i in d.find_all("ul"):
5            text_list.append(i.text)
1for i in text_list:
2    print(i)
3    print("-" * 80)
1556 – One of the deadliest earthquakes in history struck Shaanxi, China, resulting in at least 100,000 direct deaths.
1849 – Elizabeth Blackwell (pictured) graduated from Geneva Medical College in New York, making her the first woman to receive a medical degree in the United States.
1909 – Two men committed an armed robbery in Tottenham, London, and led police on a two-hour chase, partially by tram, that ended in the perpetrators' suicides.
1942 – World War II: Japan began an invasion of the island of New Britain in the Australian Territory of New Guinea.
1993 – The first version of Mosaic, created by Marc Andreessen and Eric Bina, was released, becoming the first popular web browser.
--------------------------------------------------------------------------------
Mary Ward  (b. 1585)Ernst Abbe  (b. 1840)Louisa Cadamuro  (b. 1987)
--------------------------------------------------------------------------------
January 22
January 23
January 24
--------------------------------------------------------------------------------
Archive
By email
List of days of the year
--------------------------------------------------------------------------------