Gutenberg#

Scraping the url of Project Gutenberg’s Top 100 eBooks to identify the eBooks’ links. BeautifulSoup is used to parse the HTML and regular expressions to identify the Top 100 eBook file numbers.

Importing libraries and packages#

1# Data gathering
2import re
3import requests
4
5from bs4 import BeautifulSoup as bs

Reading html#

1url = "https://www.gutenberg.org/browse/scores/top#books-last1"
2
3response = requests.get(url)
4response
<Response [200]>
1type(response)
requests.models.Response

Checking status of the request#

 1def status_check(r):
 2    if r.status_code == 200:
 3        print("Success!")
 4        return 1
 5    else:
 6        print("Failed!")
 7        return -1
 8
 9
10status_check(response)
Success!
1

Decoding the contents#

1def encoding_check(r):
2    return r.encoding
3
4
5def decode_content(r, encoding):
6    return r.content.decode(encoding)
7
8
9contents = decode_content(response, encoding_check(response))
1type(contents)
str
1len(contents)
58980
1contents[:10000]
'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n <meta charset="UTF-8"/>\n\n<title>Top 100 | Project Gutenberg</title>\n <link rel="stylesheet" href="/gutenberg/style.css?v=1.1">\n <link rel="stylesheet" href="/gutenberg/collapsible.css?1.1">\n <link rel="stylesheet" href="/gutenberg/new_nav.css?v=1.321231">\n<link rel="stylesheet" href="/gutenberg/pg-desktop-one.css">\n <meta name="viewport" content="width=device-width, initial-scale=1">\n <meta name="keywords" content="books, ebooks, free, kindle, android, iphone, ipad"/>\n <meta name="google-site-verification" content="wucOEvSnj5kP3Ts_36OfP64laakK-1mVTg-ptrGC9io"/>\n <meta name="alexaVerifyID" content="4WNaCljsE-A82vP_ih2H_UqXZvM"/>\n <link rel="copyright" href="https://www.gnu.org/copyleft/fdl.html"/>\n <link rel="shortcut icon" href="/gutenberg/favicon.ico?v=1.1"/>\n\n <meta property="og:title"        content="Project Gutenberg" />\n <meta property="og:type"         content="website" />\n <meta property="og:url"          content="https://www.gutenberg.org/" />\n <meta property="og:description"  content="Project Gutenberg is a library of free eBooks." />\n <meta property="fb:admins"       content="615269807" />\n <meta property="fb:app_id"       content="115319388529183" />\n <meta property="og:site_name"    content="Project Gutenberg" />\n <meta property="og:image"        content="https://www.gutenberg.org/gutenberg/pg-logo-144x144.png" />\n </head>\n <body>\n  <div class="container"><!-- start body --><nav>\n  <!--<div id="main_logo"> -->\n  <a id="main_logo" href="/" class="no-hover">\n    <img src="/gutenberg/pg-logo-129x80.png" alt="Project Gutenberg" draggable="false" />\n  </a>\n  <!--\t</div>-->\n  <div id="menu">\n    <label for="tm" id="toggle-menu">Menu<span class="drop-icon">&#9662;</span></label>\n    <input type="checkbox" id="tm" />\n    <ul class="main-menu cf">\n      <li>\n\t<a href="/about/">About\n          <span class="drop-icon">&#9662;</span>\n      \t</a>\n        <label title="Toggle Drop-down" class="drop-icon" for="sm0">&#9662;</label>\n     \t<input type="checkbox" id="sm0" />\n     \t<ul class="sub-menu">\n\t  <li><a href="/about/">About Project Gutenberg</a></li>\n          <li><a href="/policy/collection_development.html">Collection Development</a></li>\n          <li><a href="/about/contact_information.html">Contact Us</a></li>\n          <li><a href="/about/background/">History &amp; Philosophy</a></li>\n          <li><a href="/policy/permission.html">Permissions &amp; License</a></li>\n          <li><a href="/policy/privacy_policy.html">Privacy Policy</a></li>\n          <li><a href="/policy/terms_of_use.html">Terms of Use</a></li>\n\t</ul>\n      </li>\n      <li>\n\t<a href="/ebooks/">Search and Browse\n      \t  <span class="drop-icon">&#9662;</span>\n      \t</a>\n\t<label title="Toggle Drop-down" class="drop-icon" for="sm8">&#9662;</label>\n        <input type="checkbox" id="sm8" />\n        <ul class="sub-menu">\n\t  <li><a href="/ebooks/">Book Search</a></li>\n\t  <li><a href="/ebooks/bookshelf/">Bookshelves</a></li>\n\t  <li><a href="/browse/scores/top">Frequently Downloaded</a></li>\n\t  <li><a href="/ebooks/offline_catalogs.html">Offline Catalogs</a></li>\n\t</ul>\n      </li>\n      <li>\n\t<a href="/help/">Help\n          <span class="drop-icon">&#9662;</span>\n   \t</a>\n         <label title="Toggle Drop-down" class="drop-icon" for="sm3">&#9662;</label>\n   \t<input type="checkbox" id="sm3" />\n    \t<ul class="sub-menu">\n          <li><a href="/help/">All help topics &rarr;</a></li>\n          <li><a href="/help/copyright.html">Copyright Procedures</a></li>\n          <li><a href="/help/errata.html">Errata, Fixes and Bug Reports</a></li>\n          <li><a href="/help/file_formats.html">File Formats</a></li>\n          <li><a href="/help/faq.html">Frequently Asked Questions</a></li>\n          <li><a href="/policy/">Policies &rarr;</a></li>\n          <li><a href="/help/public_domain_ebook_submission.html">Public Domain eBook Submission</a></li>\n          <li><a href="/help/submitting_your_own_work.html">Submitting Your Own Work</a></li>\n          <li><a href="/help/mobile.html">Tablets, Phones and eReaders</a></li>\n          <li><a href="/attic/">The Attic &rarr;</a></li>\n        </ul>\n      </li>\n      <li><a href="/donate/">Donate</a></li>\n    </ul>\n  </div>\n  <div class="donate">\n  <div class="searchbox">\n    <form method="get" action="/ebooks/search/" accept-charset="utf-8" enctype="multipart/form-data" class="searchbox">\n      <input type="text" value="" id="menu-book-search" name="query" class="searchInput" title="Quick search" tabindex="20" size="20" maxlength="80" placeholder="  Quick search" />\n      <input type="submit" name="submit_search" value="Go!" style="vertical-align:middle;" />\n    </form>\n  </div>\n    <form class="donatelink" action="https://www.paypal.com/cgi-bin/webscr" method="post" target="new">\n      <p><a href="/donate/">Donation</a></p>\n      <input type="hidden" name="cmd" value="_s-xclick" />\n      <input type="hidden" name="hosted_button_id" value="XKAL6BZL3YPSN" />\n      <input class="donbtn" type="image" src="/pics/en_US.gif" name="submit" alt="Donate via PayPal" />\n    </form>\n  </div>\n</nav>\n  <div class="page_content"><!-- start content -->\t\n<h1>Frequently Viewed or Downloaded</h1>\n<p>These listings are based on the number of times each eBook gets downloaded.\n      Multiple downloads from the same Internet address on the same day count as one download, and addresses that download more than 100 eBooks in a day are considered robots and are not counted.</p>\n\n<table>\n  <caption>Downloaded Books</caption>\n  <tr><th>2023-01-22</th><td class="right">247217</td></tr>\n  <tr><th>last 7 days</th><td class="right">1863916</td></tr>\n  <tr><th>last 30 days</th><td class="right">7043785</td></tr>\n</table>\n <div class="padded">\n  <ul>\n   <li><a href="#books-last1">Top 100 EBooks yesterday</a></li>\n   <li><a href="#authors-last1">Top 100 Authors yesterday</a></li>\n   <li><a href="#books-last7">Top 100 EBooks last 7 days</a></li>\n   <li><a href="#authors-last7">Top 100 Authors last 7 days</a></li>\n   <li><a href="#books-last30">Top 100 EBooks last 30 days</a></li>\n   <li><a href="#authors-last30">Top 100 Authors last 30 days</li>\n  </ul>\n </div>\n<h2 id="books-last1">Top 100 EBooks yesterday</h2>\n\n<ol>\n<li><a href="/ebooks/1513">Romeo and Juliet by William Shakespeare (7049)</a></li>\n<li><a href="/ebooks/2641">A Room with a View by E. M.  Forster (6038)</a></li>\n<li><a href="/ebooks/145">Middlemarch by George Eliot (5777)</a></li>\n<li><a href="/ebooks/37106">Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott (5326)</a></li>\n<li><a href="/ebooks/16389">The Enchanted April by Elizabeth Von Arnim (5241)</a></li>\n<li><a href="/ebooks/67979">The Blue Castle: a novel by L. M.  Montgomery (5092)</a></li>\n<li><a href="/ebooks/100">The Complete Works of William Shakespeare by William Shakespeare (5011)</a></li>\n<li><a href="/ebooks/394">Cranford by Elizabeth Cleghorn Gaskell (4947)</a></li>\n<li><a href="/ebooks/6761">The Adventures of Ferdinand Count Fathom — Complete by T.  Smollett (4927)</a></li>\n<li><a href="/ebooks/2701">Moby Dick; Or, The Whale by Herman Melville (4887)</a></li>\n<li><a href="/ebooks/2160">The Expedition of Humphry Clinker by T.  Smollett (4769)</a></li>\n<li><a href="/ebooks/4085">The Adventures of Roderick Random by T.  Smollett (4715)</a></li>\n<li><a href="/ebooks/6593">History of Tom Jones, a Foundling by Henry Fielding (4500)</a></li>\n<li><a href="/ebooks/5197">My Life — Volume 1 by Richard Wagner (4396)</a></li>\n<li><a href="/ebooks/1259">Twenty Years After by Alexandre Dumas (4385)</a></li>\n<li><a href="/ebooks/1342">Pride and Prejudice by Jane Austen (1575)</a></li>\n<li><a href="/ebooks/84">Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley (1498)</a></li>\n<li><a href="/ebooks/8581">The Art of Money Getting; Or, Golden Rules for Making Money by P. T.  Barnum (1374)</a></li>\n<li><a href="/ebooks/43696">The Story of Mary MacLane by Mary MacLane (879)</a></li>\n<li><a href="/ebooks/11">Alice\'s Adventures in Wonderland by Lewis Carroll (814)</a></li>\n<li><a href="/ebooks/1661">The Adventures of Sherlock Holmes by Arthur Conan Doyle (685)</a></li>\n<li><a href="/ebooks/64317">The Great Gatsby by F. Scott  Fitzgerald (600)</a></li>\n<li><a href="/ebooks/98">A Tale of Two Cities by Charles Dickens (581)</a></li>\n<li><a href="/ebooks/1952">The Yellow Wallpaper by Charlotte Perkins Gilman (577)</a></li>\n<li><a href="/ebooks/174">The Picture of Dorian Gray by Oscar Wilde (576)</a></li>\n<li><a href="/ebooks/345">Dracula by Bram Stoker (543)</a></li>\n<li><a href="/ebooks/28054">The Brothers Karamazov by Fyodor Dostoyevsky (538)</a></li>\n<li><a href="/ebooks/2542">A Doll\'s House : a play by Henrik Ibsen (531)</a></li>\n<li><a href="/ebooks/69856">Orpheus: or, The Music of the Future by W. J. Turner (504)</a></li>\n<li><a href="/ebooks/1080">A Modest Proposal by Jonathan Swift (502)</a></li>\n<li><a href="/ebooks/69857">Pride and Passion: Robert Burns, 1759-1796 by DeLancey Ferguson (486)</a></li>\n<li><a href="/ebooks/5200">Metamorphosis by Franz Kafka (477)</a></li>\n<li><a href="/ebooks/46">A Christmas Carol in Prose; Being a Ghost Story of Christmas by Charles Dickens (468)</a></li>\n<li><a href="/ebooks/2591">Grimms\' Fairy Tales by Jacob Grimm and Wilhelm Grimm (466)</a></li>\n<li><a href="/ebooks/2131">An Account of Egypt by Herodotus (458)</a></li>\n<li><a href="/ebooks/69854">The Christmas Makers\' Club by Edith A.  Sawyer (454)</a></li>\n<li><a href="/ebooks/69855">Letters, sentences and maxims by Earl of Philip Dormer Stanhope Chesterfield (441)</a></li>\n<li><a href="/ebooks/20228">Noli Me Tangere by José Rizal (438)</a></li>\n<li><a href="/ebooks/6130">The Iliad by Homer (424)</a></li>\n<li><a href="/ebooks/1400">Great Expectations by Charles Dickens (421)</a></li>\n<li><a href="/ebooks/4300">Ulysses by James Joyce (407)</a></li>\n<li><a href="/ebooks/25344">The Scarlet Letter by Nathaniel Hawthorne (398)</a></li>\n<li><a href="/eboo'

Extracting readable text#

1soup = bs(contents, "html.parser")
1# Empty list to hold all the http links in the HTML page
2list_links = []
3# Find all the href tags and store them in the list of links
4for link in soup.find_all("a"):
5    list_links.append(link.get("href"))
6
7list_links[:30]
['/',
 '/about/',
 '/about/',
 '/policy/collection_development.html',
 '/about/contact_information.html',
 '/about/background/',
 '/policy/permission.html',
 '/policy/privacy_policy.html',
 '/policy/terms_of_use.html',
 '/ebooks/',
 '/ebooks/',
 '/ebooks/bookshelf/',
 '/browse/scores/top',
 '/ebooks/offline_catalogs.html',
 '/help/',
 '/help/',
 '/help/copyright.html',
 '/help/errata.html',
 '/help/file_formats.html',
 '/help/faq.html',
 '/policy/',
 '/help/public_domain_ebook_submission.html',
 '/help/submitting_your_own_work.html',
 '/help/mobile.html',
 '/attic/',
 '/donate/',
 '/donate/',
 '#books-last1',
 '#authors-last1',
 '#books-last7']
 1booknumbers = []
 2
 3for i in range(19, 119):
 4    link = list_links[i]
 5    link = link.strip()
 6    # Regular expression to find the numeric digits in the link (href) string
 7    n = re.findall("[0-9]+", link)
 8    if len(n) == 1:
 9        # Append the filenumber cast as integer
10        booknumbers.append(int(n[0]))
11
12print("\nThe file numbers for the top 100 ebooks on Gutenberg:\n" + "-" * 80)
13print(booknumbers)
The file numbers for the top 100 ebooks on Gutenberg:
--------------------------------------------------------------------------------
[1, 1, 7, 7, 30, 30, 1513, 2641, 145, 37106, 16389, 67979, 100, 394, 6761, 2701, 2160, 4085, 6593, 5197, 1259, 1342, 84, 8581, 43696, 11, 1661, 64317, 98, 1952, 174, 345, 28054, 2542, 69856, 1080, 69857, 5200, 46, 2591, 2131, 69854, 69855, 20228, 6130, 1400, 4300, 25344, 69851, 2554, 42108, 1260, 76, 408, 219, 1497, 2814, 844, 1184, 1232, 43, 996, 2600, 4363, 30254, 58585, 69859, 205, 768, 67098, 5740, 7370, 27827, 1727, 5827, 26184, 16328, 45, 33283, 120, 209, 3207, 244, 2680, 35, 158, 135, 55, 15399, 16, 36, 3206]
1print(soup.text[:2100])
Top 100 | Project Gutenberg



























Menu▾



About
          ▾

▾


About Project Gutenberg
Collection Development
Contact Us
History & Philosophy
Permissions & License
Privacy Policy
Terms of Use



Search and Browse
      	  ▾

▾


Book Search
Bookshelves
Frequently Downloaded
Offline Catalogs



Help
          ▾

▾


All help topics →
Copyright Procedures
Errata, Fixes and Bug Reports
File Formats
Frequently Asked Questions
Policies →
Public Domain eBook Submission
Submitting Your Own Work
Tablets, Phones and eReaders
The Attic →


Donate










Donation







Frequently Viewed or Downloaded
These listings are based on the number of times each eBook gets downloaded.
      Multiple downloads from the same Internet address on the same day count as one download, and addresses that download more than 100 eBooks in a day are considered robots and are not counted.

Downloaded Books
2023-01-22247217
last 7 days1863916
last 30 days7043785



Top 100 EBooks yesterday
Top 100 Authors yesterday
Top 100 EBooks last 7 days
Top 100 Authors last 7 days
Top 100 EBooks last 30 days
Top 100 Authors last 30 days


Top 100 EBooks yesterday

Romeo and Juliet by William Shakespeare (7049)
A Room with a View by E. M.  Forster (6038)
Middlemarch by George Eliot (5777)
Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott (5326)
The Enchanted April by Elizabeth Von Arnim (5241)
The Blue Castle: a novel by L. M.  Montgomery (5092)
The Complete Works of William Shakespeare by William Shakespeare (5011)
Cranford by Elizabeth Cleghorn Gaskell (4947)
The Adventures of Ferdinand Count Fathom — Complete by T.  Smollett (4927)
Moby Dick; Or, The Whale by Herman Melville (4887)
The Expedition of Humphry Clinker by T.  Smollett (4769)
The Adventures of Roderick Random by T.  Smollett (4715)
History of Tom Jones, a Foundling by Henry Fielding (4500)
My Life — Volume 1 by Richard Wagner (4396)
Twenty Years After by Alexandre Dumas (4385)
Pride and Prejudice by Jane Austen (1575)
Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley (1498)
The Art
 1list_titles_temp = []
 2list_titles = []
 3skiprows = 8  # Skip first 8 rows, there are 2 "Top 100 EBooks yesterday"
 4start_idx = soup.text.splitlines().index("Top 100 EBooks yesterday") + skiprows
 5
 6for i in range(100):
 7    list_titles_temp.append(soup.text.splitlines()[start_idx + 2 + i])
 8
 9for i in range(100):
10    id1, id2 = re.match("^[a-zA-Z ]*", list_titles_temp[i]).span()
11    list_titles.append(list_titles_temp[i][id1:id2])
12
13for title in list_titles:
14    print(title)
Romeo and Juliet by William Shakespeare 
A Room with a View by E
Middlemarch by George Eliot 
Little Women
The Enchanted April by Elizabeth Von Arnim 
The Blue Castle
The Complete Works of William Shakespeare by William Shakespeare 
Cranford by Elizabeth Cleghorn Gaskell 
The Adventures of Ferdinand Count Fathom 
Moby Dick
The Expedition of Humphry Clinker by T
The Adventures of Roderick Random by T
History of Tom Jones
My Life 
Twenty Years After by Alexandre Dumas 
Pride and Prejudice by Jane Austen 
Frankenstein
The Art of Money Getting
The Story of Mary MacLane by Mary MacLane 
Alice
The Adventures of Sherlock Holmes by Arthur Conan Doyle 
The Great Gatsby by F
A Tale of Two Cities by Charles Dickens 
The Yellow Wallpaper by Charlotte Perkins Gilman 
The Picture of Dorian Gray by Oscar Wilde 
Dracula by Bram Stoker 
The Brothers Karamazov by Fyodor Dostoyevsky 
A Doll
Orpheus
A Modest Proposal by Jonathan Swift 
Pride and Passion
Metamorphosis by Franz Kafka 
A Christmas Carol in Prose
Grimms
An Account of Egypt by Herodotus 
The Christmas Makers
Letters
Noli Me Tangere by Jos
The Iliad by Homer 
Great Expectations by Charles Dickens 
Ulysses by James Joyce 
The Scarlet Letter by Nathaniel Hawthorne 
The medieval Inquisition
Crime and Punishment by Fyodor Dostoyevsky 
The Slang Dictionary
Jane Eyre
Adventures of Huckleberry Finn by Mark Twain 
The Souls of Black Folk by W
Heart of Darkness by Joseph Conrad 
The Republic by Plato 
Dubliners by James Joyce 
The Importance of Being Earnest
The Count of Monte Cristo
The Prince by Niccol
The Strange Case of Dr
Don Quixote by Miguel de Cervantes Saavedra 
War and Peace by graf Leo Tolstoy 
Beyond Good and Evil by Friedrich Wilhelm Nietzsche 
The Romance of Lust
The Prophet by Kahlil Gibran 
Table Traits
Walden
Wuthering Heights by Emily Bront
Winnie
Tractatus Logico
Second Treatise of Government by John Locke 
The Kama Sutra of Vatsyayana by Vatsyayana 
The Odyssey by Homer 
The Problems of Philosophy by Bertrand Russell 
Simple Sabotage Field Manual by United States
Beowulf
Anne of Green Gables by L
Calculus Made Easy by Silvanus P
Treasure Island by Robert Louis Stevenson 
The Turn of the Screw by Henry James 
Leviathan by Thomas Hobbes 
A Study in Scarlet by Arthur Conan Doyle 
Meditations by Emperor of Rome Marcus Aurelius 
The Time Machine by H
Emma by Jane Austen 
Les Mis
The Wonderful Wizard of Oz by L
The Interesting Narrative of the Life of Olaudah Equiano
Peter Pan by J
The War of the Worlds by H
Moby Multiple Language Lists of Common Words by Grady Ward 
Narrative of the Captivity and Restoration of Mrs
Thus Spake Zarathustra
Westminster Abbey
Autobiography of Benjamin Franklin by Benjamin Franklin 
An index finger by Tulis Abrojal 
The Hound of the Baskervilles by Arthur Conan Doyle 
Little Women by Louisa May Alcott 
The Adventures of Tom Sawyer
The Works of Edgar Allan Poe 
Narrative of the Life of Frederick Douglass
Anna Karenina by graf Leo Tolstoy 
Essays of Michel de Montaigne 
The Legend of Sleepy Hollow by Washington Irving 
The murder of Roger Ackroyd by Agatha Christie