Questions tagged [web-scraping]
Web scraping is the process of extracting specific information from websites that do not readily provide an API or other methods of automated data retrieval. Questions about "How To Get Started With Scraping" (e.g. with Excel VBA) should be *thoroughly researched* as numerous functional code samples are available. Web scraping methods include 3rd-party applications, development of custom software, or even manual data collection in a standardized way.
                                	
	web-scraping
    
                            
                        
                    
            50,759
            questions
        
        
            668
            votes
        
        
            20
            answers
        
        
            1.2m
            views
        
    How to find elements by class
                I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup. The code looks like this
soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div["...
            
        
       
    
            378
            votes
        
        
            3
            answers
        
        
            83k
            views
        
    Headless Browser and scraping - solutions [closed]
                I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping.
BROWSER TESTING / SCRAPING:
Selenium - polyglot flagship in browser ...
            
        
       
    
            343
            votes
        
        
            26
            answers
        
        
            147k
            views
        
    How do I prevent site scraping? [closed]
                I have a fairly large music website with a large artist database.  I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches ...
            
        
       
    
            300
            votes
        
        
            27
            answers
        
        
            480k
            views
        
    Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org [duplicate]
                I'm practicing the code from 'Web Scraping with Python', and I keep having this certificate problem:
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import re
pages = set()
def ...
            
        
       
    
            281
            votes
        
        
            18
            answers
        
        
            474k
            views
        
    How can I scrape a page with dynamic content (created by JavaScript) in Python?
                I'm trying to develop a simple web scraper. I want to extract plain text without HTML markup. My code works on plain (static) HTML, but not when content is generated by JavaScript embedded in the page....
            
        
       
    
            278
            votes
        
        
            7
            answers
        
        
            152k
            views
        
    How can I pass variable into an evaluate function?
                I'm trying to pass a variable into a page.evaluate() function in Puppeteer, but when I use the following very simplified example, the variable evalVar is undefined.
I can't find any examples to build ...
            
        
       
    
            264
            votes
        
        
            6
            answers
        
        
            679k
            views
        
    How can I get the Google cache age of any URL or web page? [closed]
                In my project I need the Google cache age to be added as important information. I tried to search sources for the Google cache age, that is, the number of days since Google last re-indexed the page ...
            
        
       
    
            211
            votes
        
        
            3
            answers
        
        
            214k
            views
        
    How can I efficiently parse HTML with Java?
                I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both tasks.
I want to use a light ...
            
        
       
    
            205
            votes
        
        
            18
            answers
        
        
            350k
            views
        
    How to save an image locally using Python whose URL address I already know?
                I know the URL of an image on Internet.
e.g. http://www.digimouth.com/news/media/2011/09/google-logo.jpg, which contains the logo of Google.
Now, how can I download this image using Python without ...
            
        
       
    
            197
            votes
        
        
            10
            answers
        
        
            220k
            views
        
    Web scraping with Python [closed]
                I'd like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?
            
        
       
    
            191
            votes
        
        
            9
            answers
        
        
            375k
            views
        
    How can I use Python's Requests to fake a browser visit a.k.a and generate User Agent? [duplicate]
                I want to get the content from this website.
If I use a browser like Firefox or Chrome, I could get the real website page I want, but if I use the Python Requests package (or wget command) to get it, ...
            
        
       
    
            185
            votes
        
        
            16
            answers
        
        
            347k
            views
        
    retrieve links from web page using python and BeautifulSoup [closed]
                How can I retrieve the links of a webpage and copy the url address of the links using Python?
            
        
       
    
            184
            votes
        
        
            14
            answers
        
        
            320k
            views
        
    How do I avoid HTTP error 403 when web scraping with Python?
                When I try this code to scrape a web page:
#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re
webpage = urllib.request.urlopen('http://www....
            
        
       
    
            163
            votes
        
        
            4
            answers
        
        
            130k
            views
        
    Scraping html tables into R data frames using the XML package
                How do I scrape html tables using the XML package?
Take, for example, this wikipedia page on the Brazilian soccer team. I would like to read it in R and get the "list of all matches Brazil have ...
            
        
       
    
            162
            votes
        
        
            10
            answers
        
        
            319k
            views
        
    can we use XPath with BeautifulSoup?
                I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url =  &...
            
        
       
    
            156
            votes
        
        
            11
            answers
        
        
            182k
            views
        
    How to scrape only visible webpage text with BeautifulSoup?
                Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even ...
            
        
       
    
            114
            votes
        
        
            2
            answers
        
        
            170k
            views
        
    What's the best way of scraping data from a web site? [closed]
                I need to extract contents from a web site, but the application doesn’t provide any application programming interface or another mechanism to access that data programmatically.
I found a useful third-...
            
        
       
    
            109
            votes
        
        
            3
            answers
        
        
            352k
            views
        
    XPath:: Get following Sibling
                I have following HTML Structure: I am trying to build a robust method to extract second color digest element since there will be many of these tag within the DOM.
<table>
  <tbody>
    &...
            
        
       
    
            104
            votes
        
        
            6
            answers
        
        
            77k
            views
        
    What is the difference between web-crawling and web-scraping? [duplicate]
                Is there a difference between Crawling and Web-scraping?
If there's a difference, what's the best method to use in order to collect some web data to supply a database for later use in a customised ...
            
        
       
    
            102
            votes
        
        
            2
            answers
        
        
            112k
            views
        
    selenium with scrapy for dynamic page
                I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:
starts with a product_list page with 10 products
a click on "next"  button loads the ...
            
        
       
    
            99
            votes
        
        
            5
            answers
        
        
            172k
            views
        
    How to scrape a website which requires login using python and beautifulsoup?
                If I want to scrape a website that requires login with password first, how can I start scraping it with python using beautifulsoup4 library? Below is what I do for websites that do not require login. 
...
            
        
       
    
            87
            votes
        
        
            7
            answers
        
        
            206k
            views
        
    Using python Requests with javascript pages
                I am trying to use the Requests framework with python (http://docs.python-requests.org/en/latest/) but the page I am trying to get to uses javascript to fetch the info that I want. 
I have tried to ...
            
        
       
    
            86
            votes
        
        
            8
            answers
        
        
            87k
            views
        
    How to run Scrapy from within a Python script
                I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this:
http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/
http://snipplr.com/...
            
        
       
    
            85
            votes
        
        
            18
            answers
        
        
            187k
            views
        
    Converting html to text with Python
                I am trying to convert an html block to text using Python.
Input:
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum ...
            
        
       
    
            83
            votes
        
        
            7
            answers
        
        
            42k
            views
        
    crawler vs scraper
                Can somebody distinguish between a crawler and scraper in terms of scope and functionality.
            
        
       
    
            83
            votes
        
        
            7
            answers
        
        
            48k
            views
        
    Extracting an information from web page by machine learning
                I would like to extract a specific type of information from web pages in Python. Let's say postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number ...
            
        
       
    
            82
            votes
        
        
            8
            answers
        
        
            97k
            views
        
    Selenium-Debugging: Element is not clickable at point (X,Y)
                I try to scrape this site by Selenium.
I want to click in "Next Page" buttom, for this I do:
driver.find_element_by_class_name('pagination-r').click()
it works for many pages but not for ...
            
        
       
    
            82
            votes
        
        
            3
            answers
        
        
            145k
            views
        
    Is it ok to scrape data from Google results? [closed]
                I'd like to fetch results from Google using curl to detect potential duplicate content.
Is there a high risk of being banned by Google?
            
        
       
    
            82
            votes
        
        
            4
            answers
        
        
            57k
            views
        
    How to manage log in session through headless chrome?
                I want to create a scraper that:
opens a headless browser,
goes to a url,
logs in (there is steam oauth),
fills some inputs,
and clicks 2 buttons.
My problem is that every new instance of headless ...
            
        
       
    
            79
            votes
        
        
            5
            answers
        
        
            164k
            views
        
    Python - make a POST request using Python 3 urllib
                I am trying to make a POST request to the following page: http://search.cpsa.ca/PhysicianSearch
In order to simulate clicking the 'Search' button without filling out any of the form, which adds data ...
            
        
       
    
            77
            votes
        
        
            7
            answers
        
        
            27k
            views
        
    Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)
                What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extension with JavaScript and whatever more technologies are available. Other JavaScript-...
            
        
       
    
            76
            votes
        
        
            4
            answers
        
        
            33k
            views
        
    Simple jQuery selector only selects first element in Chrome..?
                I'm a bit new to jQuery so forgive me for being dense. I want to select all <td> elements on a particular page via Chrome's JS console:
$('td')
Yet when I do this, I get the following output:
...
            
        
       
    
            76
            votes
        
        
            10
            answers
        
        
            143k
            views
        
    Web scraping with Java
                I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID and extract the HTML ...
            
        
       
    
            74
            votes
        
        
            6
            answers
        
        
            179k
            views
        
    What should I use to open a url instead of urlopen in urllib3
                I wanted to write a piece of code like the following:
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.thefamouspeople.com/singers.php'
html = urllib2.urlopen(url)
soup = BeautifulSoup(...
            
        
       
    
            71
            votes
        
        
            4
            answers
        
        
            83k
            views
        
    Click a Button in Scrapy
                I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking).
I found out that Scrapy ...
            
        
       
    
            69
            votes
        
        
            5
            answers
        
        
            83k
            views
        
    Scrape An Entire Website [closed]
                I'm looking for recommendations for a program to scrape and download an entire corporate website.
The site is powered by a CMS that has stopped working and getting it fixed is expensive and we are ...
            
        
       
    
            69
            votes
        
        
            9
            answers
        
        
            158k
            views
        
    How to print an exception in Python 3?
                Right now, I catch the exception in the except Exception: clause, and do print(exception). The result provides no information since it always prints <class 'Exception'>. I knew this used to work ...
            
        
       
    
            69
            votes
        
        
            5
            answers
        
        
            101k
            views
        
    Get meta tag content property with BeautifulSoup and Python
                I am trying to use python and beautiful soup to extract the content part of the tags below:
<meta property="og:title" content="Super Fun Event 1" />
<meta property="og:url" content="http://...
            
        
       
    
            67
            votes
        
        
            4
            answers
        
        
            191k
            views
        
    Using BeautifulSoup to extract text without tags
                My webpage looks like this:
<p>
  <strong class="offender">YOB:</strong> 1987<br/>
  <strong class="offender">RACE:</strong> WHITE<br/>
  <strong class="...
            
        
       
    
            66
            votes
        
        
            6
            answers
        
        
            24k
            views
        
    How to manage a 'pool' of PhantomJS instances
                I'm planning a webservice for my own use internally that takes one argument, a URL, and returns html representing the resolved DOM from that URL. By resolved I mean that the webservice will firstly ...
            
        
       
    
            66
            votes
        
        
            9
            answers
        
        
            46k
            views
        
    Scrape web pages in real time with Node.js
                What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com, where one query is dispatched to several ...
            
        
       
    
            65
            votes
        
        
            4
            answers
        
        
            72k
            views
        
    Python: Disable images in Selenium Google ChromeDriver
                I spend a lot of time searching about this.
At the end of the day I combined a number of answers and it works. I share my answer and I'll appreciate it if anyone edits it or provides us with an easier ...
            
        
       
    
            65
            votes
        
        
            8
            answers
        
        
            111k
            views
        
    Puppeteer - Protocol error (Page.navigate): Target closed
                As you can see with the sample code below, I'm using Puppeteer with a cluster of workers in Node to run multiple requests of websites screenshots by a given URL:
const cluster = require('cluster');
...
            
        
       
    
            65
            votes
        
        
            10
            answers
        
        
            28k
            views
        
    Web scraping - how to identify main content on a webpage
                Given a news article webpage (from any major news source such as times or bloomberg), I want to  identify the main article content on that page and throw out the other misc elements such as ads, menus,...
            
        
       
    
            63
            votes
        
        
            4
            answers
        
        
            74k
            views
        
    csv.writer writing each character of word in separate column/cell
                Objective: To extract the text from the anchor tag inside all lines in models and put it in a csv.
I'm trying this code: 
with open('Sprint_data.csv', 'ab') as csvfile:
  spamwriter = csv.writer(...
            
        
       
    
            63
            votes
        
        
            5
            answers
        
        
            135k
            views
        
    How can I download a file on a click event using selenium?
                I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code.  
from selenium import webdriver
from selenium.common.exceptions import ...
            
        
       
    
            62
            votes
        
        
            5
            answers
        
        
            57k
            views
        
    Change IP address dynamically? [closed]
                Consider the case, 
I want to crawl websites frequently, but my IP address got blocked after some day/limit.
So, how can change my IP address dynamically or any other ideas?
            
        
       
    
            62
            votes
        
        
            6
            answers
        
        
            63k
            views
        
    Save and render a webpage with PhantomJS and node.js
                I'm looking for an example of requesting a webpage, waiting for the JavaScript to render (JavaScript modifies the DOM), and then grabbing the HTML of the page.
This should be a simple example with an ...
            
        
       
    
            57
            votes
        
        
            10
            answers
        
        
            113k
            views
        
    How to "scan" a website (or page) for info, and bring it into my program?
                Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java). 
For example, if I know the exact page I want info from, for the sake of ...
            
        
       
    
            57
            votes
        
        
            3
            answers
        
        
            56k
            views
        
    Scraping a JSON response with Scrapy
                How do you use Scrapy to scrape web requests that return JSON? For example, the JSON would look like this:
{
    "firstName": "John",
    "lastName": "Smith",
    "age": 25,
    "address": {
        "...
            
        
       
     
         
         
         
         
         
         
         
         
         
        