Selenium Python Web Scraping



Oct 03, 2018 Summary: We learnt how to scrape a website using Selenium in Python and get large amounts of data. You can carry out multiple unstructured data analytics and find interesting trends, sentiments, etc. Using this data. If anyone is interested in looking at the complete code, here is the link to my Github. Let me know if this was helpful. 2 days ago  python selenium web-scraping. Improve this question. Follow edited 5 mins ago. 5,182 4 4 gold badges 15 15 silver badges 21 21 bronze badges. Asked 32 mins ago. J.C 122 J.C 122. 11 2 2 bronze badges. New contributor. J.C 122 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.

Python Selenium - get href value. Writing the data to CSV file is not related to the scraping challenge. Just try to look at examples and you will be able to do.

Beginner's guide to web scraping with python's selenium

In the first part of this series, we introduced ourselves to the concept of web scraping using two python libraries to achieve this task. Namely, requests and BeautifulSoup. The results were then stored in a JSON file. In this walkthrough, we'll tackle web scraping with a slightly different approach using the selenium python library. We'll then store the results in a CSV file using the pandas library.

The code used in this example is on github.

Why use selenium

Selenium is a framework which is designed to automate test for web applications.You can then write a python script to control the browser interactions automatically such as link clicks and form submissions. However, in addition to all this selenium comes in handy when we want to scrape data from javascript generated content from a webpage. That is when the data shows up after many ajax requests. Nonetheless, both BeautifulSoup and scrapy are perfectly capable of extracting data from a webpage. The choice of library boils down to how the data in that particular webpage is rendered.

Other problems one might encounter while web scraping is the possibility of your IP address being blacklisted. I partnered with scraper API, a startup specializing in strategies that'll ease the worry of your IP address from being blocked while web scraping. They utilize IP rotation so you can avoid detection. Boasting over 20 million IP addresses and unlimited bandwidth.

In addition to this, they provide CAPTCHA handling for you as well as enabling a headless browser so that you'll appear to be a real user and not get detected as a web scraper. For more on its usage, check out my post on web scraping with scrapy. Although you can use it with both BeautifulSoup and selenium.

If you want more info as well as an intro the scrapy library check out my post on the topic.

Using this scraper api link and the codelewis10, you'll get a 10% discount off your first purchase!

For additional resources to understand the selenium library and best practices, this article by towards datascience and accordbox.

Setting up

We'll be using two python libraries. selenium and pandas. To install them simply run pip install selenium pandas

Selenium Python Web Scraping Examples

In addition to this, you'll need a browser driver to simulate browser sessions.Since I am on chrome, we'll be using that for the walkthrough.

Driver downloads

  1. Chrome.

Getting started

For this example, we'll be extracting data from quotes to scrape which is specifically made to practise web scraping on.We'll then extract all the quotes and their authors and store them in a CSV file.

The code above is an import of the chrome driver and pandas libraries.We then make an instance of chrome by using driver = Chrome(webdriver)Note that the webdriver variable will point to the driver executable we downloaded previously for our browser of choice. If you happen to prefer firefox, import like so

Main script

On close inspection of the sites URL, we'll notice that the pagination URL isHttp://quotes.toscrape.com/js/page/{{current_page_number}}/

where the last part is the current page number. Armed with this information, we can proceed to make a page variable to store the exact number of web pages to scrape data from. In this instance, we'll be extracting data from just 10 web pages in an iterative manner.

The driver.get(url) command makes an HTTP get request to our desired webpage.From here, it's important to know the exact number of items to extract from the webpage.From our previous walkthrough, we defined web scraping as

This is the process of extracting information from a webpage by taking advantage of patterns in the web page's underlying code.

We can use web scraping to gather unstructured data from the internet, process it and store it in a structured format.

On inspecting each quote element, we observe that each quote is enclosed within a div with the class name of quote. By running the directive driver.get_elements_by_class('quote')we get a list of all elements within the page exhibiting this pattern.

Final step

To begin extracting the information from the webpages, we'll take advantage of the aforementioned patterns in the web pages underlying code.

We'll start by iterating over the quote elements, this allows us to go over each quote and extract a specific record.From the picture above we notice that the quote is enclosed within a span of class text and the author within the small tag with a class name of author.

Finally, we store the quote_text and author names variables in a tuple which we proceed to append to the python list by the name total.

Using the pandas library, we'll initiate a dataframe to store all the records(total list) and specify the column names as quote and author.Finally, export the dataframe to a CSV file which we named quoted.csv in this case.

Don't forget to close the chrome driver using driver.close().

Adittional resources

1. finding elements

You'll notice that I used the find_elements_by_class method in this walkthrough. This is not the only way to find elements. This tutorial by Klaus explains in detail how to use other selectors.

2. Video

If you prefer to learn using videos this series by Lucid programming was very useful to me.https://www.youtube.com/watch?v=zjo9yFHoUl8

3. Best practises while using selenium

4. Toptal's guide to modern web scraping with selenium

And with that, hopefully, you too can make a simple web scraper using selenium 😎.

If you enjoyed this post subscribe to my newsletter to get notified whenever I write new posts.

open to collaboration

I recently made a collaborations page on my website. Have an interesting project in mind or want to fill a part-time role?You can now book a session with me directly from my site.

Thanks.

Selenium is a widely used tool for web automation. It comes in handy for automating website tests or helping with web scraping, especially for sites that require javascript to be executed. In this article, I will show you how to get up to speed with Selenium using Python.

What is Selenium?

Selenium’s mission is simple, its purpose is to automate web browsers. If you are in need to always execute the same task on a website. It can be automated with Selenium. This is especially the case when you carry out routine web administration tasks but also when you need to test a website. You can automate it all with Selenium.

With this simple goal, Selenium can be used for many different purposes. For instance web-scraping. Many websites run client-side scripts to present data in an asynchronous way. This can cause issues when you are trying to scrape sites in which data you need is rendered through javascript. Selenium comes to the rescue here by automating the browser to visit the site and run the client-side scripts giving you the required HTML. If you would simply use the python requests package to get HTML from a site that runs client-side code, the rendered HTML won’t be complete.

There are many other cases for using Selenium. In the meantime let’s get to using Selenium with Python.

Python Install Selenium

Before you begin you need to download the driver for your particular browser. This article is written using chrome. Head on to the following URL to download the chrome driver to use with selenium by clicking here.

The next step is to install the necessary Selenium python packages to your environment. It can be done using the following pip command:

Selenium 101

Selenium Python Web Scraping

To begin using selenium, you need to instantiate a selenium webdriver. This class will then control the web browser and you can take various actions as if you were the one navigating the browser such as navigating to a URL or clicking on a button. Let’s see how to do that using python.

First, import the necessary modules and instantiate a selenium webdriver. You need to provide the path to the chromedriver.exe you downloaded earlier.

After executing the command, a new browser window will open up specifying that it is being controlled by automated testing software.

In some cases, you get an error when chrome opens and needs to disable the extensions to remove the error message. To pass options to chrome when starting it, use the following code.

Now, let’s navigate to a specific URL, in our case that will be google’s homepage by executing the get function.

Locate, Enter a Value to TextBox

What do you do on google? You search! Let’s use selenium to perform an automated search on google. First, you need to learn how to locate items.

Selenium provides many options to do so. You can find web elements by ID, Name, Text and many others. Read on here to get the full list.

We will be locating the textbox by name. Google’s input textbox has a name of q. Let’s find this element with Selenium.

Once this element is found, enter your search to it. We will search for this site by executing the following method.

Lastly, send an “Enter” command as you would from your keyboard.

Wait for an Element to Load

As mentioned earlier, many times the page you are browsing to doesn’t completely load at first, rather it executes client-side code that takes longer to load and you need to wait for these to load before continuing. Selenium provides functionality to achieve this by using the WebDriverWait class. Let’s see how to do this.

TipRanks.com is a site that lets you see the track record and measured performance of any analyst or blogger you come across. We will browse to Apple’s analysis page which upon accessing runs javascript to generate the charts. Our code will wait until these are generated before continuing.

First, we need to import additional modules for our sample such as By, expected_conditions and the WebDriverWait class. ExpectedConditions provide functionality for common conditions that are frequently used when automating web browsers for example to detect the visibility of elements.

After accessing the page, we will wait for a max of 10 seconds until a specific CSS class becomes visible. We are looking for the span.fs-13 that becomes visible until charts complete loading.

Get Page HTML

Once the driver has loaded a page and its rendered completely, either by waiting for elements to load or just navigating to the page. You can extract the page’s rendered HTML quite easily with selenium. This can then be processed using BeautifulSoup or other packages to get information from them.

Python Selenium Web Scraping Headless

Run the following command to get the page HTML.

Conclusion

Selenium makes web automation very easy allowing you to perform advanced tasks by automating your web browser. We learned how to get Selenium ready to use with Python and its most important tasks such as navigating to a site, locating elements, entering information and waiting for items to load. Hope this article was helpful and stay tuned for more!





Comments are closed.