Web UI manipulation for testing, data extraction, or web scraping, is a complicated and comprehensive process. Websites that contain an abundance of JavaScript code and unprecedented iframe structures can be hard to extract through simple requests. Capybara, paired with Poltergeist driver, makes it simple and hassle-free to retrieve complicated data from any kind of website. This tutorial will explain web scraping using Ruby and Capybara; it will highlight in detail how to use Capybara cookies in RestClient Gem Requests for a smooth data extraction.

What exactly is Capybara?

Capybara is a web-based test automation tool that simulates scenarios and automates web app testing for behavior-driven app development. It basically simulates web browser testing from the real user’s perspective.

Capybara is programmed with Ruby which makes the data extraction easy (i.e. how the users interact with the application), and it can communicate with various types of browsers which allows you to execute tests through a simple and clean interface.

Apart from testing your web application, you can send forms, fill fields, execute javaScript code and so on. In other words, with Capybara, you have the power to act like a real user and test the browser in different scenarios.

Capybara serves many purposes. For this tutorial, our web harvesting experts have focused on its prominent feature: data extraction. Search engines like Google already use this technique to crawl millions of web pages. Today, we will learn how to use Capybara cookies to get the authorization details required for quick and hassle-free web UI manipulation.

Using Capybara Cookies in RestClient Requests

In the following tutorial, we will go through the process of data extraction using Capybara cookies.

We will start by creating a new Capybara object and make a request to a sample page. Get started with the following:

page = Capybara::Session.new(:poltergeist)
page.visit("http://somesamplesite.com")

At this point, you will need to sign-in into a user account. The sign-in form is submitted by the JavaScript code which may also generate security token not available in the page source.

page.fill_in 'user[email]', with: 'email'
page.fill_in 'user[password]', with: 'password'
page.click_button 'Sign in'

The above code generally works well; Capybara readily simulates the scenario for user behavior in the browser and submits the sign-in form even if the JavaScript code is responsible for the form submission or the security token addition.

With the above code, you will have access to the user account and the account will be ready for website data extraction.

Let’s take an example. Suppose you want to scrap a list of user blog posts from a website. This is what you will need to do in Capybara:

page.visit("http://somesamplesite.com/account/posts")
body = page.body
html = Nokogiri::HTML(body)

posts = []
html.css("#blog_posts p").each do |blog_paragraph|
  title = blog_paragraph.css("h2").first.text
  content = blog_paragraph.css("#content p").text

  posts << {
    title: title,
    content: content
  }
end

posts

The mentioned code may not be as fast as you may anticipate; you will have to wait for the page to load to see the results. Note the use of Nokogiri. It is a useful library for parsing data from the HTML documents.

It is also important here to remember that JavaScript codes can easily manipulate the content of the page making the scraping process harder. In this unique scenario, you will have to find an endpoint and access the data before it is formatted by the JavaScript codes. For that, you can use Dev Tools in your browser.

Let’s take an example to understand this better.

Let’s say that you found an endpoint with a URL: http://somesamplesite.com/users/posts.json This endpoint will be secure, which means that you will have to sign-in in order to access the user data. Now the authorization credentials are stored in the browser cookies and here’s how you can use Capybara to get those cookies:

cookies = {}

# each Capybara cookie is a two elements array. The first element is the cookie name
# the second element is Capybara::Poltergeist::Cookie object where the object attributes
# are mapped to the cookie attributes
page.driver.cookies.each do |cookie|
  cookies[cookie.first] = cookie.last.value
end

Once in, make sure that you collect all the cookies because it is impossible to tell which cookies the website is utilizing for authorization.

The cookies will help you to authenticate in the system in order to retrieve the blog posts’ data. Once we have the cookies, we can use the following code to request the blog posts:

response = RestClient.get("http://somesamplesite.com/users/posts.json", {cookies: cookies})
json_posts = JSON.parse(response.body)
posts = []

json_posts.each do |post|
  posts << {
    title: post['title'],
    content: post['content']
  }
end

posts

And that is it!

The above steps can help you retrieve the blog post data without any hassle. All you have to do is to ensure that you have entered the right code at the right places. The mentioned code is fast and effective and allows easy testing. In addition, it is not dependent on website UI changes.

Final thoughts: Web UI manipulation with Capybara is simpler and time-saving. It effectively works with all types of websites and allows you to retrieve data through endpoints so it remains unaffected by complicated JavaScript codes that essentially manipulate the content.

Ruby is a powerful programming language that plays a key role in myriad web development tasks, including web data extraction. Ruby uses ‘gems’, including Capybara, that offer already implemented functions to speed up the development process. We specialize in programming languages like Ruby and technologies like Capybara to help your enterprise extract large amounts of data that is crucial for critical data analysis. If you require any assistance with regards to web and software development, contact us right away. Our software experts would love to help you with your projects.