Dedicated Server

How to perform android web scraping using jsoup ?

Suppose you have a website which is up and running. You can create a native android application for your website by parsing html content from your web page into your app. This technique is generally called android web scraping. In android we have one cool library for web scraping - THE JSOUP LIBRARY

jsoup is an efficient html parser libary. jsoup consists of a class called Elements for representing a list of nodes. The elements class implements iterable which enables us to iterate it over a for loop. This is one reason why jsoup becomes a popular choice while considering android web scraping.

First thing that you would require is the gradle dependency for jsoup, add it to your app's build.gradle

compile 'org.jsoup:jsoup:1.10.1'

Some usefull classes provided by jsoup for handling html responses easily are
  • Document - load the entire web page into a document object, which can be then queried upon by using the select()
  • Elements - save the contents of a particular tag(or cssQuery)

Before it gets boring let us take a dive into the coding part. First of all, create a new thread to perform the network call. Then make the network request to the web page using Jsoup.connect() as shown below 

new Thread(new Runnable() {
            @Override
            public void run() {
                final StringBuilder builder = new StringBuilder();

                try {
                    Document doc = Jsoup.connect("add_webpage_url_here").get();
                    String title = doc.title();
                    Elements links = doc.select("a[href]");
                    builder.append(title).append("\n");

                    Element table = doc.select("table").get(0);
                    Elements rows = table.select("tr");

                } catch (IOException e) {
                    bus_progress.setVisibility(View.GONE);
                    builder.append("Error : ").append(e.getMessage()).append("\n");
                }

                runOnUiThread(new Runnable() {
                    @Override
                    public void run() {
                        
                    }
                });
            }
        }).start();

Here,

  •  doc.title() gives the title of the requested web page 
  •  doc.select("a[href]") gives list of all the links in the web page
Consider a webpage having a table element like this


html table that we need to parse using json


and assume its html code as something like this

<table id="bus-timing-chart" class="table table-responsive table-striped bus-timing-chart"> 
      <thead> 
       <tr> 
        <th>From</th> 
        <th>Via</th> 
        <th>To</th> 
        <th>Arrival</th> 
        <th>Departure</th> 
        <th>Bay</th> 
        <th>Bus Name</th> 
       </tr> 
      </thead> 
      <tbody> 
       <tr> 
        <td align="left" valign="top">test 1</td> 
        <td align="left" valign="top">test 2</td> 
        <td align="left" valign="top">test 3</td> 
        <td align="left" valign="top" style="width:9%;  ">test 4</td> 
        <td align="left" valign="top" style="width:9%;  ">test 5</td> 
        <td align="left" valign="top">test 6</td> 
        <td align="left" valign="top">test 7</td> 
       </tr> 
       <tr> 
        <td align="left" valign="top">test 1</td> 
        <td align="left" valign="top">test 2</td> 
        <td align="left" valign="top">test 3</td> 
        <td align="left" valign="top" style="width:9%;  ">test 4</td> 
        <td align="left" valign="top" style="width:9%;  ">test 5</td> 
        <td align="left" valign="top">test 6</td> 
<td align="left" valign="top">test 7</td> </tr> <tr>
.
.
.

Now we can easily parse the table rows from doc element simply like following

Element table = doc.select("table").get(0); 
Elements rows = table.select("tr");

Now we have rows which contains the list of all rows within the selected table element. You can now loop through each row and get values of each column element like this

for (int i = 1; i < rows.size()-1; i++) {
String from_place=rows.get(i).getElementsByTag("td").get(0).toString();

// getElementsByTag("td").get(0) gives the first row in td element(i.e,test 1)
}


Congrats! you have now learned the basics of android web scraping using jsoup library.

No comments:

Post a Comment