Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

Web Crawling with Node.js, it’s an interesting world

$
0
0

Today’s some fun time! I’ll try to scrape a website, I wanted something simple but unique, so I chose to scrape Google search results(Oh the irony!)

I am not at all a Javascript expert, but picking up NodeJS seems to be really much more fun than the days of doing Python based scraping(yes I am old!). The obvious reason is JS allows much more convenient DOM parsing, and if you use one of the gazillion JS based frameworks, you are gonna get it very fast.

Let’s dive into an example straight using Osmosis(https://github.com/rchipka/node-osmosis), which I started with, which would be a no-brainer library to start with for anyone. We take a very simple example which fires up the Google search URL and then extracts some information about the result.

Examining Google Search DOM

This is how HTML for a google search result looks, I have cleaned out a lot of things from what actually is there, you can always go and see it for yourself.

<div class="g"><h3 class="r"><a href="/URL to Hit/">Search Result</a></h3><cite>/URL which is shown on Google search with .../</cite><span class="st">Description of the website</span></div>

For you guys, I have created this to show how the above result would look like:

snip20170304_2
A result which would correspond to the HTML above

Let’s scrape the results

Now let’s get started with the code, do read my comments to understand what is going on:

var searchUrl = 'https://www.google.co.in/search?q=random+search';
osmosis
  .get(searchUrl)
  .find('.g')               // Find all outer div tags
  .set({
    'title': '.r',          // Extract the properties out of it which are needed
    'url':   'cite',          // Similar to DOM extraction .class/tag/#id/@property can be used to get values
    'link':  '.r @href',
    'text':  '.st'
  })
  .data(function(data) {
    console.log(data);      // Data here would be each search result with the properties that we set above
  })
  .error(console.log)
  .debug(console.log);

Before showing you what is the output, take a moment and analyze some really interesting properties of this code:

  • If you are a functional programmer, it might ring some bells for you:
    • .find() will return all the results from the page :: Just like filter()
    • .set() will allow you to extract some meaningful data from the results you collected :: Just like map()
    • .data() will allow you to iterate over each of the results :: Just like foreach()
  • Asynchronous interface(why we all love JS)!
  • And last but not the least, the way we have accessed and used classes/tags/properties/DOM chaining, it is extremely wonderful.

I might have left one question in your mind here, what if you want to collect another set of information here? Do you need to parse the DOM again? The short answer is yes, and the long answer is, there is a hack to do it another way, will discuss that shortly.

Here’s how the result looks like:

{ 
  title: 'Random search - Wikipedia',
  url: 'https://en.wikipedia.org/wiki/Random_search',
  link: 'https://en.wikipedia.org/wiki/Random_search',
  text: 'Random search (RS) is a family of numerical optimization methods that do not require the gradient of the problem to be optimized, and RS can hence be used ...' 
}
{
  title: '15 Random Google Searches & What We Learned From Them',
  url: 'www.makeuseof.com/tag/random-google-searches-learned/',
  link: 'http://www.makeuseof.com/tag/random-google-searches-learned/',
  text: 'Dec 3, 2014 - In particular, we take Google and the other search engines for granted. That needs to stop. Here. And now. By dissecting 15 completely random ...' 
}
{
  title: '[PDF]Random Search for Hyper-Parameter Optimization - Journal of ...',
  url: 'www.jmlr.org/papers/volume13/bergstra12a/bergstra12a',
  link: 'http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a',
  text: 'budget, random search finds better models by effectively searching a larger, less ... manual search and grid search, purely random search over the same ...' 
}
....
....

Go in deeper

Let’s take this a step further by opening each of the links and process them:

var searchUrl = 'https://www.google.co.in/search?q=random+search';

/**** Same as above ****/
osmosis
  .get(searchUrl)
  .find('.g')               
  .set({
    'title': '.r', 
    'link':  '.r @href',         
    'url': 'cite',     
    'text': '.st'
  })
/**** Same as above ****/
  .follow('.r @href')         // Follow the link. Really that's it!!
  .set({
     'pageText': 'body'       // Set some property for the pageText by parsing body tag
  })
  .data(function(data) {
    console.log(data);        // Data here would be each search result with the properties that we set above
  })
  .error(console.log)
  .debug(console.log);

This got so interesting so fast! We are able to go to open all the URLs and get some data from them. Best thing out of this? Yes you get this right, we are taking full advantage of JS asynchronous IO. So you are hitting a series of pages in parallel and getting data out of them.

Note: An important thing out of this, we keep collecting in the same object for each dictionary. It makes a lot of sense here, because we opened the link that was given to us. It becomes much more hard to follow if you collect data for 2 different tags that are not related. We see that in the next section.

Result of the above:

{ 
  title: 'Random search - Wikipedia',
  url: 'https://en.wikipedia.org/wiki/Random_search',
  link: 'https://en.wikipedia.org/wiki/Random_search',
  text: 'Random search (RS) is a family of numerical optimization methods that do not require the gradient of the problem to be optimized, and RS can hence be used ...',
  pageText: 'Random search\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\tFrom Wikipedia, the free encyclopedia\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\tJump to:\t\t\t\t\tnavigation, \t\t\t\t\tsearch\n\t\t\t\t\n\t\t\t\tRandom search (RS) is a family of numerical optimization methods that do not require the gradient of the problem to be optimized, and RS can hence be used on functions that are not continuous or differentiable. Such optimization methods are also known as direct-search, derivative-free, or black....'
}

Goto Next!

If that wasn’t good enough for you, let’s do something even more challenging!
We will try navigating through Google search’s pages, to go to next page. Why I chose next page has a significance here, it is something that a lot of websites have, and probably if you are trying to crawl a website, then definitely something you need!

First a quick understanding of Google’s lower nav bar(again cleaned out most of the stuff for you):

<table id="nav"><td class="navend"><a href="/**URL**/>Previous</a></td><td><a href="/**URL**/>1</a></td><td><a href="/**URL**/>2</a></td><td><a href="/**URL**/>3</a></td><td><a href="/**URL**/>4</a></td>
	.
	.
	.<td class="navend"><a href="/**URL**/>Next</a></td></table>

Aim now would be to just get the URL from the last cell in the table. Here’s the code:

var searchUrl = 'https://www.google.co.in/search?q=random+search';
var nextUrl;

osmosis.get(searchUrl)
    .find('#nav td:last a')            // div with Nav ID -> last table division -> a tag
    .set({
        'nextLink': '@href'            // href tag
    })    
/**** Same as above ****/
    .find('.g')
    .set({
        'title':    '.r',
        'url':      'cite',
        'link':     '.r @href',
        'text':     '.st'
    })
    .follow('.r @href')
    .set({
        'pageText': 'body'
    })
/**** Same as above ****/
    .data(function(data) {
         nextUrl = data['href']
         console.log(data);
    })
    .error(console.log)
    .debug(console.log);

Guess what’s the output here? Hard to guess right?

{
  nextLink: '/search?q=random+search&ei=8Nq7WO7NHsuV0gSgwr64Bw&start=10&sa=N',
  title: 'Random search - Wikipedia',
  url: 'https://en.wikipedia.org/wiki/Random_search',
  link: 'https://en.wikipedia.org/wiki/Random_search',
  text: 'Random search (RS) is a family of numerical optimization methods that do not require the gradient of the problem to be optimized, and RS can hence be used ...',
  pageText: 'Random search...
}
{
 nextLink: '/search?q=random+search&ei=8Nq7WO7NHsuV0gSgwr64Bw&start=10&sa=N',
  title: 'Random Search - Clever Algorithms: Nature-Inspired Programming ...',
  url: 'www.cleveralgorithms.com › Table of Contents › Stochastic Algorithms',
  link: 'http://www.cleveralgorithms.com/nature-inspired/stochastic/random_search.html',
  text: 'Random search belon...
} 
...

People from the functional world already are picking up arms. Yes, the nextLink tag is collected for all the results, meaning that all the operations happening on the DOM are completely disjoint. Why? This is obvious, you wouldn’t want to load the DOM again and again to parse multiple things.

Is this a disadvantage? Yes, it makes things harder to understand. If we had collected all the URLs(say n) instead of just the Next URL, we would have had n*number of results on the page(combination of multiple disjoint sets if want to think mathematically). That’s why I call it a hacky way of doing things, but for a simple single tag extraction, it seems okay to go ahead with.

We need to keep the nextLink tag as part of each object that we get finally. The only other way would be to do a separate osmosis.get() call again. But nevertheless, I wouldn’t need to care about this for normal crawling which isn’t too memory intensive.

Now we have the URL for the next tag. To make things simpler, I wait for the current page parsing to complete before going to next page.

var nextLink;
var searchUrl = 'https://www.google.co.in/search?q=random+search';

function open_page(url) {
/**** Same as above ****/
    console.log("Opening " + url);
    var nextUrl;

    osmosis.get(searchUrl)
        .find('#nav td:last a')    
        .set({
            'nextLink': '@href'           
        })    
        .find('.g')               
        .set({
            'title': '.r',         
            'url':   'cite',         
            'text':  '.st'
     })
    .data(function(data) {
        console.log(data);
        nextUrl = data['href'];
    })
    .error(console.log)
    .debug(console.log)
/**** Same as above ****/
    .done(function() {
       // Open the next page when complete.
       // Using event driven model which JS is built up on
       open_page('https://www.google.co.in/' + nextLink);
    })
}

open_page(searchUrl);

And that’s it! Given that it took me much lesser time to learn it than to write this post, I think this is going to be my choice of framework when doing web scraping in the future.
Here’s a link to the code: https://gist.github.com/kunalgrover05/75c31dc48fb44e63616409794b383b71
I haven’t run it long enough to see when my IP gets blacklisted, but worth a try 😉

Advertisements

Viewing all articles
Browse latest Browse all 25817

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>