Crawl Book Information from Amazon with Erlang

This post demonstrates 2 different ways in Erlang to fetch ranking information of books from Amazon. The input of this program is a file named ISBN.txt and as the output books, information (ISBN, Book title and Rank) will be displayed sorted by the rank. We will do this job in both sequential way and concurrent way.

Let’s write our first function readfile that will read the input file and create a list of ISBN numbers by tokenizing the file.

Now we have a list of ISBNs. For each ISBN we will call a function that will crawl the title and rank from Amazon. This function looks like this.

In this function, we first start inets and ssl API that enable us to work with secured network requests. Then we call a function amazon_url_for(ISBN). It is a very small one-liner function that simply creates the URL of Amazon that contains the book with a particular ISBN number.

We have to define a string BASE_URL that contains the basic URL format of Amazon (<ISBN>).

So, after generating the URL for a particular ISBN, we create a request to Amazon using a random User-Agent. We used a random string here. In line 7-10, we parse the response HTML page using regular expressions and extract the book’s sales rank and the title.  To get a deeper idea how we did it, you can look at the source HTML code for any Amazon books page. We return the Title, ISBN, and Rank at line 11.

Now we have our main building block function that will fetch the title and sales rank for an ISBN number. The complete script is given at the bottom that uses both sequential and concurrent ways to call this function and display the book’s information. We used a timer to monitor the time consumed by both of the approaches.

To run this program,

  1. Save the script in a file called first.erl.
  2. Create a file called isbn.txt. Put some ISBN numbers in the file separated by newline.
  3. cd to the directory that contains these 2 files.
  4. Run the following command in the terminal.

Complete Script

Sample output in terminal

Thanks to Md. Emtiaz Ahmed.


Facebook Comments

Leave a Comment.