Tuesday, January 21, 2014

How to crawl websites in multiple languages using C#

In this article we will create a console app to crawl a webpage and read data using c#. We can use HttpWebRequest and HttpWebResponse also to get response from a webpage but I used httpagilitypack to read data in the example given below. It is much more easier to read the title, header and other html tags using htmlagility.

1)      Create a console application.

2)      Download htmlagilitypack from http://htmlagilitypack.codeplex.com/

3)      Copy htmlagility pack DLL to folder inside the project and add reference to HtmlAgilityPack.dll

4)      Next create a webclient and pass the URL to read to webclient. To read content in all languages (like Chinese, Japanese, Indonesian, Russian etc) make sure that the encoding is UTF8.

Complete code for crawling and reading data from webpage is given below

class Program
    {
        static void Main(string[] args)
        {
            WebClient webclient = new WebClient();
            HtmlDocument htmlDoc = new HtmlDocument();
            htmlDoc.Load(webclient.OpenRead("http://deebujacob.blogspot.com/2013/03/rendering-multiple-series-in-highcharts.html"), Encoding.UTF8);

            Console.WriteLine(htmlDoc.DocumentNode.SelectSingleNode("//title").InnerText);
            Console.WriteLine(htmlDoc.DocumentNode.SelectSingleNode("//body//h1").InnerText);

            Console.ReadLine();

        }
    }


No comments: