Tuesday, January 21, 2014

How to crawl websites in multiple languages using C#

In this article we will create a console app to crawl a webpage and read data using c#. We can use HttpWebRequest and HttpWebResponse also to get response from a webpage but I used httpagilitypack to read data in the example given below. It is much more easier to read the title, header and other html tags using htmlagility.

1)      Create a console application.

2)      Download htmlagilitypack from http://htmlagilitypack.codeplex.com/

3)      Copy htmlagility pack DLL to folder inside the project and add reference to HtmlAgilityPack.dll

4)      Next create a webclient and pass the URL to read to webclient. To read content in all languages (like Chinese, Japanese, Indonesian, Russian etc) make sure that the encoding is UTF8.

Complete code for crawling and reading data from webpage is given below

class Program
    {
        static void Main(string[] args)
        {
            WebClient webclient = new WebClient();
            HtmlDocument htmlDoc = new HtmlDocument();
            htmlDoc.Load(webclient.OpenRead("http://deebujacob.blogspot.com/2013/03/rendering-multiple-series-in-highcharts.html"), Encoding.UTF8);

            Console.WriteLine(htmlDoc.DocumentNode.SelectSingleNode("//title").InnerText);
            Console.WriteLine(htmlDoc.DocumentNode.SelectSingleNode("//body//h1").InnerText);

            Console.ReadLine();

        }
    }


How to store different languages in SQL Server

To store different languages (like Chinese, Japanese etc)  in SQL server the data type of the column should be NVARCHAR.

Also make sure that the stored procedure used to store data also has NVARCHAR parameter. If you are using SQL statement to directly insert data into the table then use "N" before the data.

Example given below


Create table #temp(data nvarchar(100))

INSERT INTO #temp(data) values (N'ハロウィンの飾りつけになっているJVCケンウッド丸の内ショールームに行ってきました。')

SELECT * FROM #temp