Quantcast
Channel: htmlagilitypack Forum Rss Feed
Viewing all articles
Browse latest Browse all 655

New Post: How to scrape some data from a website using HTMLAgilityPack

$
0
0
I have the following Crawler.cs helper class:

_using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;

namespace CrawlerWeb
{
public class Crawler
{

    public string Url
    {
        get;
        set;
    }
    public Crawler() { }
    public Crawler(string Url)
    {
        this.Url = Url;
    }
    public XDocument GetXDocument()
    {
        HtmlAgilityPack.HtmlWeb doc1 = new HtmlAgilityPack.HtmlWeb();
        doc1.UserAgent = "Mozilla/4.0 (conpatible; MSIE 7.0; Windows NT 5.1)";
        HtmlAgilityPack.HtmlDocument doc2 = doc1.Load(Url);
        doc2.OptionOutputAsXml = true;
        doc2.OptionAutoCloseOnEnd = true;
        doc2.OptionDefaultStreamEncoding = System.Text.Encoding.UTF8;
        XDocument xdoc = XDocument.Parse(doc2.DocumentNode.SelectSingleNode("html").OuterHtml);
        return xdoc;
    }
}
}_

The following partial main class:

_private void btnCrawl_Click(object sender, EventArgs e)
    {
        foreach (SHDocVw.InternetExplorer ie in shellWindows)
        {
            filename = Path.GetFileNameWithoutExtension( ie.FullName ).ToLower();

            if ( filename.Equals( "iexplore" ) )
            txtURL.Text = "Now Crawling: " + ie.LocationURL.ToString();
        }
        string url = txtURL.Text.ToString();
        MessageBox.Show(url);
        string xmlns = "{http://www.w3.org/1999/xhtml}";
        Crawler cl = new Crawler(url);
        XDocument xdoc = cl.GetXDocument();
        var res = from item in xdoc.Descendants(xmlns + "div")
                  where item.Attribute("class") != null && item.Attribute("class").Value == "folder-news"
                  && item.Element(xmlns + "a") != null
                  //select item;
                  select new
                  {
                      Link = item.Element(xmlns + "a").Attribute("href").Value,
                      Image = item.Element(xmlns + "a").Element(xmlns + "img").Attribute("src").Value,
                      Title = item.Elements(xmlns + "p").ElementAt(0).Element(xmlns + "a").Value,
                      Desc = item.Elements(xmlns + "p").ElementAt(1).Value
                  };
        foreach (var node in res)
        {
            tb.Text = node + "\n";
        }
        //Console.ReadKey();
    }_
The HTML source looks like this:

<B>Name : </B> WILLIAMS AJAYA L <BR>
<B>Address : </B> NEW YORK NY <BR>
<B>Profession : </B> ATHLETIC TRAINER <BR>
<B>License No: </B> 001475 <BR>
<B>Date of Licensure : </B> 01/12/07 <BR> <B>Additional Qualification : </B>   Not applicable in this profession <BR>
<B> <A href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> REGISTERED <BR>
<B>Registered through last day of : </B> 08/15 <BR>

How can I modify the main/crawler helper class so the application can use htmlagilitypack can extract the following:

Name WILLIAMS AJAYA L

Address NEW YORK NY

Profession ATHLETIC TRAINER

License No 001475

Date of Licensure 1/12/07

Additional Qualification Not applicable in this profession

Status REGISTERED

Registered through last day of 08/15

I would like to extract the data so I can in the future save it to a SQL database.

Viewing all articles
Browse latest Browse all 655

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>