Quantcast
Channel: htmlagilitypack Forum Rss Feed
Viewing all articles
Browse latest Browse all 655

New Post: HTML table contents are parsed wrongly

$
0
0
I'd like to read out phone numbers from a HTML table. However the cell text is parsed wrongly:

HTML source:
<TD class="tnum">0176 
      49329688<BR>4989/6492673<BR>123<BR>456<BR>789<BR>123<BR>456<BR>789<BR>012</TD>
Output:
0176 \r\n      493296884989/6492673123456789123456789012
Desired Output:
0176 49329688
4989/6492673
123
456
789
123
456
789
012
Does someone know what's wrong?

C# programm code:
string hTMLDocumentPath = File.ReadAllText(@"D:\Temp\fonbook_list.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

doc.LoadHtml(hTMLDocumentPath);

// get phone book table in the document
var table = doc.DocumentNode.SelectSingleNode("//div[@id='uiScroll']/table/tbody")
           .Descendants("tr")
           .Select(n => n.Elements("td").Select(e => e.InnerText).ToArray());

// print entries
string output = "" + Environment.NewLine;

foreach (var tr in table)
{
    if (tr.Length == 10)
    {
        string[] phoneNumbers = tr[2].Replace("\r\n", "").Replace(" ", "").Split(new string[] { "<br>" }, StringSplitOptions.None);
        string[] phoneTypes = tr[3].Split(new string[] { "<br>" }, StringSplitOptions.None);

        output += tr[1] + Environment.NewLine;

        for (int i = 0; i < phoneNumbers.Length; i++)
        {
            output += phoneTypes[i] + ": " + phoneNumbers[i] + Environment.NewLine;
        }
    }

    output += (Environment.NewLine);
}

Viewing all articles
Browse latest Browse all 655

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>