Friday, 15 May 2015

php - How can I parse this HTML with a regular expression? -



php - How can I parse this HTML with a regular expression? -

i trying write regular look extract href , anchor text of list of urls html source. anchor text can values.

the html part goes follow:

<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">url1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">this url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">this url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">sweet url 4</a></div>

i tried next regular expression, it's not working since grabs before </a> tag , fails.

preg_match_('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/', $source , $website_array);

what working regular look extract required data?

if must know, look greedy, match start of first anchor , end of last; /u modifier prepare that:

preg_match('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/u', $source , $website_array);

note pcre.backtrack_limit applies ungreedy mode.

using look-ahead sets might give improve performance:

preg_match('/<a rel="nofollow" target="_blank" href="([^"]*)" class="see-all">([^<]*)<\/a>/', $source , $website_array);

this have problem tags within anchor itself.

with aforementioned limitations, consider using html parser:

$d = new domdocument; $d->loadhtml($source); $xp = new domxpath($d); foreach ($xp->query('//a[@class="see-all"][@rel="nofollow"][@target="_blank"]') $anchor) { $href = $anchor->getattribute('href'); $text = $anchor->nodevalue; }

demo

this happily handle attributes in different order , give ability query farther inside, etc.

php regex

No comments:

Post a Comment