• Yahoo!
  • My Yahoo!
  • Mail
  •      Make Y! your home page
Yahoo! Search
Do you have feedback/suggestions for the Pipes team? We want to hear from you!
View: Simple | Summary | Expanded
As: Threaded | Msg List
Page 1 of about 1   First | < Prev | Next > | Last
Messages in Topic
View messages rated: What's this?
Best regex to remove html tags     24-Jun-08 04:18 pm    
What do you think is the best regex to remove html? So far i've used <.*?> but compared to others I've seen it's so simple I think it could have bugs since I've not tested it extensively.

Or is there a site or something with fully tested regex for common uses?
Rating :
 (1 Rating)
Rate it:
awful/not related to \pooraveragegoodexcellent

carlosz

24/Male


View Messages

Ignore User

Report Abuse

Re: Best regex to remove html tags     26-Jun-08 11:55 am    
<.*?> won't work. The regex engine will not know what to do with the ? modifier because it already has a * modifier to use on the . selector. <.*> would select the first < to the last > and EVERYTHING in between - this means you'll end up with only content that comes before or after tags, nothing between tags. Regex * modifier is greedy and . matches everything. A simple expression that will work most of the time is <[^>]*> this selects a < followed by any characters that are not > followed by a >. The case that breaks this is if an attribute contains a > within quotes (I'm not sure if that would be proper HTML, but I expect that it is possible).
Rating :
 (1 Rating)
Rate it:
awful/not related to \pooraveragegoodexcellent

David Robarts

Male
San Luis Ob...


View Messages

Ignore User

Report Abuse

Re: Best regex to remove html tags     26-Jun-08 12:43 pm    
The ? after .* makes .* "lazy". The regex engine will start by grabbing the fewest possible characters and then move to the next part of the regex to see if there rest of the expression will match. If this fails it returns to .*?, adds the next character and then tries for a match etc..

<[^.]*> is more efficient than <.*?> because, given a successful match, the former expression does not carry out the backtracking required by the latter.

It is indeed possible to have an intervening > somewhere in quotes, and there are more complex expressions that cater for this, probably quite rare, circumstance. Personally I am happy to use the simpler expression.
Rating :
 (1 Rating)
Rate it:
awful/not related to \pooraveragegoodexcellent

hapdaniel

58/Male
NA


View Messages

Ignore User

Report Abuse

Re: Best regex to remove html tags     26-Jun-08 01:45 pm    
Thanks for the clarification about the *? construct. Very useful to know.
Rating :
 (No ratings)
Rate it:
awful/not related to \pooraveragegoodexcellent

David Robarts

Male
San Luis Ob...


View Messages

Ignore User

Report Abuse

View: Simple | Summary | Expanded
As: Threaded | Msg List
Page 1 of about 1   First | < Prev | Next > | Last