|
Page
1
of about
1
First
| < Prev
| Next >
| Last
|
Messages in Topic
|
What do you think is the best regex to remove html? So far i've used <.*?> but compared to others I've seen it's so simple I think it could have bugs since I've not tested it extensively.
Or is there a site or something with fully tested regex for common uses? Rating :
![]() ![]() ![]() ![]() (1 Rating) |
![]()
24/Male |
|
<.*?> won't work. The regex engine will not know what to do with the ? modifier because it already has a * modifier to use on the . selector. <.*> would select the first < to the last > and EVERYTHING in between - this means you'll end up with only content that comes before or after tags, nothing between tags. Regex * modifier is greedy and . matches everything. A simple expression that will work most of the time is <[^>]*> this selects a < followed by any characters that are not > followed by a >. The case that breaks this is if an attribute contains a > within quotes (I'm not sure if that would be proper HTML, but I expect that it is possible).
Rating :
![]() ![]() ![]() ![]() (1 Rating) |
![]()
Male |
|
The ? after .* makes .* "lazy". The regex engine will start by grabbing the fewest possible characters and then move to the next part of the regex to see if there rest of the expression will match. If this fails it returns to .*?, adds the next character and then tries for a match etc..
<[^.]*> is more efficient than <.*?> because, given a successful match, the former expression does not carry out the backtracking required by the latter. It is indeed possible to have an intervening > somewhere in quotes, and there are more complex expressions that cater for this, probably quite rare, circumstance. Personally I am happy to use the simpler expression. Rating :
![]() ![]() ![]() ![]() (1 Rating) |
![]()
58/Male |
|
Thanks for the clarification about the *? construct. Very useful to know.
Rating :
![]() ![]() ![]() ![]() (No ratings) |
![]()
Male |
|
Page
1
of about
1
First
| < Prev
| Next >
| Last
|



