|
|||
|
I'm struggling with what I thought was a simple thing, and I'm hoping you guys can help.
I have a string that may contain a ", ', or neither. So, I wrote this in the regex: ["|']* But this doesn't match anything. Here's the complete code: # $text comes from a form, so this is just a sample $text = <<EOF; <img src="<a href='http://www.example.com/whatever.jpg' target='_new'> http://www.example.com/whatever.jpg</a>" width="300" height="300" border="0"> EOF # Regex; line breaks added here for the sake of reading $text =~ s/<img(.*?)src= ["|']*\s*<a.*? href= ["|']*\s*(.*?) ["|']*.*?>(.*?)<\/a> ["|']*(.*?)> /<img src="$2"$1$4>/gsi; If I change ["|']* to whatever I have hard coded, then it works fine, so I know the issue is with that pattern. So how do I correctly match them? |
|
|
||||
|
||||
|
|
|
|||
|
Quoth Jason C <jwcarlton@gmail.com>: > I'm struggling with what I thought was a simple thing, and I'm hoping > you guys can help. > > I have a string that may contain a ", ', or neither. So, I wrote this in > the regex: > > ["|']* You don't use | like that inside a character class. None of the normal regex special characters have their special meanings, and the class matches any one of the characters listed, so that class will match any of ", ', or |. You also probably don't want that *. AFAICS you want to match exactly one quote, of either type, so you just want ["']. (If you wanted to get fancy you could insist on matching quotes using \1 backreferences, but you may not think it's worth it.) > But this doesn't match anything. > > Here's the complete code: > > # $text comes from a form, so this is just a sample > $text = <<EOF; > <img src="<a href='http://www.example.com/whatever.jpg' > target='_new'> > http://www.example.com/whatever.jpg</a>" > width="300" height="300" border="0"> > EOF > > # Regex; line breaks added here for the sake of reading If you use /x you can do this in your real source too, though you will need to remember to escape spaces when you do want them to match literally. > $text =~ s/<img(.*?)src= > ["|']*\s*<a.*? href= > ["|']*\s*(.*?) > ["|']*.*?>(.*?)<\/a> > ["|']*(.*?)> > /<img src="$2"$1$4>/gsi; > > If I change ["|']* to whatever I have hard coded, then it works fine, so > I know the issue is with that pattern. So how do I correctly match them? When I try this (after habing removed the line breaks) it does match *something*, just not what you wanted it to match. $text ends up as <img src="" width="300" height="300" border="0"> which is happening because the second uncaptured .*? is picking up all the text you wanted to get in $2. Everything between the 'href=' and the '>' is *ed, so it can all match nothing if it wants to. The .*? in $2 wants to match as little as possible, and so does the one before the >, and when two sections of the pattern are 'fighting' over what to match the one earlier in the pattern wins. In general, .*? is not a panacea in situations like this. You would probably be better off using negated character classes, something like $text =~ s{ <img ([^>]*) src=["'] \s* <a [^>]* [ ] href=["'] \s* ([^'"]*) ["'] [^>]* > ([^<]*) </a> ['"] ([^>]*) > }{<img src="$2"$1$4>}gsix; (I've used /x to format it decently, which means the literal space needs to be escaped somehow. I usually prefer putting it in a character class to backslashing it, though either would work.) Here each negated character class stops the match running off past the next thing, so for instance $2 can't run past the end of the quotes. This isn't perfect: it will not match at all if there are other tags inside the <a>, and it's not terribly easy to modify it so it will. (While it is possible to correctly match arbitrary HTML with Perl regexes, it isn't entirely straightforward.) Ben |
|
|||
|
On Thursday, July 12, 2012 7:46:04 PM UTC-4, Ben Morrow wrote:
<snip> > In general, .*? is not a panacea in situations like this. You would > probably be better off using negated character classes, something like > > $text =~ s{ > <img ([^>]*) src=["'] \s* > <a [^>]* [ ] href=["'] \s* ([^'"]*) ["'] [^>]* > > ([^<]*) </a> > ['"] ([^>]*) > > }{<img src="$2"$1$4>}gsix; > > (I've used /x to format it decently, which means the literal space needs > to be escaped somehow. I usually prefer putting it in a character class > to backslashing it, though either would work.) > > Here each negated character class stops the match running off past the > next thing, so for instance $2 can't run past the end of the quotes. > This isn't perfect: it will not match at all if there are other tags > inside the <a>, and it's not terribly easy to modify it so it will. > (While it is possible to correctly match arbitrary HTML with Perl > regexes, it isn't entirely straightforward.) > > Ben Perfect! I actually did mean for the " or ' to be optional, though (it's possible to have references without a quote), so I had to add the * back in, but the idea of negated characters was exactly what I needed. For the sake of my own knowledge, does the pattern: /img([^>])src/ translate to "img, not followed by a >, and followed by src", or "img, followed by anything except a >, and followed by src"? |
|
|||
|
Quoth Jason C <jwcarlton@gmail.com>: > On Thursday, July 12, 2012 7:46:04 PM UTC-4, Ben Morrow wrote: > <snip> > > In general, .*? is not a panacea in situations like this. You would > > probably be better off using negated character classes, something like > > > > $text =~ s{ > > <img ([^>]*) src=["'] \s* ^^^^ ^^^^ ^^^^^^^^^^^ If you're going to be posting to programming newsgroups you need to find a way to stop that from happening. Dropping Google in favour of a real newsreader might be a good start. > Perfect! I actually did mean for the " or ' to be optional, though (it's > possible to have references without a quote), so I had to add the * back > in, but the idea of negated characters was exactly what I needed. 'Optional' is ?, not *. Presumably you don't want to allow <a href=""""""one two three""""""> > For the sake of my own knowledge, does the pattern: > > /img([^>])src/ > > translate to "img, not followed by a >, and followed by src", or "img, > followed by anything except a >, and followed by src"? The latter. Ben |
|
|||
|
On Jul 12, 6:12*pm, Jason C <jwcarl...@gmail.com> wrote:
> I'm struggling with what I thought was a simple thing, and I'm hoping youguys can help. > > I have a string that may contain a ", ', or neither. So, I wrote this in the regex: If you process CSV files, this can get real hairy. CSV files can contain one or more double quotes, one or more single quotes, pairs of double and/or single quotes, and commas embedded within quotation marks. The best help, and one that I strongly recommend to you, is to examine the Perl source for one or more of the CSV modules. The contain regular expressions for dis-entangling CSV strings, and trying to understand how they work will strengthen your RE chops. I normally follow two strategies when faced with this situation. First, is to replace all non-delimiting or non-qulaifying quotation marks with some unusual character that's unlikely to appear in the string, such as s/["']/#/g and then later, after I've processed the string, reverse the change like this s\s/#/'/g which converts all the quotations to single quotes, which may or may not work for you (it normally works for me). Or, I escape the quotations with either single or double backslashes, depending on whatever subsequent processing you plan to do, like this s/(["'])/\$1/g This has the advantage of preserving the kinds of quotes. I'm posting from memory so the above might have errors, but you understand the idea. In practice, I find that single quotes turn up in the oddest places, where you would never expect them. For this reason, when I process a string, out of pure defensiveness, I usually escape quotes (as well as some other potentially trouble makers). CC |
|
|||
|
On Friday, July 13, 2012 5:49:03 AM UTC-4, Ben Morrow wrote:
> > > $text =~ s{ > > > &lt;img ([^&gt;]*) src=[&quot;&#39;] \s* > ^^^^ ^^^^ ^^^^^^^^^^^ > If you're going to be posting to programming newsgroups you need to find > a way to stop that from happening. Dropping Google in favour of a real > newsreader might be a good start. Blech, why did Google start doing that?? I really don't use NG's that often, but those substitutions sure make it hard to talk about regex! I guess I'll have to grab a copy of Forte Agent or something... > 'Optional' is ?, not *. Presumably you don't want to allow Maybe I really am confused. Regex isn't really my strong point, though, so I appreciate the clarification. I thought that ? made it not greedy; meaning, instead of catching the next reference, it would find the last reference. Example: $text = "Example >->->"; $text = s/>?//; would return: Example ->-> But this: $text = "Example >->->"; $text = s/>//; would return: Example -- Then, I thought that * meant "0 or more times", which would essentially make it optional? |
|
|||
|
Quoth Jason C <jwcarlton@gmail.com>: > On Friday, July 13, 2012 5:49:03 AM UTC-4, Ben Morrow wrote: > > > 'Optional' is ?, not *. Presumably you don't want to allow > > Maybe I really am confused. Regex isn't really my strong point, though, > so I appreciate the clarification. > > I thought that ? made it not greedy; meaning, instead of catching the > next reference, it would find the last reference. ? does sometimes make something not greedy, but that isn't what 'not greedy' means. Suppose I have a section of pattern, A. Then /A/ matches A once /A?/ matches A 0-or-1 times /A*/ matches A 0-or-more times /A+/ matches A 1-or-more times Greedyness controls what to do when there is a choice about how many times to match. The greedy quantifiers above will all match as much as possible whenever there's a choice. These non-greedy quantifiers: /A??/ matches A 0-or-1 times, not greedy /A*?/ matches A 0-or-more times, not greedy /A+?/ matches A 1-or-more times, not greedy will all instead match as *little* as possible. > Then, I thought that * meant "0 or more times", which would essentially > make it optional? Well, yes, in a sense. 'Optional' is ambiguous; in this case, as I said, I believe you want 0-or-1-times rather than 0-or-more-times. Ben |
|
|||
|
Ben Morrow <ben@morrow.me.uk> writes:
[...] >> Then, I thought that * meant "0 or more times", which would essentially >> make it optional? > > Well, yes, in a sense. 'Optional' is ambiguous; in this case, as I said, > I believe you want 0-or-1-times rather than 0-or-more-times. I don't think so. ? is equivalent to the quantifier {0,1}, * is equivalent to the quantifier {0,}. Both imply that the re they apply to is optional for success of the match (not matching it at all is fine). The difference is on the right side of the comma: The first one may match at most once, the second one represents an unbounded sequence. |
|
|||
|
Quoth Rainer Weikusat <rweikusat@mssgmbh.com>: > Ben Morrow <ben@morrow.me.uk> writes: > [I recommended the OP use ? instead of *] > > >> Then, I thought that * meant "0 or more times", which would essentially > >> make it optional? > > > > Well, yes, in a sense. 'Optional' is ambiguous; in this case, as I said, > > I believe you want 0-or-1-times rather than 0-or-more-times. > > I don't think so. ? is equivalent to the quantifier {0,1}, * is > equivalent to the quantifier {0,}. Both imply that the re they apply > to is optional for success of the match (not matching it at all is > fine). The difference is on the right side of the comma: The first one > may match at most once, the second one represents an unbounded > sequence. I believe you are agreeing with me. Ben |
|
|||
|
In <877gu1ntli.fsf@sapphire.mobileactivedefense.com >, on 07/18/2012
at 01:26 PM, Rainer Weikusat <rweikusat@mssgmbh.com> said: >I don't think so. Because you didn't read Ben's text. >? is equivalent to the quantifier {0,1}, * is >equivalent to the quantifier {0,}. Which doesn't conflict with what Ben wrote. -- Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel> Unsolicited bulk E-mail subject to legal action. I reserve the right to publicly post or ridicule any abusive E-mail. Reply to domain Patriot dot net user shmuel+news to contact me. Do not reply to spamtrap@library.lspace.org |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|