I know it’s possible to match a word and then reverse the matches using other tools (e.g. grep -v
). However, is it possible to match lines that do not contain a specific word, e.g. hede
, using a regular expression?
The notion that regex doesn’t support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
The regex above will match any string, or line without a line break, not containing the (sub)string ‘hede’. As mentioned, this is not something regex is “good” at (or should do), but still, it is possible.
And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s
in the following pattern):
or use it inline:
(where the /.../
are the regex delimiters, i.e., not part of the pattern)
If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]
:
A string is just a list of n
characters. Before, and after each character, there’s an empty string. So a list of n
characters will have n+1
empty strings. Consider the string "ABhedeCD"
:
┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
index 0 1 2 3 4 5 6 7
where the e
’s are the empty strings. The regex (?!hede).
looks ahead to see if there’s no substring "hede"
to be seen, and if that is the case (so something else is seen), then the .
(dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don’t consume any characters. They only assert/validate something.
So, in my example, every empty string is first validated to see if there’s no "hede"
up ahead, before a character is consumed by the .
(dot). The regex (?!hede).
will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*
. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$
As you can see, the input "ABhedeCD"
will fail because on e3
, the regex (?!hede)
fails (there is "hede"
up ahead!).
Note that the solution to does not start with “hede”:
is generally much more efficient than the solution to does not contain “hede”:
The former checks for “hede” only at the input string’s first position, rather than at every position.
If you’re just using it for grep, you can use grep -v hede
to get all lines which do not contain hede.
ETA Oh, rereading the question, grep -v
is probably what you meant by “tools options”.
How can an email address be validated in JavaScript?
Using regular expressions is probably the best way. You can see a bunch of tests here (taken from chromium)
function validateEmail(email) {
var re = /^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
return re.test(String(email).toLowerCase());
}
Here’s the example of regular expresion that accepts unicode:
var re = /^(([^<>()\[\]\.,;:\s@\"]+(\.[^<>()\[\]\.,;:\s@\"]+)*)|(\".+\"))@(([^<>()[\]\.,;:\s@\"]+\.)+[^<>()[\]\.,;:\s@\"]{2,})$/i;
But keep in mind that one should not rely only upon JavaScript validation. JavaScript can easily be disabled. This should be validated on the server side as well.
Here’s an example of the above in action:
function validateEmail(email) {
var re = /^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
return re.test(email);
}
function validate() {
var $result = $("#result");
var email = $("#email").val();
$result.text("");
if (validateEmail(email)) {
$result.text(email + " is valid :)");
$result.css("color", "green");
} else {
$result.text(email + " is not valid :(");
$result.css("color", "red");
}
return false;
}
$("#validate").on("click", validate);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<form>
<p>Enter an email address:</p>
<input id='email'>
<button type='submit' id='validate'>Validate!</button>
</form>
<h2 id='result'></h2>
Just for completeness, here you have another RFC 2822 compliant regex
The official standard is known as RFC 2822. It describes the syntax that valid email addresses must adhere to. You can (but you shouldn’t — read on) implement it with this regular expression:
(?:[a-z0-9!#$%&'*+/=?^_
{|}~-]+(?:.[a-z0-9!#$%&’*+/=?^_{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9].html)?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9].html)?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
(…) We get a more practical implementation of RFC 2822 if we omit the syntax using double quotes and square brackets. It will still match 99.99% of all email addresses in actual use today.
[a-z0-9!#$%&'*+/=?^_
{|}~-]+(?:.[a-z0-9!#$%&’*+/=?^_{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9].html)?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9].html)?
A further change you could make is to allow any two-letter country code top level domain, and only specific generic top level domains. This regex filters dummy email addresses like
asdf@adsf.adsf
. You will need to update it as new top-level domains are added.So even when following official standards, there are still trade-offs to be made. Don’t blindly copy regular expressions from online libraries or discussion forums. Always test them on your own data and with your own applications.
[a-z0-9!#$%&'*+/=?^_
{|}~-]+(?:.[a-z0-9!#$%&’*+/=?^_{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9].html)?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b
Emphasis mine
I’ve slightly modified Jaymon’s answer for people who want really simple validation in the form of:
anystring@anystring.anystring
The regular expression:
Example JavaScript function:
I want a regular expression to check that
a password must be eight characters including one uppercase letter, one special character and alphanumeric characters.
And here is my validation expression which is for eight characters including one uppercase letter, one lowercase letter, and one number or special character.
How I can write it for a password that must be eight characters including one uppercase letter, one special character and alphanumeric characters?
The regular expression you are after will most likely be huge and a nightmare to maintain especially for people who are not that familiar with regular expressions.
I think it would be easier to break your regex down and do it one bit at a time. It might take a bit more to do, but I am pretty sure that maintaining it and debugging it would be easier. This would also allow you to provide more directed error messages to your users (other than just Invalid Password
) which should improve user experience.
From what I am seeing you are pretty fluent in regex, so I would presume that giving you the regular expressions to do what you need would be futile.
Seeing your comment, this is how I would go about it:
Must be eight characters Long: You do not need a regex for this. Using the .Length
property should be enough.
Including one uppercase letter: You can use the [A-Z]+
regular expression. If the string contains at least one upper case letter, this regular expression will yield true
.
One special character: You can use either the \W
which will match any character which is not a letter or a number or else, you can use something like so [!@#]
to specify a custom list of special characters. Note though that characters such as $
, ^
, (
and )
are special characters in the regular expression language, so they need to be escaped like so: \$
. So in short, you might use the \W
.
Alphanumeric characters: Using the \w+
should match any letter and number and underscore.
Take a look at this tutorial for more information.
( # Start of group
(?=.*\d) # must contain at least one digit
(?=.*[A-Z]) # must contain at least one uppercase character
(?=.*\W) # must contain at least one special symbol
. # match anything with previous condition checking
{8,8} # length is exactly 8 characters
) # End of group
In one line:
You need to match entire input string. So, you can enclose the regex between ^
and $
to prevent accidentally assuming partial matches as matching entire input:
So many answers…. all bad!
Regular expressions don’t have an AND operator, so it’s pretty hard to write a regex that matches valid passwords, when validity is defined by something AND something else AND something else…
But, regular expressions do have an OR operator, so just apply DeMorgan’s theorem, and write a regex that matches invalid passwords.
anything with less than 8 characters OR anything with no numbers OR anything with no uppercase OR anything with no special characters
So:
If anything matches that, then it’s an invalid password.
What would be the best regular expression for this scenario?
Given this URL:
How should I go about selecting everything between (but not including) http://php.net
and .php
:
This is for an Nginx configuration file.
Like this:
Explanation:
"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
net # Match the characters “net” literally
)
. # Match any single character that is not a line break character
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
\. # Match the character “.” literally
php # Match the characters “php” literally
)
"
A regular expression might not be the most effective tool for this job.
Try using parse_url()
, combined with pathinfo()
:
$url = 'http://php.net/manual/en/function.preg-match.php';
$path = parse_url($url, PHP_URL_PATH);
$pathinfo = pathinfo($path);
echo $pathinfo['dirname'], '/', $pathinfo['filename'];
The above code outputs:
/manual/en/function.preg-match
There’s no need to use a regular expression to dissect a URL. PHP has built-in functions for this, pathinfo() and parse_url().
Over the years I have slowly developed a regular expression that validates MOST email addresses correctly, assuming they don’t use an IP address as the server part.
I use it in several PHP programs, and it works most of the time. However, from time to time I get contacted by someone that is having trouble with a site that uses it, and I end up having to make some adjustment (most recently I realized that I wasn’t allowing 4-character TLDs).
What is the best regular expression you have or have seen for validating emails?
I’ve seen several solutions that use functions that use several shorter expressions, but I’d rather have one long complex expression in a simple function instead of several short expression in a more complex function.
The fully RFC 822 compliant regex is inefficient and obscure because of its length. Fortunately, RFC 822 was superseded twice and the current specification for email addresses is RFC 5322. RFC 5322 leads to a regex that can be understood if studied for a few minutes and is efficient enough for actual use.
One RFC 5322 compliant regex can be found at the top of the page at http://emailregex.com/ but uses the IP address pattern that is floating around the internet with a bug that allows 00
for any of the unsigned byte decimal values in a dot-delimited address, which is illegal. The rest of it appears to be consistent with the RFC 5322 grammar and passes several tests using grep -Po
, including cases domain names, IP addresses, bad ones, and account names with and without quotes.
Correcting the 00
bug in the IP pattern, we obtain a working and fairly fast regex. (Scrape the rendered version, not the markdown, for actual code.)
(?:[a-z0-9!#%&’*+/=?^_`{|}~-]+)|"(?:[-0b0c0e-1f-5b5d-7f]|\[-0b0c0e-7f])")@(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])).){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[-0b0c0e-1f-5a-7f]|\[-0b0c0e-7f])+)])
Here is diagram of finite state machine for above regexp which is more clear than regexp itself
The more sophisticated patterns in Perl and PCRE (regex library used e.g. in PHP) can correctly parse RFC 5322 without a hitch. Python and C# can do that too, but they use a different syntax from those first two. However, if you are forced to use one of the many less powerful pattern-matching languages, then it’s best to use a real parser.
It’s also important to understand that validating it per the RFC tells you absolutely nothing about whether that address actually exists at the supplied domain, or whether the person entering the address is its true owner. People sign others up to mailing lists this way all the time. Fixing that requires a fancier kind of validation that involves sending that address a message that includes a confirmation token meant to be entered on the same web page as was the address.
Confirmation tokens are the only way to know you got the address of the person entering it. This is why most mailing lists now use that mechanism to confirm sign-ups. After all, anybody can put down president@whitehouse.gov
, and that will even parse as legal, but it isn’t likely to be the person at the other end.
For PHP, you should not use the pattern given in Validate an E-Mail Address with PHP, the Right Way from which I quote:
There is some danger that common usage and widespread sloppy coding will establish a de facto standard for e-mail addresses that is more restrictive than the recorded formal standard.
That is no better than all the other non-RFC patterns. It isn’t even smart enough to handle even RFC 822, let alone RFC 5322. This one, however, is.
If you want to get fancy and pedantic, implement a complete state engine. A regular expression can only act as a rudimentary filter. The problem with regular expressions is that telling someone that their perfectly valid e-mail address is invalid (a false positive) because your regular expression can’t handle it is just rude and impolite from the user’s perspective. A state engine for the purpose can both validate and even correct e-mail addresses that would otherwise be considered invalid as it disassembles the e-mail address according to each RFC. This allows for a potentially more pleasing experience, like
The specified e-mail address ‘myemail@address,com’ is invalid. Did you mean ‘myemail@address.com’?
See also Validating Email Addresses, including the comments. Or Comparing E-mail Address Validating Regular Expressions.
You should not use regular expressions to validate email addresses.
Instead, use the MailAddress class, like this:
The MailAddress class uses a BNF parser to validate the address in full accordance with RFC822.
If you really want to use a regex, here it is:
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n.html)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*)
This question is asked a lot, but I think you should step back and ask yourself why you want to validate email adresses syntactically? What is the benefit really?
If you want to validate that an email is correct, you have no choice than to send an confirmation email and have the user reply to that. In many cases you will have to send a confirmation mail anyway for security reasons or for ethical reasons (so you cannot e.g. sign someone up to a service against their will).
I would like to have a regular expression that checks if a string contains only upper and lowercase letters, numbers, and underscores.
To match a string that contains only those characters (or an empty string), try
This works for .NET regular expressions, and probably a lot of other languages as well.
Breaking it down:
^ : start of string
[ : beginning of character group
a-z : any lowercase letter
A-Z : any uppercase letter
0-9 : any digit
_ : underscore
] : end of character group
* : zero or more of the given characters
$ : end of string
If you don’t want to allow empty strings, use + instead of *.
EDIT As others have pointed out, some regex languages have a shorthand form for [a-zA-Z0-9_]
. In the .NET regex language, you can turn on ECMAScript behavior and use \w
as a shorthand (yielding ^\w*$
or ^\w+$
). Note that in other languages, and by default in .NET, \w
is somewhat broader, and will match other sorts of unicode characters as well (thanks to Jan for pointing this out). So if you’re really intending to match only those characters, using the explicit (longer) form is probably best.
There’s a lot of verbosity in here, and I’m deeply against it, so, my conclusive answer would be:
\w
is equivalent to [A-Za-z0-9_]
, which is pretty much what you want. (unless we introduce unicode to the mix)
Using the +
quantifier you’ll match one or more characters. If you want to accept an empty string too, use *
instead.
I’m trying to put together a comprehensive regex to validate phone numbers. Ideally it would handle international formats, but it must handle US formats, including the following:
I’ll answer with my current attempt, but I’m hoping somebody has something better and/or more elegant.
Better option… just strip all non-digit characters on input (except ‘x’ and leading ‘+’ signs), taking care because of the British tendency to write numbers in the non-standard form +44 (0) ...
when asked to use the international prefix (in that specific case, you should discard the (0)
entirely).
Then, you end up with values like:
12345678901
12345678901x1234
345678901x1234
12344678901
12345678901
12345678901
12345678901
+4112345678
+441234567890
Then when you display, reformat to your hearts content. e.g.
It turns out that there’s something of a spec for this, at least for North America, called the NANP.
You need to specify exactly what you want. What are legal delimiters? Spaces, dashes, and periods? No delimiter allowed? Can one mix delimiters (e.g., +0.111-222.3333)? How are extensions (e.g., 111-222-3333 x 44444) going to be handled? What about special numbers, like 911? Is the area code going to be optional or required?
Here’s a regex for a 7 or 10 digit number, with extensions allowed, delimiters are spaces, dashes, or periods:
If the user wants to give you his phone number, then trust him to get it right. If he does not want to give it to you then forcing him to enter a valid number will either send him to a competitor’s site or make him enter a random string that fits your regex. I might even be tempted to look up the number of a premium rate sex line and enter that instead.
I would also consider any of the following as valid entries on a web site:
"123 456 7890 until 6pm, then 098 765 4321"
"123 456 7890 or try my mobile on 098 765 4321"
"ex-directory - mind your own business"
How can I use regular expressions in Excel and take advantage of Excel’s powerful grid-like setup for data manipulation?
I understand Regex is not ideal for many situations (To use or not to use regular expressions?) since excel can use Left
, Mid
, Right
, Instr
type commands for similar manipulations.
Regular expressions are used for Pattern Matching.
To use in Excel follow these steps :
Step 1: Add VBA reference to “Microsoft VBScript Regular Expressions 5.5”
Step 2: Define your pattern
Basic definitions:
-
Range.
a-z
matches an lower case letters from a to z
0-5
matches any number from 0 to 5
[]
Match exactly one of the objects inside these brackets.
[a]
matches the letter a
[abc]
matches a single letter which can be a, b or c
[a-z]
matches any single lower case letter of the alphabet.
()
Groups different matches for return purposes. See examples below.
{}
Multiplier for repeated copies of pattern defined before it.
[a]{2}
matches two consecutive lower case letter a: aa
[a]{1,3}
matches at least one and up to three lower case letter a
, aa
, aaa
+
Match at least one, or more, of the pattern defined before it.
a+
will match consecutive a’s a
, aa
, aaa
, and so on
?
Match zero or one of the pattern defined before it.
[a-z]?
matches empty string or any single lower case letter.
*
Match zero or more of the pattern defined before it. - E.g. Wildcard for pattern that may or may not be present. - E.g. [a-z]*
matches empty string or string of lower case letters.
.
Matches any character except newline \n
a.
Matches a two character string starting with a and ending with anything except \n
|
OR operator
a|b
means either a
or b
can be matched.
red|white|orange
matches exactly one of the colors.
^
NOT operator
[^0-9]
character can not contain a number
[^aA]
character can not be lower case a
or upper case A
\
Escapes special character that follows (overrides above behavior)
\.
, \\
, \(
, \?
, \$
, \^
Anchoring Patterns:
^
Match must occur at start of string
^a
First character must be lower case letter a
^[0-9]
First character must be a number.
$
Match must occur at end of string
a$
Last character must be lower case letter a
Precedence table:
Order Name Representation
1 Parentheses ( )
2 Multipliers ? + * {m,n} {m, n}?
3 Sequence & Anchors abc ^ $
4 Alternation |
Predefined Character Abbreviations:
abr same as meaning
\d [0-9] Any single digit
\D [^0-9] Any single character that's not a digit
\w [a-zA-Z0-9_] Any word character
\W [^a-zA-Z0-9_] Any non-word character
\s [ \r\t\n\f] Any space character
\S [^ \r\t\n\f] Any non-space character
\n [\n] New line
Example 1: Run as macro
The following example macro looks at the value in cell A1
to see if the first 1 or 2 characters are digits. If so, they are removed and the rest of the string is displayed. If not, then a box appears telling you that no match is found. Cell A1
values of 12abc
will return abc
, value of 1abc
will return abc
, value of abc123
will return “Not Matched” because the digits were not at the start of the string.
Private Sub simpleRegex()
Dim strPattern As String: strPattern = "^[0-9]{1,2}"
Dim strReplace As String: strReplace = ""
Dim regEx As New RegExp
Dim strInput As String
Dim Myrange As Range
Set Myrange = ActiveSheet.Range("A1")
If strPattern <> "" Then
strInput = Myrange.Value
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
If regEx.Test(strInput) Then
MsgBox (regEx.Replace(strInput, strReplace))
Else
MsgBox ("Not matched")
End If
End If
End Sub
Example 2: Run as an in-cell function
This example is the same as example 1 but is setup to run as an in-cell function. To use, change the code to this:
Function simpleCellRegex(Myrange As Range) As String
Dim regEx As New RegExp
Dim strPattern As String
Dim strInput As String
Dim strReplace As String
Dim strOutput As String
strPattern = "^[0-9]{1,3}"
If strPattern <> "" Then
strInput = Myrange.Value
strReplace = ""
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
If regEx.test(strInput) Then
simpleCellRegex = regEx.Replace(strInput, strReplace)
Else
simpleCellRegex = "Not matched"
End If
End If
End Function
Place your strings (“12abc”) in cell A1
. Enter this formula =simpleCellRegex(A1)
in cell B1
and the result will be “abc”.
Example 3: Loop Through Range
This example is the same as example 1 but loops through a range of cells.
Private Sub simpleRegex()
Dim strPattern As String: strPattern = "^[0-9]{1,2}"
Dim strReplace As String: strReplace = ""
Dim regEx As New RegExp
Dim strInput As String
Dim Myrange As Range
Set Myrange = ActiveSheet.Range("A1:A5")
For Each cell In Myrange
If strPattern <> "" Then
strInput = cell.Value
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
If regEx.Test(strInput) Then
MsgBox (regEx.Replace(strInput, strReplace))
Else
MsgBox ("Not matched")
End If
End If
Next
End Sub
Example 4: Splitting apart different patterns
This example loops through a range (A1
, A2
& A3
) and looks for a string starting with three digits followed by a single alpha character and then 4 numeric digits. The output splits apart the pattern matches into adjacent cells by using the ()
. $1
represents the first pattern matched within the first set of ()
.
Private Sub splitUpRegexPattern()
Dim regEx As New RegExp
Dim strPattern As String
Dim strInput As String
Dim Myrange As Range
Set Myrange = ActiveSheet.Range("A1:A3")
For Each C In Myrange
strPattern = "(^[0-9]{3})([a-zA-Z])([0-9]{4})"
If strPattern <> "" Then
strInput = C.Value
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = strPattern
End With
If regEx.test(strInput) Then
C.Offset(0, 1) = regEx.Replace(strInput, "$1")
C.Offset(0, 2) = regEx.Replace(strInput, "$2")
C.Offset(0, 3) = regEx.Replace(strInput, "$3")
Else
C.Offset(0, 1) = "(Not matched)"
End If
End If
Next
End Sub
Results:
Additional Pattern Examples
String Regex Pattern Explanation
a1aaa [a-zA-Z][0-9][a-zA-Z]{3} Single alpha, single digit, three alpha characters
a1aaa [a-zA-Z]?[0-9][a-zA-Z]{3} May or may not have preceeding alpha character
a1aaa [a-zA-Z][0-9][a-zA-Z]{0,3} Single alpha, single digit, 0 to 3 alpha characters
a1aaa [a-zA-Z][0-9][a-zA-Z]* Single alpha, single digit, followed by any number of alpha characters
</i8> \<\/[a-zA-Z][0-9]\> Exact non-word character except any single alpha followed by any single digit
To make use of regular expressions directly in Excel formulas the following UDF (user defined function) can be of help. It more or less directly exposes regular expression functionality as an excel function.
It takes 2-3 parameters.
$0
, $1
, $2
, and so on. $0
is the entire match, $1
and up correspond to the respective match groups in the regular expression. Defaults to $0
.
Extracting an email address:
=regex("Peter Gordon: some@email.com, 47", "\w+@\w+\.\w+")
=regex("Peter Gordon: some@email.com, 47", "\w+@\w+\.\w+", "$0")
Results in: some@email.com
Extracting several substrings:
Results in: E-Mail: some@email.com, Name: Peter Gordon
To take apart a combined string in a single cell into its components in multiple cells:
=regex("Peter Gordon: some@email.com, 47", "^(.+): (.+), (\d+)$", "$" & 1)
=regex("Peter Gordon: some@email.com, 47", "^(.+): (.+), (\d+)$", "$" & 2)
Results in: Peter Gordon
some@email.com
…
To use this UDF do the following (roughly based on this Microsoft page. They have some good additional info there!):
ALT+F11
to open the Microsoft Visual Basic for Applications Editor.
Add VBA reference to the Regular Expressions library (shamelessly copied from Portland Runners++ answer):
Click on Insert Module. If you give your module a different name make sure the Module does not have the same name as the UDF below (e.g. naming the Module Regex
and the function regex
causes #NAME! errors).
In the big text window in the middle insert the following:
Function regex(strInput As String, matchPattern As String, Optional ByVal outputPattern As String = "$0") As Variant
Dim inputRegexObj As New VBScript_RegExp_55.RegExp, outputRegexObj As New VBScript_RegExp_55.RegExp, outReplaceRegexObj As New VBScript_RegExp_55.RegExp
Dim inputMatches As Object, replaceMatches As Object, replaceMatch As Object
Dim replaceNumber As Integer
With inputRegexObj
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = matchPattern
End With
With outputRegexObj
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = "\$(\d+)"
End With
With outReplaceRegexObj
.Global = True
.MultiLine = True
.IgnoreCase = False
End With
Set inputMatches = inputRegexObj.Execute(strInput)
If inputMatches.Count = 0 Then
regex = False
Else
Set replaceMatches = outputRegexObj.Execute(outputPattern)
For Each replaceMatch In replaceMatches
replaceNumber = replaceMatch.SubMatches(0)
outReplaceRegexObj.Pattern = "\$" & replaceNumber
If replaceNumber = 0 Then
outputPattern = outReplaceRegexObj.Replace(outputPattern, inputMatches(0).Value)
Else
If replaceNumber > inputMatches(0).SubMatches.Count Then
'regex = "A to high $ tag found. Largest allowed is $" & inputMatches(0).SubMatches.Count & "."
regex = CVErr(xlErrValue)
Exit Function
Else
outputPattern = outReplaceRegexObj.Replace(outputPattern, inputMatches(0).SubMatches(replaceNumber - 1))
End If
End If
Next
regex = outputPattern
End If
End Function
```</li>
<li><p>Save and close the <em>Microsoft Visual Basic for Applications</em> Editor window.</p></li>
</ol>
#### Answer 3 (score 54)
Expanding on <a href="https://stackoverflow.com/users/1975049/patszim">patszim</a>'s <a href="https://stackoverflow.com/a/28176749/1699071">answer</a> for those in a rush.
<ol>
<li>Open Excel workbook.</li>
<li><kbd>Alt</kbd>+<kbd>F11</kbd> to open VBA/Macros window.</li>
<li>Add reference to regex under <strong><em>Tools</em></strong> then <strong><em>References</em></strong><br>
<a href="https://i.stack.imgur.com/sKCdA.png" rel="noreferrer"><img src="https://i.stack.imgur.com/sKCdA.png" alt="![Excel VBA Form add references"></a></li>
<li>and selecting <strong>Microsoft VBScript Regular Expression 5.5</strong><br>
<a href="https://i.stack.imgur.com/nmSgP.png" rel="noreferrer"><img src="https://i.stack.imgur.com/nmSgP.png" alt="![Excel VBA add regex reference"></a></li>
<li>Insert a new module (code needs to reside in the module otherwise it doesn't work).<br>
<a href="https://i.stack.imgur.com/RaLQ0.png" rel="noreferrer"><img src="https://i.stack.imgur.com/RaLQ0.png" alt="![Excel VBA insert code module"></a></li>
<li>In the newly inserted module,<br>
<a href="https://i.stack.imgur.com/DFJ7F.png" rel="noreferrer"><img src="https://i.stack.imgur.com/DFJ7F.png" alt="![Excel VBA insert code into module"></a></li>
<li><p>add the following code: </p>
```perl
Function RegxFunc(strInput As String, regexPattern As String) As String
Dim regEx As New RegExp
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = regexPattern
End With
If regEx.Test(strInput) Then
Set matches = regEx.Execute(strInput)
RegxFunc = matches(0).Value
Else
RegxFunc = "not matched"
End If
End Function
```</li>
<li><p>The regex pattern is placed in one of the cells and <strong><em>absolute referencing</em></strong> is used on it.
<a href="https://i.stack.imgur.com/XnS6t.png" rel="noreferrer"><img src="https://i.stack.imgur.com/XnS6t.png" alt="![Excel regex function in-cell usage"></a>
Function will be tied to workbook that its created in.<br>
If there's a need for it to be used in different workbooks, store the function in <strong>Personal.XLSB</strong></p></li>
</ol>
</b> </em> </i> </small> </strong> </sub> </sup>
### 10: How to extract numbers from a string in Python? (score [754314](https://stackoverflow.com/q/4289331.html) in 2019)
#### Question
I would extract all the numbers contained in a string. Which is the better suited for the purpose, regular expressions or the `isdigit()` method?
Example:
```perl
line = "hello 12 hi 89"
Result:
If you only want to extract only positive integers, try the following:
>>> str = "h3110 23 cat 444.4 rabbit 11 2 dog"
>>> [int(s) for s in str.split() if s.isdigit()]
[23, 11, 2]
I would argue that this is better than the regex example for three reasons. First, you don’t need another module; secondly, it’s more readable because you don’t need to parse the regex mini-language; and third, it is faster (and thus likely more pythonic):
python -m timeit -s "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "[s for s in str.split() if s.isdigit()]"
100 loops, best of 3: 2.84 msec per loop
python -m timeit -s "import re" "str = 'h3110 23 cat 444.4 rabbit 11 2 dog' * 1000" "re.findall('\\b\\d+\\b', str)"
100 loops, best of 3: 5.66 msec per loop
This will not recognize floats, negative integers, or integers in hexadecimal format. If you can’t accept these limitations, slim’s answer below will do the trick.
I’d use a regexp :
This would also match 42 from bla42bla
. If you only want numbers delimited by word boundaries (space, period, comma), you can use
To end up with a list of numbers instead of a list of strings:
This is more than a bit late, but you can extend the regex expression to account for scientific notation too.
import re
# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
('hello X42 I\'m a Y-32.35 string Z30',
['42', '-32.35', '30']),
('he33llo 42 I\'m a 32 string -30',
['33', '42', '32', '-30']),
('h3110 23 cat 444.4 rabbit 11 2 dog',
['3110', '23', '444.4', '11', '2']),
('hello 12 hi 89',
['12', '89']),
('4',
['4']),
('I like 74,600 commas not,500',
['74,600', '500']),
('I like bad math 1+2=.001',
['1', '+2', '.001'])]
for s, r in ss:
rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
if rr == r:
print('GOOD')
else:
print('WRONG', rr, 'should be', r)
Gives all good!
Additionally, you can look at the AWS Glue built-in regex
Possible Duplicate:
Regular Expression Sanitize (PHP)
I am facing an issue with URLs, I want to be able to convert titles that could contain anything and have them stripped of all special characters so they only have letters and numbers and of course I would like to replace spaces with hyphens.
How would this be done? I’ve heard a lot about regular expressions (regex) being used…
Easy peasy:
function clean($string) {
$string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
}
Usage:
Will output: abcdef-g
Edit:
Hey, just a quick question, how can I prevent multiple hyphens from being next to each other? and have them replaced with just 1?
The solution below has a “SEO friendlier” version:
function hyphenize($string) {
$dict = array(
"I'm" => "I am",
"thier" => "their",
// Add your own replacements here
);
return strtolower(
preg_replace(
array( '#[\\s-]+#', '#[^A-Za-z0-9. -]+#' ),
array( '-', '' ),
// the full cleanString() can be downloaded from http://www.unexpectedit.com/php/php-clean-string-of-utf8-chars-convert-to-similar-ascii-char
cleanString(
str_replace( // preg_replace can be used to support more complicated replacements
array_keys($dict),
array_values($dict),
urldecode($string)
)
)
)
);
}
function cleanString($text) {
$utf8 = array(
'/[áàâãªä]/u' => 'a',
'/[ÁÀÂÃÄ]/u' => 'A',
'/[ÍÌÎÏ]/u' => 'I',
'/[íìîï]/u' => 'i',
'/[éèêë]/u' => 'e',
'/[ÉÈÊË]/u' => 'E',
'/[óòôõºö]/u' => 'o',
'/[ÓÒÔÕÖ]/u' => 'O',
'/[úùûü]/u' => 'u',
'/[ÚÙÛÜ]/u' => 'U',
'/ç/' => 'c',
'/Ç/' => 'C',
'/ñ/' => 'n',
'/Ñ/' => 'N',
'/–/' => '-', // UTF-8 hyphen to "normal" hyphen
'/[’‘‹›‚]/u' => ' ', // Literally a single quote
'/[“”«»„]/u' => ' ', // Double quote
'/ /' => ' ', // nonbreaking space (equiv. to 0x160)
);
return preg_replace(array_keys($utf8), array_values($utf8), $text);
}
The rationale for the above functions (which I find way inefficient - the one below is better) is that a service that shall not be named apparently ran spelling checks and keyword recognition on the URLs.
After losing a long time on a customer’s paranoias, I found out they were not imagining things after all – their SEO experts [I am definitely not one] reported that, say, converting “Viaggi Economy Perù” to viaggi-economy-peru
“behaved better” than viaggi-economy-per
(the previous “cleaning” removed UTF8 characters; Bogotà became bogot, Medellìn became medelln and so on).
There were also some common misspellings that seemed to influence the results, and the only explanation that made sense to me is that our URL were being unpacked, the words singled out, and used to drive God knows what ranking algorithms. And those algorithms apparently had been fed with UTF8-cleaned strings, so that “Perù” became “Peru” instead of “Per”. “Per” did not match and sort of took it in the neck.
In order to both keep UTF8 characters and replace some misspellings, the faster function below became the more accurate (?) function above. $dict
needs to be hand tailored, of course.
A simple approach:
// Remove all characters except A-Z, a-z, 0-9, dots, hyphens and spaces
// Note that the hyphen must go last not to be confused with a range (A-Z)
// and the dot, NOT being special (I know. My life was a lie), is NOT escaped
$str = preg_replace('/[^A-Za-z0-9. -]/', '', $str);
// Replace sequences of spaces with hyphen
$str = preg_replace('/ */', '-', $str);
// The above means "a space, followed by a space repeated zero or more times"
// (should be equivalent to / +/)
// You may also want to try this alternative:
$str = preg_replace('/\\s+/', '-', $str);
// where \s+ means "zero or more whitespaces" (a space is not necessarily the
// same as a whitespace) just to be sure and include everything
Note that you might have to first urldecode()
the URL, since %20 and + both are actually spaces - I mean, if you have “Never%20gonna%20give%20you%20up” you want it to become Never-gonna-give-you-up, not Never20gonna20give20you20up . You might not need it, but I thought I’d mention the possibility.
So the finished function along with test cases:
function hyphenize($string) {
return
## strtolower(
preg_replace(
array('#[\\s-]+#', '#[^A-Za-z0-9. -]+#'),
array('-', ''),
## cleanString(
urldecode($string)
## )
)
## )
;
}
print implode("\n", array_map(
function($s) {
return $s . ' becomes ' . hyphenize($s);
},
array(
'Never%20gonna%20give%20you%20up',
"I'm not the man I was",
"'Légeresse', dit sa majesté",
)));
Never%20gonna%20give%20you%20up becomes never-gonna-give-you-up
I'm not the man I was becomes im-not-the-man-I-was
'Légeresse', dit sa majesté becomes legeresse-dit-sa-majeste
To handle UTF-8 I used a cleanString
implementation found online (link broken since, but a stripped down copy with all the not-too-esoteric UTF8 characters is at the beginning of the answer; it’s also easy to add more characters to it if you need) that converts UTF8 characters to normal characters, thus preserving the word “look” as much as possible. It could be simplified and wrapped inside the function here for performance.
The function above also implements converting to lowercase - but that’s a taste. The code to do so has been commented out.
Here, check out this function:
function seo_friendly_url($string){
$string = str_replace(array('[\', \']'), '', $string);
$string = preg_replace('/\[.*\]/U', '', $string);
$string = preg_replace('/&(amp;)?#?[a-z0-9]+;/i', '-', $string);
$string = htmlentities($string, ENT_COMPAT, 'utf-8');
$string = preg_replace('/&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);/i', '\\1', $string );
$string = preg_replace(array('/[^a-z0-9]/i', '/[-]+/') , '-', $string);
return strtolower(trim($string, '-'));
}
I haven’t used regular expressions at all, so I’m having difficulty troubleshooting. I want the regex to match only when the contained string is all numbers; but with the two examples below it is matching a string that contains all numbers plus an equals sign like “1234=4321”. I’m sure there’s a way to change this behavior, but as I said, I’ve never really done much with regular expressions.
string compare = "1234=4321";
Regex regex = new Regex(@"[\d]");
if (regex.IsMatch(compare))
{
//true
}
regex = new Regex("[0-9]");
if (regex.IsMatch(compare))
{
//true
}
In case it matters, I’m using C# and .NET2.0.
Use the beginning and end anchors.
Use "^\d+$"
if you need to match more than one digit.
Note that "\d"
will match [0-9]
and other digit characters like the Eastern Arabic numerals ٠١٢٣٤٥٦٧٨٩
. Use "^[0-9]+$"
to restrict matches to just the Arabic numerals 0 - 9.
If you need to include any numeric representations other than just digits (like decimal values for starters), then see @tchrist’s comprehensive guide to parsing numbers with regular expressions.
Your regex will match anything that contains a number, you want to use anchors to match the whole string and then match one or more numbers:
The ^
will anchor the beginning of the string, the $
will anchor the end of the string, and the +
will match one or more of what precedes it (a number in this case).
If you need to tolerate decimal point and thousand marker
You will need a “-”, if the number can go negative.
The following should be matched:
can I do: ".*123"
?
Yes, you can. That should work.
.
= any char
\.
= the actual dot character
.?
= .{0,1}
= match any char zero or one times
.*
= .{0,}
= match any char zero or more times
.+
= .{1,}
= match any char one or more times
Yes that will work, though note that .
will not match newlines unless you pass the DOTALL flag when compiling the expression:
Use the pattern .
to match any character once, .*
to match any character zero or more times, .+
to match any character one or more times.
I want to match a portion of a string using a regular expression and then access that parenthesized substring:
var myString = "something format_abc"; // I want "abc"
var arr = /(?:^|\s)format_(.*?)(?:\s|$)/.exec(myString);
console.log(arr); // Prints: [" format_abc", "abc"] .. so far so good.
console.log(arr[1]); // Prints: undefined (???)
console.log(arr[0]); // Prints: format_undefined (!!!)
What am I doing wrong?
I’ve discovered that there was nothing wrong with the regular expression code above: the actual string which I was testing against was this:
Reporting that “%A” is undefined seems a very strange behaviour, but it is not directly related to this question, so I’ve opened a new one, Why is a matched substring returning “undefined” in JavaScript?.
The issue was that console.log
takes its parameters like a printf
statement, and since the string I was logging ("%A"
) had a special value, it was trying to find the value of the next parameter.
You can access capturing groups like this:
And if there are multiple matches you can iterate over them:
As you can see the way to iterate over multiple matches was not very intuitive. This lead to the proposal of the String.prototype.matchAll
method. This new method is expected to ship in the ECMAScript 2020 specification. It gives us a clean API and solves multiple problems. It has been started to land on major browsers and JS engines as Chrome 73+ / Node 12+ and Firefox 67+.
The method returns an iterator and is used as follows:
As it returns an iterator, we can say it’s lazy, this is useful when handling particularly large numbers of capturing groups, or very large strings. But if you need, the result can be easily transformed into an Array by using the spread syntax or the Array.from
method:
function getFirstGroup(regexp, str) {
const array = [...str.matchAll(regexp)];
return array.map(m => m[1]);
}
// or:
function getFirstGroup(regexp, str) {
return Array.from(str.matchAll(regexp), m => m[1]);
}
In the meantime, while this proposal gets more wide support, you can use the official shim package.
Also, the internal workings of the method are simple. An equivalent implementation using a generator function would be as follows:
function* matchAll(str, regexp) {
const flags = regexp.global ? regexp.flags : regexp.flags + "g";
const re = new RegExp(regexp, flags);
let match;
while (match = re.exec(str)) {
yield match;
}
}
A copy of the original regexp is created; this is to avoid side-effects due to the mutation of the lastIndex
property when going through the multple matches.
Also, we need to ensure the regexp has the global flag to avoid an infinite loop.
I’m also happy to see that even this StackOverflow question was referenced in the discussions of the proposal.
Here’s a method you can use to get the nth capturing group for each match:
function getMatches(string, regex, index) {
index || (index = 1); // default to the first capturing group
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match[index]);
}
return matches;
}
// Example :
var myString = 'something format_abc something format_def something format_ghi';
var myRegEx = /(?:^|\s)format_(.*?)(?:\s|$)/g;
// Get an array containing the first capturing group for every match
var matches = getMatches(myString, myRegEx, 1);
// Log results
document.write(matches.length + ' matches found: ' + JSON.stringify(matches))
console.log(matches);
The \b
isn’t exactly the same thing. (It works on --format_foo/
, but doesn’t work on format_a_b
) But I wanted to show an alternative to your expression, which is fine. Of course, the match
call is the important thing.
I want to use JavaScript (can be with jQuery) to do some client-side validation to check whether a string matches the regex:
Ideally it would be an expression that returned true or false.
I’m a JavaScript newbie, does match()
do what I need? It seems to check whether part of a string matches a regex, not the whole thing.
Use regex.test()
if all you want is a boolean result:
…and you could remove the ()
from your regexp since you’ve no need for a capture.
Use test()
method :
You can use match()
as well:
But test()
seems to be faster as you can read here.
Important difference between match()
and test()
:
match()
works only with strings, but test()
works also with integers.
12345.match(/^([a-z0-9]{5,})$/); // ERROR
/^([a-z0-9]{5,})$/.test(12345); // true
/^([a-z0-9]{5,})$/.test(null); // false
// Better watch out for undefined values
/^([a-z0-9]{5,})$/.test(undefined); // true
I have a parameter file of the form:
Where the parameters may be in any order but there is only one parameter per line. I want to replace one parameter’s parameter-value
with a new value.
I am using a line replace function posted previously to replace the line which uses Python’s string.replace(pattern, sub)
. The regular expression that I’m using works for instance in vim but doesn’t appear to work in string.replace()
.
Here is the regular expression that I’m using:
Where "interfaceOpDataFile"
is the parameter name that I’m replacing (/i for case-insensitive) and the new parameter value is the contents of the fileIn
variable.
Is there a way to get Python to recognize this regular expression or else is there another way to accomplish this task?
str.replace()
v2|v3 does not recognize regular expressions.
To perform a substitution using a regular expression, use re.sub()
v2|v3.
For example:
import re
line = re.sub(
r"(?i)^.*interfaceOpDataFile.*$",
"interfaceOpDataFile %s" % fileIn,
line
)
In a loop, it would be better to compile the regular expression first:
You are looking for the re.sub function.
will print axample atring
As a summary
import sys
import re
f = sys.argv[1]
find = sys.argv[2]
replace = sys.argv[3]
with open (f, "r") as myfile:
s=myfile.read()
ret = re.sub(find,replace, s) # <<< This is where the magic happens
print ret
I need a regex that will accept only digits from 0-9 and nothing else. No letters, no characters.
I thought this would work:
or even
but these are accepting the characters : ^,$,(,), etc
I thought that both the regexes above would do the trick and I’m not sure why its accepting those characters.
EDIT:
This is exactly what I am doing:
private void OnTextChanged(object sender, EventArgs e)
{
if (!System.Text.RegularExpressions.Regex.IsMatch("^[0-9]", textbox.Text))
{
textbox.Text = string.Empty;
}
}
This is allowing the characters I mentioned above.
Your regex ^[0-9]
matches anything beginning with a digit, including strings like “1A”. To avoid a partial match, append a $
to the end:
This accepts any number of digits, including none. To accept one or more digits, change the *
to +
. To accept exactly one digit, just remove the *
.
UPDATE: You mixed up the arguments to IsMatch
. The pattern should be the second argument, not the first:
CAUTION: In JavaScript, \d
is equivalent to [0-9]
, but in .NET, \d
by default matches any Unicode decimal digit, including exotic fare like ႒ (Myanmar 2) and ߉ (N’Ko 9). Unless your app is prepared to deal with these characters, stick with [0-9]
(or supply the RegexOptions.ECMAScript flag).
Obviously, you can use the |
(pipe?) to represent OR
, but is there a way to represent AND
as well?
Specifically, I’d like to match paragraphs of text that contain ALL of a certain phrase, but in no particular order.
Use a non-consuming regular expression.
The typical (i.e. Perl/Java) notation is:
(?=
expr)
This means “match expr but after that continue matching at the original match-point.”
You can do as many of these as you want, and this will be an “and.” Example:
(?=match this expression)(?=match this too)(?=oh, and this)
You can even add capture groups inside the non-consuming expressions if you need to save some of the data therein.
You need to use lookahead as some of the other responders have said, but the lookahead has to account for other characters between its target word and the current match position. For example:
The .*
in the first lookahead lets it match however many characters it needs to before it gets to “word1”. Then the match position is reset and the second lookahead seeks out “word2”. Reset again, and the final part matches “word3”; since it’s the last word you’re checking for, it isn’t necessary that it be in a lookahead, but it doesn’t hurt.
In order to match a whole paragraph, you need to anchor the regex at both ends and add a final .*
to consume the remaining characters. Using Perl-style notation, that would be:
The ‘m’ modifier is for multline mode; it lets the ^
and $
match at paragraph boundaries (“line boundaries” in regex-speak). It’s essential in this case that you not use the ‘s’ modifier, which lets the dot metacharacter match newlines as well as all other characters.
Finally, you want to make sure you’re matching whole words and not just fragments of longer words, so you need to add word boundaries:
I need something like:
You can also do it using -v
(for --invert-match
) option of grep as:
grep -v "unwanted_word" file
will filter the lines that have the unwanted_word
and grep XXXXXXXX
will list only lines with pattern XXXXXXXX
.
EDIT:
From your comment it looks like you want to list all lines without the unwanted_word
. In that case all you need is:
I understood the question as “How do I match a word but exclude another”, for which one solution is two greps in series: First grep finding the wanted “word1”, second grep excluding “word2”:
In my case: I need to differentiate between “plot” and “#plot” which grep’s “word” option won’t do (“#” not being a alphanumerical).
Hope this helps.
If your grep
supports Perl regular expression with -P
option you can do (if bash; if tcsh you’ll need to escape the !
):
Demo:
Let us now list all foo
except foo3
I need to match a space character in a PHP regular expression. Anyone got any ideas?
I mean like “gavin schulz”, the space in between the two words. I am using a regular expression to make sure that I only allow letters, number and a space. But I’m not sure how to find the space. This is what I have right now:
If you’re looking for a space, that would be " "
(one space).
If you’re looking for one or more, it’s " *"
(that’s two spaces and an asterisk) or " +"
(one space and a plus).
If you’re looking for common spacing, use "[ X]"
or "[ X][ X]*"
or "[ X]+"
where X
is the physical tab character (and each is preceded by a single space in all those examples).
These will work in every* regex engine I’ve ever seen (some of which don’t even have the one-or-more "+"
character, ugh).
If you know you’ll be using one of the more modern regex engines, "\s"
and its variations are the way to go. In addition, I believe word boundaries match start and end of lines as well, important when you’re looking for words that may appear without preceding or following spaces.
For PHP specifically, this page may help.
From your edit, it appears you want to remove all non valid characters The start of this is (note the space inside the regex):
If you also want trickery to ensure there’s only one space between each word and none at the start or end, that’s a little more complicated (and probably another question) but the basic idea would be:
\040
matches exactly the space character.
New Link
Escape sequences for Regex PHP
Here is a everything you need to know about whitespace in regular expressions:
[[:blank:]]
Space or tab only
[[:space:]]
Whitespace
\s
Any whitespace character
\v
Vertical whitespace
\h
Horizontal whitespace
x
Ignore whitespace
I would like to create a String.replaceAll()
method in JavaScript and I’m thinking that using a regex would be most terse way to do it. However, I can’t figure out how to pass a variable in to a regex. I can do this already which will replace all the instances of "B"
with "A"
.
But I want to do something like this:
String.prototype.replaceAll = function(replaceThis, withThis) {
this.replace(/replaceThis/g, withThis);
};
But obviously this will only replace the text "replaceThis"
…so how do I pass this variable in to my regex string?
Instead of using the /regex/g
syntax, you can construct a new RegExp object:
You can dynamically create regex objects this way. Then you will do:
As Eric Wendelin mentioned, you can do something like this:
This yields "regex matching ."
. However, it will fail if str1 is "."
. You’d expect the result to be "pattern matching regex"
, replacing the period with "regex"
, but it’ll turn out to be…
This is because, although "."
is a String, in the RegExp constructor it’s still interpreted as a regular expression, meaning any non-line-break character, meaning every character in the string. For this purpose, the following function may be useful:
Then you can do:
yielding "pattern matching regex"
.
"ABABAB".replace(/B/g, "A");
As always: don’t use regex unless you have to. For a simple string replace, the idiom is:
Then you don’t have to worry about the quoting issues mentioned in Gracenotes’s answer.
How are non-capturing groups, i.e. (?:)
, used in regular expressions and what are they good for?
Let me try to explain this with an example.
Consider the following text:
Now, if I apply the regex below over it…
… I would get the following result:
Match "http://stackoverflow.com/"
Group 1: "http"
Group 2: "stackoverflow.com"
Group 3: "/"
Match "https://stackoverflow.com/questions/tagged/regex"
Group 1: "https"
Group 2: "stackoverflow.com"
Group 3: "/questions/tagged/regex"
But I don’t care about the protocol – I just want the host and path of the URL. So, I change the regex to include the non-capturing group (?:)
.
Now, my result looks like this:
Match "http://stackoverflow.com/"
Group 1: "stackoverflow.com"
Group 2: "/"
Match "https://stackoverflow.com/questions/tagged/regex"
Group 1: "stackoverflow.com"
Group 2: "/questions/tagged/regex"
See? The first group has not been captured. The parser uses it to match the text, but ignores it later, in the final result.
As requested, let me try to explain groups too.
Well, groups serve many purposes. They can help you to extract exact information from a bigger match (which can also be named), they let you rematch a previous matched group, and can be used for substitutions. Let’s try some examples, shall we?
Ok, imagine you have some kind of XML or HTML (be aware that regex may not be the best tool for the job, but it is nice as an example). You want to parse the tags, so you could do something like this (I have added spaces to make it easier to understand):
The first regex has a named group (TAG), while the second one uses a common group. Both regexes do the same thing: they use the value from the first group (the name of the tag) to match the closing tag. The difference is that the first one uses the name to match the value, and the second one uses the group index (which starts at 1).
Let’s try some substitutions now. Consider the following text:
Now, let’s use this dumb regex over it:
This regex matches words with at least 3 characters, and uses groups to separate the first three letters. The result is this:
Match "Lorem"
Group 1: "L"
Group 2: "o"
Group 3: "r"
Group 4: "em"
Match "ipsum"
Group 1: "i"
Group 2: "p"
Group 3: "s"
Group 4: "um"
...
Match "consectetuer"
Group 1: "c"
Group 2: "o"
Group 3: "n"
Group 4: "sectetuer"
...
So, if we apply the substitution string:
… over it, we are trying to use the first group, add an underscore, use the third group, then the second group, add another underscore, and then the fourth group. The resulting string would be like the one below.
L_ro_em i_sp_um d_lo_or s_ti_ a_em_t c_no_sectetuer f_ue_giat f_ma_es m_la_esuada p_er_tium e_eg_stas.
You can use named groups for substitutions too, using ${name}
.
To play around with regexes, I recommend http://regex101.com/, which offers a good amount of details on how the regex works; it also offers a few regex engines to choose from.
You can use capturing groups to organize and parse an expression. A non-capturing group has the first benefit, but doesn’t have the overhead of the second. You can still say a non-capturing group is optional, for example.
Say you want to match numeric text, but some numbers could be written as 1st, 2nd, 3rd, 4th,… If you want to capture the numeric part, but not the (optional) suffix you can use a non-capturing group.
That will match numbers in the form 1, 2, 3… or in the form 1st, 2nd, 3rd,… but it will only capture the numeric part.
?:
is used when you want to group an expression, but you do not want to save it as a matched/captured portion of the string.
An example would be something to match an IP address:
Note that I don’t care about saving the first 3 octets, but the (?:...)
grouping allows me to shorten the regex without incurring the overhead of capturing and storing a match.
I have got a price field to display which sometimes can be either 100 or 100.99 or 100.9, What I want is to display the price in 2 decimal places only if the decimals are entered for that price , for instance if its 100 so it should only show 100 not 100.00 and if the price is 100.2 it should display 100.20 similarly for 100.22 should be same . I googled and came across some examples but they didn’t match exactly what i wanted :
An inelegant way would be:
With DoFormat
being something like:
public static string DoFormat( double myNumber )
{
var s = string.Format("{0:0.00}", myNumber);
if ( s.EndsWith("00") )
{
return ((int)myNumber).ToString();
}
else
{
return s;
}
}
Not elegant but working for me in similar situations in some projects.
Sorry for reactivating this question, but I didn’t find the right answer here.
In formatting numbers you can use 0
as a mandatory place and #
as an optional place.
So:
// just two decimal places
String.Format("{0:0.##}", 123.4567); // "123.46"
String.Format("{0:0.##}", 123.4); // "123.4"
String.Format("{0:0.##}", 123.0); // "123"
You can also combine 0
with #
.
String.Format("{0:0.0#}", 123.4567) // "123.46"
String.Format("{0:0.0#}", 123.4) // "123.4"
String.Format("{0:0.0#}", 123.0) // "123.0"
For this formating method is always used CurrentCulture
. For some Cultures .
will be changed to ,
.
The simpliest solution comes from @Andrew (here). So I personally would use something like this:
This is a common formatting floating number use case.
Unfortunately, all of the built-in one-letter format strings (eg. F, G, N) won’t achieve this directly.
For example, num.ToString("F2")
will always show 2 decimal places like 123.40
.
You’ll have to use 0.##
pattern even it looks a little verbose.
A complete code example:
double a = 123.4567;
double b = 123.40;
double c = 123.00;
string sa = a.ToString("0.##"); // 123.46
string sb = b.ToString("0.##"); // 123.4
string sc = c.ToString("0.##"); // 123
Example: “This is justsimple sentence”.
I want to match every character between “This is” and “sentence”. Line breaks should be ignored. I can’t figure out the correct syntax.
For example
I used lookbehind (?<=)
and look ahead (?=)
so that “This is” and “sentence” is not included in the match, but this is up to your use case, you can also simply write This is(.*)sentence
.
The important thing here is that you activate the “dotall” mode of your regex engine, so that the .
is matching the newline. But how you do this depends on your regex engine.
The next thing is if you use .*
or .*?
. The first one is greedy and will match till the last “sentence” in your string, the second one is lazy and will match till the next “sentence” in your string.
Update
Where the (?s) turns on the dotall modifier, making the .
matching the newline characters.
Update 2:
is matching your example “This is (a simple) sentence”. See here on Regexr
Resurrecting this question because the regex in the accepted answer doesn’t seem quite correct to me. Why? Because
will match my first sentence. This is my second
in This is my first sentence. This is my second sentence.
You need a lazy quantifier between the two lookarounds. Adding a ?
makes the star lazy.
This matches what you want:
See demo. I removed the capture group, which was not needed.
DOTALL Mode to Match Across Line Breaks
Note that in the demo the “dot matches line breaks mode” (a.k.a.) dot-all is set (see how to turn on DOTALL in various languages). In many regex flavors, you can set it with the online modifier (?s)
, turning the expression into:
Reference
Try This is[\s\S]*sentence
, works in javascript
hash = window.location.hash.substr(1);
var reg = new RegExp('^[0-9]$');
console.log(reg.test(hash));
I get false on both "123"
and "123f"
. I would like to check if the hash only contains numbers. Did I miss something?
should do it. The original matches anything that consists of exactly one digit.
As you said, you want hash to contain only numbers.
or
\d
and [0-9]
both mean the same thing. The + used means that search for one or more occurring of [0-9].
This one will allow also for signed and float numbers or empty string:
If you don’t want allow to empty string use this one:
I know that I can negate group of chars as in [^bar]
but I need a regular expression where negation applies to the specific word - so in my example how do I negate an actual "bar"
and not "any chars in bar"
?
A great way to do this is to use negative lookahead:
The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead [is any regex pattern].
Unless performance is of utmost concern, it’s often easier just to run your results through a second pass, skipping those that match the words you want to negate.
Regular expressions usually mean you’re doing scripting or some sort of low-performance task anyway, so find a solution that is easy to read, easy to understand and easy to maintain.
The following regex will do what you want (as long as negative lookbehinds and lookaheads are supported), matching things properly; the only problem is that it matches individual characters (i.e. each match is a single character rather than all characters between two consecutive “bar”s), possibly resulting in a potential for high overhead if you’re working with very long strings.
How can I write a regex that matches only letters?
Use a character set: [a-zA-Z]
matches one letter from A–Z in lowercase and uppercase. [a-zA-Z]+
matches one or more letters and ^[a-zA-Z]+$
matches only strings that consist of one or more letters only (^
and $
mark the begin and end of a string respectively).
If you want to match other letters than A–Z, you can either add them to the character set: [a-zA-ZäöüßÄÖÜ]
. Or you use predefined character classes like the Unicode character property class \p{L}
that describes the Unicode characters that are letters.
\p{L}
matches anything that is a Unicode letter if you’re interested in alphabets beyond the Latin one
I’m looking for the Python equivalent of
The str.split()
method without an argument splits on whitespace:
Another method through re
module. It does the reverse operation of matching all the words instead of spitting the whole sentence by space.
>>> import re
>>> s = "many fancy word \nhello \thi"
>>> re.findall(r'\S+', s)
['many', 'fancy', 'word', 'hello', 'hi']
Above regex would match one or more non-space characters.
What is the regular expression for a decimal with a precision of 2?
Valid examples:
Invalid examples:
The decimal point may be optional, and integers may also be included.
Valid regex tokens vary by implementation. The most generic form that I know of would be:
The most compact:
Both assume that you must have both at least one digit before and one after the decimal place.
To require that the whole string is a number of this form, wrap the expression in start and end tags such as (in Perl’s form):
ADDED: Wrapped the fractional portion in ()? to make it optional. Be aware that this excludes forms such as “12.” Including that would be more like ^\d+\.?\d{0,2}$
.
Added: Use this format ^\d{1,6}(\.\d{1,2})?$
to stop repetition and give a restriction to whole part of the decimal value.
And since regular expressions are horrible to read, much less understand, here is the verbose equivalent:
^ # Start of string
[0-9]+ # Require one or more numbers
( # Begin optional group
\. # Point must be escaped or it is treated as "any character"
[0-9]{1,2} # One or two numbers
)? # End group--signify that it's optional with "?"
$ # End of string
You can replace [0-9]
with \d
in most regular expression implementations (including PCRE, the most common). I’ve left it as [0-9]
as I think it’s easier to read.
Also, here is the simple Python script I used to check it:
import re
deci_num_checker = re.compile(r"""^[0-9]+(\.[0-9]{1,2})?$""")
valid = ["123.12", "2", "56754", "92929292929292.12", "0.21", "3.1"]
invalid = ["12.1232", "2.23332", "e666.76"]
assert len([deci_num_checker.match(x) != None for x in valid]) == len(valid)
assert [deci_num_checker.match(x) == None for x in invalid].count(False) == 0
To include an optional minus sign and to disallow numbers like 015
(which can be mistaken for octal numbers) write:
I’m currently programming a vocabulary algorithm that checks if a user has typed in the word correctly. I have the following situation: The correct solution for the word would be “part1, part2”. The user should be able to enter either “part1” (answer 1), “part2” (answer 2) or “part1, part2” (answer 3). I now try to match the string given by the user with the following, automatically created, regex expression:
This only returns answer 1 and 2 as correct while answer 3 would be wrong. I’m now wondering whether there’s an operator similar to | that says and/or
instead of either...or
.
May anyone help me solve this problem?
I’m going to assume you want to build a the regex dynamically to contain other words than part1 and part2, and that you want order not to matter. If so you can use something like this:
Positive matches:
Negative matches:
does it work?
Does this work without alternation?
or why not this?
The first works for all conditions the second for all but part2
(using GNU sed 4.1.5)
Take this regular expression: /^[^abc]/
. This will match any single character at the beginning of a string, except a, b, or c.
If you add a *
after it – /^[^abc]*/
– the regular expression will continue to add each subsequent character to the result, until it meets either an a
, or b
, or c
.
For example, with the source string "qwerty qwerty whatever abc hello"
, the expression will match up to "qwerty qwerty wh"
.
But what if I wanted the matching string to be "qwerty qwerty whatever "
…In other words, how can I match everything up to (but not including) the exact sequence "abc"
?
You didn’t specify which flavor of regex you’re using, but this will work in any of the most popular ones that can be considered “complete”.
The .+?
part is the un-greedy version of .+
(one or more of anything). When we use .+
, the engine will basically match everything. Then, if there is something else in the regex it will go back in steps trying to match the following part. This is the greedy behavior, meaning as much as possible to satisfy.
When using .+?
, instead of matching all at once and going back for other conditions (if any), the engine will match the next characters by step until the subsequent part of the regex is matched (again if any). This is the un-greedy, meaning match the fewest possible to satisfy.
/.+X/ ~ "abcXabcXabcX" /.+/ ~ "abcXabcXabcX"
^^^^^^^^^^^^ ^^^^^^^^^^^^
/.+?X/ ~ "abcXabcXabcX" /.+?/ ~ "abcXabcXabcX"
^^^^ ^
Following that we have (?=
{contents}
)
, a zero width assertion, a look around. This grouped construction matches its contents, but does not count as characters matched (zero width). It only returns if it is a match or not (assertion).
Thus, in other terms the regex /.+?(?=abc)/
means:
Match any characters as few as possible until a “abc” is found, without counting the “abc”.
If you’re looking to capture everything up to “abc”:
Explanation:
( )
capture the expression inside the parentheses for access using $1
, $2
, etc.
^
match start of line
.*
match anything, ?
non-greedily (match the minimum number of characters required) - [1]
[1] The reason why this is needed is that otherwise, in the following string:
by default, regexes are greedy, meaning it will match as much as possible. Therefore /^.*abc/
would match “whatever whatever something abc something”. Adding the non-greedy quantifier ?
makes the regex only match “whatever whatever something”.
As @Jared Ng and @Issun pointed out, the key to solve this kind of RegEx like “matching everything up to a certain word or substring” or “matching everything after a certain word or substring” is called “lookaround” zero-length assertions. Read more about them here.
In your particular case, it can be solved by a positive look ahead. A picture is worth a thousand words. See the detail explanation in the screenshot.
I have a string that has two single quotes in it, the '
character. In between the single quotes is the data I want.
How can I write a regex to extract “the data i want” from the following text?
Assuming you want the part between single quotes, use this regular expression with a Matcher
:
Example:
String mydata = "some string with 'the data i want' inside";
Pattern pattern = Pattern.compile("'(.*?)'");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}
Result:
the data i want
You don’t need regex for this.
Add apache commons lang to your project (http://commons.apache.org/proper/commons-lang/), then use:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
Pattern pattern = Pattern.compile(".*'([^']*)'.*");
String mydata = "some string with 'the data i want' inside";
Matcher matcher = pattern.matcher(mydata);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
}
}
How can I make the following regex ignore case sensitivity? It should match all the correct characters but ignore whether they are lower or uppercase.
Assuming you want the whole regex to ignore case, you should look for the i
flag. Nearly all regex engines support it:
Check the documentation for your language/platform/tool to find how the matching modes are specified.
If you want only part of the regex to be case insensitive (as my original answer presumed), then you have two options:
Use the (?i)
and [optionally] (?-i)
mode modifiers:
(?i)G[a-b](?-i.html).*
```</li>
<li><p>Put all the variations (i.e. lowercase and uppercase) in the regex - useful if mode modifiers are not supported:</p>
```perl
[gG][a-bA-B].*
```</li>
</ol>
One last note: if you're dealing with Unicode characters besides ASCII, check whether or not your regex engine properly supports them.
#### Answer 2 (score 120)
<p>Depends on implementation
but I would use</p>
```perl
(?i)G[a-b].
VARIATIONS:
Modern regex flavors allow you to apply modifiers to only part of the regular expression. If you insert the modifier (?im) in the middle of the regex then the modifier only applies to the part of the regex to the right of the modifier. With these flavors, you can turn off modes by preceding them with a minus sign (?-i).
Description is from the page: https://www.regular-expressions.info/modifiers.html
regular expression for validate ‘abc’ ignoring case sensitive
How can I find/replace all CR/LF characters in Notepad++?
I am looking for something equivalent to the ^p special character in Microsoft Word.
[\r\n]+
should work too
Update March, 26th 2012, release date of Notepad++ 6.0:
OMG, it actually does work now!!!
Original answer 2008 (Notepad++ 4.x) - 2009-2010-2011 (Notepad++ 5.x)
Actually no, it does not seem to work with regexp…
But if you have Notepad++ 5.x, you can use the ‘extended’ search mode and look for \r\n
. That does find all your CRLF
.
(I realize this is the same answer than the others, but again, ‘extended mode’ is only available with Notepad++ 4.9, 5.x and more)
Since April 2009, you have a wiki article on the Notepad++ site on this topic:
“How To Replace Line Ends, thus changing the line layout”.
(mentioned by georgiecasey in his/her answer below)
Some relevant extracts includes the following search processes:
Simple search (Ctrl+F), Search Mode =
Normal
You can select an
EOL
in the editing window.
- Just move the cursor to the end of the line, and type Shift+Right Arrow.
- or, to select
EOL
with the mouse, start just at the line end and drag to the start of the next line; dragging to the right of theEOL
won’t work. You can manually copy theEOL
and paste it into the field for Unix files (LF
-only).Simple search (Ctrl+F), Search Mode = Extended
The “Extended” option shows
\n
and\r
as characters that could be matched.
As with the Normal search mode, Notepad++ is looking for the exact character.
Searching for\r
in a UNIX-format file will not find anything, but searching for\n
will. Similarly, a Macintosh-format file will contain\r
but not\n
.Simple search (Ctrl+F), Search Mode = Regular expression
Regular expressions use the characters
^
and$
to anchor the match string to the beginning or end of the line. For instance, searching forreturn;$
will find occurrences of “return;” that occur with no subsequent text on that same line. The anchor characters work identically in all file formats.
The ‘.’ dot metacharacter does not match line endings.[Tested in Notepad++ 5.8.5]: a regular expression search with an explicit
\r
or\n
does not work (contrary to the Scintilla documentation).
Neither does a search on an explicit (pasted) LF, or on the (invisible) EOL characters placed in the field when an EOL is selected. Advanced search (Ctrl+R) without regexpCtrl+M will insert something that matches newlines. They will be replaced by the replace string.
I recommend this method as the most reliable, unless you really need to use regex.
As an example, to remove every second newline in a double spaced file, enter Ctrl+M twice in the search string box, and once in the replace string box.Advanced search (Ctrl+R) with Regexp. Neither Ctrl+M,
$
nor\r\n
are matched.
The same wiki also mentions the Hex editor alternative:
- Type the new string at the beginning of the document.
- Then select to view the document in Hex mode.
- Select one of the new lines and hit Ctrl+H.
- While you have the Replace dialog box up, select on the background the new replacement string and Ctrl+C copy it to paste it in the Replace with text input.
- Then Replace or Replace All as you wish.
Note: the character selected for new line usually appears as
0a
.
It may have a different value if the file is in Windows Format. In that case you can always go toEdit -> EOL Conversion -> Convert to Unix Format
, and after the replacement switch it back andEdit -> EOL Conversion -> Convert to Windows Format
.
It appears that this is a FAQ, and the resolution offered is:
Simple search (Ctrl+H) without regexp
You can turn on View/Show End of Line or view/Show All, and select the now visible newline characters. Then when you start the command some characters matching the newline character will be pasted into the search field. Matches will be replaced by the replace string, unlike in regex mode.
Note 1: If you select them with the mouse, start just before them and drag to the start of the next line. Dragging to the end of the line won’t work.
Note 2: You can’t copy and paste them into the field yourself.
Advanced search (Ctrl+R) without regexp
Ctrl+M will insert something that matches newlines. They will be replaced by the replace string.
On the Replace dialog, you want to set the search mode to “Extended”. Normal or Regular Expression modes wont work.
Then just find “” (or just for unix files or just mac format files), and set the replace to whatever you want.
Currently I have an input box which will detect the URL and parse the data.
So right now, I am using:
var urlR = /^(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)
(?::(\d+))?(?:\/([^?#]*))?(?:\?([^#]*))?(?:#(.*))?$/;
var url= content.match(urlR);
The problem is, when I enter a URL like www.google.com
, its not working. when I entered http://www.google.com
, it is working.
I am not very fluent in regular expressions. Can anyone help me?
Regex if you want to ensure URL starts with HTTP/HTTPS:
https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)
If you do not require HTTP protocol:
To try this out see http://regexr.com?37i6s, or for a version which is less restrictive http://regexr.com/3e6m0.
Example JavaScript implementation:
(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})
Will match the following cases
http://www.foufos.gr
https://www.foufos.gr
http://foufos.gr
http://www.foufos.gr/kino
http://werer.gr
www.foufos.gr
www.mp3.com
www.t.co
http://t.co
http://www.t.co
https://www.t.co
www.aa.com
http://aa.com
http://www.aa.com
https://www.aa.com
Will NOT match the following
www.foufos
www.foufos-.gr
www.-foufos.gr
foufos.gr
http://www.foufos
http://foufos
www.mp3#.com
var expression = /(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})/gi;
var regex = new RegExp(expression);
var check = [
'http://www.foufos.gr',
'https://www.foufos.gr',
'http://foufos.gr',
'http://www.foufos.gr/kino',
'http://werer.gr',
'www.foufos.gr',
'www.mp3.com',
'www.t.co',
'http://t.co',
'http://www.t.co',
'https://www.t.co',
'www.aa.com',
'http://aa.com',
'http://www.aa.com',
'https://www.aa.com',
'www.foufos',
'www.foufos-.gr',
'www.-foufos.gr',
'foufos.gr',
'http://www.foufos',
'http://foufos',
'www.mp3#.com'
];
check.forEach(function(entry) {
if (entry.match(regex)) {
$("#output").append( "<div >Success: " + entry + "</div>" );
} else {
$("#output").append( "<div>Fail: " + entry + "</div>" );
}
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div id="output"></div>
Check it in rubular - NEW version
Check it in rubular - old version
I need to validate a date string for the format dd/mm/yyyy
with a regular expresssion.
This regex validates dd/mm/yyyy
, but not the invalid dates like 31/02/4500
:
What is a valid regex to validate dd/mm/yyyy
format with leap year support?
The regex you pasted does not validate leap years correctly, but there is one that does in the same post. I modified it to take dd/mm/yyyy
, dd-mm-yyyy
or dd.mm.yyyy
.
^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[13-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
I tested it a bit in the link Arun provided in his answer and also here and it seems to work.
Edit February 14th 2019: I’ve removed a comma that was in the regex which allowed dates like 29-0,-11
I have extended the regex given by @Ofir Luzon for the formats dd-mmm-YYYY, dd/mmm/YYYY, dd.mmm.YYYY as per my requirement. Anyone else with same requirement can refer this
^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Feb))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
and tested for some test cases here http://regexr.com/39tr1.
For better understanding for this Regular expression refer this image:
Notice:
Your regexp does not work for years that “are multiples of 4 and 100, but not of 400”. Years that pass that test are not leap years. For example: 1900, 2100, 2200, 2300, 2500, etc. In other words, it puts all years with the format in the same class of leap years, which is incorrect. – MuchToLearn
So it works properly only for [1901 - 2099] (Whew) 😊
dd/MM/yyyy:
Checks if leap year. Years from 1900 to 9999 are valid. Only dd/MM/yyyy
(^(((0[1-9]|1[0-9]|2[0-8])[\/](0[1-9]|1[012].html))|((29|30|31)[\/](0[13578]|1[02].html))|((29|30)[\/](0[4,6,9]|11.html)))[\/](19|[2-9][0-9].html)\d\d$)|(^29[\/]02[\/](19|[2-9][0-9].html)(00|04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96)$)
I would like to remove specific characters from strings within a vector, similar to the Find and Replace feature in Excel.
Here are the data I start with:
I start with just the first column; I want to produce the second column by removing the e
’s:
With a regular expression and the function gsub()
:
group <- c("12357e", "12575e", "197e18", "e18947")
group
[1] "12357e" "12575e" "197e18" "e18947"
gsub("e", "", group)
[1] "12357" "12575" "19718" "18947"
What gsub
does here is to replace each occurrence of "e"
with an empty string ""
.
See ?regexp
or gsub
for more help.
Regular expressions are your friends:
R> ## also adds missing ')' and sets column name
R> group<-data.frame(group=c("12357e", "12575e", "197e18", "e18947")) )
R> group
group
1 12357e
2 12575e
3 197e18
4 e18947
Now use gsub()
with the simplest possible replacement pattern: empty string:
Summarizing 2 ways to replace strings:
gsub
stringr
packageBoth will produce the desire output:
For example, this regex
will match:
But how do I get it to match across multiple lines?
It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:
The s at the end causes the dot to match all characters including newlines.
Try this:
It basically says “any character or a newline” repeated zero or more times.
The question is, can .
pattern match any character? The answer varies from engine to engine. The main difference is whether the pattern is used by a POSIX or non-POSIX regex library.
Special note about lua-patterns: they are not considered regular expressions, but .
matches any char there, same as POSIX based engines.
Another note on matlab and octave: the .
matches any char by default (demo): str = "abcde\n fghij<Foobar>"; expression = '(.*)<Foobar>*'; [tokens,matches] = regexp(str,expression,'tokens','match');
(tokens
contain a abcde\n fghij
item).
Also, in all of boost’s regex grammars the dot matches line breaks by default. Boost’s ECMAScript grammar allows you to turn this off with regex_constants::no_mod_m
(source).
As for oracle (it is POSIX based), use n
option (demo): select regexp_substr('abcde' || chr(10) ||' fghij<Foobar>', '(.*)<Foobar>', 1, 1, 'n', 1) as results from dual
POSIX-based engines:
A mere .
already matches line breaks, no need to use any modifiers, see bash (demo).
The tcl (demo), postgresql (demo), r (TRE, base R default engine with no perl=TRUE
, for base R with perl=TRUE
or for stringr/stringi patterns, use the (?s)
inline modifier) (demo) also treat .
the same way.
However, most POSIX based tools process input line by line. Hence, .
does not match the line breaks just because they are not in scope. Here are some examples how to override this:
sed 'H;1h;$!d;x; s/\(.*\)><Foobar>/\1/'
(H;1h;$!d;x;
slurps the file into memory). If whole lines must be included, sed '/start_pattern/,/end_pattern/d' file
(removing from start will end with matched lines included) or sed '/start_pattern/,/end_pattern/{{//!d;};}' file
(with matching lines excluded) can be considered.
perl -0pe 's/(.*)<FooBar>/$1/gs' <<< "$str"
(-0
slurps the whole file into memory, -p
prints the file after applying the script given by -e
). Note that using -000pe
will slurp the file and activate ‘paragraph mode’ where Perl uses consecutive newlines (\n\n
) as the record separator.
grep -Poz '(?si)abc\K.*?(?=<Foobar>)' file
. Here, z
enables file slurping, (?s)
enables the DOTALL mode for the .
pattern, (?i)
enables case insensitive mode, \K
omits the text matched so far, *?
is a lazy quantifier, (?=<Foobar>)
matches the location before <Foobar>
.
pcregrep -Mi "(?si)abc\K.*?(?=<Foobar>)" file
(M
enables file slurping here). Note pcregrep
is a good solution for Mac OS grep
users.
Non-POSIX-based engines:
s
modifier PCRE_DOTALL modifier: preg_match('~(.*)<Foobar>~s', $s, $m)
(demo)
RegexOptions.Singleline
flag (demo): var result = Regex.Match(s, @"(.*)<Foobar>", RegexOptions.Singleline).Groups[1].Value;
var result = Regex.Match(s, @"(?s)(.*)<Foobar>").Groups[1].Value;
(?s)
inline option: $s = "abcde
nfghij<FooBar>“; $s -match”(?s)(.)<Foobar>“; $matches[1]</li> <li><a href="/questions/tagged/perl" class="post-tag" title="show questions tagged 'perl'" rel="tag">perl</a> - Use
smodifier (or
(?s)inline version at the start) (<a href="https://ideone.com/nsYpjE" rel="noreferrer">demo</a>):
/(.)<FooBar>/s</li> <li><a href="/questions/tagged/python" class="post-tag" title="show questions tagged 'python'" rel="tag">python</a> - Use
re.DOTALL(or
re.S) flags or
(?s)inline modifier (<a href="https://ideone.com/A21CXy" rel="noreferrer">demo</a>):
m = re.search(r"(.)<FooBar>”, s, flags=re.S)(and then
if m:,
print(m.group(1)))</li> <li><a href="/questions/tagged/java" class="post-tag" title="show questions tagged 'java'" rel="tag">java</a> - Use
Pattern.DOTALLmodifier (or inline
(?s)flag) (<a href="https://ideone.com/Oq1j8Z" rel="noreferrer">demo</a>):
Pattern.compile(“(.)<FooBar>", Pattern.DOTALL)</li> <li><a href="/questions/tagged/groovy" class="post-tag" title="show questions tagged 'groovy'" rel="tag">groovy</a> - Use
(?s)in-pattern modifier (<a href="https://ideone.com/2wmACW" rel="noreferrer">demo</a>):
regex = /(?s)(.)<FooBar>/</li> <li><a href="/questions/tagged/scala" class="post-tag" title="show questions tagged 'scala'" rel="tag">scala</a> - Use
(?s)modifier (<a href="https://ideone.com/faL4xJ" rel="noreferrer">demo</a>):
”(?s)(.)<Foobar>“.r.findAllIn(”abcdefghij<Foobar>“).matchData foreach { m => println(m.group(1)) }</li> <li><a href="/questions/tagged/javascript" class="post-tag" title="show questions tagged 'javascript'" rel="tag">javascript</a> - Use
[^]or workarounds
[]/
[]/
[](<a href="https://jsfiddle.net/36c6rt7o/3/" rel="noreferrer">demo</a>):
s.match(/([])<FooBar>/)[1]</li> <li><a href="/questions/tagged/c%2b%2b" class="post-tag" title="show questions tagged 'c++'" rel="tag">c++</a> (
std::regex) Use
[]or the JS workarounds (<a href="https://ideone.com/2xC4ih" rel="noreferrer">demo</a>):
regex rex(R"(([])<FooBar>)”);</li> <li><a href="/questions/tagged/vba" class="post-tag" title="show questions tagged 'vba'" rel="tag">vba</a> - Use the same approach as in JavaScript,
([])<Foobar>.</li> <li><a href="/questions/tagged/ruby" class="post-tag" title="show questions tagged 'ruby'" rel="tag">ruby</a> - Use <a href="https://ruby-doc.org/core-2.4.0/Regexp.html#class-Regexp-label-Options" rel="noreferrer">
/m<em>MULTILINE</em> modifier</a> (<a href="https://ideone.com/hSj5M2" rel="noreferrer">demo</a>):
s[/(.*)<Foobar>/m, 1]</li> <li><a href="/questions/tagged/go" class="post-tag" title="show questions tagged 'go'" rel="tag">go</a> - Use the inline modifier
(?s)at the start (<a href="https://play.golang.org/p/Xqproig3jZ" rel="noreferrer">demo</a>):
re: = regexp.MustCompile((?s)(.*)<FooBar>
)</li> <li><a href="/questions/tagged/swift" class="post-tag" title="show questions tagged 'swift'" rel="tag">swift</a> - Use <a href="https://developer.apple.com/documentation/foundation/nsregularexpression.options/1412529-dotmatcheslineseparators" rel="noreferrer">
dotMatchesLineSeparators</a> or (easier) pass the
(?s)inline modifier to the pattern:
let rx = “(?s)(.)<Foobar>"</li> <li><a href="/questions/tagged/objective-c" class="post-tag" title="show questions tagged 'objective-c'" rel="tag">objective-c</a> - Same as Swift,
(?s)works the easiest, but here is how the <a href="https://ideone.com/C6RP37" rel="noreferrer">option can be used</a>: <code>NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionDotMatchesLineSeparators error:&regexError];</code></li> <li><a href="/questions/tagged/re2" class="post-tag" title="show questions tagged 're2'" rel="tag">re2</a>, <a href="/questions/tagged/google-apps-script" class="post-tag" title="show questions tagged 'google-apps-script'" rel="tag">google-apps-script</a> - Use
(?s)modifier (<a href="https://docs.google.com/spreadsheets/d/1kn6Bb4TTjXT27Yfqwi3Z9K6YQVQxqHIBYoAAa1B4NsA/edit#gid=0" rel="noreferrer">demo</a>):
"(?s)(.)<Foobar>”(in Google Spreadsheets,
=REGEXEXTRACT(A2,"(?s)(.)<Foobar>")`)
NOTES ON (?s)
:
In most non-POSIX engines, (?s)
inline modifier (or embedded flag option) can be used to enforce .
to match line breaks.
If placed at the start of the pattern, (?s)
changes the bahavior of all .
in the pattern. If the (?s)
is placed somewhere after the beginning, only those .
will be affected that are located to the right of it unless this is a pattern passed to Python re
. In Python re
, regardless of the (?s)
location, the whole pattern .
are affected. The (?s)
effect is stopped using (?-s)
. A modified group can be used to only affect a specified range of a regex pattern (e.g. Delim1(?s:.*?)\nDelim2.*
will make the first .*?
match across newlines and the second .*
will only match the rest of the line).
POSIX note:
In non-POSIX regex engines, to match any char, [\s\S]
/ [\d\D]
/ [\w\W]
constructs can be used.
In POSIX, [\s\S]
is not matching any char (as in JavaScript or any non-POSIX engine) because regex escape sequences are not supported inside bracket expressions. [\s\S]
is parsed as bracket expressions that match a single char, \
or s
or S
.
My regex pattern looks something like
I am only interested in the part in quotes assigned to location. Shouldn’t it be as easy as below without the greedy switch?
Does not seem to work.
You need to make your regular expression non-greedy, because by default, "(.*)"
will match all of "file path/level1/level2" xxx some="xxx"
.
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
Adding a ?
on a quantifier (?
, *
or +
) makes it non-greedy.
location="(.*)"
will match from the " after location=
until the " after some="xxx
unless you make it non-greedy. So you either need .*?
(i.e. make it non-greedy) or better replace .*
with [^"]*
.
How about
This avoids the unlimited search with .* and will match exactly to the first quote.
I have a regular expression as follows:
This matches strings such as /hello
or /hello123
.
However, I would like it to exclude a couple of string values such as /ignoreme
and /ignoreme2
.
I’ve tried a few variants but can’t seem to get any to work!
My latest feeble attempt was
Any help would be gratefully appreciated :-)
Here’s yet another way (using a negative look-ahead):
Note: There’s only one capturing expression: ([a-z0-9]+)
.
This should do it:
You can add as much ignored words as you like, here is a simple PHP implementation:
As you want to exclude both words, you need a conjuction:
Now both conditions must be true (neither ignoreme nor ignoreme2 is allowed) to have a match.
I want a regular expression to check that:
A password contains at least eight characters, including at least one number and includes both lower and uppercase letters and special characters, for example #
, ?
, !
.
It cannot be your old password or contain your username, "password"
, or "websitename"
And here is my validation expression which is for eight characters including one uppercase letter, one lowercase letter, and one number or special character.
How can I write it for a password must be eight characters including one uppercase letter, one special character and alphanumeric characters?
Minimum eight characters, at least one letter and one number:
Minimum eight characters, at least one letter, one number and one special character:
Minimum eight characters, at least one uppercase letter, one lowercase letter and one number:
Minimum eight characters, at least one uppercase letter, one lowercase letter, one number and one special character:
Minimum eight and maximum 10 characters, at least one uppercase letter, one lowercase letter, one number and one special character:
You may use this regex with multiple lookahead assertions (conditions):
This regex will enforce these rules:
(?=.*?[A-Z])
(?=.*?[a-z])
(?=.*?[0-9])
(?=.*?[#?!@$%^&*-])
.{8,}
(with the anchors)
I have a text in a textarea and I read it out using the .value attribute.
Now I would like to remove all linebreaks (the character that is produced when you press Enter) from my text now using .replace with a regular expression, but how do I indicate a linebreak in a regex?
If that is not possible, is there another way?
This is probably a FAQ. Anyhow, line breaks (better: newlines) can be one of Carriage Return (CR, \r
, on older Macs), Line Feed (LF, \n
, on Unices incl. Linux) or CR followed by LF (\r\n
, on WinDOS). (Contrary to another answer, this has nothing to do with character encoding.)
Therefore, the most efficient RegExp
literal to match all variants is
If you want to match all newlines in a string, use a global match,
respectively. Then proceed with the replace
method as suggested in several other answers. (Probably you do not want to remove the newlines, but replace them with other whitespace, for example the space character, so that words remain intact.)
How you’d find a line break varies between operating system encodings. Windows would be \r\n
, but Linux just uses \n
and Apple uses \r
.
I found this in JavaScript line breaks:
That should remove all kinds of line breaks.
String.trim()
removes whitespace from the beginning and end of strings… including newlines.
const myString = " \n \n\n Hey! \n I'm a string!!! \n\n";
const trimmedString = myString.trim();
console.log(trimmedString);
// outputs: "Hey! \n I'm a string!!!"
Here’s an example fiddle: http://jsfiddle.net/BLs8u/
NOTE! it only trims the beginning and end of the string, not line breaks or whitespace in the middle of the string.
I have a requirement to find and extract a number contained within a string.
For example, from these strings:
How can I do this?
\d+
is the regex for an integer number. So
returns a string containing the first occurrence of a number in subjectString
.
Int32.Parse(resultString)
will then give you the number.
Here’s how I cleanse phone numbers to get the digits only:
Simple regex question. I have a string on the following format:
What is the regular expression to extract the words within the square brackets, ie.
Note: In my use case, brackets cannot be nested.
You can use the following regex globally:
Explanation:
\[
: [
is a meta char and needs to be escaped if you want to match it literally.
(.*?)
: match everything in a non-greedy way and capture it.
\]
: ]
is a meta char and needs to be escaped if you want to match it literally.
Will capture content without brackets
(?<=\[)
- positive lookbehind for [
.*?
- non greedy match for the content
(?=\])
- positive lookahead for ]
EDIT: for nested brackets the below regex should work:
This should work out ok:
I want to match two passwords with regular expression. For example I have two inputs “123456” and “1234567” then the result should be not match (false). And when I have entered “123456” and “123456” then the result should be match (true).
I couldn’t make the expression. How do I do it?
if you have a the input password in a variable and you want to match exactly 123456 then anchors will help you:
in perl the test for matching the password would be something like
EDIT:
bart kiers is right tho, why don’t you use a strcmp() for this? every language has it in its own way
as a second thought, you may want to consider a safer authentication mechanism :)
In malfaux’s answer ‘^’ and ‘$’ has been used to detect the beginning and the end of the text.
These are usually used to detect the beginning and the end of a line.
However this may be the correct way in this case.
But if you wish to match an exact word the more elegant way is to use ‘. In this case following pattern will match the exact phrase’123456’.
/
this makes sure that your match is not preceded by some character, number, or underscore and is not followed immediately by character or number, or underscore
so it will match “abc” in “abc”, “abc.”, “abc”, but not “4abc”, nor “abcde”
I’m trying to split text in a JTextArea
using a regex to split the String by \n
However, this does not work and I also tried by \r\n|\r|n
and many other combination of regexes. Code:
public void insertUpdate(DocumentEvent e) {
String split[], docStr = null;
Document textAreaDoc = (Document)e.getDocument();
try {
docStr = textAreaDoc.getText(textAreaDoc.getStartPosition().getOffset(), textAreaDoc.getEndPosition().getOffset());
} catch (BadLocationException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
split = docStr.split("\\n");
}
This should cover you:
There’s only really two newlines (UNIX and Windows) that you need to worry about.
If you don’t want empty lines:
String#split(String regex)
method is using regex (regular expressions). Since Java 8 regex supports \R
which represents (from documentation of Pattern class):
Linebreak matcher
Any Unicode linebreak sequence, is equivalent to\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
So we can use it to match:
\u000D\000A
-> \r\n
pair
\n
)
\t
which is \\u0009
)
\f
)
\r
)
As you see \r\n
is placed at start of regex which ensures that regex will try to match this pair first, and only if that match fails it will try to match single character line separators.
So if you want to split on line separator use split("\\R")
.
If you don’t want to remove from resulting array trailing empty strings ""
use split(regex, limit)
with negative limit
parameter like split("\\R", -1)
.
If you want to treat one or more continues empty lines as single delimiter use split("\\R+")
.
How can I replace empty lines in Notepad++? I tried a find and replace with the empty lines in the find, and nothing in the replace, but it did not work; it probably needs regex.
You need something like a regular expression.
You have to be in Extended
mode
If you want all the lines to end up on a single line use \r\n
. If you want to simply remove empty lines, use \n\r
as @Link originally suggested.
Replace either expression with nothing.
There is now a built-in way to do this as of version 6.5.2
Edit -> Line Operations -> Remove Empty Lines
or Remove Empty Lines (Containing Blank characters)
There is a plugin that adds a menu entitled TextFX
. This menu, which houses a dizzying array of quick text editing options, gives a person the ability to make quick coding changes. In this menu, you can find selections such as Drop Quotes, Delete Blank Lines as well as Unwrap and Rewrap Text
Do the following:
How do I make an expression to match absolutely anything (including whitespaces)?
Example:
Regex: I bought _____ sheep.
Matches: I bought sheep. I bought a sheep. I bought five sheep.
I tried using (.*)
, but that doesn’t seem to be working.
Update: I got it to work, apparently the problem wasn’t with the regular expressions, it’s just that the (.) characters were being escaped.
Normally the dot matches any character except newlines.
So if .*
isn’t working, set the “dot matches newlines, too” option (or use (?s).*
).
If you’re using JavaScript, which doesn’t have a “dotall” option, try [\s\S]*
. This means “match any number of characters that are either whitespace or non-whitespace” - effectively “match any string”.
Another option that only works for JavaScript (and is not recognized by any other regex flavor) is [^]*
which also matches any string. But [\s\S]*
seems to be more widely used, perhaps because it’s more portable.
(.*?)
matches anything - I’ve been using it for years.
I often need to kill a process during programming.
The way I do it now is:
[~]$ ps aux | grep 'python csp_build.py'
user 5124 1.0 0.3 214588 13852 pts/4 Sl+ 11:19 0:00 python csp_build.py
user 5373 0.0 0.0 8096 960 pts/6 S+ 11:20 0:00 grep python csp_build.py
[~]$ kill 5124
How can I extract the process id automatically and kill it in the same line?
Like this:
In bash
, you should be able to do:
Details on its workings are as follows:
ps
gives you the list of all the processes.
grep
filters that based on your search string, [p]
is a trick to stop you picking up the actual grep
process itself.
awk
just gives you the second field of each line, which is the PID.
$(x)
construct means to execute x
then take its output and put it on the command line. The output of that ps
pipeline inside that construct above is the list of process IDs so you end up with a command like kill 1234 1122 7654
.
Here’s a transcript showing it in action:
pax> sleep 3600 &
[1] 2225
pax> sleep 3600 &
[2] 2226
pax> sleep 3600 &
[3] 2227
pax> sleep 3600 &
[4] 2228
pax> sleep 3600 &
[5] 2229
pax> kill $(ps aux | grep '[s]leep' | awk '{print $2}')
[5]+ Terminated sleep 3600
[1] Terminated sleep 3600
[2] Terminated sleep 3600
[3]- Terminated sleep 3600
[4]+ Terminated sleep 3600
and you can see it terminating all the sleepers.
Explaining the grep '[p]ython csp_build.py'
bit in a bit more detail:
When you do sleep 3600 &
followed by ps -ef | grep sleep
, you tend to get two processes with sleep
in it, the sleep 3600
and the grep sleep
(because they both have sleep
in them, that’s not rocket science).
However, ps -ef | grep '[s]leep'
won’t create a process with sleep
in it, it instead creates grep '[s]leep'
and here’s the tricky bit: the grep
doesn’t find it because it’s looking for the regular expression "any character from the character class [s]
(which is s
) followed by leep
.
In other words, it’s looking for sleep
but the grep process is grep '[s]leep'
which doesn’t have sleep
in it.
When I was shown this (by someone here on SO), I immediately started using it because
| grep -v grep
; and
if you have pkill,
If you only want to grep against the process name (instead of the full argument list) then leave off -f
.
ps aux | grep -i csp_build | awk '{print $2}' | xargs sudo kill -9
awk '{print $2}'
sudo
is optional
kill -9 5124
, kill -9 5373
etc (kill -15 is more graceful but slightly slower)
I also have 2 shortcut functions defined in my .bash_profile (~/.bash_profile is for osx, you have to see what works for your *nix machine).
p keyword
p csp_build
, p python
etc
bash_profile code:
ka keyword
ka csp_build
, ka python
etc
ka csp_build 15
, ka python 9
bash_profile code:
# KILL ALL
function ka(){
cnt=$( p $1 | wc -l) # total count of processes found
klevel=${2:-15} # kill level, defaults to 15 if argument 2 is empty
echo -e "\nSearching for '$1' -- Found" $cnt "Running Processes .. "
p $1
echo -e '\nTerminating' $cnt 'processes .. '
ps aux | grep -i $1 | grep -v grep | awk '{print $2}' | xargs sudo kill -klevel
echo -e "Done!\n"
echo "Running search again:"
p "$1"
echo -e "\n"
}
I am looking for a pattern that matches everything until the first occurrence of a specific character, say a “;” - a semicolon.
I wrote this:
But it actually matches everything (including the semicolon) until the last occurrence of a semicolon.
You need
The [^;]
is a character class, it matches everything but a semicolon.
To cite the perlre
manpage:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the “[” is “^”, the class matches any character not in the list.
This should work in most regex dialects.
Would;
work?
The ?
is a lazy operator, so the regex grabs as little as possible before matching the ;
.
/^[^;]*/
The [^;] says match anything except a semicolon. The square brackets are a set matching operator, it’s essentially, match any character in this set of characters, the ^
at the start makes it an inverse match, so match anything not in this set.
How do I check if a string matches this pattern?
Uppercase letter, number(s), uppercase letter, number(s)…
Example, These would match:
These wouldn’t (‘^’ points to problem)
Edit: As noted in the comments match
checks only for matches at the beginning of the string while re.search()
will match a pattern anywhere in string. (See also: https://docs.python.org/library/re.html#search-vs-match)
One-liner: re.match(r"pattern", string) # No need to compile
You can evalute it as bool
if needed
Please try the following:
import re
name = ["A1B1", "djdd", "B2C4", "C2H2", "jdoi","1A4V"]
# Match names.
for element in name:
m = re.match("(^[A-Z]\d[A-Z]\d)", element)
if m:
print(m.groups())
How do I split a string with multiple separators in JavaScript? I’m trying to split on both commas and spaces but, AFAIK, JS’s split function only supports one separator.
Pass in a regexp as the parameter:
Edited to add:
You can get the last element by selecting the length of the array minus 1:
>>> bits = "Hello awesome, world!".split(/[\s,]+/)
["Hello", "awesome", "world!"]
>>> bit = bits[bits.length - 1]
"world!"
… and if the pattern doesn’t match:
You can pass a regex into Javascript’s split operator. For example:
Or, if you want to allow multiple separators together to act as one only:
(You have to use the non-capturing (?:) parens because otherwise it gets spliced back into the result. Or you can be smart like Aaron and use a character class.)
(Examples tested in Safari + FF)
Another simple but effective method is to use split + join repeatedly.
Essentially doing a split followed by a join is like a global replace so this replaces each separator with a comma then once all are replaced it does a final split on comma
The result of the above expression is:
Expanding on this you could also place it in a function:
function splitMulti(str, tokens){
var tempChar = tokens[0]; // We can use the first token as a temporary join character
for(var i = 1; i < tokens.length; i++){
str = str.split(tokens[i]).join(tempChar);
}
str = str.split(tempChar);
return str;
}
Usage:
If you use this functionality a lot it might even be worth considering wrapping String.prototype.split
for convenience (I think my function is fairly safe - the only consideration is the additional overhead of the conditionals (minor) and the fact that it lacks an implementation of the limit argument if an array is passed).
Be sure to include the splitMulti
function if using this approach to the below simply wraps it :). Also worth noting that some people frown on extending built-ins (as many people do it wrong and conflicts can occur) so if in doubt speak to someone more senior before using this or ask on SO :)
var splitOrig = String.prototype.split; // Maintain a reference to inbuilt fn
String.prototype.split = function (){
if(arguments[0].length > 0){
if(Object.prototype.toString.call(arguments[0]) == "[object Array]" ) { // Check if our separator is an array
return splitMulti(this, arguments[0]); // Call splitMulti
}
}
return splitOrig.apply(this, arguments); // Call original split maintaining context
};
Usage:
var a = "a=b,c:d";
a.split(['=', ',', ':']); // ["a", "b", "c", "d"]
// Test to check that the built-in split still works (although our wrapper wouldn't work if it didn't as it depends on it :P)
a.split('='); // ["a", "b,c:d"]
Enjoy!
How can I count the number of times a particular string occurs in another string. For example, this is what I am trying to do in Javascript:
The g
in the regular expression (short for global) says to search the whole string rather than just find the first occurrence. This matches is
twice:
And, if there are no matches, it returns 0
:
/** Function that count occurrences of a substring in a string;
* @param {String} string The string
* @param {String} subString The sub string to search for
* @param {Boolean} [allowOverlapping] Optional. (Default:false)
*
* @author Vitim.us https://gist.github.com/victornpb/7736865
* @see Unit Test https://jsfiddle.net/Victornpb/5axuh96u/
* @see http://stackoverflow.com/questions/4009756/how-to-count-string-occurrence-in-string/7924240#7924240
*/
function occurrences(string, subString, allowOverlapping) {
string += "";
subString += "";
if (subString.length <= 0) return (string.length + 1);
var n = 0,
pos = 0,
step = allowOverlapping ? 1 : subString.length;
while (true) {
pos = string.indexOf(subString, pos);
if (pos >= 0) {
++n;
pos += step;
} else break;
}
return n;
}
occurrences("foofoofoo", "bar"); //0
occurrences("foofoofoo", "foo"); //3
occurrences("foofoofoo", "foofoo"); //1
Matches:
I’ve made a benchmark test and my function is more then 10 times faster then the regexp match function posted by gumbo. In my test string is 25 chars length. with 2 occurences of the character ‘o’. I executed 1 000 000 times in Safari.
Safari 5.1
Benchmark> Total time execution: 5617 ms (regexp)
Benchmark> Total time execution: 881 ms (my function 6.4x faster)
Firefox 4
Benchmark> Total time execution: 8547 ms (Rexexp)
Benchmark> Total time execution: 634 ms (my function 13.5x faster)
Edit: changes I’ve made
cached substring length
added type-casting to string.
added optional ‘allowOverlapping’ parameter
fixed correct output for "" empty substring case.
Gist
I want a regular expression that prevents symbols and only allows letters and numbers. The regex below works great, but it doesn’t allow for spaces between words.
For example, when using this regular expression “HelloWorld” is fine, but “Hello World” does not match.
How can I tweak it to allow spaces?
Just add a space in your character class.
The above isn’t exactly correct. Due to the fact that *
means zero or more, it would match all of the following cases that one would not usually mean to match:
Originally I didn’t think such details were worth going into, as OP was asking such a basic question that it seemed strictness wasn’t a concern. Now that the question’s gained some popularity however, I want to say…
Which, in my flavor (without using \w
) translates to:
(Please upvote @stema regardless.)
Some things to note about this (and @stema’s) answer:
If you want to allow multiple spaces between words (say, if you’d like to allow accidental double-spaces, or if you’re working with copy-pasted text from a PDF), then add a +
after the space:
^\w+( +\w+)*$
```</li>
<li><p>If you want to allow tabs and newlines (whitespace characters), then replace the space with a `\s+`:</p>
```perl
^\w+(\s+\w+)*$
+
by default because, for example, Windows linebreaks consist of two whitespace characters in sequence, \r\n
, so you’ll need the +
to catch both.
Check what dialect of regular expressions you’re using.* In languages like Java you’ll have to escape your backslashes, i.e. \\w
and \\s
. In older or more basic languages and utilities, like sed
, \w
and \s
aren’t defined, so write them out with character classes, e.g. [a-zA-Z0-9_]
and [\f\n\p\r\t]
, respectively.
* I know this question is tagged vb.net, but based on 25,000+ views, I’m guessing it’s not only those folks who are coming across this question. Currently it’s the first hit on google for the search phrase, regular expression space word.
One possibility would be to just add the space into you character class, like acheong87 suggested, this depends on how strict you are on your pattern, because this would also allow a string starting with 5 spaces, or strings consisting only of spaces.
The other possibility is to define a pattern:
I will use \w
this is in most regex flavours the same than [a-zA-Z0-9_]
(in some it is Unicode based)
This will allow a series of at least one word and the words are divided by spaces.
^
Match the start of the string
\w+
Match a series of at least one word character
( \w+)*
is a group that is repeated 0 or more times. In the group it expects a space followed by a series of at least one word character
$
matches the end of the string
This one worked for me
I’m not too sure how to do this. I need to validate email addresses using regex with something like this:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9].html)?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)
Then I need to run this in a jQuery function like this:
$j("#fld_emailaddress").live('change',function() {
var emailaddress = $j("#fld_emailaddress").val();
// validation here?
if(emailaddress){}
// end validation
$j.ajax({
type: "POST",
url: "../ff-admin/ff-register/ff-user-check.php",
data: "fld_emailaddress="+ emailaddress,
success: function(msg)
{
if(msg == 'OK') {
$j("#fld_username").attr('disabled',false);
$j("#fld_password").attr('disabled',false);
$j("#cmd_register_submit").attr('disabled',false);
$j("#fld_emailaddress").removeClass('object_error'); // if necessary
$j("#fld_emailaddress").addClass("object_ok");
$j('#email_ac').html(' <img src="img/cool.png" align="absmiddle"> <font color="Green"> Your email <strong>'+ emailaddress+'</strong> is OK.</font> ');
} else {
$j("#fld_username").attr('disabled',true);
$j("#fld_password").attr('disabled',true);
$j("#cmd_register_submit").attr('disabled',true);
$j("#fld_emailaddress").removeClass('object_ok'); // if necessary
$j("#fld_emailaddress").addClass("object_error");
$j('#email_ac').html(msg);
}
}
});
});
Where does the validation go and what is the expression?
function isValidEmailAddress(emailAddress) {
var pattern = /^([a-z\d!#$%&'*+\-\/=?^_`{|}~\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]+(\.[a-z\d!#$%&'*+\-\/=?^_`{|}~\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]+)*|"((([ \t]*\r\n)?[ \t]+)?([\x01-\x08\x0b\x0c\x0e-\x1f\x7f\x21\x23-\x5b\x5d-\x7e\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]|\\[\x01-\x09\x0b\x0c\x0d-\x7f\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))*(([ \t]*\r\n)?[ \t]+)?")@(([a-z\d\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]|[a-z\d\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF][a-z\d\-._~\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]*[a-z\d\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])\.)+([a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]|[a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF][a-z\d\-._~\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]*[a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])\.?$/i;
return pattern.test(emailAddress);
}
This is my solution:
function isValidEmailAddress(emailAddress) {
var pattern = new RegExp(/^[+a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$/i);
// alert( pattern.test(emailAddress) );
return pattern.test(emailAddress);
};
Found that RegExp over here: http://mdskinner.com/code/email-regex-and-validation-jquery
$(document).ready(function() {
$('#emailid').focusout(function(){
$('#emailid').filter(function(){
var emil=$('#emailid').val();
var emailReg = /^([\w-\.]+@([\w-]+\.)+[\w-]{2,4})?$/;
if( !emailReg.test( emil ) ) {
alert('Please enter valid email');
} else {
alert('Thank you for your valid email');
}
})
});
});
Want to remove all new lines from string.
I’ve this regex, it can catch all of them, the problem is I don’t know with which function should I use it.
$string
should become:
You have to be cautious of double line breaks, which would cause double spaces. Use this really efficient regular expression:
Multiple spaces and newlines are replaced with a single space.
Edit: As others have pointed out, this solution has issues matching single newlines in between words. This is not present in the example, but one can easily see how that situation could occur. An alternative is to do the following:
A few comments on the accepted answer:
The +
means “1 or more”. I don’t think you need to repeat \s
. I think you can simply write '/\s+/'
.
Also, if you want to remove whitespace first and last in the string, add trim
.
With these modifications, the code would be:
Just use preg_replace()
You could get away with str_replace()
on this one, although the code doesn’t look as clean:
See it live on ideone
I have a table with ~500k rows; varchar(255) UTF8 column filename
contains a file name;
I’m trying to strip out various strange characters out of the filename - thought I’d use a character class: [^a-zA-Z0-9()_ .\-]
Now, is there a function in MySQL that lets you replace through a regular expression? I’m looking for a similar functionality to REPLACE() function - simplified example follows:
SELECT REPLACE('stackowerflow', 'ower', 'over');
Output: "stackoverflow"
/* does something like this exist? */
SELECT X_REG_REPLACE('Stackoverflow','/[A-Zf]/','-');
Output: "-tackover-low"
I know about REGEXP/RLIKE, but those only check if there is a match, not what the match is.
(I could do a “SELECT pkey_id,filename FROM foo WHERE filename RLIKE '[^a-zA-Z0-9()_ .\-]'
” from a PHP script, do a preg_replace
and then “UPDATE foo ... WHERE pkey_id=...
”, but that looks like a last-resort slow & ugly hack)
MySQL 8.0+ you could use natively REGEXP_REPLACE
.
REGEXP_REPLACE(expr, pat, repl[, pos[, occurrence[, match_type]]])
Replaces occurrences in the string expr that match the regular expression specified by the pattern pat with the replacement string repl, and returns the resulting string. If expr, pat, or repl is NULL, the return value is NULL.
and Regular expression support:
Previously, MySQL used the Henry Spencer regular expression library to support regular expression operators (REGEXP, RLIKE).
Regular expression support has been reimplemented using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. The REGEXP_LIKE() function performs regular expression matching in the manner of the REGEXP and RLIKE operators, which now are synonyms for that function. In addition, the REGEXP_INSTR(), REGEXP_REPLACE(), and REGEXP_SUBSTR() functions are available to find match positions and perform substring substitution and extraction, respectively.
No.
But if you have access to your server, you could use a user defined function (UDF) like mysql-udf-regexp.
EDIT: MySQL 8.0+ you could use natively REGEXP_REPLACE. More in answer above
Use MariaDB instead. It has a function
See MariaDB docs and PCRE Regular expression enhancements
Note that you can use regexp grouping as well (I found that very useful):
returns
I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.
This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum
:
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that’s the best way to go about it.
Here is a regex to match a string of characters that are not a letters or numbers:
Here is the Python command to do a regex substitution:
Shorter way :
If you want spaces between words and numbers substitute ’’ with ’ ’
Using awk
, I need to find a word in a file that matches a regex pattern.
I only want to print the word matched with the pattern.
So if in the line, I have:
And pattern:
I want to only get:
EDIT: thanks to kurumi i managed to write something like this:
and this is what i needed :) thanks a lot!
This is the very basic
ask awk
to search for pattern
using //
, then print out the line, which by default is called a record, denoted by $0. At least read up the documentation.
If you only want to get print out the matched word.
It sounds like you are trying to emulate GNU’s grep -o
behaviour. This will do that providing you only want the first match on each line:
Here’s an example:
% awk 'match($0, /a.t/) {
print substr($0, RSTART, RLENGTH)
}
' /usr/share/dict/words | head
act
act
act
act
aft
ant
apt
art
art
art
Read about match
, substr
, RSTART
and RLENGTH
in the awk
manual.
After that you may wish to extend this to deal with multiple matches on the same line. I can’t do all your homework for you :-)
gawk can get the matching part of every line using this as action:
match(string, regexp [, array]) If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp. If regexp contains parentheses, the integer-indexed elements of array are set to contain the portion of string matching the corresponding parenthesized subexpression. http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
I want to remove special characters like:
from an String using Java.
That depends on what you define as special characters, but try replaceAll(...)
:
Note that the ^
character must not be the first one in the list, since you’d then either have to escape it or it would mean “any but these characters”.
Another note: the -
character needs to be the first or last one on the list, otherwise you’d have to escape it or it would define a range ( e.g. :-,
would mean "all characters in the range :
to ,
).
So, in order to keep consistency and not depend on character positioning, you might want to escape all those characters that have a special meaning in regular expressions (the following list is not complete, so be aware of other characters like (
, {
, $
etc.):
If you want to get rid of all punctuation and symbols, try this regex: \p{P}\p{S}
(keep in mind that in Java strings you’d have to escape back slashes: "\\p{P}\\p{S}"
).
A third way could be something like this, if you can exactly define what should be left in your string:
This means: replace everything that is not a word character (a-z in any case, 0-9 or _) or whitespace.
Edit: please note that there are a couple of other patterns that might prove helpful. However, I can’t explain them all, so have a look at the reference section of regular-expressions.info.
Here’s less restrictive alternative to the “define allowed characters” approach, as suggested by Ray:
The regex matches everything that is not a letter in any language and not a separator (whitespace, linebreak etc.). Note that you can’t use [\P{L}\P{Z}]
(upper case P means not having that property), since that would mean “everything that is not a letter or not whitespace”, which almost matches everything, since letters are not whitespace and vice versa.
Additional information on Unicode
Some unicode characters seem to cause problems due to different possible ways to encode them (as a single code point or a combination of code points). Please refer to regular-expressions.info for more information.
As described here http://developer.android.com/reference/java/util/regex/Pattern.html
Patterns are compiled regular expressions. In many cases, convenience methods such asString.matches
,String.replaceAll
andString.split
will be preferable, but if you need to do a lot of work with the same regular expression, it may be more efficient to compile it once and reuse it. The Pattern class and its companion, Matcher, also offer more functionality than the small amount exposed by String.
public class RegularExpressionTest {
public static void main(String[] args) {
System.out.println("String is = "+getOnlyStrings("!&(*^*(^(+one(&(^()(*)(*&^%$#@!#$%^&*()("));
System.out.println("Number is = "+getOnlyDigits("&(*^*(^(+91-&*9hi-639-0097(&(^("));
}
public static String getOnlyDigits(String s) {
Pattern pattern = Pattern.compile("[^0-9]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
public static String getOnlyStrings(String s) {
Pattern pattern = Pattern.compile("[^a-z A-Z]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
}
Result
As described here http://developer.android.com/reference/java/util/regex/Pattern.html
Patterns are compiled regular expressions. In many cases, convenience methods such asString.matches
,String.replaceAll
andString.split
will be preferable, but if you need to do a lot of work with the same regular expression, it may be more efficient to compile it once and reuse it. The Pattern class and its companion, Matcher, also offer more functionality than the small amount exposed by String.
public class RegularExpressionTest {
public static void main(String[] args) {
System.out.println("String is = "+getOnlyStrings("!&(*^*(^(+one(&(^()(*)(*&^%$#@!#$%^&*()("));
System.out.println("Number is = "+getOnlyDigits("&(*^*(^(+91-&*9hi-639-0097(&(^("));
}
public static String getOnlyDigits(String s) {
Pattern pattern = Pattern.compile("[^0-9]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
public static String getOnlyStrings(String s) {
Pattern pattern = Pattern.compile("[^a-z A-Z]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
}
Result
I’m trying to set a regexp which will check the start of a string, and if it contains either http://
or https://
it should match it.
How can I do that? I’m trying the following which isn’t working:
Your use of []
is incorrect – note that []
denotes a character class and will therefore only ever match one character. The expression [(http)(https)]
translates to “match a (
, an h
, a t
, a t
, a p
, a )
, or an s
.” (Duplicate characters are ignored.)
Try this:
If you really want to use alternation, use this syntax instead:
Case insensitive:
You might have to escape the forward slashes though, depending on context.
I use this
regexp to validate the email
([\w\.\-]+)
- this is for the first-level domain (many letters and numbers, also point and hyphen)
([\w\-]+)
- this is for second-level domain
((\.(\w){2,3})+)
- and this is for other level domains(from 3 to infinity) which includes a point and 2 or 3 literals
what’s wrong with this regex?
EDIT:it doesn’t match the “something@someth.ing” email
TLD’s like .museum aren’t matched this way, and there are a few other long TLD’s. Also, you can validate email addresses using the MailAddress class as Microsoft explains here in a note:
Instead of using a regular expression to validate an email address, you can use the System.Net.Mail.MailAddress class. To determine whether an email address is valid, pass the email address to the MailAddress.MailAddress(String) class constructor.
public bool IsValid(string emailaddress)
{
try
{
MailAddress m = new MailAddress(emailaddress);
return true;
}
catch (FormatException)
{
return false;
}
}
This saves you a lot af headaches because you don’t have to write (or try to understand someone else’s) regex.
I think @"^([\w\.\-]+)@([\w\-]+)((\.(\w){2,3})+)$"
should work.
You need to write it like
string email = txtemail.Text;
Regex regex = new Regex(@"^([\w\.\-]+)@([\w\-]+)((\.(\w){2,3})+)$");
Match match = regex.Match(email);
if (match.Success)
Response.Write(email + " is correct");
else
Response.Write(email + " is incorrect");
Be warned that this will fail if:
There is a subdomain after the @
symbol.
You use a TLD with a length greater than 3, such as .info
I have an expression for checking email addresses that I use.
Since none of the above were as short or as accurate as mine, I thought I would post it here.
@"^[\w!#$%&'*+\-/=?\^_`{|}~]+(\.[\w!#$%&'*+\-/=?\^_`{|}~]+)*"
+ "@"
+ @"((([\-\w]+\.)+[a-zA-Z]{2,4})|(([0-9]{1,3}\.){3}[0-9]{1,3}))$";
For more info go read about it here: C# – Email Regular Expression
Also, this checks for RFC validity based on email syntax, not for whether the email really exists. The only way to test that an email really exists is to send and email and have the user verify they received the email by clicking a link or entering a token.
Then there are throw-away domains, such as Mailinator.com, and such. This doesn’t do anything to verify whether an email is from a throwaway domain or not.
I am after documentation on using wildcard or regular expressions (not sure on the exact terminology) with a jQuery selector.
I have looked for this myself but have been unable to find information on the syntax and how to use it. Does anyone know where the documentation for the syntax is?
EDIT: The attribute filters allow you to select based on patterns of an attribute value.
James Padolsey created a wonderful filter that allows regex to be used for selection.
Say you have the following div
:
Padolsey’s :regex
filter can select it like so:
Also, check the official documentation on selectors.
You can use the filter
function to apply more complicated regex matching.
Here’s an example which would just match the first three divs:
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div id="abcd">Not matched</div>
<div id="abccd">Not matched</div>
<div id="abcccd">Not matched</div>
<div id="abd">Not matched</div>
These can be helpful.
If you’re finding by Contains then it’ll be like this
If you’re finding by Starts With then it’ll be like this
If you’re finding by Ends With then it’ll be like this
If you want to select elements which id is not a given string
If you want to select elements which name contains a given word, delimited by spaces
If you want to select elements which id is equal to a given string or starting with that string followed by a hyphen
I would like a RegExp that will remove all special characters from a string. I am trying something like this but it doesn’t work in IE7, though it works in Firefox.
var specialChars = "!@#$^&%*()+=-[]\/{}|:<>?,.";
for (var i = 0; i < specialChars.length; i++) {
stringToReplace = stringToReplace.replace(new RegExp("\\" + specialChars[i], "gi"), "");
}
A detailed description of the RegExp would be helpful as well.
As was mentioned in the comments it’s easier to do this as a whitelist - replace the characters which aren’t in your safelist.
The caret (^
) character is the negation of the set [...]
, gi
say global and case-insensitive (the latter is a bit redundant but I wanted to mention it) and the safelist in this example is digits, word characters, underscores (\w
) and whitespace (\s
).
Note that if you still want to exclude a set, including things like slashes and special characters you can do the following:
take special note that in order to also include the “minus” character, you need to escape it with a backslash like the latter group. if you don’t it will also select 0-9 which is probably undesired.
Do not use [^\w\s]
, this will remove letters with accents (like àèéìòù), not to mention to Cyrillic or Chinese, letters coming from such languages will be completed removed.
You really don’t want remove these letters together with all the special characters. You have two chances:
[^èéòàùì\w\s]
.
\p{...}
syntax.
var str = "Їжак::: résd,$%& adùf"
var search = XRegExp('([^?<first>\\pL ]+)');
var res = XRegExp.replace(str, search, '',"all");
console.log(res); // returns "Їжак::: resd,adf"
console.log(str.replace(/[^\w\s]/gi, '') ); // returns " rsd adf"
console.log(str.replace(/[^\wèéòàùì\s]/gi, '') ); // returns " résd adùf"
`<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.js"></script>`
I need to extract from a string a set of characters which are included between two delimiters, without returning the delimiters themselves.
A simple example should be helpful:
Target: extract the substring between square brackets, without returning the brackets themselves.
Base string: This is a test string [more or less]
If I use the following reg. ex.
\[.*?\]
The match is [more or less]
. I need to get only more or less
(without the brackets).
Is it possible to do it?
Easy done:
Technically that’s using lookaheads and lookbehinds. See Lookahead and Lookbehind Zero-Width Assertions. The pattern consists of:
Alternatively you can just capture what’s between the square brackets:
and return the first captured group instead of the entire match.
If you are using JavaScript, the first solution provided by cletus, (?<=\[)(.*?)(?=\])
, will not work because JavaScript doesn’t support the lookbehind operator.
However, the second solution works well, but you need to get the second matched element.
Example:
var regex = /\[(.*?)\]/;
var strToMatch = "This is a test string [more or less]";
var matched = regex.exec(strToMatch);
It will return:
So, what you need is the second value. Use:
To return:
You just need to ‘capture’ the bit between the brackets.
To capture you put it inside parentheses. You do not say which language this is using. In Perl for example, you would access this using the $1 variable.
Other languages will have different mechanisms. C#, for example, uses the Match collection class, I believe.
As part of a project for school, I need to replace a string from the form:
to something like:
I believe this can be done with regular expressions, but I don’t know how to do it yet.
Can you lend me a hand?
P.S. The actual assignment is to implement a Polynomial Processing Java application, and I’m using this to pass polynomial.toString() from the model to the view, and I want do display it using html tags in a pretty way.
private String removeScript(String content) {
Pattern p = Pattern.compile("<script[^>]*>(.*?)</script>",
Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
return p.matcher(content).replaceAll("");
}
Why does the second line of this code throw ArrayIndexOutOfBoundsException
?
While this works:
I use Java 7.
You need to escape the dot if you want to split on a literal dot:
Otherwise you are splitting on the regex .
, which means “any character”.
Note the double backslash needed to create a single backslash in the regex.
You’re getting an ArrayIndexOutOfBoundsException
because your input string is just a dot, ie "."
, which is an edge case that produces an empty array when split on dot; split(regex)
removes all trailing blanks from the result, but since splitting a dot on a dot leaves only two blanks, after trailing blanks are removed you’re left with an empty array.
To avoid getting an ArrayIndexOutOfBoundsException
for this edge case, use the overloaded version of split(regex, limit)
, which has a second parameter that is the size limit for the resulting array. When limit
is negative, the behaviour of removing trailing blanks from the resulting array is disabled:
ie, when filename
is just a dot "."
, calling filename.split("\\.", -1)[0]
will return a blank, but calling filename.split("\\.")[0]
will throw an ArrayIndexOutOfBoundsException
.
“.” is a special character in java regex engine, so you have to use “\.” to escape this character:
I hope this helps
I am trying to write a regular expression that will only allow lowercase letters and up to 10 characters. What I have so far looks like this:
This does not work or compile. I had a working one that would just allow lowercase letters which was this:
But I need to limit the number of characters to 10.
You can use curly braces to control the number of occurrences. For example, this means 0 to 10:
The options are:
See the regular expression reference.
Your expression had a + after the closing curly brace, hence the error.
/^[a-z]{0,10}$/
should work. /^[a-z]{1,10}$/
if you want to match at least one character, like /^[a-z]+$/
does.
What is the regex for simply checking if a string contains a certain word (e.g. ‘Test’)? I’ve done some googling but can’t get a straight example of such a regex. This is for a build script but has no bearing to any particular programming language.
Assuming regular PCRE-style regex flavors:
If you want to check for it as a single, full word, it’s \bTest\b
, with appropriate flags for case insensitivity if desired and delimiters for your programming language. \b
represents a “word boundary”, that is, a point between characters where a word can be considered to start or end. For example, since spaces are used to separate words, there will be a word boundary on either side of a space.
If you want to check for it as part of the word, it’s just Test
, again with appropriate flags for case insensitivity. Note that usually, dedicated “substring” methods tend to be faster in this case, because it removes the overhead of parsing the regex.
Just don’t anchor your pattern:
The above regex will check for the literal string “Test” being found somewhere within it.
I have hashed together different elements to get the validation we needed for student emails. I hope this is going to work I haven’t tested fully.
Given a string like:
"The dog has a long tail, and it is RED!"
What kind of jQuery or JavaScript magic can be used to keep spaces to only one space max?
Goal:
"The dog has a long tail, and it is RED!"
Given that you also want to cover tabs, newlines, etc, just replace \s\s+
with ' '
:
If you really want to cover only spaces (and thus not tabs, newlines, etc), do so:
Since you seem to be interested in performance, I profiled these with firebug. Here are the results I got:
str.replace( / +/g, ' ' ) -> 380ms
str.replace( /\s\s+/g, ' ' ) -> 390ms
str.replace( / {2,}/g, ' ' ) -> 470ms
str.replace( / +/g, ' ' ) -> 790ms
str.replace( / +(?= )/g, ' ') -> 3250ms
This is on Firefox, running 100k string replacements.
I encourage you to do your own profiling tests with firebug, if you think performance is an issue. Humans are notoriously bad at predicting where the bottlenecks in their programs lie.
(Also, note that IE 8’s developer toolbar also has a profiler built in – it might be worth checking what the performance is like in IE.)
EDIT: If you wish to replace all kind of whitespace characters the most efficient way would be like that:
I’m trying to determine if a string is a subset of another string. For example:
I want to return TRUE if “value” appears as part of the string “chars”. In the following scenario, I would want to return false:
Use the grepl
function
Sigh, it took me 45 minutes to find the answer to this simple question. The answer is: grepl(needle, haystack, fixed=TRUE)
# Correct
> grepl("1+2", "1+2", fixed=TRUE)
[1] TRUE
> grepl("1+2", "123+456", fixed=TRUE)
[1] FALSE
# Incorrect
> grepl("1+2", "1+2")
[1] FALSE
> grepl("1+2", "123+456")
[1] TRUE
grep
is named after the linux executable, which is itself an acronym of “Global Regular Expression Print”, it would read lines of input and then print them if they matched the arguments you gave. “Global” meant the match could occur anywhere on the input line, I’ll explain “Regular Expression” below, but the idea is it’s a smarter way to match the string (R calls this “character”, eg class("abc")
), and “Print” because it’s a command line program, emitting output means it prints to its output string.
Now, the grep
program is basically a filter, from lines of input, to lines of output. And it seems that R’s grep
function similarly will take an array of inputs. For reasons that are utterly unknown to me (I only started playing with R about an hour ago), it returns a vector of the indexes that match, rather than a list of matches.
But, back to your original question, what we really want is to know whether we found the needle in the haystack, a true/false value. They apparently decided to name this function grepl
, as in “grep” but with a “Logical” return value (they call true and false logical values, eg class(TRUE)
).
So, now we know where the name came from and what it’s supposed to do. Lets get back to Regular Expressions. The arguments, even though they are strings, they are used to build regular expressions (henceforth: regex). A regex is a way to match a string (if this definition irritates you, let it go). For example, the regex a
matches the character "a"
, the regex a*
matches the character "a"
0 or more times, and the regex a+
would match the character "a"
1 or more times. Hence in the example above, the needle we are searching for 1+2
, when treated as a regex, means “one or more 1 followed by a 2”… but ours is followed by a plus!
So, if you used the grepl
without setting fixed
, your needles would accidentally be haystacks, and that would accidentally work quite often, we can see it even works for the OP’s example. But that’s a latent bug! We need to tell it the input is a string, not a regex, which is apparently what fixed
is for. Why fixed? No clue, bookmark this answer b/c you’re probably going to have to look it up 5 more times before you get it memorized.
The better your code is, the less history you have to know to make sense of it. Every argument can have at least two interesting values (otherwise it wouldn’t need to be an argument), the docs list 9 arguments here, which means there’s at least 2^9=512 ways to invoke it, that’s a lot of work to write, test, and remember… decouple such functions (split them up, remove dependencies on each other, string things are different than regex things are different than vector things). Some of the options are also mutually exclusive, don’t give users incorrect ways to use the code, ie the problematic invocation should be structurally nonsensical (such as passing an option that doesn’t exist), not logically nonsensical (where you have to emit a warning to explain it). Put metaphorically: replacing the front door in the side of the 10th floor with a wall is better than hanging a sign that warns against its use, but either is better than neither. In an interface, the function defines what the arguments should look like, not the caller (because the caller depends on the function, inferring everything that everyone might ever want to call it with makes the function depend on the callers, too, and this type of cyclical dependency will quickly clog a system up and never provide the benefits you expect). Be very wary of equivocating types, it’s a design flaw that things like TRUE
and 0
and "abc"
are all vectors.
You want grepl
:
> chars <- "test"
> value <- "es"
> grepl(value, chars)
[1] TRUE
> chars <- "test"
> value <- "et"
> grepl(value, chars)
[1] FALSE
Python has string.find()
and string.rfind()
to get the index of a substring in a string.
I’m wondering whether there is something like string.find_all()
which can return all found indexes (not only the first from the beginning or the first from the end).
For example:
There is no simple built-in string function that does what you’re looking for, but you could use the more powerful regular expressions:
If you want to find overlapping matches, lookahead will do that:
If you want a reverse find-all without overlaps, you can combine positive and negative lookahead into an expression like this:
search = 'tt'
[m.start() for m in re.finditer('(?=%s)(?!.{1,%d}%s)' % (search, len(search)-1, search), 'ttt')]
#[1]
re.finditer
returns a generator, so you could change the []
in the above to ()
to get a generator instead of a list which will be more efficient if you’re only iterating through the results once.
Thus, we can build it ourselves:
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub) # use start += 1 to find overlapping matches
list(find_all('spam spam spam spam', 'spam')) # [0, 5, 10, 15]
No temporary strings or regexes required.
Here’s a (very inefficient) way to get all (i.e. even overlapping) matches:
>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]
How can I check if a given string is a valid URL address?
My knowledge of regular expressions is basic and doesn’t allow me to choose from the hundreds of regular expressions I’ve already seen on the web.
I wrote my URL (actually IRI, internationalized) pattern to comply with RFC 3987 (http://www.faqs.org/rfcs/rfc3987.html). These are in PCRE syntax.
For absolute IRIs (internationalized):
/^[a-z](?:[-a-z0-9\+\.].html)*:(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:])*@)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,1}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+\.[-a-z0-9\._~!\$&'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*)?|(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\/\?])*)?$/i
To also allow relative IRIs:
/^(?:[a-z](?:[-a-z0-9\+\.].html)*:(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:])*@)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,1}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+\.[-a-z0-9\._~!\$&'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*)?|(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\/\?])*)?|(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:])*@)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,1}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+\.[-a-z0-9\._~!\$&'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*)?|(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=@])+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&'\(\)\*\+,;=:@])|[\/\?])*)?)$/i
How they were compiled (in PHP):
<?php
/* Regex convenience functions (character class, non-capturing group) */
function cc($str, $suffix = '', $negate = false) {
return '[' . ($negate ? '^' : '') . $str . ']' . $suffix;
}
function ncg($str, $suffix = '') {
return '(?:' . $str . ')' . $suffix;
}
/* Preserved from RFC3986 */
$ALPHA = 'a-z';
$DIGIT = '0-9';
$HEXDIG = $DIGIT . 'a-f';
$sub_delims = '!\\$&\'\\(\\)\\*\\+,;=';
$gen_delims = ':\\/\\?\\#\\[\\]@';
$reserved = $gen_delims . $sub_delims;
$unreserved = '-' . $ALPHA . $DIGIT . '\\._~';
$pct_encoded = '%' . cc($HEXDIG) . cc($HEXDIG);
$dec_octet = ncg(implode('|', array(
cc($DIGIT),
cc('1-9') . cc($DIGIT),
'1' . cc($DIGIT) . cc($DIGIT),
'2' . cc('0-4') . cc($DIGIT),
'25' . cc('0-5')
)));
$IPv4address = $dec_octet . ncg('\\.' . $dec_octet, '{3}');
$h16 = cc($HEXDIG, '{1,4}');
$ls32 = ncg($h16 . ':' . $h16 . '|' . $IPv4address);
$IPv6address = ncg(implode('|', array(
ncg($h16 . ':', '{6}') . $ls32,
'::' . ncg($h16 . ':', '{5}') . $ls32,
ncg($h16, '?') . '::' . ncg($h16 . ':', '{4}') . $ls32,
ncg($h16 . ':' . $h16, '?') . '::' . ncg($h16 . ':', '{3}') . $ls32,
ncg(ncg($h16 . ':', '{0,2}') . $h16, '?') . '::' . ncg($h16 . ':', '{2}') . $ls32,
ncg(ncg($h16 . ':', '{0,3}') . $h16, '?') . '::' . $h16 . ':' . $ls32,
ncg(ncg($h16 . ':', '{0,4}') . $h16, '?') . '::' . $ls32,
ncg(ncg($h16 . ':', '{0,5}') . $h16, '?') . '::' . $h16,
ncg(ncg($h16 . ':', '{0,6}') . $h16, '?') . '::',
)));
$IPvFuture = 'v' . cc($HEXDIG, '+') . cc($unreserved . $sub_delims . ':', '+');
$IP_literal = '\\[' . ncg(implode('|', array($IPv6address, $IPvFuture))) . '\\]';
$port = cc($DIGIT, '*');
$scheme = cc($ALPHA) . ncg(cc('-' . $ALPHA . $DIGIT . '\\+\\.'), '*');
/* New or changed in RFC3987 */
$iprivate = '\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD}';
$ucschar = '\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}' .
'\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}' .
'\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}' .
'\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}' .
'\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}' .
'\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}';
$iunreserved = '-' . $ALPHA . $DIGIT . '\\._~' . $ucschar;
$ipchar = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . ':@'));
$ifragment = ncg($ipchar . '|' . cc('\\/\\?'), '*');
$iquery = ncg($ipchar . '|' . cc($iprivate . '\\/\\?'), '*');
$isegment_nz_nc = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . '@'), '+');
$isegment_nz = ncg($ipchar, '+');
$isegment = ncg($ipchar, '*');
$ipath_empty = '(?!' . $ipchar . ')';
$ipath_rootless = ncg($isegment_nz) . ncg('\\/' . $isegment, '*');
$ipath_noscheme = ncg($isegment_nz_nc) . ncg('\\/' . $isegment, '*');
$ipath_absolute = '\\/' . ncg($ipath_rootless, '?'); // Spec says isegment-nz *( "/" isegment )
$ipath_abempty = ncg('\\/' . $isegment, '*');
$ipath = ncg(implode('|', array(
$ipath_abempty,
$ipath_absolute,
$ipath_noscheme,
$ipath_rootless,
$ipath_empty
))) . ')';
$ireg_name = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . '@'), '*');
$ihost = ncg(implode('|', array($IP_literal, $IPv4address, $ireg_name)));
$iuserinfo = ncg($pct_encoded . '|' . cc($iunreserved . $sub_delims . ':'), '*');
$iauthority = ncg($iuserinfo . '@', '?') . $ihost . ncg(':' . $port, '?');
$irelative_part = ncg(implode('|', array(
'\\/\\/' . $iauthority . $ipath_abempty . '',
'' . $ipath_absolute . '',
'' . $ipath_noscheme . '',
'' . $ipath_empty . ''
)));
$irelative_ref = $irelative_part . ncg('\\?' . $iquery, '?') . ncg('\\#' . $ifragment, '?');
$ihier_part = ncg(implode('|', array(
'\\/\\/' . $iauthority . $ipath_abempty . '',
'' . $ipath_absolute . '',
'' . $ipath_rootless . '',
'' . $ipath_empty . ''
)));
$absolute_IRI = $scheme . ':' . $ihier_part . ncg('\\?' . $iquery, '?');
$IRI = $scheme . ':' . $ihier_part . ncg('\\?' . $iquery, '?') . ncg('\\#' . $ifragment, '?');
$IRI_reference = ncg($IRI . '|' . $irelative_ref);
Edit 7 March 2011: Because of the way PHP handles backslashes in quoted strings, these are unusable by default. You’ll need to double-escape backslashes except where the backslash has a special meaning in regex. You can do that this way:
$escape_backslash = '/(?<!\\)\\(?![\[\]\\\^\$\.\|\*\+\(\)QEnrtaefvdwsDWSbAZzB1-9GX]|x\{[0-9a-f]{1,4}\}|\c[A-Z]|)/';
$absolute_IRI = preg_replace($escape_backslash, '\\\\', $absolute_IRI);
$IRI = preg_replace($escape_backslash, '\\\\', $IRI);
$IRI_reference = preg_replace($escape_backslash, '\\\\', $IRI_reference);
I’ve just written up a blog post for a great solution for recognizing URLs in most used formats such as:
www.google.com
http://www.google.com
mailto:somebody@google.com
somebody@google.com
www.url-with-querystring.com/?url=has-querystring
The regular expression used is:
/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)/
Is there any way to tell sed
to output only captured groups? For example given the input:
and pattern:
Could I get only 123 and 987 output in the way formatted by back references?
The key to getting this to work is to tell sed
to exclude what you don’t want to be output as well as specifying what you do want.
string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
This says:
-n
)
p
)
In general, in sed
you capture groups using parentheses and output what you capture using a back reference:
will output “bar”. If you use -r
(-E
for OS X) for extended regex, you don’t need to escape the parentheses:
There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:
outputs “a bar a”.
If you have GNU grep
(it may also work in BSD, including OS X):
or variations such as:
The -P
option enables Perl Compatible Regular Expressions. See man 3 pcrepattern
or man 3 pcresyntax
.
Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.
See here for examples and more detail
you can use grep
I’m looking how to replace/encode text using RegEx based on RegEx settings/params below:
I have seen some examples on RegEx, but confused as to how to apply it the same way in SQL Server. Any suggestions would be helpful. Thank you.
You do not need to interact with managed code, as you can use LIKE:
CREATE TABLE #Sample(Field varchar(50), Result varchar(50))
GO
INSERT INTO #Sample (Field, Result) VALUES ('ABC123 ', 'Do not match')
INSERT INTO #Sample (Field, Result) VALUES ('ABC123.', 'Do not match')
INSERT INTO #Sample (Field, Result) VALUES ('ABC123&', 'Match')
SELECT * FROM #Sample WHERE Field LIKE '%[^a-z0-9 .]%'
GO
DROP TABLE #Sample
As your expression ends with +
you can go with '%[^a-z0-9 .][^a-z0-9 .]%'
EDIT: to make clear: SQL Server doesn’t supports regular expressions without managed code. Depending on the situation, the LIKE
operator can be an option, but it lacks the flexibility that regular expressions provides.
You will have to build a CLR procedure that provides regex functionality, as this article illustrates.
Their example function uses VB.NET:
Imports System
Imports System.Data.Sql
Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.Runtime.InteropServices
Imports System.Text.RegularExpressions
Imports System.Collections 'the IEnumerable interface is here
Namespace SimpleTalk.Phil.Factor
Public Class RegularExpressionFunctions
'RegExIsMatch function
<SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _
Public Shared Function RegExIsMatch( _
ByVal pattern As SqlString, _
ByVal input As SqlString, _
ByVal Options As SqlInt32) As SqlBoolean
If (input.IsNull OrElse pattern.IsNull) Then
Return SqlBoolean.False
End If
Dim RegExOption As New System.Text.RegularExpressions.RegExOptions
RegExOption = Options
Return RegEx.IsMatch(input.Value, pattern.Value, RegExOption)
End Function
End Class '
End Namespace
…and is installed in SQL Server using the following SQL (replacing ‘%’-delimted variables by their actual equivalents:
sp_configure 'clr enabled', 1
RECONFIGURE WITH OVERRIDE
IF EXISTS ( SELECT 1
FROM sys.objects
WHERE object_id = OBJECT_ID(N'dbo.RegExIsMatch') )
DROP FUNCTION dbo.RegExIsMatch
go
IF EXISTS ( SELECT 1
FROM sys.assemblies asms
WHERE asms.name = N'RegExFunction ' )
DROP ASSEMBLY [RegExFunction]
CREATE ASSEMBLY RegExFunction
FROM '%FILE%'
GO
CREATE FUNCTION RegExIsMatch
(
@Pattern NVARCHAR(4000),
@Input NVARCHAR(MAX),
@Options int
)
RETURNS BIT
AS EXTERNAL NAME
RegExFunction.[SimpleTalk.Phil.Factor.RegularExpressionFunctions].RegExIsMatch
GO
--a few tests
---Is this card a valid credit card?
SELECT dbo.RegExIsMatch ('^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$','4241825283987487',1)
--is there a number in this string
SELECT dbo.RegExIsMatch( '\d','there is 1 thing I hate',1)
--Verifies number Returns 1
DECLARE @pattern VARCHAR(255)
SELECT @pattern ='[a-zA-Z0-9]\d{2}[a-zA-Z0-9](-\d{3}.html){2}[A-Za-z0-9]'
SELECT dbo.RegExIsMatch (@pattern, '1298-673-4192',1),
dbo.RegExIsMatch (@pattern,'A08Z-931-468A',1),
dbo.RegExIsMatch (@pattern,'[A90-123-129X',1),
dbo.RegExIsMatch (@pattern,'12345-KKA-1230',1),
dbo.RegExIsMatch (@pattern,'0919-2893-1256',1)
Slightly modified version of Julio’s answer.
-- MS SQL using VBScript Regex
-- select dbo.RegexReplace('aa bb cc','($1) ($2) ($3)','([^\s]*)\s*([^\s]*)\s*([^\s]*)')
-- $$ dollar sign, $1 - $9 back references, $& whole match
CREATE FUNCTION [dbo].[RegexReplace]
( -- these match exactly the parameters of RegExp
@searchstring varchar(4000),
@replacestring varchar(4000),
@pattern varchar(4000)
)
RETURNS varchar(4000)
AS
BEGIN
declare @objRegexExp int,
@objErrorObj int,
@strErrorMessage varchar(255),
@res int,
@result varchar(4000)
if( @searchstring is null or len(ltrim(rtrim(@searchstring))) = 0) return null
set @result=''
exec @res=sp_OACreate 'VBScript.RegExp', @objRegexExp out
if( @res <> 0) return '..VBScript did not initialize'
exec @res=sp_OASetProperty @objRegexExp, 'Pattern', @pattern
if( @res <> 0) return '..Pattern property set failed'
exec @res=sp_OASetProperty @objRegexExp, 'IgnoreCase', 0
if( @res <> 0) return '..IgnoreCase option failed'
exec @res=sp_OAMethod @objRegexExp, 'Replace', @result OUT,
@searchstring, @replacestring
if( @res <> 0) return '..Bad search string'
exec @res=sp_OADestroy @objRegexExp
return @result
END
You’ll need Ole Automation Procedures turned on in SQL:
exec sp_configure 'show advanced options',1;
go
reconfigure;
go
sp_configure 'Ole Automation Procedures', 1;
go
reconfigure;
go
sp_configure 'show advanced options',0;
go
reconfigure;
go
I want to match a regular expression on a whole word.
In the following example I am trying to match s
or season
but what I have matches s
, e
, a
, o
and n
.
How do I make a regular expression to match a whole word?
Square brackets are meant for character class, and you’re actually trying to match any one of: s
, |
, s
(again), e
, a
, s
(again), o
and n
.
Use parentheses instead for grouping:
or non-capturing group:
Note: Non-capture groups tell the engine that it doesn’t need to store the match, while the other one (capturing group does). For small stuff, either works, for ‘heavy duty’ stuff, you might want to see first if you need the match or not. If you don’t, better use the non-capture group to allocate more memory for calculation instead of storing something you will never need to use.
Above screenshot taken from this live example: https://regex101.com/r/cU5lC2/1
I’ll be using the phpsh interactive shell on Ubuntu 12.10 to demonstrate the PCRE regex engine through the method known as preg_match
Start phpsh, put some content into a variable, match on word.
el@apollo:~/foo$ phpsh
php> $content1 = 'badger'
php> $content2 = '1234'
php> $content3 = '$%^&'
php> echo preg_match('(\w+)', $content1);
1
php> echo preg_match('(\w+)', $content2);
1
php> echo preg_match('(\w+)', $content3);
0
The preg_match method used the PCRE engine within the PHP language to analyze variables: $content1
, $content2
and $content3
with the (\w)+
pattern.
$content1 and $content2 contain at least one word, $content3 does not.
el@apollo:~/foo$ phpsh
php> $gun1 = 'dart gun';
php> $gun2 = 'fart gun';
php> $gun3 = 'darty gun';
php> $gun4 = 'unicorn gun';
php> echo preg_match('(dart|fart)', $gun1);
1
php> echo preg_match('(dart|fart)', $gun2);
1
php> echo preg_match('(dart|fart)', $gun3);
1
php> echo preg_match('(dart|fart)', $gun4);
0
Variables gun1
and gun2
contain the string dart
or fart
which is correct, but gun3 contains darty
and still matches, that is the problem. So onto the next example.
Word Boundaries can be force matched with \b
, see:
Regex Visual Image acquired from http://jex.im/regulex and https://github.com/JexCheng/regulex Example:
el@apollo:~/foo$ phpsh
php> $gun1 = 'dart gun';
php> $gun2 = 'fart gun';
php> $gun3 = 'darty gun';
php> $gun4 = 'unicorn gun';
php> echo preg_match('(\bdart\b|\bfart\b)', $gun1);
1
php> echo preg_match('(\bdart\b|\bfart\b)', $gun2);
1
php> echo preg_match('(\bdart\b|\bfart\b)', $gun3);
0
php> echo preg_match('(\bdart\b|\bfart\b)', $gun4);
0
The \b
asserts that we have a word boundary, making sure " dart " is matched, but " darty " isn’t.
I test examples in js. Simplest solution - just add word u need inside / /:
var reg = /cat/;
reg.test('some cat here');//1 test
true // result
reg.test('acatb');//2 test
true // result
Now if u need this specific word with boundaries, not inside any other signs-letters. We use b marker:
var reg = /\bcat\b/
reg.test('acatb');//1 test
false // result
reg.test('have cat here');//2 test
true // result
We have also exec() method in js, whichone returns object-result. It helps f.g. to get info about place/index of our word.
If we need get all matched words in string/sentence/text, we can use g modifier (global match):
Now the last one - i need not 1 specific word, but some of them. We use | sign, it means choice/or.
I am trying to use grep
to match lines that contain two different strings. I have tried the following but this matches lines that contain either string1 or string2 which not what I want.
So how do I match with grep
only the lines that contain both strings?
You can use grep 'string1' filename | grep 'string2'
Or, grep 'string1.*string2\|string2.*string1' filename
I think this is what you were looking for:
I think that answers like this:
only match the case where both are present, not one or the other or both.
To search for files containing all the words in any order anywhere:
The first grep kicks off a recursive search (r
), ignoring case (i
) and listing (printing out) the name of the files that are matching (l
) for one term ('action'
with the single quotes) occurring anywhere in the file.
The subsequent greps search for the other terms, retaining case insensitivity and listing out the matching files.
The final list of files that you will get will the ones that contain these terms, in any order anywhere in the file.
I need a regex able to match everything but a string starting with a specific pattern (specifically index.php
and what follows, like index.php?id=2342343
)
Not a regexp expert, but I think you could use a negative lookahead from the start, e.g. ^(?!foo).*$
shouldn’t match anything starting with foo
.
Regex: match everything but:
a string starting with a specific pattern (e.g. any - empty, too - string not starting with foo
):
Lookahead-based solution for NFAs:
Negated character class based solution for regex engines not supporting lookarounds:
a string ending with a specific pattern (say, no world.
at the end):
Lookbehind-based solution:
POSIX workaround:
a string containing specific text (say, not match a string having foo
) (no POSIX compliant patern, sorry):
a string containing specific character (say, avoid matching a string having a |
symbol):
a string equal to some string (say, not equal to foo
):
Lookaround-based:
POSIX:
a sequence of characters:
cat
): /cat(*SKIP)(*FAIL)|[^c]*(?:c(?!at)[^c]*)*/i
or /cat(*SKIP)(*FAIL)|(?:(?!cat).)+/is
(cat)|[^c]*(?:c(?!at)[^c]*)*
(or (?s)(cat)|(?:(?!cat).)*
, or (cat)|[^c]+(?:c(?!at)[^c]*)*|(?:c(?!at)[^c]*)+[^c]*
) and then check with language means: if Group 1 matched, it is not what we need, else, grab the match value if not empty
a certain single character or a set of characters:
[^a-z]+
(any char other than a lowercase ASCII letter)
|
: [^|]+
Demo note: the newline \n
is used inside negated character classes in demos to avoid match overflow to the neighboring line(s). They are not necessary when testing individual strings.
Anchor note: In many languages, use \A
to define the unambiguous start of string, and \z
(in Python, it is \Z
, in JavaScript, $
is OK) to define the very end of the string.
Dot note: In many flavors (but not POSIX, TRE, TCL), .
matches any char but a newline char. Make sure you use a corresponding DOTALL modifier (/s
in PCRE/Boost/.NET/Python/Java and /m
in Ruby) for the .
to match any char including a newline.
Backslash note: In languages where you have to declare patterns with C strings allowing escape sequences (like \n
for a newline), you need to double the backslashes escaping special characters so that the engine could treat them as literal characters (e.g. in Java, world\.
will be declared as "world\\."
, or use a character class: "world[.]"
). Use raw string literals (Python r'\bworld\b'
), C# verbatim string literals @"world\."
, or slashy strings/regex literal notations like /world\./
.
You can put a ^
in the beginning of a character set to match anything but those characters.
will match everything but =
I need a regex to match if anywhere in a sentence there is NOT either < or >.
If either < or > are in the string then it must return false.
I had a partial success with this but only if my < > are at the beginning or end:
I am using .Net if that makes a difference.
Thanks for the help.
The caret in the character class ([^
) means match anything but, so this means, beginning of string, then one or more of anything except <
and >
, then the end of the string.
Here you go:
This will test for string that has no <
and no >
If you want to test for a string that may have <
and >
, but must also have something other you should use just
Where [<>]
means any of <
or >
and [^<>]
means any that is not of <
or >
.
And of course the mandatory link.
The Java API for regular expressions states that \s
will match whitespace. So the regex \\s\\s
should match two spaces.
Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");
The aim of this is to replace all instances of two consecutive whitespace with a single space. However this does not actually work.
Am I having a grave misunderstanding of regexes or the term “whitespace”?
Yeah, you need to grab the result of matcher.replaceAll():
You can’t use \s
in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.
Unicode defines 26 code points as \p{White_Space}
: 20 of them are various sorts of \pZ
GeneralCategory=Separator, and the remaining 6 are \p{Cc}
GeneralCategory=Control.
White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:
String whitespace_chars = "" /* dummy empty string for homogeneity */
+ "\\u0009" // CHARACTER TABULATION
+ "\\u000A" // LINE FEED (LF)
+ "\\u000B" // LINE TABULATION
+ "\\u000C" // FORM FEED (FF)
+ "\\u000D" // CARRIAGE RETURN (CR)
+ "\\u0020" // SPACE
+ "\\u0085" // NEXT LINE (NEL)
+ "\\u00A0" // NO-BREAK SPACE
+ "\\u1680" // OGHAM SPACE MARK
+ "\\u180E" // MONGOLIAN VOWEL SEPARATOR
+ "\\u2000" // EN QUAD
+ "\\u2001" // EM QUAD
+ "\\u2002" // EN SPACE
+ "\\u2003" // EM SPACE
+ "\\u2004" // THREE-PER-EM SPACE
+ "\\u2005" // FOUR-PER-EM SPACE
+ "\\u2006" // SIX-PER-EM SPACE
+ "\\u2007" // FIGURE SPACE
+ "\\u2008" // PUNCTUATION SPACE
+ "\\u2009" // THIN SPACE
+ "\\u200A" // HAIR SPACE
+ "\\u2028" // LINE SEPARATOR
+ "\\u2029" // PARAGRAPH SEPARATOR
+ "\\u202F" // NARROW NO-BREAK SPACE
+ "\\u205F" // MEDIUM MATHEMATICAL SPACE
+ "\\u3000" // IDEOGRAPHIC SPACE
;
/* A \s that actually works for Java’s native character set: Unicode */
String whitespace_charclass = "[" + whitespace_chars + "]";
/* A \S that actually works for Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";
Now you can use whitespace_charclass + "+"
as the pattern in your replaceAll
.
Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.
And if you think white space is bad, you should see what you have to do to get \w
and \b
to finally behave properly!
Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.
If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.
For Java (not php, not javascript, not anyother):
I have some images named with generated uuid1 string. For example 81397018-b84a-11e0-9d2a-001b77dc0bed.jpg. I want to find out all these images using “find” command:
But it doesn’t work. Something wrong with the regex? Could someone help me with this?
Note that you need to specify .*/
in the beginning because find
matches the whole path.
Example:
susam@nifty:~/so$ find . -name "*.jpg"
./foo-111.jpg
./test/81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
./81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
susam@nifty:~/so$
susam@nifty:~/so$ find . -regextype sed -regex ".*/[a-f0-9\-]\{36\}\.jpg"
./test/81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
./81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
My version of find:
$ find --version
find (GNU findutils) 4.4.2
Copyright (C) 2007 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Eric B. Decker, James Youngman, and Kevin Dalley.
Built using GNU gnulib version e5573b1bad88bfabcda181b9e0125fb0c52b7d3b
Features enabled: D_TYPE O_NOFOLLOW(enabled) LEAF_OPTIMISATION FTS() CBO(level=0)
susam@nifty:~/so$
susam@nifty:~/so$ find . -regextype foo -regex ".*/[a-f0-9\-]\{36\}\.jpg"
find: Unknown regular expression type `foo'; valid types are `findutils-default', `awk', `egrep', `ed', `emacs', `gnu-awk', `grep', `posix-awk', `posix-basic', `posix-egrep', `posix-extended', `posix-minimal-basic', `sed'.
The -regex
find expression matches the whole name, including the relative path from the current directory. For find .
this always starts with ./
, then any directories.
Also, these are emacs
regular expressions, which have other escaping rules than the usual egrep regular expressions.
If these are all directly in the current directory, then
should work. (I’m not really sure - I can’t get the counted repetition to work here.) You can switch to egrep expressions by -regextype posix-egrep
:
(Note that everything said here is for GNU find, I don’t know anything about the BSD one which is also the default on Mac.)
Judging from other answers, it seems this might be find’s fault.
However you can do it this way instead:
find . * | grep -P "[a-f0-9\-]{36}\.jpg"
You might have to tweak the grep a bit and use different options depending on what you want but it works.
Possible Duplicate:
How can I check if a word is contained in another string using PHP?
What would be the most efficient way to check whether a string contains a “.” or not?
I know you can do this in many different ways like with regular expressions or loop through the string to see if it contains a dot (“.”).
Note that you need to compare with the !==
operator. If you use !=
or <>
and the '.'
is found at position 0
, hey! 0
compares equal to FALSE
and you lose. This will cause you to point a production website at a development database over the weekend, causing no end of joy when you return monday.
You can use these string functions,
strstr — Find the first occurrence of a string
stristr — Case-insensitive strstr()
strrchr — Find the last occurrence of a character in a string
strpos — Find the position of the first occurrence of a substring in a string
strpbrk — Search a string for any of a set of characters
If that doesn’t help then you should use preg
regular expression
preg_match — Perform a regular expression match
You can use stristr()
or strpos()
. Both return false if nothing is found.
I’m looking for a regular expression to match every new line character (\n
) inside a XML tag which is <content>
, or inside any tag which is inside that <content>
tag, for example :
Actually… you can’t use a simple regex here, at least not one. You probably need to worry about comments! Someone may write:
You can take two approaches here:
Be careful.
I am also not so sure you can match all new lines at once. @Quartz suggested this one:
This will match any content tags that have a newline character RIGHT BEFORE the closing tag… but I’m not sure what you mean by matching all newlines. Do you want to be able to access all the matched newline characters? If so, your best bet is to grab all content tags, and then search for all the newline chars that are nested in between. Something more like this:
BUT THERE IS ONE CAVEAT: regexes are greedy, so this regex will match the first opening tag to the last closing one. Instead, you HAVE to suppress the regex so it is not greedy. In languages like python, you can do this with the “?” regex symbol.
I hope with this you can see some of the pitfalls and figure out how you want to proceed. You are probably better off using an XML parsing library, then iterating over all the content tags.
I know I may not be offering the best solution, but at least I hope you will see the difficulty in this and why other answers may not be right…
UPDATE 1:
Let me summarize a bit more and add some more detail to my response. I am going to use python’s regex syntax because it is what I am more used to (forgive me ahead of time… you may need to escape some characters… comment on my post and I will correct it):
To strip out comments, use this regex:
Notice the “?” suppresses the .* to make it non-greedy.Similarly, to search for content tags, use: .*?
Also, You may be able to try this out, and access each newline character with the match objects groups():
I know my escaping is off, but it captures the idea. This last example probably won’t work, but I think it’s your best bet at expressing what you want. My suggestion remains: either grab all the content tags and do it yourself, or use a parsing library.
UPDATE 2:
So here is python code that ought to work. I am still unsure what you mean by “find” all newlines. Do you want the entire lines? Or just to count how many newlines. To get the actual lines, try:
#!/usr/bin/python
import re
def FindContentNewlines(xml_text):
# May want to compile these regexes elsewhere, but I do it here for brevity
comments = re.compile(r"<!--.*?-->", re.DOTALL)
content = re.compile(r"<content>(.*?)</content>", re.DOTALL)
newlines = re.compile(r"^(.*?)$", re.MULTILINE|re.DOTALL)
# strip comments: this actually may not be reliable for "nested comments"
# How does xml handle <!-- <!-- --> -->. I am not sure. But that COULD
# be trouble.
xml_text = re.sub(comments, "", xml_text)
result = []
all_contents = re.findall(content, xml_text)
for c in all_contents:
result.extend(re.findall(newlines, c))
return result
if __name__ == "__main__":
example = """
<!-- This stuff
ought to be omitted
<content>
omitted
</content>
-->
This stuff is good
<content>
<p>
haha!
</p>
</content>
This is not found
"""
print FindContentNewlines(example)
This program prints the result:
The first and last empty strings come from the newline chars immediately preceeding the first <p>
and the one coming right after the </p>
. All in all this (for the most part) does the trick. Experiment with this code and refine it for your needs. Print out stuff in the middle so you can see what the regexes are matching and not matching.
Hope this helps :-).
PS - I didn’t have much luck trying out my regex from my first update to capture all the newlines… let me know if you do.
I need to find a reg ex that only allows alphanumeric. So far, everyone I try only works if the string is alphanumeric, meaning contains both a letter and a number. I just want one what would allow either and not require both.
If you wanted to return a replaced result, then this would work:
This would return:
Note that the gi is necessary because it means global (not just on the first match), and case-insensitive, which is why I have a-z instead of a-zA-Z. And the ^ inside the brackets means “anything not in these brackets”.
WARNING: Alphanumeric is great if that’s exactly what you want. But if you’re using this in an international market on like a person’s name or geographical area, then you need to account for unicode characters, which this won’t do. For instance, if you have a name like “Âlvarö”, it would make it “lvar”.
Use the word character class. The following is equivalent to a ^[a-zA-Z0-9_]+$
:
Explanation:
Use /[^\w]|_/g
if you don’t want to match the underscore.
I found this code in some website, and it works perfectly. It validates that the phone number is in one of these formats:
(123) 456-7890 or 123-456-7890
The problem is that my client (I don’t know why, maybe client stuffs) wants to add another format, the ten numbers consecutively, something like this: 1234567890.
I’m using this regular expression,
How can I add that it also validates the another format? I’m not good with regular expressions.
First off, your format validator is obviously only appropriate for NANP (country code +1) numbers. Will your application be used by someone with a phone number from outside North America? If so, you don’t want to prevent those people from entering a perfectly valid [international] number.
Secondly, your validation is incorrect. NANP numbers take the form NXX NXX XXXX
where N
is a digit 2-9 and X
is a digit 0-9. Additionally, area codes and exchanges may not take the form N11
(end with two ones) to avoid confusion with special services except numbers in a non-geographic area code (800, 888, 877, 866, 855, 900) may have a N11
exchange.
So, your regex will pass the number (123) 123 4566 even though that is not a valid phone number. You can fix that by replacing \d{3}
with [2-9]{1}\d{2}
.
Finally, I get the feeling you’re validating user input in a web browser. Remember that client-side validation is only a convenience you provide to the user; you still need to validate all input (again) on the server.
TL;DR don’t use a regular expression to validate complex real-world data like phone numbers or URLs. Use a specialized library.
My regex of choice is:
Valid formats:
(123) 456-7890
(123)456-7890
123-456-7890
123.456.7890
1234567890
+31636363634
075-63546725
If you are looking for 10 and only 10 digits, ignore everything but the digits-
This code is always alerting out "null"
, which means that the string does not match the expression.
If you define your regular expression as a string then all backslashes need to be escaped, so instead of ‘’ you should have ‘\w’.
Alternatively, define it as a regular expression:
BTW, please don’t validate email addresses on the client-side. Your regular expression is way too simple to pass for a solid implementation anyway.
See the real thing here: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
this is the one i am using on my page.
http://www.zparacha.com/validate-email-address-using-javascript-regular-expression/
/^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$/
I’ve been using this function for a while. it returns a boolean value.
// Validates email address of course.
function validEmail(e) {
var filter = /^\s*[\w\-\+_]+(\.[\w\-\+_]+)*\@[\w\-\+_]+\.[\w\-\+_]+(\.[\w\-\+_]+)*\s*$/;
return String(e).search (filter) != -1;
}
How do I remove all non alphanumeric characters from a string except dash and space characters?
Replace [^a-zA-Z0-9 -]
with an empty string.
I could have used RegEx, they can provide elegant solution but they can cause performane issues. Here is one solution
char[] arr = str.ToCharArray();
arr = Array.FindAll<char>(arr, (c => (char.IsLetterOrDigit(c)
|| char.IsWhiteSpace(c)
|| c == '-')));
str = new string(arr);
When using the compact framework (which doesn’t have FindAll)
Replace FindAll with1
char[] arr = str.Where(c => (char.IsLetterOrDigit(c) ||
char.IsWhiteSpace(c) ||
c == '-')).ToArray();
str = new string(arr);
You can try:
Where s
is your string.
How do I create a regular expression to match a word at the beginning of a string. We are looking to match stop
at the beginning of a string and anything can follow it.
For example the expression should match:
Thanks.
If you wish to match only lines beginning with stop use
If you wish to match lines beginning with the word stop followed by a space
Or, if you wish to match lines beginning with the word stop but followed by either a space or any other non word character you can use (your regex flavor permitting)
On the other hand, what follows matches a word at the beginning of a string on most regex flavors (in these flavors matches the opposite of )
If your flavor does not have the shortcut, you can use
Be wary that this second idiom will only match letters and numbers, no symbol whatsoever.
Check your regex flavor manual to know what shortcuts are allowed and what exactly do they match (and how do they deal with Unicode.)
Try this:
Explanation:
If you would like to enforce that stop be followed by a whitespace, you could modify the RegEx like so:
Note: Also keep in mind that the RegEx above requires that the stop word be followed by a space! So it wouldn’t match a line that only contains: stop
I was wondering If i could get a regular expression which will match a string that only has alpahabetic characters, and that alone
You may use any of these 2 variants:
to match an input string of ASCII alphabets.
[A-Za-z]
will match all the alphabets (both lowercase and uppercase).
^
and $
will make sure that nothing but these alphabets will be matched.
Code:
Output:
Test case is for OP’s comment that he wants to match only if there are 1 or more alphabets present in input. As you can see in test case that match failed because there was ^
in the input string abcAbc^Xyz
.
Note: Please note that above answer only matches ASCII alphabets and doesn’t match Unicode characters. If you want to match unicode letters then use:
If you need to include non-ASCII alphabetic characters, and if your regex flavor supports Unicode, then
would be the correct regex.
Some regex engines don’t support this Unicode syntax but allow the \w
alphanumeric shorthand to also match non-ASCII characters. In that case, you can get all alphabetics by subtracting digits and underscores from \w
like this:
\A
matches at the start of the string, \z
at the end of the string (^
and $
also match at the start/end of lines in some languages like Ruby, or if certain regex options are set).
This will match one or more alphabetical characters:
You can make it case insensitive using:
or:
Does anyone have a regular expression handy that will match any legal DNS hostname or IP address?
It’s easy to write one that works 95% of the time, but I’m hoping to get something that’s well tested to exactly match the latest RFC specs for DNS hostnames.
You can use the following regular expressions separately or by combining them in a joint OR expression.
ValidIpAddressRegex = "^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$";
ValidHostnameRegex = "^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])$";
ValidIpAddressRegex matches valid IP addresses and ValidHostnameRegex valid host names. Depending on the language you use could have to be escaped with .
ValidHostnameRegex is valid as per RFC 1123. Originally, RFC 952 specified that hostname segments could not start with a digit.
http://en.wikipedia.org/wiki/Hostname
The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and must not end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits.
The hostname regex of smink does not observe the limitation on the length of individual labels within a hostname. Each label within a valid hostname may be no more than 63 octets long.
ValidHostnameRegex="^([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])\ (\.([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9]))*$"
Note that the backslash at the end of the first line (above) is Unix shell syntax for splitting the long line. It’s not a part of the regular expression itself.
Here’s just the regular expression alone on a single line:
^([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])(\.([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]{0,61}[a-zA-Z0-9]))*$
You should also check separately that the total length of the hostname must not exceed 255 characters. For more information, please consult RFC-952 and RFC-1123.
To match a valid IP address use the following regex:
instead of:
Many regex engine match the first possibility in the OR
sequence. For instance, try the following regex:
Test the difference between good vs bad
Is it possible to define a regex which will match every character except a certain defined character or set of characters?
Basically, I wanted to split a string by either comma (,) or semi-colon (;). So I was thinking of doing it with a regex which would match everything until it encountered a comma or a semi-colon.
You haven’t specified the regex implementation you are using. Most of them have a Split
method that takes delimiters and split by them. You might want to use that one with a “normal” (without ^
) character class:
Use character classes. A character class beginning with caret will match anything not in the class.
use a negative character class:
I need to escape the regular expression special characters using java script.How can i achieve this?Any help should be appreciated.
Thanks for your quick reply.But i need to escape all the special characters of regular expression.I have try by this code,But i can’t achieve the result.
RegExp.escape=function(str)
{
if (!arguments.callee.sRE) {
var specials = [
'/', '.', '*', '+', '?', '|',
'(', ')', '[', ']', '{', '}', '\\'
];
arguments.callee.sRE = new RegExp(
'(\\' + specials.join('|\\') + ')', 'gim'
);
}
return str.replace(arguments.callee.sRE, '\\$1');
}
function regExpFind() {
<%--var regex = new RegExp("\\[munees\\]","gim");--%>
var regex= new RegExp(RegExp.escape("[Munees]waran"));
<%--var regex=RegExp.escape`enter code here`("[Munees]waran");--%>
alert("Reg : "+regex);
}
What i am wrong with this code?Please guide me.
Use the \
character to escape a character that has special meaning inside a regular expression.
To automate it, you could use this:
Update: There is now a proposal to standardize this method, possibly in ES2016: https://github.com/benjamingr/RegExp.escape
Update: The abovementioned proposal was rejected, so keep implementing this yourself if you need it.
Use the backslash to escape a character. For example:
This will match of a numeric character
I have found very similar posts, but I can’t quite get my regular expression right here.
I am trying to write a regular expression which returns a string which is between two other strings. For example: I want to get the string which resides between the strings “cow” and “milk”.
My cow always gives milk
would return
“always gives”
Here is the expression I have pieced together so far:
However, this returns the string “cow always gives”.
A lookahead (that (?=
part) does not consume any input. It is a zero-width assertion (as are boundary checks and lookbehinds).
You want a regular match here, to consume the cow
portion. To capture the portion in between, you use a capturing group (just put the portion of pattern you want to capture inside parenthesis):
No lookaheads are needed at all.
Regular expression to get a string between two strings in JavaScript
The most complete solution that will work in the vast majority of cases is using a capturing group with a lazy dot matching pattern. However, a dot .
in JavaScript regex does not match line break characters, so, what will work in 100% cases is a [^]
or [\s\S]
/[\d\D]
/[\w\W]
constructs.
In JavaScript environments supporting ECMAScript 2018, s
modifier allows .
to match any char including line break chars, and the regex engine supports lookbehinds of variable length. So, you may use a regex like
var result = s.match(/(?<=cow\s+).*?(?=\s+milk)/gs); // Returns multiple matches if any
// Or
var result = s.match(/(?<=cow\s*).*?(?=\s*milk)/gs); // Same but whitespaces are optional
In both cases, the current position is checked for cow
with any 1/0 or more whitespaces after cow
, then any 0+ chars as few as possible are matched and consumed (=added to the match value), and then milk
is checked for (with any 1/0 or more whitespaces before this substring).
This and all other scenarios below are supported by all JavaScript environments. See usage examples at the bottom of the answer.
cow
is found first, then a space, then any 0+ chars other than line break chars, as few as possible as *?
is a lazy quantifier, are captured into Group 1 and then a space with milk
must follow (and those are matched and consumed, too).
Here, cow
and a space are matched first, then any 0+ chars as few as possible are matched and captured into Group 1, and then a space with milk
are matched.
If you have a string like >>>15 text>>>67 text2>>>
and you need to get 2 matches in-between >>>
+number
+whitespace
and >>>
, you can’t use />>>\d+\s(.*?)>>>/g
as this will only find 1 match due to the fact the >>>
before 67
is already consumed upon finding the first match. You may use a positive lookahead to check for the text presence without actually “gobbling” it (i.e. appending to the match):
See the online regex demo yielding text1
and text2
as Group 1 contents found.
Also see How to get all possible overlapping matches for a string.
Lazy dot matching pattern (.*?
) inside regex patterns may slow down script execution if very long input is given. In many cases, unroll-the-loop technique helps to a greater extent. Trying to grab all between cow
and milk
from "Their\ncow\ngives\nmore\nmilk"
, we see that we just need to match all lines that do not start with milk
, thus, instead of cow\n([\s\S]*?)\nmilk
we can use:
See the regex demo (if there can be \r\n
, use /cow\r?\n(.*(?:\r?\n(?!milk$).*)*)\r?\nmilk/gm
). With this small test string, the performance gain is negligible, but with very large text, you will feel the difference (especially if the lines are long and line breaks are not very numerous).
Sample regex usage in JavaScript:
//Single/First match expected: use no global modifier and access match[1] console.log("My cow always gives milk".match(/cow (.*?) milk/)[1]); // Multiple matches: get multiple matches with a global modifier and // trim the results if length of leading/trailing delimiters is known var s = "My cow always gives milk, thier cow also gives milk"; console.log(s.match(/cow (.*?) milk/g).map(function(x) {return x.substr(4,x.length-9);})); //or use RegExp#exec inside a loop to collect all the Group 1 contents var result = [], m, rx = /cow (.*?) milk/g; while ((m=rx.exec(s)) !== null) { result.push(m[1]); } console.log(result);
Here’s a regex which will grab what’s between cow and milk (without leading/trailing space):
srctext = "My cow always gives milk.";
var re = /(.*cow\s+)(.*)(\s+milk.*)/;
var newtext = srctext.replace(re, "$2");
An example: http://jsfiddle.net/entropo/tkP74/
I need to split a string base on delimiter -
and .
. Below are my desired output.
AA.BB-CC-DD.zip
->
but my following code does not work.
I think you need to include the regex OR operator:
What you have will match:
[DASH followed by DOT together] -.
not
[DASH or DOT any of them] -
or .
Try this regex "[-.]+"
. The + after treats consecutive delimiter chars as one. Remove plus if you do not want this.
You can use the regex “”.This matches any non-word character.The required line would be:
I need a regular expression to select all the text between two outer brackets.
Example: some text(text here(possible text)text(possible text(more text)))end text
Result: (text here(possible text)text(possible text(more text)))
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in this answer to a previous question.
You can use regex recursion:
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups.
Where c
is used as the depth counter.
PCRE using a recursive pattern.
Demo at regex101; Or without alternation:
Demo at regex101; Or unrolled for performance:
Demo at regex101; The pattern is pasted at (?R)
which represents (?0)
.
Perl, PHP, Notepad++, R: perl=TRUE, Python: Regex package with (?V1)
for Perl behaviour.
Ruby using subexpression calls.
With Ruby 2.0 \g<0>
can be used to call full pattern.
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
Demo at Rubular (atomic grouping since Ruby 1.9.3)
JavaScript API :: XRegExp.matchRecursive
JS, Java and other regex flavors without recursion up to 2 levels of nesting:
Demo at regex101. Deeper nesting needs to be added to pattern.
To fail faster on unbalanced parenthesis drop the +
quantifier.
Java: An interesting idea using forward references by @jaytea.
Reference - What does this regex mean?
I have a value like this:
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar
and Another Value
)?
I’ve been using the following with great success:
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here’s an explanation from user ephemient:
([""'])
match a quote;((?=(\\?))\2.)
if backslash exists, gobble it, and whether or not that happens, match a character;*?
match many times (non-greedily, as to not eat the closing quote);\1
match the same quote that was use for opening.
In general, the following regular expression fragment is what you are looking for:
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
I would go for:
The [^"] is regex for any character except ‘"’
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
I have a string User name (sales)
and I want to extract the text between the brackets, how would I do this?
I suspect sub-string but I can’t work out how to read until the closing bracket, the length of text will vary.
A very simple way to do it is by using regular expressions:
As a response to the (very funny) comment, here’s the same Regex with some explanation:
\( # Escaped parenthesis, means "starts with a '(' character"
( # Parentheses in a regex mean "put (capture) the stuff
# in between into the Groups array"
[^)] # Any character that is not a ')' character
* # Zero or more occurrences of the aforementioned "non ')' char"
) # Close the capturing group
\) # "Ends with a ')' character"
If you wish to stay away from regular expressions, the simplest way I can think of is:
I am using the jQuery validation plugin. Great stuff! I want to migrate my existing ASP.NET solution to use jQuery instead of the ASP.NET validators. I am missing a replacement for the regular expression validator. I want to be able to do something like this:
How do I add a custom rule to achieve this?
Thanks to the answer of redsquare I added a method like this:
$.validator.addMethod(
"regex",
function(value, element, regexp) {
var re = new RegExp(regexp);
return this.optional(element) || re.test(value);
},
"Please check your input."
);
now all you need to do to validate against any regex is this:
Additionally, it looks like there is a file called additional-methods.js that contains the method “pattern”, which can be a RegExp when created using the method without quotes.
http://bassistance.de/jquery-plugins/jquery-plugin-validation/
http://ajax.aspnetcdn.com/ajax/jquery.validate/1.9/additional-methods.js
You can use the addMethod()
e.g
$.validator.addMethod('postalCode', function (value) {
return /^((\d{5}-\d{4})|(\d{5})|([A-Z]\d[A-Z]\s\d[A-Z]\d))$/.test(value);
}, 'Please enter a valid US or Canadian postal code.');
good article here https://web.archive.org/web/20130609222116/http://www.randallmorey.com/blog/2008/mar/16/extending-jquery-form-validation-plugin/
I had some trouble putting together all the pieces for doing a jQuery regular expression validator, but I got it to work… Here is a complete working example. It uses the ‘Validation’ plugin which can be found in jQuery Validation Plugin
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<script src="http://YOURJQUERYPATH/js/jquery.js" type="text/javascript"></script>
<script src="http://YOURJQUERYPATH/js/jquery.validate.js" type="text/javascript"></script>
<script type="text/javascript">
$().ready(function() {
$.validator.addMethod("EMAIL", function(value, element) {
return this.optional(element) || /^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,5}$/i.test(value);
}, "Email Address is invalid: Please enter a valid email address.");
$.validator.addMethod("PASSWORD",function(value,element){
return this.optional(element) || /^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,16}$/i.test(value);
},"Passwords are 8-16 characters with uppercase letters, lowercase letters and at least one number.");
$.validator.addMethod("SUBMIT",function(value,element){
return this.optional(element) || /[^ ]/i.test(value);
},"You did not click the submit button.");
// Validate signup form on keyup and submit
$("#LOGIN").validate({
rules: {
EMAIL: "required EMAIL",
PASSWORD: "required PASSWORD",
SUBMIT: "required SUBMIT",
},
});
});
</script>
</head>
<body>
<div id="LOGIN_FORM" class="form">
<form id="LOGIN" name="LOGIN" method="post" action="/index/secure/authentication?action=login">
<h1>Log In</h1>
<div id="LOGIN_EMAIL">
<label for="EMAIL">Email Address</label>
<input id="EMAIL" name="EMAIL" type="text" value="" tabindex="1" />
</div>
<div id="LOGIN_PASSWORD">
<label for="PASSWORD">Password</label>
<input id="PASSWORD" name="PASSWORD" type="password" value="" tabindex="2" />
</div>
<div id="LOGIN_SUBMIT">
<input id="SUBMIT" name="SUBMIT" type="submit" value="Submit" tabindex="3" />
</div>
</form>
</div>
</body>
</html>
I am trying to write a regular expression which returns a string which is between parentheses. For example: I want to get the string which resides between the strings “(” and “)”
would return
Found Regular Expression to get a string between two strings in Javascript
But I’m new with regex. I don’t know how to use ‘(’, ‘)’ in regexp
You need to create a set of escaped (with \
) parentheses (that match the parentheses) and a group of regular parentheses that create your capturing group:
Breakdown:
\(
: match an opening parentheses
(
: begin capturing group
[^)]+
: match one or more non )
characters
)
: end capturing group
\)
: match closing parentheses
Here is a visual explanation on RegExplained
Try string manipulation:
var txt = "I expect five hundred dollars ($500). and new brackets ($600)";
var newTxt = txt.split('(');
for (var i = 1; i < newTxt.length; i++) {
console.log(newTxt[i].split(')')[0]);
}
or regex (which is somewhat slow compare to the above)
Ported Mr_Green’s answer to a functional programming style to avoid use of temporary global variables.
var matches = string2.split('[')
.filter(function(v){ return v.indexOf(']') > -1})
.map( function(value) {
return value.split(']')[0]
})
I would like to create a page where all images which reside on my website are listed with title and alternative representation.
I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src
, title
and alt
from this HTML:
<img <b>src</b>="/image/fluffybunny.jpg" <b>title</b>="Harvey the bunny" <b>alt</b>="a cute little fluffy bunny" />
I guess this should be done with some regex, but since the order of the tags may vary, and I need all of them, I don’t really know how to parse this in an elegant way (I could do it the hard char by char way, but that’s painful).
Using regexp to solve this kind of problem is a bad idea and will likely lead in unmaintainable and unreliable code. Better use an HTML parser.
In that case it’s better to split the process into two parts :
I will assume your doc is not xHTML strict so you can’t use an XML parser. E.G. with this web page source code :
/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */
preg_match_all('/<img[^>]+>/i',$html, $result);
print_r($result);
Array
(
[0] => Array
(
[0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
[1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
[3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
[4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[...]
)
)
Then we get all the img tag attributes with a loop :
$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}
print_r($img);
Array
(
[<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
)
)
Regexps are CPU intensive so you may want to cache this page. If you have no cache system, you can tweak your own by using ob_start and loading / saving from a text file.
First, we use preg_ match_ all, a function that gets every string matching the pattern and ouput it in it’s third parameter.
The regexps :
We apply it on all html web pages. It can be read as every string that starts with “<img
”, contains non “>” char and ends with a >.
We apply it successively on each img tag. It can be read as every string starting with “alt”, “title” or “src”, then a “=”, then a ’ " ‘, a bunch of stuff that are not’ " ’ and ends with a ’ " ’. Isolate the sub-strings between ().
Finally, every time you want to deal with regexps, it handy to have good tools to quickly test them. Check this online regexp tester.
EDIT : answer to the first comment.
It’s true that I did not think about the (hopefully few) people using single quotes.
Well, if you use only ‘, just replace all the " by’.
If you mix both. First you should slap yourself :-), then try to use (“|’) instead or” and [^ø] to replace [^"].
Just to give a small example of using PHP’s XML functionality for the task:
$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}
I did use the DOMDocument::loadHTML()
method because this method can cope with HTML-syntax and does not force the input document to be XHTML. Strictly speaking the conversion to a SimpleXMLElement
is not necessary - it just makes using xpath and the xpath results more simple.