--- title: "Regular Expressions - If only they were not so practical" date: 2019-09-26 draft: false tags: ["programming","helper","regex"] categories: ["Archive"] --- For a long time, I've been avoiding regular expressions. Every time I got into the subject, the instinctive reaction was flight. I have tried again and again to approach this topic. But all the people who use regular expressions can not be completely wrong. I took all courage together and fought my way through. To be honest, I'm still not the biggest fan of regular expression. Still, they save a lot of time. Let's start with a very simple examples. Imagine you have a long table of article numbers in Google Sheets. The article numbers have a uniform structure (e.g., article-color-12587). What are you doing when you want to change the order of the segments? With a few simple formulas here and there and some extra column you can do this 😁. With a regular expression you can solve this problem much more elegantly. Google Sheets has a REGEXREPLACE () function. This can solve this problem with one simple formula. But before we come to the solution, we need to know a few basics. ## The basics for regular expressions Regular expressions help identify patterns within strings. The smallest element of a string is a single character. That's where we start. There are different so-called character classes when using regular expressions: ### Character classes |RegEx | Meaning | |-------|-------------------------------------------------------------------| |. | Any character | |\d | Digits (0-9) | |\D | All characters except numbers | |\s | Space (space, tab, CR, LF | |\S | All characters that are not space | |\w | Alphanumeric characters including "_" | |\W | Any character that does not include an alphanumeric character or "_" | This already helps us to look for a date pattern: \d\d\.\d\d\.\d\d\d\d Two digits followed by a dot followed by 2 digits, a dot and 4 digits. The period gets prefixed with a \. This makes it clear that we do not mean every single sign, but actually the period. This is exactly how it works with the numbers. If the \ were not in front of the d, the letter d would be searched for, not the character class. ## Special characters Besides the common characters there are special characters. |Regex |Meaning| |-------|----------------| |c | e.g. the "c" character | |^ | Beginning of lines / negation of [^ ..] character classes| |$ | End of line or string| |\\ | highlights the special meaning of the next character| |\n | LF, feed to the next line / line break| |\r | CR or WR - return movement of the write movement to pos.1 of the same line| |\r\n | Line break DOS / Windows| |\t | Tab| |\f | FF or page break - moving to the first line of the next page| |\a | Beep | |\e | Escape| |\b | Empty string at the beginning or end of the word | |\B | Empty string not at the beginning or end of the word| |\< | Empty string at the beginning of the word| |> | Empty string at the end of the word| ### Custom character classen You can also define your own character classes: |Regex |Meaning| |-------|----------------| |[abc] | a, b, or c - a so-called simple class| |[^abc] |any character that is not a, b, or c| |[a-h]      |Character range from a to h| |[a-h]'[r-u]| characters in the range between a to h or r to u| In a character class, either single characters are defined [aeiou] or an area [a-h0-9]. With the ^ one can negate the class, i. [^ abc] means any character that is not an a, b, or c. In addition, you can connect several character classes with the 'operator (or). ## Quantities You can specify in regular expressions how often a character is allowed. |Regex |Meaning| |-------|----------------| |a?   | once or not at all| |a   | not at all up to any number of times| |a+   | once up to any number of times| |a{3} | exactly three times| |a{3,5} | three times or more, but not more than five times| This can be used e.g. Find number blocks: \d{3}-\d{4}-\d{5} finds numbers formatted like this: 123-4567-89101. ## Greedy and lazy quantifiers When specifying sets in regular expressions, there are so-called lazy and gluttonous quantifiers. Greedy quantifiers try to process as much as possible per result. Your lazy colleagues want to process as little as possible for each outcome. ### Lazy quantifiers: |Regex |Meaning| |-------|----------------| |a?    | not at all up to as rare as possible| |a+?    |  once up to to as rare as possible| a{3,}? |   three times or more, but as little as possible | ### Greedy quantifiers: |Regex |Meaning| |-------|----------------| |a+ |not at all up to as often as possible | |a++ |once to up to as often as possible | |a{3,}+ |three times or more, but as often as possible| Suppose we have the sentence Hello -Bob-, how re -You-? Und now we're looking with a greedy and a lazy quantifier: (greedy quantifier): -.- finds: -Bob-, how re -You- (lazy quantifier): -.\?- finds: -Bob- -You- You can see that the greedy quantifier finds one long match in the sentence. The lazy quantifier finds two short matches instead. ## Groups You can separate parts of a regular expression from each other. In this case, groups in a regular expression are defined by parentheses. Groups within a regular expression get a number. The numbering so that the complete found expression gets the 0. Then the respective groups will be counted up. --- Example SKU: artikel-rot-2545 regular expression: (\w+)-(\w+)-(\d+) --- The regular expression finds the complete article number. But there are also assigned 4 numbers. For the call you use $ + the respective number. In an example, it looks like this: --- $0 = artikel-rot-2545 $1 = artikel $2 = rot $3 = 2545 --- Now we have everything we need for our origin problem. The formula in Google Sheets looks like this: --- =REGEXREPLACE(A1;"(\w+)-(\w+)-(\d+)";"$3-$2-$1") --- Example: [http://Google Sheets](https://docs.google.com/spreadsheets/d/1jE5u8q2bKauiIQu6bGRX76Y-XE3CVtuumtwgRBis1Rw/edit?usp=sharing) The same principle can be applied in various programming languages. There are slight differences depending on the language: Python: re.sub(r'(\w+) (\w+)',r'\2 \1','Word1 Word2') Go: regex := regexp.MustCompile(\`(\w+) (\w+)\`) fmt.Printf(regex.ReplaceAllString("Word1 Word2", "$2 $1")) Javascript: let regex = /(\w+) (\w+)/; "Word1 Word2".replace(regex, "$2 $1"); different way in Javascript: let regex2 = new RegExp("(\w+) (\w+)"); "Word1 Word2".replace(regex2, "$2 $1"); ## Lookaround Now it gets a bit more complicated 😇. One can also specify the context of a regular expression. That we can specifically search for something that is in front of or behind a specific string. ### Look behind Example SKU list: artikel-rot-2538 artikel-gelb-2539 artikel-blau-2542 artikel-lila-2543 artikel-rot-2545 artikel-gelb-2546 Regular expression + Look behind: --- (?<=artikel-)\w+ The first term article- must precede the second expression \w+ (?