Tuesday, March 23, 2010

sed: more intro

The slash as a delimiter

The character after the s is the delimiter. It is conventionally a slash, because this is what ed, more, and vi use. It can be anything you want, however. If you want to change a pathname that contains a slash - say /usr/local/bin to /common/bin - you could use the backslash to quote the slash:
sed 's/\/usr\/local\/bin/\/common\/bin/' new
Gulp. Some call this a 'Picket Fence' and it's ugly. It is easier to read if you use an underline instead of a slash as a delimiter:
sed 's_/usr/local/bin_/common/bin_' new
Some people use colons:
sed 's:/usr/local/bin:/common/bin:' new
Others use the "|" character.
sed 's|/usr/local/bin|/common/bin|' new
Pick one you like. As long as it's not in the string you are looking for, anything goes. And remember that you need three delimiters. If you get a "Unterminated `s' command" it's because you are missing one of them.

Using & as the matched string

Sometimes you want to search for a pattern and add some characters, like parenthesis, around or near the pattern you found. It is easy to do this if you are looking for a particular string:
sed 's/abc/(abc)/' new
This won't work if you don't know exactly what you will find. How can you put the string you found in the replacement string if you don't know what it is?
The solution requires the special character "&." It corresponds to the pattern found.
sed 's/[a-z]*/(&)/' new
You can have any number of "&" in the replacement string. You could also double a pattern, e.g. the first number of a line:
% echo "123 abc" | sed 's/[0-9]*/& &/' 123 123 abc
Let me slightly amend this example. Sed will match the first string, and make it as greedy as possible. The first match for '[0-9]*' is the first character on the line, as this matches zero of more numbers. So if the input was "abc 123" the output would be unchanged (well, except for a space before the letters). A better way to duplicate the number is to make sure it matches a number:
% echo "123 abc" | sed 's/[0-9][0-9]*/& &/' 123 123 abc
The string "abc" is unchanged, because it was not matched by the regular expression. If you wanted to eliminate "abc" from the output, you must expand the the regular expression to match the rest of the line and explicitly exclude part of the expression using "(", ")" and "\1", which is the next topic.

Using \1 to keep part of the pattern

I have already described the use of "(" ")" and "1" in my tutorial on regular expressions.To review, the escaped parentheses (that is, parentheses with backslashes before them) remember portions of the regular expression. You can use this to exclude part of the regular expression. The "\1" is the first remembered pattern, and the "\2" is the second remembered pattern. Sed has up to nine remembered patterns.
If you wanted to keep the first word of a line, and delete the rest of the line, mark the important part with the parenthesis:
sed 's/\([a-z]*\).*/\1/'
I should elaborate on this. Regular exprssions are greedy, and try to match as much as possible. "[a-z]*" matches zero or more lower case letters, and tries to be as big as possible. The ".*" matches zero or more characters after the first match. Since the first one grabs all of the lower case letters, the second matches anything else. Therefore if you type
echo abcd123 | sed 's/\([a-z]*\).*/\1/'
This will output "abcd" and delete the numbers.
If you want to switch two words around, you can remember two patterns and change the order around:
sed 's/\([a-z]*\) \([a-z]*\)/\2 \1/'
Note the space between the two remembered patterns. This is used to make sure two words are found.
The "\1" doesn't have to be in the replacement string (in the right hand side). It can be in the pattern you are searching for (in the left hand side). If you want to eliminate duplicated words, you can try:
sed 's/\([a-z]*\) \1/\1/'
You can have up to nine values: "\1" thru "\9."

1 comment:

  1. your
    sed 's/\([a-z]*\) \([a-z]*\)/\2 \1/'
    example did not work for me.
    it just returned the original string.

    ReplyDelete