Text processing using tr, uniq, sort, sed and awk
Table of Contents
- Introduction
- Objective
- Commands intro
- Let us create the command
- Let us breakdown the commands
- Conclusion
- References
Introduction
- Text processing in shell script is always easy and effective using various commands.
- In this blog, we will pick a sample sentence and see the use of commands
tr, uniq, sort, sed and awk
to process the sentence - These commands are very effective and has user friendly options to process and print string as per user convenience
- In real-time usage - The text processing will be needed in case if we need to process a larger log files, and print the required values as per user convenience
Objective
For the given string, we need to count the number of occurrences of alphanumeric (strings) and print output to the console (in descending order) as per the format given below,
Example:
string1->4
string2->3
string3->3
string4->2
Commands intro
-
tr
- Translates the given string in the specified pattern -
uniq
- Omits repeated string occurrences -
sed
- Text stream editor, one of the powerful sh command for filtering/transforming texts -
sort
- Sort the lines of string or files -
awk
- Pattern matching and text processing command, it is another powerful command likesed
Let us create the command
- Store the given string in a variable
$ string="This is the Sample sentence, that contains repeated sample string exists more than once in the sample sentence. Repeat once more added in the sample string"
- Here is the command to achieve our objective
$ echo $string | tr -c '[:alnum:]' '\n' | tr '[:upper:]' '[:lower:]' | sed '/^$/d' | sort | uniq -c | sort -nr | awk '{ print $2"->"$1 }'
- Executing the command will print the output in console, as per the expected format
$ echo $string | tr -c '[:alnum:]' '\n' | tr '[:upper:]' '[:lower:]' | sed '/^$/d' | sort | uniq -c | sort -nr | awk '{ print $2"->"$1 }'
sample->4
the->3
string->2
sentence->2
once->2
more->2
in->2
this->1
that->1
than->1
repeated->1
repeat->1
is->1
exists->1
contains->1
added->1
Let us breakdown the commands
-
tr -c '[:alnum:]' '\n'
-> Convert the paragraph (all alphanumeric chars) into one column single word per line -
tr '[:upper:]' '[:lower:]'
-> Convert upper case letters to lower case letters -
sed '/^$/d'
-> Remove empty newlines -
sort
-> Sort each line alphabetically -
uniq -c
-> Count the word occurrences and prefix lines by the number of occurrences -
sort -nr
-> Compare and sort according to string numerical value and print in reverse desc order -
awk '{ print $2"->"$1 }'
-> Print the sort output as per the expected format
Conclusion
- In this blog, we discussed some of the text processing shell commands and printed a sample output as per the expected format
- As a reader, you can explore the options of each commands from the
man
pages and learn more about it and apply when needed
Thanks for reading!