Learn about extracting unique lines from a text or stdin, with examples of most of the use cases with sort and uniq commands.
- [Instructor] Welcome to the eighth video of section two: Sorting Unique and Duplicates. In the previous video we have learned about cryptography tools and hashes. This video illustrates most of the use cases with sort and unique commands. Sorting is a common task that we can encounter with text files. The sort command helps us perform sort operations over text files and standard in. Most often it can be coupled with many other commands to produce the required output.
Unique is another command that is often used along with a sort command. It helps to extract unique or duplicate lines from a text or standard in. The sort command accepts input as file names as well as standard in, standard input, and outputs the result by writing standard out. The same applies to the unique command. Let's see how to do it. We can easily sort a given set of files. For example file one dot text, and file two dot text as follows.
For a numerical sort, we can use. To sort in the reverse order, we can use. For sorting by months in the order Jan, Feb, March, use. To merge two already sorted files, use. To find unique lines from a sorted file, use. To check if a file has already been sorted, use.
Let's go to the encoded script. Now replace file name with the file you want to check, and run the script. Let's see the result. As we don't have any file or directory, we get an unsorted as our output. As shown in the example, sort takes numerous parameters. It can be used to sort the data files in different ways. Furthermore, it is useful when using the unique command which expects it's input to be sorted. There are numerous scenarios where these sort and unique commands can be used.
Let's go through the various options and usage techniques. For checking if a file is already sorted or not, we exploit the fact that sort returns an exit code, dollar question mark of zero. If the file is sorted and non-zero, otherwise. These are some basic usages of the sort command. Let's see some ways of using it to accomplish complex tasks. Sorting according to the keys or columns. We can use a column with sort if we need to sort text as follows.
We can sort this in many ways. Currently it's numeric, sorted by the serial number, the first column. We can also sort by the second column, and the third column. Minus k specifies the key by which the sort is to be performed. Key is the column number by which the sort is to be done. Minus r specifies the sort command to sort in reverse order. For example.
Always be careful about the minus n option for numeric sort. The sort command treats alphabetical sort and numeric sort differently. Had to know what to specify numeric sort, the minus n option should be provided. Usually, by default, keys are columns in the text file. Columns are separated by space characters, but in certain circumstances you may need to specify keys as a group of characters in the given character number range. For example, key one equals character four dash character eight.
In such cases where keys are to be specified explicitly as a range of characters, we can specify the key's ranges with the character position at key starts, and key ends, as follows.
The highlighted characters are to be used as numeric keys. To extract, use their positions in the lines as the key format. In the previous example they're two and three. To use the first character as key, use. To make the sort output x args compatible with the slash zero terminator, use the following command. Zero terminator is used to make safe use with x args. Sometimes the text may contain unnecessary extraneous characters such as spaces.
To sort them in dictionary order by ignoring punctuation and folds, use. The minus b option is used to ignore leading blank lines from the file. And the minus d option is used to specify sort in the dictionary order. Unique. Unique is a command used to find out the unique lines from the given input. Standard in or from (mumbles) command argument by eliminating the duplicates. It can also be used to find out the duplicate lines from the input. Unique can be applied only for sorted data input.
Hence, unique is to be used always along with the sort command using Pipe, or using sorted file as input. Produce the unique lines. All lines in the input are printed, and even the duplicate lines are printed only once, from the given input data as follows.
Display only unique lines, the lines which are not repeated or duplicated in the input file, as follows. To count how many times each of the lines appears in the file, use the following command. To find duplicate lines in the file. To specify keys we can use the combination of the minus s and minus w arguments. Minus s specifies a number for the first n characters to be skipped, minus w specifies the maximum number of characters to be compared.
This comparison key is used as the index for the unique operation as follows. We need to use the highlighted characters as the unique key. This is used to ignore the first two characters, minus s two, and the maximum number of comparison characters is specified using the w option, minus w two.
While we use output from one command as input to the x args command, it's always preferable to use zero byte terminator for each of the lines of the output, which acts as the source for x args. While using the unique commands output as a source x args, we should use a zero terminated output. If a zero byte terminator is not used, by default the space characters are used as a delimiter to split the arguments in the x args command. For example, a line with the text, this is a line, from standard int will be taken as four separate arguments by the x args command.
Actually, it's a single line. When a zero by terminator is used, slash zero is used as a deliminator character, and hence a single including a space is interpreted as a single argument. Zero byte terminated output can be generated from the unique command as follows. The following command removes all the files with file names read from files dot text. If multiple line entries of file names exist in that file, the unique command writes the filing only once to standard out.
Great, we've successfully learned about sorting unique and duplicates. In the next video we'll learn about temporary file naming and random numbers.
Note: This course was created by Packt Publishing. We are pleased to host this training in our library.
- Printing in the terminal
- Performing math in the Linux shell
- Getting and setting dates
- Working with functions and arguments
- Reading output
- Making comparisons
- Concatenating text
- Finding, editing, generating, and deleting files
- Running parallel processes
- Using regular expressions
- Downloading webpages
- Parsing data from a website
- Finding broken links
- Backing up and archiving
- Transferring files and data through the network
- Monitoring your Linux system
- Gathering data for system administration