Pages

Saturday, September 5, 2020

Removing Duplicates Using AWK and How it Works.

<script data-ad-client="ca-pub-7841181112240136" async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>

As a normal method in unix we need to use sort and then uniq to remove the duplicates from the file. Because without sort, it wont give correct unique values.

Instead of that we can use awk command to remove the duplicates in the file. For this we need to use awk associate array.

Short notes about AWK associative array

Unlike regular arrays in AWK associative arrays the indexes need not to be continuous set of number; you can use either string or number as an array index. Also, there is no need to declare the size of an array in advance – arrays can expand/shrink at runtime.

Its syntax is : array_name[index] = value

Example

Cat tes 

File Contents

Cat test

1

2

3

1

1

3

3a

3a

5

Command to Use: cat tes | awk '!seen[$0]++'  

How it Works

Seen[$0] - Uses the current line as the key to the array a. In our case 1 so array index will become 1 2 3 3a and 5. As we can't have same array index.

Note : Seen is an arbitrary word, it can be any words like a, b or any strings.

if seen[1] is never reference before then a[1] evalutes to empty string as awk will crate empty if it was not initialized before. IN this zero is false. If we negate then we will get true result. if it is non-zero (true) then we will get false result.

uses the current line $0 as key to the array a, taking the value stored there. If this particular key was never referenced before, a[$0] evaluates to the empty string.

!seen[$0]

The ! negates the value from before. If it was empty or zero (false), we now have a true result. If it was non-zero (true), we have a false result. If the whole expression evaluated to true, meaning that a[$0] was not set to begin with, the whole line is printed as the default action.

Also, regardless of the old value, the post-increment operator adds one to a[$0], so the next the same value in the array is accessed, it will be positive and the whole condition will fail.

Below is the actual way how awk with expression works

    awk 'expression' file Is actually a short hand of: awk 'expression {print $0}' file

Whenever the a test with no associated action is true, the default action is triggered. The default action is the equivalent of { print } or { print $0 }, which prints the current record, which for all accounts and purposes in this example is the current unmodified line of input.

No comments:

Post a Comment