Pages

Thursday, December 27, 2018

AWK Two File Processing

When Processing multiple files awk reads each file sequentially, one after another, in the order they are specified on the command line. 

$ awk 'NR == FNR { # some actions; next} # other condition {# other actions}' file1.txt file2.txt


How it Works

So, the condition NR == FNR is only true while awk is reading the first file. Thus, in the program above, the actions indicated by # some actions are executed when awk is reading the first file; the actions indicated by # other actions are executed when awk is reading the second file, if the condition in # other condition is met. 

The next at the end of the first action block is needed to prevent the condition in # other condition from being evaluated, and the actions in # other actions from being executed, while awk is reading the first file.

Probably, it all becomes much clearer with some examples. There are really many problems that involve two files that can be solved using this technique. Let's look at this:

# prints lines that are both in file1.txt and file2.txt (intersection)

$ awk 'NR == FNR{a[$0];next} $0 in a' file1.txt file2.txt

Here we see another typical idiom: a[$0] alone has the only purpose of creating the array element indexed by $0, even if we don't assign any value to it. During the pass over the first file, all the lines seen are remembered as indexes of the array a. The pass over the second file just needs to check whether each line being read exists as an index in the array a (that's what the condition $0 in a does). If the condition is true, the line being read from file2.txt is printed (as we already know). In a very similar way, we can easily write the code to print the lines that appear in only one of the two files:

# prints lines that are only in file1.txt and not in file2.txt
$ awk 'NR == FNR{a[$0];next} !($0 in a)' file2.txt file1.txt
Note the order of the arguments. file2.txt is given first. To print lines that are only in file2.txt and not in file1.txt, just reverse the order of the arguments.

No comments:

Post a Comment