Under-appreciated command line tools: comm

The comm command is surely one of the most under-appreciated commands in the GNU coreutils. Its man page is barely a page long, and here's the most interesting part:


NAME
     comm -- select or reject lines common to two files

SYNOPSIS
     comm [-123i] file1 file2

DESCRIPTION

     The following options are available:

     -1      Suppress printing of column 1.

     -2      Suppress printing of column 2.

     -3      Suppress printing of column 3.

But this doesn’t really tell you much about what the command can do.

To put it simply, comm allows you to do set arithmetic on the command line. Given two input files, it will tell you which lines are unique to the first and second files, and which lines are common to both.

So given sets A and B, you can find:

  • The relative complements (lines present in only one of the input files)
    comm -23 fileA fileB # A \ B, or A - B
    comm -13 fileA fileB # B \ A, or B - A
  • And set intersections (lines common to both the files)
    comm -12 fileA fileB # A ∩ B.

All it requires is that the input be in a sorted order, which is slightly annoying. I make it a point to do run sort | uniq on my data before passing it to comm.

Why is this useful?

I use it a lot for data reconciliation and filtering. Instead of writing a short Python script or using a spreadsheet, if I'm working on the command line already, I just use comm. It’s great when you’re asking questions like “What happened in case A but not in case B” or “What was common in cases A & B” etc.

Many time, I don’t really care about the exact matches: I just pipe the output of comm to wc -l to get a line-count of the output.