Check for duplicate code with a one-liner

Duplicate lines in your code are sometimes a sign that your code needs refactoring. If a line appears very often, you might want to wrap it in a function and call the function instead, for example, making maintenance easier. Here’s a quick Linux pipeline that lists duplicate lines to help you in this task. I’ll explain it step by step. (Thanks to Robert Strandh for the basic technique.)

I’ll assume the code is in Python and all the files are in a single directory, but you can adapt the technique for other languages and directory trees.

The basic technique

The first stage of this pipeline uses the cat command to combine all the Python files in sequence:

$ cat *.py
class Parser:
  """Parse a file"""
  def __init__(self):
...and so on...

Pipe the output to a command that can delete spaces and tabs, such as tr, sed, or awk, so our duplicate detector will treat whitespace as insignificant:

$ cat *.py | tr -d ' \t'
classParser:
"""Parseafile"""
def__init__(self):
...and so on...

Sort the results so that duplicate lines become adjacent lines. (If desired, add the option -f to make the sort case-insensitive. I did not.)

$ cat *.py | tr -d ' \t' | sort

Then count identical adjacent lines with uniq -c. Any of the output lines that begins with the number 2 or greater represents a duplicate.

$ cat *.py | tr -d ' \t' | sort | uniq -c
    16 response=requests.get(url)
   203 def__init__(self):
     1 #Thisisatotallyuniquecomment
     8 x=Vector([1,2,3])
...and so on...

Finally, sort the output numerically from highest to lowest value with sort -nr, so the most-duplicated lines are at the top of the output.

$ cat *.py | tr -d ' \t' | sort | uniq -c | sort -nr

If you read my book, Efficient Linux at the Command Line, you may recognize the combination of sort, uniq -c, and sort -nr as a common pattern that I explain on page 13.

Checking a whole directory tree

If your files are stored in subdirectories, not in a single directory, don’t worry. Instead of using cat alone, run the find command to locate Python files throughout the tree and use xargs to “cat” their contents together:

$ find . -type f -name \*.py -print0 | xargs -0 cat | tr -d ' \t' \
  | sort | uniq -c  | sort -nr

Cutting down the noise

Some of the duplicate lines will be trivial: blank lines, comments, a reserved word on a line by itself, and so on. You can skip past them by eye, but consider making your pipeline more robust by removing these trivial lines. One technique is to create a file of lines that you want to omit:

$ cat noisy-lines
"""
#
#TODO:
break
continue
def__init__(self):
pass
...and so on...

Then use fgrep -v to filter them out before calling uniq:

$ cat *.py | tr -d ' \t' | fgrep -v -f noisy-lines \
  | sort | uniq -c  | sort -nr
   16 response=requests.get(url)
    8 x=Vector([1,2,3])
...and so on...

The commands I’ve presented are certainly not as robust as real static analysis, but they’re a simple and handy tools for your toolbox.

For more examples of interesting pipelines, or to understand the general techniques in this post in more detail, check out Efficient Linux at the Command Line.

Check for duplicate code with a one-liner

The basic technique

Checking a whole directory tree

Cutting down the noise

Daniel Barrett