• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Daniel J. Barrett

  • Home
  • Books
    • Efficient Linux at the Command Line
    • Linux Pocket Guide
    • Linux Security Cookbook
    • Macintosh Terminal Pocket Guide
    • MediaWiki
    • SSH, the Secure Shell: The Definitive Guide
    • Translations
    • Out of Print
  • Blog
  • About
  • Contact

Commands, Linux, Shell · April 7, 2022

Check for duplicate code with a one-liner

Duplicate lines in your code are sometimes a sign that your code needs refactoring. If a line appears very often, you might want to wrap it in a function and call the function instead, for example, making maintenance easier. Here’s a quick Linux pipeline that lists duplicate lines to help you in this task. I’ll explain it step by step. (Thanks to Robert Strandh for the basic technique.)

I’ll assume the code is in Python and all the files are in a single directory, but you can adapt the technique for other languages and directory trees.

The basic technique

The first stage of this pipeline uses the cat command to combine all the Python files in sequence:

$ cat *.py
class Parser:
  """Parse a file"""
  def __init__(self):
...and so on...

Pipe the output to a command that can delete spaces and tabs, such as tr, sed, or awk, so our duplicate detector will treat whitespace as insignificant:

$ cat *.py | tr -d ' \t'
classParser:
"""Parseafile"""
def__init__(self):
...and so on...

Sort the results so that duplicate lines become adjacent lines. (If desired, add the option -f to make the sort case-insensitive. I did not.)

$ cat *.py | tr -d ' \t' | sort

Then count identical adjacent lines with uniq -c. Any of the output lines that begins with the number 2 or greater represents a duplicate.

$ cat *.py | tr -d ' \t' | sort | uniq -c
    16 response=requests.get(url)
   203 def__init__(self):
     1 #Thisisatotallyuniquecomment
     8 x=Vector([1,2,3])
...and so on...

Finally, sort the output numerically from highest to lowest value with sort -nr, so the most-duplicated lines are at the top of the output.

$ cat *.py | tr -d ' \t' | sort | uniq -c | sort -nr

If you read my book, Efficient Linux at the Command Line, you may recognize the combination of sort, uniq -c, and sort -nr as a common pattern that I explain on page 13.

Checking a whole directory tree

If your files are stored in subdirectories, not in a single directory, don’t worry. Instead of using cat alone, run the find command to locate Python files throughout the tree and use xargs to “cat” their contents together:

$ find . -type f -name \*.py -print0 | xargs -0 cat | tr -d ' \t' \
  | sort | uniq -c  | sort -nr

Cutting down the noise

Some of the duplicate lines will be trivial: blank lines, comments, a reserved word on a line by itself, and so on. You can skip past them by eye, but consider making your pipeline more robust by removing these trivial lines. One technique is to create a file of lines that you want to omit:

$ cat noisy-lines
"""
#
#TODO:
break
continue
def__init__(self):
pass
...and so on...

Then use fgrep -v to filter them out before calling uniq:

$ cat *.py | tr -d ' \t' | fgrep -v -f noisy-lines \
  | sort | uniq -c  | sort -nr
   16 response=requests.get(url)
    8 x=Vector([1,2,3])
...and so on...

The commands I’ve presented are certainly not as robust as real static analysis, but they’re a simple and handy tools for your toolbox.

For more examples of interesting pipelines, or to understand the general techniques in this post in more detail, check out Efficient Linux at the Command Line.

Filed Under: Commands, Linux, Shell

Daniel Barrett

Dan is the author of numerous books on technology, including the bestselling "Linux Pocket Guide" and the new "Efficient Linux at the Command Line."

Primary Sidebar

Efficient Linux at the Command Line
Buy: Amazon US • O'Reilly • Bookshop.org • Amazon CA • Amazon UK

Linux Pocket Guide

Buy: Amazon US • O'Reilly • Bookshop.org • Amazon CA • Amazon UK

Recent posts

  • Sharpen Your Linux Command-Line Skills with Spelling Bee Puzzles
  • Check for duplicate code with a one-liner
  • Brace expansion to reduce typing
  • Why I wrote “Efficient Linux at the Command Line”

RSS feed

RSS feed

Archives

  • October 2022
  • April 2022
  • February 2022

Copyright 2022 Daniel J. Barrett