Relearning Python #2: duplicate-files.py

Posted: 2023-03-07
Word Count: 688
Tags: programming python

Table of Contents

In the last post, we wrote a rough cut of a Python script to list duplicated files in one or more directories. In this post we have added command-line options, output options, and some internal improvements.

Improvements

Last time I said I’d do the following:

[…]

[…]

Take a look at the latest duplicate-files.py. This script now works almost exactly like the Ruby version.

$ time duplicate-files.py -q Projects -o Projects/dupes-py.yaml

real	0m5.971s
user	0m4.272s
sys 	0m1.675s

$ time duplicate-files.rb -q Projects -o Projects/dupes-rb.yaml

real	0m7.724s
user	0m4.753s
sys 	0m2.896s

$ diff Projects/dupes-{rb,py}.yaml
154a155,156
> - - Projects/3pty/jsonp-api/tck/tck-tests/src/main/resources/jsonObjectUnknownEncoding.json
>   - Projects/3pty/jsonp-api/tck/tck-tests/target/classes/jsonObjectUnknownEncoding.json
4089a4092,4093
> - - Projects/vendor/java/antlr/4.7.2/antlr-python2-runtime-4.7.2/src/antlr4_python2_runtime.egg-info/dependency_links.txt
>   - Projects/vendor/java/antlr/4.7.2/antlr-python3-runtime-4.7.2/src/antlr4_python3_runtime.egg-info/dependency_links.txt

Important Changes In Detail

Please follow along in the Python code

add_to_dupsets

(ll 66-73)

As I suggested last time, I’ve rewritten this function to use frozensets which Python will accept as elements in a set. Also, all keys in the dupset mapping from file names to known duplicates share the same frozenset if they’re duplicates.

compare_files

(ll 75-86)

I’ve made this function more “Pythonic” by using itertools.combinations instead of nested loops. (The Ruby version now uses something similar.) I also inlined sort_uniq because using frozensets really did make the code simpler. I could have inlined the variable superset, but I wanted to inspect the sets I was creating along the way.

convert_results

(ll 134, 135)

This snippet of code to turn nested Paths and sets into strings and lists kept moving around. I made it its own function so I could deploy it as late in processing as possible, because sets and Paths are the best data structures for this script.

make_argparser

(ll 88-132)

This is just a big procedure to define command line arguments. Python’s argparse is much more verbose than Ruby’s OptionParser.

run

The main function has a few big changes:

  1. Using the argument parser to parse options and the regular directory arguments.

  2. Implementing the -z option by post-processing the results of compare_files.

  3. Adding YAML as an output option … assuming the import doesn’t fail.

  4. Adding “pretty-printing” to both YAML and JSON. The “pretty” YAML isn’t quite the same as in the Ruby script, but honestly it wasn’t that pretty in that script either.

  5. Writing to an output file if the user specifies that option.

Also worth noting: just as PathEncoder strips out the Path objects for JSON, convert_results strips out Path objects and sets for YAML.

More Improvements

I also said I’d do three other things:

Right now -q suppresses no output and -v adds no output because there is none. Maybe because it’s my second time doing this I felt no need to print messages, debugging or otherwise.

I’m going to deprecate -d. One might as well add it to the directory list, since you’re recursing down that directory either way and I doubt it saves time not checking for duplicates that don’t involve that directory. The only other effect is to put the file from that directory first in the list of duplicates, since remove-files.rb removes all but the first in a set of duplicates. (That’s why -z includes a blank line at the start of the zero-length files.)

Instead I removed the Progress class in duplicate-files.rb. I can figure out why I couldn’t predict the number of comparisons later, as a different project.

Vale, duplicate-files.*

Despite still having more to do, I’m going to put these scripts aside for now. Instead, next time I’ll write something else in Python: a processor for the screenplay equivalent of Markdown, Fountain.