Relearning Python #1: Python At Last

While waiting for a Python Meetup to start¹, I decided to start writing a Python version of the duplicate-files script I presented previously. These are my thoughts on the experience of cutting a whole 68-line script.

Follow along in another window as I comment on duplicate-files.py.

General Comments

How hard was it?

Honestly, the second hardest part was looking up how the builtin objects and libraries work. I kept flipping between the docs and my source code, multiple times, to figure out how to get the entries from a ~~Hash~~ Dictionary, how to iterate over an ~~Array~~ List, and so forth.

The hardest, however, came when I discovered Python’s strict rules about mutable objects and hash keys. Because dictionaries, lists, and sets are mutable, attempts to use them as keys always fail. (Mutable objects can change, changing the hash value.²) Furthermore, Ruby’s YAML and JSON modules accept any iterable object as a JSON Array or YAML List, while Python’s json accepted only Lists.³ It also balked at Path objects, which is why line 28 uses strings.³ (And Python has no YAML extension in its standard library. I could have installed pyyaml, but I wanted to see what Python could do out of the box.)

I ran into the same problems turning duplicate file sets into unique and sorted lists; the Ruby code took only a couple of method calls, while my Python code uses a bunch of temporary objects. There’s probably a more elegant solution with list comprehensions, but it was late and I wanted something working first.

Otherwise, the whole thing came together pretty easily.

Does it work?

Here’s a quick test run on them both, using my Projects directory – all my little coding projects – as sample data.

$ time duplicate-files.py Projects > Projects/dupes-py.json 

real	0m4.032s
user	0m2.434s
sys 	0m1.521s

For reference:

$ time duplicate-files.rb -jp -o Projects/dupes-rb.json Projects
Looking for files in ["Projects"]: - done!
Found 15150 files in 3123 size groups.
Comparing files ...
26593/6246 |########################################|... done!
Wrote performance data to /tmp/dfperf20230303-60194-gfcr6l/perf-2023-03-03-124130-960648980.json

real	0m5.493s
user	0m3.527s
sys 	0m1.938s

The Python version is about a second faster, but it’s doing way less.⁴

The Python version also found two more duplicate file sets than the Ruby version. One of the file sets contained only a null byte (00), and the other only a newline (0A). Maybe Ruby’s FileUtils.cmp tries to compare files by lines, while Python’s filecmp.cmp does a straight binary comparison.

Specific Functions

`find_files`

(lines 12-30)

Honestly I thought this might be the hardest part without Ruby’s Find module, but a simple recursive algorithm suits the problem much better.

As noted above, not keeping the Path objects annoyed me a bit. It worked out, but I wish Python’s json used duck-typing more when converting a structure to JSON.³

And yes, I couldn’t help putting in a comment marking the function’s end. Too much Ruby and Lua, I guess.

`duplicate_files`

(lines 32-33)

This has two fewer clauses than the Ruby equivalent. I don’t know whether checking that the files exist before comparing them, or comparing paths simply by string name rather than the paths they point to, will hurt me when I start trying to break the script.

Also, I see that my C-instilled habit of putting procedures above the places that use them remains intact …

`add_to_dupsets`

(lines 35-41)

In the Ruby version this required a few calls in compare_files. Because Python lacks the “initializer blocks” the Ruby version uses, though, I felt I had to make it its own function.

`sort_uniq`

(lines 43-46)

This is a mess. I won’t sugar-coat it. I needed unique sorted lists of all the file sets I created. Briefly I considered making them the same Set instance, but I was worried I’d end up eradicating one or more of the Sets if I got fancy.² So instead I ended up creating multiple sets, then make the file sets into sorted lists, then made the lists into tuples so I could check for equality, then a set of the tuples to make them unique, then turned the set of unique tuples back into a list so the JSON emitter wouldn’t complain.

`compare_files`

(lines 48-57)

This is where the magic happens, and by magic I mean brute-force combinatorics. Like the Ruby version, I compare every file of the same size against every other file in the same size set, unless the size is 0 (which would always succeed).

Both the Ruby and Python versions are unable to infer that if A is a duplicate of B, and A a duplicate of C, than B is a duplicate of C without doing another comparison. I don’t know if that would speed things up or just complicate the code.²

`run`

(lines 59-65, 67-68)

Essentially this is the Ruby main function without the option parsing … or user-friendly output … or performance logging … or vertical whitespace. In other words, what the first cut of the Ruby program looked like in the beginning. I think. It was so long ago.

The use of Path here rather than up in find_files … tasks me. It tasks me and I shall fix it. Maybe with a short procedure that kicks off the recursive calls.

Finally I call the main procedure if the user invoked this script as a program. Because they might mistake it for a module …

Conclusion and Next Steps

This little script is no Zope. It doesn’t even do what my crufty Ruby script can do. It’s just a quick project to get back into Python.

Still, improving it will be my next installment:

Parsing command-line options with argparse or something else.
Generating (or not generating) messages to let the user know what the script is doing. (Spinning wheels and progress bars optional.)
Write results to a file instead of STDOUT.
Installing pyyaml to generate (pretty-printed) YAML or JSON as desired.
An option to include zero-length files as “duplicates”.
(Maybe) compare all files to those in a “canonical directory”.
(Maybe) reworking sort_uniq to create unique, sorted lists inside a larger list more efficiently and elegantly.
(Maybe) figure out why I can’t accurately estimate how many comparisons a data set will require, thus removing the need for “performace” (sic) data.
(Maybe) deleting duplicates with user approval, thus removing the need for a remove-files.py.⁵

Originally in person, but because of bad weather yesterday we held it on Zoom. ↩︎
Having one frozenset instance for every set of identical files, built by combining sets from each file and setting the result back as the set for each contained file in the dupsets index, would solve the hashing, redundant comparisons, and uniquing problems. a) The sets would already be unique. b) Before comparing files the function could check if the sets for each file already contain the other. c) If all the sets with the same data are the same frozenset, I can just check if that instance is already in a set, then convert them all to lists and sort them at the same time. I’ll leave that for next version. ↩︎ ↩︎ ↩︎
After writing this, I discovered one can substitute an encoder object into json.dumps() that understands sets, Paths, or anything else. I’m letting this sentence stand, along with the horrible code I wrote as a result of not knowing this. I’ll fix it next version. ↩︎ ↩︎ ↩︎
Also note the Progress bar is still broken. “26593/6246”? Really? ↩︎
I wanted a step to examine the duplicate files and decide which versions or locations to keep. But maybe there’s another way … ↩︎