Python is a popular programming language, on a par with C++ and Java, especially in the numerical analysis community. I thought it would behoove me to pick it up again.
Unfortunately the last Python version I used for a job was 2.0, back in 2003. The most recently released version was 3.11. Python 3.5, installed by default on my laptop, is no longer supported1, and Python 3.7 is the earliest one that is supported2. Python 2.7, used by some software on my laptop, is apparently way old.
If I recall, the jump from 2.x to 3.x was supposed to be a big deal, so I have a lot of relearning to do.
My History With Python
In maybe 1996 or 1997 – I think I was still living in Chicago – I read the first editon of Mark Lutz’s 900 page tome Programming Python, which covered Python 1.3. After seeing the mess that was Perl I found Python’s syntax familiar, its object orientation slightly clunky but usable, and its indentation-based syntax clever.
Since then I discovered other scripting languages:
- Ruby, a more thoroughly object-oriented language that has been gradually shedding its earlier Perlisms.
- Lua, which hypothetical readers will have seen before, a very small and very embeddable and extensible scripting language.
- Tcl/Tk, one of the earlier scripting languages which was really big at Sun but which seems like a glorified command-line processor and not like a “real” language at all.
- Various versions of Scheme like Guile and tinyscheme, which honestly I never really got into because of Lots of Irritating Silly Parentheses.
At the startup I worked at in 2001-2003 I even presented a little “tech talk” about scripting languages and how they might automate tasks. At one point I started developing a systems testing framework in Python 2.0, but honestly I didn’t think through all the issues or provide enough support for writing tests, so it was eventually replaced with first our new software architect’s interpreted Java engine and then actual JUnit tests in Java.
To be honest, the main reasons I forsook Python for Ruby were the use of indentation to convey program structure3, which becomes onerous without editor support when moving code blocks around, and the clunky, badly integrated support for classes and objects. (Maybe 3.x fixed the latter problem?)
Still, if only to reawaken dormant parts of my brain, Python it is.
Potential Projects
While I’ll peruse the tutorial in the standard documentation, and maybe even buy a book, the best way for me to relearn Python – the syntax, the libraries, and the idioms – is to work on simple but useful projects. Here are some initial candidates:
- A script to recurse through nested directories and find duplicate files … which I’ve already written for Ruby.
- A pull parser4 for JSON, a lightweight data interchange format that’s become standard across the Web.
- A pull parser4 for ELTN, a lightweight data interchange format I just made up a few years ago.
- A script to convert Fountain documents – a simple text format for writing screenplays – into HTML.5
I’ll probably think of some better ones as I go – too many parsers and text processors – but for now I think I’ll start with #1 above, a duplicate files finder. For learning purposes6 I’ll rewrite it from first principles … but it’s nice to know there’s a reference implementation.
duplicate-files.rb
I’ve had this script for nearly(?) twenty years, as the reference to Ruby 1.97 attests.
It’s a command line script that takes a list of directories.
It traverses the directories, finds duplicate files8,
then lists all the sets of duplicates in YAML or JSON formats.
Calling duplicate-files.rb -h
yields the following message:
Usage: duplicate-files.rb [options] dir1 [dir2 ...]
-q, --[no-]quiet Run absolutely quietly
-v, --[no-]verbose Run verbosely
-d, --from [DIR] Compare arguments only to files from DIR
-o, --output [OUTFILE] Write standard output to OUTFILE
-j, --json Write output as JSON
-y, --yaml Write output as YAML (default)
-p, --[no-]pretty Pretty-print output
The script has gotten a little crufty, and it could be simplified.
- The Spinner class was a somewhat over-elaborate means to create an ASCII spinning wheel when traversing directories. On slower machines or larger directory structures the process seems to hang.
- The Printer objects were an attempt to print results incrementally that failed. They could be replaced with a simple Array to collect results and functions with flags.
- The
-d
option came about when I wanted to keep only the versions in a specific directory. A script I wrote to automate the deletions would always keep the first path in a list and delete the rest. By default paths in each duplicate list are sorted alphabetically. With this option the “canonical” directory always appears first. - “Verbose” mode is essentially for debugging, which is why one can engage both “quiet” and “verbose” mode.
append_duplicates
could be shorter if I had come up with better data structures.
#!/usr/bin/env ruby
require 'find'
require 'fileutils'
require 'optparse'
require 'yaml'
require 'json'
# Default list of files to prune in search
PRUNE = ['.svn', 'CVS', 'CVSROOT', '.DS_Store', '.git']
# Unfortunate artifact of transition between Ruby 1.9 and 2.0
if self.class.const_defined?(:Encoding) then
ENCODING_UTF8 = Encoding.find('UTF-8')
else
ENCODING_UTF8 = nil
end
# Simple class to manage an ASCII spinner
class Spinner
SPINNER_STATES = ['-', '\\', '|', "/"]
SPINNER_INTERVAL = 0.5 # s
def initialize(io=$stderr)
@io = io
@state = 0
@updated = nil
end
def start()
@updated = Time.now
@io.print(SPINNER_STATES[0])
end
def update()
if not @updated then
start()
end
now = Time.now
if now - @updated > SPINNER_INTERVAL then
@updated = now
@state = (@state + 1) % SPINNER_STATES.length
@io.print("\b", SPINNER_STATES[@state])
end
end
end
# Collects duplicate file sets in lieu of an Array,
# and writes them as Ruby objects
class DefaultPrinter
attr_accessor :io, :pretty
def initialize
@results = []
end
def pretty?
return @pretty
end
def << (obj)
@results << obj
end
def print_results
@results.sort!
print(@results)
end
protected
def print(obj)
@io.write(obj)
@io.write("\n");
end
end
# Printer class (see above) that serializes as JSON
class JsonPrinter < DefaultPrinter
def print(obj)
if @pretty then
text = JSON.pretty_generate(obj)
else
text = JSON.generate(obj)
end
@io.write(text)
@io.write("\n")
end
end
# Printer class (see above) that serializes as YAML.
class YamlPrinter < DefaultPrinter
# :line_width setting so file names w/spaces don't break across lines
DEFAULT_LINE_WIDTH = 4096
def print(obj)
opts = {:line_width => DEFAULT_LINE_WIDTH}
if @pretty then
# {:canonical => true} looks almost identical to pretty-printed JSON
# so the header keeps them distinct.
opts[:header] = true
opts[:canonical] = true
end
YAML.dump(@results, @io, opts)
end
end
# Whether `path_i` and `path_j` refer to duplicate but not identical files
def duplicate_files?(path_i, path_j)
return (File.exist?(path_i) \
and File.exist?(path_j) \
and not File.identical?(path_i, path_j) \
and FileUtils.cmp(path_i, path_j))
end
def prunable?(path, prune=[])
name = File.basename(path)
return (prune.include?(name) or File.fnmatch('._*', name))
end
# Recurse through array of `dirs` and produce Hash of all files by size.
def files_by_size(dirs, prune=[], verbose=false)
result = Hash.new {|h,k| h[k]=[]}
count = 0
spinner = nil
if verbose then
$stderr.print("Looking for files in ", dirs, ": ")
spinner = Spinner.new($stderr)
spinner.start
end
Find.find(*dirs) do |path|
if prunable?(path, prune) then
Find.prune()
elsif File.file?(path) then
size = File.size(path)
if size > 0 then
count = count + 1
path.encode!(ENCODING_UTF8) if ENCODING_UTF8
result[size] << path if size > 0
if verbose then
count = count + 1
spinner.update
end
end
end
end
if verbose then
$stderr.print(" done!\n")
$stderr.print("Found ", count, " non-empty files in ",
result.size, " size groups.\n")
end
return result
end
# Compare each file in `paths` and append lists of equal files to `result`
def append_duplicates(result, paths)
idsets = Hash.new {|h,k| h[k]=[k]}
0.upto(paths.length - 1) do |i|
(i+1).upto(paths.length - 1) do |j|
path_i = paths[i]
path_j = paths[j]
if duplicate_files?(path_i, path_j) then
set_i, set_j = idsets[path_i], idsets[path_j]
set_i << path_j
set_i.sort!
set_j << path_i
set_j.sort!
end
end
end
idsets.values.uniq.each do | s |
result << s
end
return result
end
# Find all files in `fmap` and append lists of equal files to `result`
# `fmap` is a map of files by size.
def append_all_dupes(result, fmap, verbose=false)
fmap.each_pair do |size, paths|
if paths.size > 1 then
$stderr.print("Comparing ", paths, " ...") if verbose
append_duplicates(result, paths)
$stderr.print("done.\n") if verbose
end
end
end
# Find all files in `srcdir`, compare to all files in `fmap`,
# and append lists of equal files to `result`.
# The path from `srcdir` will always be first in the list.
def append_dir_dupes(result, fmap, srcdir, prune=[], verbose=false)
$stderr.print("Looking for files in '", srcdir, "':\n") if verbose
Find.find(srcdir) do |path_i|
if prunable?(path_i, prune) then
Find.prune()
elsif File.file?(path_i) then
size = File.size(path_i)
paths = fmap[size]
if size > 0 and paths and not paths.empty? then
dupes = []
$stderr.print("Comparing '", path_i,
"' to ", paths, " ...") if verbose
paths.each do |path_j|
if duplicate_files?(path_i, path_j) then
dupes << path_j
end
end
if dupes.size > 0 then
dupes.sort!
result << [path_i, *dupes]
end
$stderr.print(" done.\n") if verbose
end
end
end
$stderr.print("Done!\n") if verbose
return result
end
def parse_options(config)
outfile = nil
pclass = YamlPrinter
pretty = false
OptionParser.new do |opts|
opts.banner = "Usage: duplicate-files.rb [options] dir1 [dir2 ...]"
opts.on("-q", "--[no-]quiet", "Run absolutely quietly") do |v|
config.info = (not v)
end
opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
config.verbose = v
end
opts.on("-d", "--from [DIR]",
"Compare arguments only to files from DIR") do |d|
config.canondir = d
end
opts.on("-o", "--output [OUTFILE]",
"Write standard output to OUTFILE") do |f|
outfile = File.new(f, 'w')
end
opts.on("-j", "--json", "Write output as JSON") do |v|
pclass = JsonPrinter
end
opts.on("-y", "--yaml", "Write output as YAML (default)") do |v|
pclass = YamlPrinter
end
opts.on("-p", "--[no-]pretty", "Pretty-print output") do |v|
pretty = v
end
end.parse!
printer = pclass.new
printer.io = (outfile or $stdout)
printer.pretty = pretty
config.printer = printer
config.dirs = ARGV
end
Options = Struct.new("Options",
:info, :verbose, :canondir, :dirs, :prune, :printer)
def run
opts = Options.new(true, false, nil, [], PRUNE, nil)
parse_options(opts)
fmap = files_by_size(opts.dirs, opts.prune, (opts.info or opts.verbose))
if opts.info then
$stderr.print("Comparing files ...")
end
if opts.canondir then
append_dir_dupes(opts.printer, fmap, opts.canondir,
opts.prune, opts.verbose)
else
append_all_dupes(opts.printer, fmap, opts.verbose)
end
if opts.info then
$stderr.print(" done!\n")
end
opts.printer.print_results
end
run()
-
Specifically 3.5.2, which is dated Jan 26 2021 and is several patch releases behind. ↩︎
-
For security fixes only, until 2023-06-27. ↩︎
-
Seriously,
begin
…end
and the like are much more readable, and you don’t have to worry so much about reindenting code you’ve factored into a new procedure or method. ↩︎ -
A pull parser reads only the next grammatical element in the input, then returns control to the caller. (E.g. StAX in Java.) The more common “push” parser reads all the input in one go, and invokes event handling callbacks and/or builds a parse tree in memory. (E.g. SAX parsers.) Pull parsers take up less memory than a parse tree and can make for more readable code than a collection of event handlers, but anything more complicated than a simple, unambiguous recursive grammar like XML, JSON, or ELTN’s subset of Lua is virtually impossible. ↩︎ ↩︎
-
I found one called
screenplain
, but I ran it over an old fanfic in Fountain format and the results were kind of ugly. The “bare” HTML looks OK – what is an<H6>
tag though? – but the syntax mandates that unlike Markdown parsers preserve line breaks. (So there go those habits. Maybe an extension where if a line ends with\
the line break is ignored?) Still, as a practice project the idea is simple enough … I hope. ↩︎ -
And maybe out of necessity. As far as I can tell, Python lacks an equivalent to the
find
module which emulates the Unix utility of the same name. ↩︎ -
Released 2007-12-25, according to Wikipedia. The first version I used was probably 1.8, released in 2003 and the subject of Thomas and Hunt’s Programming Ruby 1st edition, a.k.a. the Pickaxe Book. (FWIW the current version of Ruby is 3.1.3, released two days ago.) ↩︎
-
It first groups files according to file size, then compares all combinations of files with the same size to create lists of files that have exactly the same bytes. ↩︎