Relearning Python #0: Introduction

Posted: 2022-11-26
Word Count: 2000
Tags: programming python ruby

Python is a popular programming language, on a par with C++ and Java, especially in the numerical analysis community. I thought it would behoove me to pick it up again.

Unfortunately the last Python version I used for a job was 2.0, back in 2003. The most recently released version was 3.11. Python 3.5, installed by default on my laptop, is no longer supported1, and Python 3.7 is the earliest one that is supported2. Python 2.7, used by some software on my laptop, is apparently way old.

If I recall, the jump from 2.x to 3.x was supposed to be a big deal, so I have a lot of relearning to do.

My History With Python

In maybe 1996 or 1997 – I think I was still living in Chicago – I read the first editon of Mark Lutz’s 900 page tome Programming Python, which covered Python 1.3. After seeing the mess that was Perl I found Python’s syntax familiar, its object orientation slightly clunky but usable, and its indentation-based syntax clever.

Since then I discovered other scripting languages:

At the startup I worked at in 2001-2003 I even presented a little “tech talk” about scripting languages and how they might automate tasks. At one point I started developing a systems testing framework in Python 2.0, but honestly I didn’t think through all the issues or provide enough support for writing tests, so it was eventually replaced with first our new software architect’s interpreted Java engine and then actual JUnit tests in Java.

To be honest, the main reasons I forsook Python for Ruby were the use of indentation to convey program structure3, which becomes onerous without editor support when moving code blocks around, and the clunky, badly integrated support for classes and objects. (Maybe 3.x fixed the latter problem?)

Still, if only to reawaken dormant parts of my brain, Python it is.

Potential Projects

While I’ll peruse the tutorial in the standard documentation, and maybe even buy a book, the best way for me to relearn Python – the syntax, the libraries, and the idioms – is to work on simple but useful projects. Here are some initial candidates:

  1. A script to recurse through nested directories and find duplicate files … which I’ve already written for Ruby.
  2. A pull parser4 for JSON, a lightweight data interchange format that’s become standard across the Web.
  3. A pull parser4 for ELTN, a lightweight data interchange format I just made up a few years ago.
  4. A script to convert Fountain documents – a simple text format for writing screenplays – into HTML.5

I’ll probably think of some better ones as I go – too many parsers and text processors – but for now I think I’ll start with #1 above, a duplicate files finder. For learning purposes6 I’ll rewrite it from first principles … but it’s nice to know there’s a reference implementation.

duplicate-files.rb

I’ve had this script for nearly(?) twenty years, as the reference to Ruby 1.97 attests.

It’s a command line script that takes a list of directories. It traverses the directories, finds duplicate files8, then lists all the sets of duplicates in YAML or JSON formats. Calling duplicate-files.rb -h yields the following message:

Usage: duplicate-files.rb [options] dir1 [dir2 ...]
    -q, --[no-]quiet                 Run absolutely quietly
    -v, --[no-]verbose               Run verbosely
    -d, --from [DIR]                 Compare arguments only to files from DIR
    -o, --output [OUTFILE]           Write standard output to OUTFILE
    -j, --json                       Write output as JSON
    -y, --yaml                       Write output as YAML (default)
    -p, --[no-]pretty                Pretty-print output

The script has gotten a little crufty, and it could be simplified.

#!/usr/bin/env ruby 

require 'find'
require 'fileutils'
require 'optparse'
require 'yaml'
require 'json'


# Default list of files to prune in search
PRUNE = ['.svn', 'CVS', 'CVSROOT', '.DS_Store', '.git']

# Unfortunate artifact of transition between Ruby 1.9 and 2.0
if self.class.const_defined?(:Encoding) then
    ENCODING_UTF8 = Encoding.find('UTF-8')
else
    ENCODING_UTF8 = nil
end

# Simple class to manage an ASCII spinner
class Spinner
    SPINNER_STATES = ['-', '\\', '|', "/"]

    SPINNER_INTERVAL = 0.5 # s

    def initialize(io=$stderr)
        @io = io
        @state = 0
        @updated = nil
    end

    def start()
        @updated = Time.now
        @io.print(SPINNER_STATES[0])
    end

    def update()
        if not @updated then
            start()
        end

        now = Time.now
        if now - @updated > SPINNER_INTERVAL then
            @updated = now
            @state = (@state + 1) % SPINNER_STATES.length
            @io.print("\b", SPINNER_STATES[@state])
        end
    end
end

# Collects duplicate file sets in lieu of an Array, 
# and writes them as Ruby objects
class DefaultPrinter
    attr_accessor :io, :pretty

    def initialize
        @results = []
    end

    def pretty?
        return @pretty
    end

    def << (obj)
        @results << obj
    end

    def print_results
        @results.sort!
        print(@results)
    end

    protected

    def print(obj)
        @io.write(obj)
        @io.write("\n");
    end
end

# Printer class (see above) that serializes as JSON
class JsonPrinter < DefaultPrinter
    def print(obj)
        if @pretty then
            text = JSON.pretty_generate(obj)
        else
            text = JSON.generate(obj)
        end
        @io.write(text)
        @io.write("\n")
    end
end

# Printer class (see above) that serializes as YAML.
class YamlPrinter < DefaultPrinter

    # :line_width setting so file names w/spaces don't break across lines
    DEFAULT_LINE_WIDTH = 4096

    def print(obj)
        opts = {:line_width => DEFAULT_LINE_WIDTH} 
        if @pretty then
            # {:canonical => true} looks almost identical to pretty-printed JSON
            # so the header keeps them distinct.
            opts[:header] = true
            opts[:canonical] = true
        end
        YAML.dump(@results, @io, opts)
    end
end


# Whether `path_i` and `path_j` refer to duplicate but not identical files
def duplicate_files?(path_i, path_j)
    return (File.exist?(path_i) \
            and File.exist?(path_j) \
            and not File.identical?(path_i, path_j) \
            and FileUtils.cmp(path_i, path_j))
end

def prunable?(path, prune=[])
    name = File.basename(path)
    return (prune.include?(name) or File.fnmatch('._*', name))
end

# Recurse through array of `dirs` and produce Hash of all files by size.
def files_by_size(dirs, prune=[], verbose=false)
    result = Hash.new {|h,k| h[k]=[]}
    count = 0
    spinner = nil 
    if verbose then
        $stderr.print("Looking for files in ", dirs, ": ")
        spinner = Spinner.new($stderr)
        spinner.start
    end
    Find.find(*dirs) do |path|
        if prunable?(path, prune) then
            Find.prune()
        elsif File.file?(path) then
            size = File.size(path)
            if size > 0 then
                count = count + 1
                path.encode!(ENCODING_UTF8) if ENCODING_UTF8
                result[size] << path if size > 0
                if verbose then
                    count = count + 1
                    spinner.update
                end
            end
        end
    end
    if verbose then
        $stderr.print(" done!\n")

        $stderr.print("Found ", count, " non-empty files in ", 
                      result.size, " size groups.\n")
    end
    return result
end

# Compare each file in `paths` and append lists of equal files to `result`
def append_duplicates(result, paths)
    idsets = Hash.new {|h,k| h[k]=[k]}
    0.upto(paths.length - 1) do |i|
        (i+1).upto(paths.length - 1) do |j|
            path_i = paths[i]
            path_j = paths[j]
            if duplicate_files?(path_i, path_j) then
                set_i, set_j = idsets[path_i], idsets[path_j]
                set_i << path_j
                set_i.sort!
                set_j << path_i
                set_j.sort!
            end
        end
    end
    idsets.values.uniq.each do | s |
        result << s
    end
    return result
end

# Find all files in `fmap` and append lists of equal files to `result`
# `fmap` is a map of files by size.
def append_all_dupes(result, fmap, verbose=false)
    fmap.each_pair do |size, paths|
        if paths.size > 1 then
            $stderr.print("Comparing ", paths, " ...") if verbose
            append_duplicates(result, paths)
            $stderr.print("done.\n") if verbose
        end
    end
end

# Find all files in `srcdir`, compare to all files in `fmap`,
# and append lists of equal files to `result`.
# The path from `srcdir` will always be first in the list.
def append_dir_dupes(result, fmap, srcdir, prune=[], verbose=false)
    $stderr.print("Looking for files in '", srcdir, "':\n") if verbose
    Find.find(srcdir) do |path_i|
        if prunable?(path_i, prune) then
            Find.prune()
        elsif File.file?(path_i) then
            size = File.size(path_i)
            paths = fmap[size]
            if size > 0 and paths and not paths.empty? then
                dupes = []
                $stderr.print("Comparing '", path_i, 
                              "' to ", paths, " ...") if verbose
                paths.each do |path_j|
                    if duplicate_files?(path_i, path_j) then
                        dupes << path_j
                    end
                end
                if dupes.size > 0 then
                   dupes.sort!
                   result << [path_i, *dupes]
                end
                $stderr.print(" done.\n") if verbose
            end
        end
    end
    $stderr.print("Done!\n") if verbose
    return result
end

def parse_options(config)
    
    outfile = nil
    pclass = YamlPrinter
    pretty = false

    OptionParser.new do |opts|
        opts.banner = "Usage: duplicate-files.rb [options] dir1 [dir2 ...]"

        opts.on("-q", "--[no-]quiet", "Run absolutely quietly") do |v|
            config.info = (not v)
        end
        opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
            config.verbose = v
        end
        opts.on("-d", "--from [DIR]", 
                "Compare arguments only to files from DIR") do |d|
            config.canondir = d
        end
        opts.on("-o", "--output [OUTFILE]", 
                "Write standard output to OUTFILE") do |f|
            outfile = File.new(f, 'w')
        end
        opts.on("-j", "--json", "Write output as JSON") do |v|
            pclass = JsonPrinter
        end
        opts.on("-y", "--yaml", "Write output as YAML (default)") do |v|
            pclass = YamlPrinter 
        end
        opts.on("-p", "--[no-]pretty", "Pretty-print output") do |v|
            pretty = v
        end
    end.parse!

    printer = pclass.new

    printer.io = (outfile or $stdout)
    printer.pretty = pretty

    config.printer = printer
    config.dirs = ARGV
end

Options = Struct.new("Options", 
                     :info, :verbose, :canondir, :dirs, :prune, :printer)

def run
    opts = Options.new(true, false, nil, [], PRUNE, nil)

    parse_options(opts)

    fmap = files_by_size(opts.dirs, opts.prune, (opts.info or opts.verbose))

    if opts.info then
        $stderr.print("Comparing files ...")
    end

    if opts.canondir then
        append_dir_dupes(opts.printer, fmap, opts.canondir, 
                         opts.prune, opts.verbose)
    else
        append_all_dupes(opts.printer, fmap, opts.verbose)
    end

    if opts.info then
        $stderr.print(" done!\n")
    end

    opts.printer.print_results
end

run()

  1. Specifically 3.5.2, which is dated Jan 26 2021 and is several patch releases behind. ↩︎

  2. For security fixes only, until 2023-06-27. ↩︎

  3. Seriously, beginend and the like are much more readable, and you don’t have to worry so much about reindenting code you’ve factored into a new procedure or method. ↩︎

  4. A pull parser reads only the next grammatical element in the input, then returns control to the caller. (E.g. StAX in Java.) The more common “push” parser reads all the input in one go, and invokes event handling callbacks and/or builds a parse tree in memory. (E.g. SAX parsers.) Pull parsers take up less memory than a parse tree and can make for more readable code than a collection of event handlers, but anything more complicated than a simple, unambiguous recursive grammar like XML, JSON, or ELTN’s subset of Lua is virtually impossible. ↩︎ ↩︎

  5. I found one called screenplain, but I ran it over an old fanfic in Fountain format and the results were kind of ugly. The “bare” HTML looks OK – what is an <H6> tag though? – but the syntax mandates that unlike Markdown parsers preserve line breaks. (So there go those habits. Maybe an extension where if a line ends with \ the line break is ignored?) Still, as a practice project the idea is simple enough … I hope. ↩︎

  6. And maybe out of necessity. As far as I can tell, Python lacks an equivalent to the find module which emulates the Unix utility of the same name. ↩︎

  7. Released 2007-12-25, according to Wikipedia. The first version I used was probably 1.8, released in 2003 and the subject of Thomas and Hunt’s Programming Ruby 1st edition, a.k.a. the Pickaxe Book. (FWIW the current version of Ruby is 3.1.3, released two days ago.) ↩︎

  8. It first groups files according to file size, then compares all combinations of files with the same size to create lists of files that have exactly the same bytes. ↩︎