Relearning Python #0: Introduction | Frank Mitchell's Blog

Python is a popular programming language, on a par with C++ and Java, especially in the numerical analysis community. I thought it would behoove me to pick it up again.

Unfortunately the last Python version I used for a job was 2.0, back in 2003. The most recently released version was 3.11. Python 3.5, installed by default on my laptop, is no longer supported¹, and Python 3.7 is the earliest one that is supported². Python 2.7, used by some software on my laptop, is apparently way old.

If I recall, the jump from 2.x to 3.x was supposed to be a big deal, so I have a lot of relearning to do.

My History With Python

In maybe 1996 or 1997 – I think I was still living in Chicago – I read the first editon of Mark Lutz’s 900 page tome Programming Python, which covered Python 1.3. After seeing the mess that was Perl I found Python’s syntax familiar, its object orientation slightly clunky but usable, and its indentation-based syntax clever.

Since then I discovered other scripting languages:

Ruby, a more thoroughly object-oriented language that has been gradually shedding its earlier Perlisms.
Lua, which hypothetical readers will have seen before, a very small and very embeddable and extensible scripting language.
Tcl/Tk, one of the earlier scripting languages which was really big at Sun but which seems like a glorified command-line processor and not like a “real” language at all.
Various versions of Scheme like Guile and tinyscheme, which honestly I never really got into because of Lots of Irritating Silly Parentheses.

At the startup I worked at in 2001-2003 I even presented a little “tech talk” about scripting languages and how they might automate tasks. At one point I started developing a systems testing framework in Python 2.0, but honestly I didn’t think through all the issues or provide enough support for writing tests, so it was eventually replaced with first our new software architect’s interpreted Java engine and then actual JUnit tests in Java.

To be honest, the main reasons I forsook Python for Ruby were the use of indentation to convey program structure³, which becomes onerous without editor support when moving code blocks around, and the clunky, badly integrated support for classes and objects. (Maybe 3.x fixed the latter problem?)

Still, if only to reawaken dormant parts of my brain, Python it is.

Potential Projects

While I’ll peruse the tutorial in the standard documentation, and maybe even buy a book, the best way for me to relearn Python – the syntax, the libraries, and the idioms – is to work on simple but useful projects. Here are some initial candidates:

A script to recurse through nested directories and find duplicate files … which I’ve already written for Ruby.
A pull parser⁴ for JSON, a lightweight data interchange format that’s become standard across the Web.
A pull parser⁴ for ELTN, a lightweight data interchange format I just made up a few years ago.
A script to convert Fountain documents – a simple text format for writing screenplays – into HTML.⁵

I’ll probably think of some better ones as I go – too many parsers and text processors – but for now I think I’ll start with #1 above, a duplicate files finder. For learning purposes⁶ I’ll rewrite it from first principles … but it’s nice to know there’s a reference implementation.

`duplicate-files.rb`

Modified 2023-02-09: I’ve moved the whole listing here, both to cut down the size of this article and because it’s easier to compare the comments below to the code on another page.

I’ve had this script for nearly(?) twenty years, as the reference to Ruby 1.9⁷ attests.

It’s a command line script that takes a list of directories. It traverses the directories, finds duplicate files⁸, then lists all the sets of duplicates in YAML or JSON formats. Calling duplicate-files.rb -h yields the following message:

The script has gotten a little crufty, and it could be simplified.

The Spinner class was a somewhat over-elaborate means to create an ASCII spinning wheel when traversing directories. On slower machines or larger directory structures the process seems to hang.
The Printer objects were an attempt to print results incrementally that failed. They could be replaced with a simple Array to collect results and functions with flags.
The -d option came about when I wanted to keep only the versions in a specific directory. A script I wrote to automate the deletions would always keep the first path in a list and delete the rest. By default paths in each duplicate list are sorted alphabetically. With this option the “canonical” directory always appears first.
“Verbose” mode is essentially for debugging, which is why one can engage both “quiet” and “verbose” mode.
append_duplicates could be shorter if I had come up with better data structures.

Full code listing here.

Specifically 3.5.2, which is dated Jan 26 2021 and is several patch releases behind. ↩︎
For security fixes only, until 2023-06-27. ↩︎
Seriously, begin … end and the like are much more readable, and you don’t have to worry so much about reindenting code you’ve factored into a new procedure or method. ↩︎
A pull parser reads only the next grammatical element in the input, then returns control to the caller. (E.g. StAX in Java.) The more common “push” parser reads all the input in one go, and invokes event handling callbacks and/or builds a parse tree in memory. (E.g. SAX parsers.) Pull parsers take up less memory than a parse tree and can make for more readable code than a collection of event handlers, but anything more complicated than a simple, unambiguous recursive grammar like XML, JSON, or ELTN’s subset of Lua is virtually impossible. ↩︎ ↩︎
I found one called screenplain, but I ran it over an old fanfic in Fountain format and the results were kind of ugly. The “bare” HTML looks OK – what is an <H6> tag though? – but the syntax mandates that unlike Markdown parsers preserve line breaks. (So there go those habits. Maybe an extension where if a line ends with \ the line break is ignored?) Still, as a practice project the idea is simple enough … I hope. ↩︎
And maybe out of necessity. As far as I can tell, Python lacks an equivalent to the find module which emulates the Unix utility of the same name. ↩︎
Released 2007-12-25, according to Wikipedia. The first version I used was probably 1.8, released in 2003 and the subject of Thomas and Hunt’s Programming Ruby 1st edition, a.k.a. the Pickaxe Book. (FWIW the current version of Ruby is 3.1.3, released two days ago.) ↩︎
It first groups files according to file size, then compares all combinations of files with the same size to create lists of files that have exactly the same bytes. ↩︎