Relearning Python #0: Introduction

Posted: 2022-11-26
Last Modified: 2023-02-09
Word Count: 1133
Tags: programming python ruby

Python is a popular programming language, on a par with C++ and Java, especially in the numerical analysis community. I thought it would behoove me to pick it up again.

Unfortunately the last Python version I used for a job was 2.0, back in 2003. The most recently released version was 3.11. Python 3.5, installed by default on my laptop, is no longer supported1, and Python 3.7 is the earliest one that is supported2. Python 2.7, used by some software on my laptop, is apparently way old.

If I recall, the jump from 2.x to 3.x was supposed to be a big deal, so I have a lot of relearning to do.

My History With Python

In maybe 1996 or 1997 – I think I was still living in Chicago – I read the first editon of Mark Lutz’s 900 page tome Programming Python, which covered Python 1.3. After seeing the mess that was Perl I found Python’s syntax familiar, its object orientation slightly clunky but usable, and its indentation-based syntax clever.

Since then I discovered other scripting languages:

At the startup I worked at in 2001-2003 I even presented a little “tech talk” about scripting languages and how they might automate tasks. At one point I started developing a systems testing framework in Python 2.0, but honestly I didn’t think through all the issues or provide enough support for writing tests, so it was eventually replaced with first our new software architect’s interpreted Java engine and then actual JUnit tests in Java.

To be honest, the main reasons I forsook Python for Ruby were the use of indentation to convey program structure3, which becomes onerous without editor support when moving code blocks around, and the clunky, badly integrated support for classes and objects. (Maybe 3.x fixed the latter problem?)

Still, if only to reawaken dormant parts of my brain, Python it is.

Potential Projects

While I’ll peruse the tutorial in the standard documentation, and maybe even buy a book, the best way for me to relearn Python – the syntax, the libraries, and the idioms – is to work on simple but useful projects. Here are some initial candidates:

  1. A script to recurse through nested directories and find duplicate files … which I’ve already written for Ruby.
  2. A pull parser4 for JSON, a lightweight data interchange format that’s become standard across the Web.
  3. A pull parser4 for ELTN, a lightweight data interchange format I just made up a few years ago.
  4. A script to convert Fountain documents – a simple text format for writing screenplays – into HTML.5

I’ll probably think of some better ones as I go – too many parsers and text processors – but for now I think I’ll start with #1 above, a duplicate files finder. For learning purposes6 I’ll rewrite it from first principles … but it’s nice to know there’s a reference implementation.

duplicate-files.rb

Modified 2023-02-09: I’ve moved the whole listing here, both to cut down the size of this article and because it’s easier to compare the comments below to the code on another page.

I’ve had this script for nearly(?) twenty years, as the reference to Ruby 1.97 attests.

It’s a command line script that takes a list of directories. It traverses the directories, finds duplicate files8, then lists all the sets of duplicates in YAML or JSON formats. Calling duplicate-files.rb -h yields the following message:

The script has gotten a little crufty, and it could be simplified.

Full code listing here.


  1. Specifically 3.5.2, which is dated Jan 26 2021 and is several patch releases behind. ↩︎

  2. For security fixes only, until 2023-06-27. ↩︎

  3. Seriously, beginend and the like are much more readable, and you don’t have to worry so much about reindenting code you’ve factored into a new procedure or method. ↩︎

  4. A pull parser reads only the next grammatical element in the input, then returns control to the caller. (E.g. StAX in Java.) The more common “push” parser reads all the input in one go, and invokes event handling callbacks and/or builds a parse tree in memory. (E.g. SAX parsers.) Pull parsers take up less memory than a parse tree and can make for more readable code than a collection of event handlers, but anything more complicated than a simple, unambiguous recursive grammar like XML, JSON, or ELTN’s subset of Lua is virtually impossible. ↩︎ ↩︎

  5. I found one called screenplain, but I ran it over an old fanfic in Fountain format and the results were kind of ugly. The “bare” HTML looks OK – what is an <H6> tag though? – but the syntax mandates that unlike Markdown parsers preserve line breaks. (So there go those habits. Maybe an extension where if a line ends with \ the line break is ignored?) Still, as a practice project the idea is simple enough … I hope. ↩︎

  6. And maybe out of necessity. As far as I can tell, Python lacks an equivalent to the find module which emulates the Unix utility of the same name. ↩︎

  7. Released 2007-12-25, according to Wikipedia. The first version I used was probably 1.8, released in 2003 and the subject of Thomas and Hunt’s Programming Ruby 1st edition, a.k.a. the Pickaxe Book. (FWIW the current version of Ruby is 3.1.3, released two days ago.) ↩︎

  8. It first groups files according to file size, then compares all combinations of files with the same size to create lists of files that have exactly the same bytes. ↩︎