001

Tokenizing a path with Pest

I’m writing a small utility that organizes files and directories in a folder based on a predefined pattern. The goal for the inital version is to be able to write a simple config file that defines which folders to watch, where to put them and how to organize them, how they should be named, amongst other details. It’s inpsired by an application I used a lot name Hazel when MacOS was my primary OS.

One of the biggest challenges was designing a way to utilize or inject details about the file into the filename without requiring the user to write a lot of information.

I started by creating a basic structure for a variable: {token:specifier:modifier}

token A token is a supported data point about the file or directory.
specifier A specifier picks which kind of token to extract if it supports multiple. (e.g. month:accessed is the month the file or directory was last accessed)
modifier Specify ways to modify the output of the token. Multiple values are separated by a pipe (e.g. lowercase\|uppercase)

In addition, I added the ability to add thresholds to some values which allows you to group by time, for example.

A rename pattern might look like:

{year:created}/{month[6,12,24]:created}/{kind:uppercase|trim}

That would create a directory for the year the item was created, a subdirectory for each threshold (>6 months, >12 months, >24 months) and the kind, determined by the mime type, for which it will also convert to uppercase and trim any surrounding whitespace.

Let’s take a look at how the parser handles this.

This is what a variable looks like in Pest:

variable   = ${ "{"{,1} ~ token ~ thresholds? ~ ":"? ~ specifier? ~ ":"? ~ modifiers? ~ "}"{,1} }

The token and specifier were relatively simple to write. They are collections of names that are matched regardless of case.

token      =  { ^"year" | ^"month" | ^"day" | ^"mime" | ^"extension" | ^"kind" | ^"size" | ^"width" }
specifier  =  { ^"created" | ^"modified" | ^"accessed" }

The thresholds portion start if it finds an open square-bracket immediately after a variable and can only contain numbers.

threshold  =  { ASCII_DIGIT+ }
thresholds =  { "["? ~ (threshold+ ~ ","?)+ ~ "]"? }

The modifiers work similarly to the variables, except they can be chained with a pipe character.

modifier   =  { ^"lowercase" | ^"uppercase" | ^"names" }
modifiers  =  { modifier ~ "|"? ~ modifier? }

It also needs to skip over non-variables and duplicate the pattern in a path-like format.

text       =  { (CASED_LETTER | LETTER_NUMBER | CONNECTOR_PUNCTUATION | DASH_PUNCTUATION | INITIAL_PUNCTUATION | FINAL_PUNCTUATION | SPACING_MARK)+ }
item       =  { variable | text }
component  =  { "/"{,1}? ~ item+ }
path       =  { component+ }

This is what the complete parser looks like.

token      =  { ^"year" | ^"month" | ^"day" | ^"mime" | ^"extension" | ^"kind" | ^"size" | ^"width" }
threshold  =  { ASCII_DIGIT+ }
thresholds =  { "["? ~ (threshold+ ~ ","?)+ ~ "]"? }
modifier   =  { ^"lowercase" | ^"uppercase" | ^"names" }
modifiers  =  { modifier ~ "|"? ~ modifier? }
specifier  =  { ^"created" | ^"modified" | ^"accessed" }
variable   = ${ "{"{,1} ~ token ~ thresholds? ~ ":"? ~ specifier? ~ ":"? ~ modifiers? ~ "}"{,1} }
text       =  { (CASED_LETTER | LETTER_NUMBER | CONNECTOR_PUNCTUATION | DASH_PUNCTUATION | INITIAL_PUNCTUATION | FINAL_PUNCTUATION | SPACING_MARK)+ }
item       =  { variable | text }
component  =  { "/"{,1}? ~ item+ }
path       =  { component+ }

This was a lot more flexible and readable than a regex approach. Nom is another popular Rust-based parser library worth exploring.

© 2023 David Benjamin