Lightweight linting with tree-sitter
Tree-sitter queries allow you to search for patterns in syntax trees, much like a regex would, in text. Combine that with some Rust glue to write simple, custom linters.
Tree-sitter syntax trees
Here is a quick crash course on syntax trees generated by tree-sitter. Syntax trees produced by tree-sitter are represented by S-expressions. The generated S-expression for the following Rust code:
would be:
Syntax trees generated by tree-sitter have a couple of other cool properties: they are lossless syntax trees (or concrete syntax trees), such a tree can regenerate the original source code in its entirety. Consider the following addition to our example:
The tree-sitter syntax tree preserves the comment, while the typical abstract syntax tree wouldn’t:
Tree-sitter queries
Tree-sitter provides a DSL to match over CSTs. These queries resemble our S-expression syntax trees, here is a query to match all line comments in a Rust CST:
Neat, eh? But don’t take my word for it, give it a go on the tree-sitter playground. Type in a query like so:
Here’s another to match let expressions that bind an integer to an identifier:
We can capture nodes into variables:
And apply certain predicates to captures:
The #match? predicate checks if a capture matches a regex:
Exhibit indifference, as a stoic programmer would, with the wildcard pattern:
The documentation does the tree-sitter query DSL more justice, but we now know enough to write our first lint.
Write you a tree-sitter lint
Strings in std::env functions are error-prone:
I prefer this instead:
Let’s write a lint to find std::env functions that use strings. Put aside the effectiveness of this lint for the moment, and take a stab at writing a tree-sitter query. For reference, a function call like so:
Produces the following S-expression:
We are definitely looking for a call_expression:
Whose function name matches std::env::var or std::env::remove_var at the very least (I know, I know, this isn’t the most optimal regex):
Let’s turn that std:: prefix optional:
And ensure that arguments is a string:
Running the linter
We could always plug our query into the web playground, but let’s go a step further:
cargo new --bin toy-lint
Add tree-sitter and tree-sitter-rust to your dependencies:
Let’s load in some Rust code to work with. As an ode to Gödel (Godel?), why not load in our linter itself:
Most tree-sitter APIs require a reference to a Language struct, we will be working with Rust if you haven’t already guessed:
Enough scaffolding, let’s parse some Rust:
The second argument to Parser::parse may be of interest. Tree-sitter has a feature that allows for quick reparsing of existing parse trees if they contain edits. If you do happen to want to reparse a source file, you can pass in the old tree:
Anyhow (hah!), now that we have a parse tree, we can inspect it:
Or better yet, run a query on it:
A QueryCursor is tree-sitter’s way of maintaining state as we iterate through the matches or captures produced by running a query on the parse tree. Observe:
We begin by passing our query to the cursor, followed by the “root node”, which is another way of saying, “start from the top”, and lastly, the source itself. If you have already taken a look at the C API, you will notice that the last argument, the source (known as the TextProvider), is not required. The Rust bindings seem to require this argument to provide predicate functionality such as #match? and #eq?.
Let’s try doing something with the matches:
Lastly, add the following line to your source code, to get the linter to catch something:
And cargo run:
Thank you tree-sitter!
Bonus
Keen readers will notice that I avoided std::env::set_var. Because set_var is called with two arguments, a “key” and a “value”, unlike env::var and env::remove_var. As a result, it requires more juggling:
The interesting part of this query is the humble ., the anchor operator. Anchors help constrain child nodes in certain ways. In this case, it ensures that we match exactly two string_literals who are siblings or exactly one string_literal with no siblings. Unfortunately, this query also matches the following invalid Rust code:
Notes
The knowledge gained from mastering the query DSL can be applied to other languages that have tree-sitter grammars too. This query detects to_json methods that do not accept additional arguments, in Ruby:
All in all, the query DSL does a great job in lowering the bar to writing language tools.