Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

Pattern Matching for Java

$
0
0

Gavin Bierman and Brian Goetz, April 2017

This document explores a possible direction for supporting pattern matching in the Java Language. This is an exploratory document only and does not constitute a plan for any specific feature in any specific version of the Java Language. This document also may reference other features under exploration; this is purely for illustrative purposes, and does not constitute any sort of plan or committment to deliver any of these features.

Motivation

Nearly every program includes some sort of logic that combines testing if an expression has a certain type or structure, and then conditionally extracting components of its state for further processing. For example, all Java programmers are familiar with the instanceof-and-cast idiom:

if (obj instanceof Integer) {int intValue = ((Integer) obj).intValue();// use intValue
}

There are three things going on here: a test (is x an Integer), a conversion (casting obj to Integer), and a destructuring (extracting the intValue component from the Integer). This pattern is straightforward and understood by all Java programmers, but is suboptimal for several reasons. It is tedious; doing both the type test and cast should be unnecessary (what else would you do after an instanceof test?). The accidental boilerplate of casting and destructuring obfuscates the more significant logic that follows. But most importantly, the repetition provides opportunities for errors to creep unnoticed into programs.

This problem gets worse when we want to test against multiple possible target types. We can extend the above example using a chain of if...else tests:

String formatted = "unknown";if (obj instanceof Integer) {int i = (Integer) obj;
    formatted = String.format("int %d", i);
}elseif (obj instanceof Byte) {byte b = (Byte) obj;
    formatted = String.format("byte %d", b);
}elseif (obj instanceof Long) {long l = (Long) obj;
    formatted = String.format("long %d", l);
}elseif (obj instanceof Double) {double d = (Double) obj;
    formatted = String.format(“double %f", d);
}elseif (obj instanceof String) {
    String s = (String) obj;
    formatted = String.format("String %s", s);
}
...

The above code is familiar, but has many undesirable properties. As already mentioned, repeating the cast in each arm is an irritation. The business logic can too easily get lost in the boilerplate. But most importantly, the approach allows coding errors to remain hidden -- because we've used an overly-general control construct. The intent of the above code is to assign something to formatted in each arm of the if...else chain. But, there is nothing here that enables the compiler to verify this actually happens. If some block -- perhaps one that is executed rarely in practice -- forgets to assign to formatted, we have a bug. (Leaving formatted as a blank local or blank final would at least enlist the "definite assignment" analysis in this effort, but this is not always done.) Finally, the above code is less optimizable; absent compiler heroics, it will have O(n) time complexity, even though the underlying problem is often O(1).

There have been plenty of ad-hoc suggestions for ameliorating these problems, such as flow typing (where the type of obj after an instanceof Integer test is refined in any control path dominated by the test, so that the cast is unneeded), or type switch (where the case labels of a switch statement can specify types as well as constants). But these are mostly band-aids; there's a better alternative that subsumes these (and other cases.)

Patterns

Rather than reach for ad-hoc solutions, we believe it is time for Java to embrace pattern matching. Pattern matching is a technique that has been adapted to many different styles of programming languages going back to the 1960s, including text-oriented languages like SNOBOL4 and AWK, functional languages like Haskell and ML, and more recently extended to object-oriented languages like Scala (and most recently, C#).

A pattern is a combination of a predicate that can be applied to a target, along with a set of binding variables that are extracted from the target if the predicate applies to it. One form of binding pattern is a type test pattern, illustrated below (in conjunction with a theoretical matches operator):

if (x matches Integer i) {// can use i here
}

The phrase Integer i is a type test pattern; the i is a declaration of a new variable, not a use of an existing one. (This takes some getting used to.) The target is tested to see if it is an instance of Integer, and if so, it is cast to Integer and its int component bound to the binding variable i.

If we had such a construct, we could simplify our if..else chain above with type test patterns, eliminating the casting and binding boilerplate:

String formatted = "unknown";if (obj matches Integer i) {
    formatted = String.format("int %d", i);
}elseif (obj matches Byte b) {
    formatted = String.format("byte %d", b);
}elseif (obj matches Long l) {
    formatted = String.format("long %d", l);
}elseif (obj matches Double d) {
    formatted = String.format(“double %f", d);
}elseif (obj matches String s) {
    formatted = String.format("String %s", s);
}
...

This is already a big improvement -- the business logic pops out much more clearly -- but we can do better.

Patterns in multi-way conditionals

The chain of if...else still has some redundancy we'd like to squeeze out, both because it gives bugs a place to hide, and makes readers work harder to understand what the code does. Specifically, the if (obj matches...) part is repeated. We'd like to say "choose the block which best describes the target object", and be guaranteed that exactly one of them will execute.

We already have a mechanism for a multi-armed equality test in the language -- switch. But switch is (at present) very limited. You can only switch on a small set of types -- numbers, strings, and enums -- and you can only test for exact equality against constants. But these limitations are mostly accidents of history; the switch statement is a perfect "match" for pattern matching. If we allow a case label to specify a pattern, we can express the above with switch:

String formatted;switch (obj) {case Integer i: formatted = String.format("int %d", i); break;case Byte b:    formatted = String.format("byte %d", b); break;case Long l:    formatted = String.format("long %d", l); break;case Double d:  formatted = String.format(“double %f", d); break;default:        formatted = String.format("String %s", s);
}
...

Now, the intent of the code is far clearer, because we're using the right control construct -- we're saying "the expression obj matches at most one of the following conditions, figure it out and execute the corresponding arm". As a bonus, it is more optimizable too; in this case we are more likely to be able to do the dispatch in O(1) time.

Whether this new construct should be called switch or something else (like match) is a side detail; while it is obviously consistent with the original spirit of switch, the switch statement comes with some baggage (fall through, scoping) that might hinder our abilities to extend it. This question will be explored further in a separate document.

Expression switch?

You may already have noticed that there is still some redundancy that can be squeezed out of our example, induced by another limitation of switch: that it is a statement, and therefore the case arms must be statements too. Again, we're using a more general control flow mechanism than we need; we'd like an expression form that is a generalization of the ternary conditional operator, where we're guaranteed that exactly one of N expressions are evaluated. Perhaps something like:

String formatted =exprswitch (obj) {case Integer i -> String.format("int %d", i); case Byte b    -> String.format("byte %d", b); case Long l    -> String.format("long %d", l); case Double d  -> String.format(“double %f", d); default        -> String.format("String %s", s);
    };
...

This is more concise, but more importantly it is also safer -- we've enlisted the language's aid in ensuring that formatted is guaranteed assigned exactly once in every case, and the compiler can verify that the supplied cases are exhaustive. (Again, whether we can rescue the switch keyword here, or need a new linguistic form, is a separate question.)

Constant patterns

Interpreting case labels as patterns can be seen as a generalization of the existing semantics of switch. Currently, case labels can only be numeric, string, or enum constants; going forward, we can redefine these to be constant patterns, where matching the pattern means the obvious thing -- testing for equality against the constant. So we can freely mix and match constant patterns with type test patterns:

String s = exprswitch (num) {case0 -> "zero";case1 -> "one";caseint i -> "some other Integer";default -> "not an Integer";
    };

Digression -- Visitors

The visitor design pattern is commonly used to separate a traversal of a data structure from the definition of the data structure itself. For example, if the data structure is a tree representing a design in a CAD application, nearly every operation requires traversing at least some part of the data structure -- saving, printing, searching for text in element labels, computing weight or cost, validating design rules, etc. While we might start out by representing each of these operations as a virtual method on the root type, this quickly becomes unwieldy, and the visitor pattern enables us to decouple the code for any given traversal (say, searching for text in element labels) from the code that defines the data structure itself.

Consider this hierarchy for describing an arithmetic expression:

interface Node { }class IntNode implements Node {int value;
}class NegNode implements Node { 
    Node node;
}class MulNode implements Node {
    Node left, right;
}class AddNode implements Node {
    Node left, right;
}

We can define a visitor over nodes (and inject some visitor goop into the node types):

interface NodeVisitor<T> {
    T visit(IntNode node);
    T visit(NegNode node);
    T visit(MulNode node);
    T visit(AddNode node);
}

Now we can define a visitor to evaluate a node:

class EvalVisitor implements NodeVisitor<Integer> {
    Integer visit(IntNode node) {return node.value;
    }
    Integer visit(NegNode node) {return -node.accept(this);
    }
    Integer visit(MulNode node) {return node.left.accept(this) * node.right.accept(this);
    }

    Integer visit(AddNode node) {return node.left.accept(this) + node.right.accept(this);
    }
}

For a simple hierarchy and a simple traversal, this isn't terrible. We suffer some constant code overhead for being visitor-ready (every node class needs an accept method, and a single visitor interface), and thereafter we write one visitor per traversing operation. (As an added penality, we have to box primitives returned by visitors.) But, visitors rightly have a reputation for being verbose, and as visitors get more complicated, it is common to have multiple levels of visitors involved in a single traversal.

Destructuring patterns

We could certainly apply type-test patterns to simplifying the above visitor code, identifying when a particular Node is, say, an AddNode, and casting it to AddNode if so. But most traversals will want to extract the components (destructure) the node as well; it would be nice if this were part of the mechanism too.

Our Node classes are simply "dumb" data carrier classes. As such, their construction process should be easily reversible; if we created an AddNode via new AddNode(x, y), we can recover x and y by looking at the fields of AddNode. We can generalize the type test pattern to subsume both type testing and extracting the state components; this is called a destructuring pattern. In a destructuring pattern for AddNode, in addition to casting the Node to AddNode, we can extract the components of AddNode into binding variables in one shot:

if (node matches AddNode(Node x, Node y)) {  ...  }

Here, the pattern matches the target if it is an AddNode, and if so, it casts the target to AddNode, and extracts the left and right components and binds them to the binding variables x and y. (How we declare the destructuring pattern in the AddNode class, or declare AddNode so that it implicitly acquires one, will be covered in a separate document.)

Here's how we'd rewrite the evaluation visitor using destructuring patterns:

inteval(Node n) {returnexprswitch(n) {caseIntNode(int i) -> i;caseNegNode(Node n) -> -eval(n);caseAddNode(Node left, Node right) -> eval(left) + eval(right);caseMulNode(Node left, Node right) -> eval(left) * eval(right);
    };
}

This is more compact -- but more importantly, more transparent -- than the visitor equivalent. We don't need a visitor interface or a subclass of the visitor interface -- we can operate directly on the node types. We just need the node types to support destructuring patterns.

Nested patterns

The destructuring pattern shown here is deceptively powerful. When we matched against AddNode(Node x, Node y), it may look like Node x and Node y are simply declarations of binding variables, but in fact they are patterns themselves!

The pattern AddNode(p1, p2), where p1 and p2 are patterns, matches a target if:

  • the target is an AddNode;
  • the left component of that AddNode matches p1;
  • the right component of that AddNode matches p2.

Because p1 and p2 are patterns, they may have their own binding variables; if the whole pattern matches, any binding variables in the subpatterns are also bound. So in:

if (node matches AddNode(Node x, Node y)) {  ...  }

the nested patterns are type-test patterns (which happen to be guaranteed to match if the target is an AddNode, because the left and right components of an AddNode are always Node.) So the effect is that we check if the target is an AddNode, and if so, immediately bind x and y to the left and right subtrees. This may sound complicated, but the effect is simple: we can match against an AddNode and bind its components in one go.

The var pattern

While it might be useful to explicitly use type-test patterns in this example (and the compiler can optimize them away based on static type information), it might sometimes be desirable to omit the manifest type and instead use a nested var pattern instead of the nested type-test patterns. A var pattern matches anything, and binds its target to a binding variable. This may sound silly -- and it is silly in itself -- but is very useful as a nested pattern. We can transform our eval into:

inteval(Node n) {returnexprswitch(n) {caseIntNode(var i) -> i;caseNegNode(var n) -> -eval(n);caseAddNode(var left, var right) -> eval(left) + eval(right);caseMulNode(var left, var right) -> eval(left) * eval(right);
    };
}

This version is completely equivalent to the previous version -- it just lets the compiler fill in the type information for you. The choice of whether to use a nested type-test pattern or a nested var pattern is solely one of whether the manifest type adds or distracts from readability and maintainability.

Nesting constant patterns

Earlier, we encountered constant patterns, as a means of expressing the traditional switch behavior. But constant patterns are also useful as nested patterns. For example, the following lets us differentiate between special kinds of points:

String formatted = exprswitch (anObject) {case Point(0, 0) -> "at origin";case Point(0, var y) -> "on x axis";case Point(var x, 0) -> "on y axis";case Point(var x, var y) -> String.format("[%d,%d]", x, y);default -> "not a point";
};

The pattern Point(0,0) will test if the target is a Point, and then further test whether both its x and y components match the constant pattern 0. If the target is a Point, but its components don't match the subpatterns, the match fails and we continue trying to find a match at the next case.

Nesting nontrivial patterns

The previous section already illustrated how patterns nest cleanly, but this can be taken much further. Consider a traversal of our Node hierarchy, where we want to simplify according to algebraic rule for additive identity: that for any expression e, 0+e==e.

If we write this with Visitor, it starts to get ugly:

class SimplifyVisitor implements NodeVisitor<Node> {
    Node visit(IntNode node) {return node;
    }
    Node visit(NegNode node) {return node.accept(this);
    }
    T visit(MulNode node) {returnnewMulNode(node.left.accept(this), node.right.accept(this));
    }

    T visit(AddNode node) {if (node.leftinstanceof IntNode&& (((IntNode) node).value == 0))return node.right.accept(this);elsereturnnewAddNode(node.left.accept(this), node.right.accept(this));
    }
}

This is already ugly -- and we're only handling one simplification rule -- and we're still cheating. In the visit(AddNode) method, we're still doing instanceof and casting, which is what the Visitor pattern was supposed to obviate. Nested patterns allow us to eliminate the complex visitor behavior, and express multiple levels of constraint in one place:

Node simplify(Node n) {returnexprswitch(n) {case IntNode -> n;caseNegNode(var n) -> newNegNode(simplify(n));caseAddNode(IntNode(0), var right) -> simplify(right);caseAddNode(var left, var right) 
            -> newAddNode(simplify(left), simplify(right));caseMulNode(var left, var right) 
            -> newMulNode(simplify(left), simplify(right));
    };
}

The key part is this line:

case AddNode(IntNode(0), var right) -> simplify(right);

This pattern is nested three deep, and it only matches if all the levels match: first we test if the matchee is an AddNode, then we test if the AddNode's left component is an IntNode; then we test whether that IntNode's integer component is zero. If our target matches this complex pattern, we know we can simplify the AddNode to the simplification of its right component. Otherwise, we proceed to the next case, which matches anyAddNode, which recursively simplifies the left and right subnodes.

Now let's scale this example up to more algebraic identities:

  • - - e == e
  • 0 + e == e
  • e + 0 == e
  • 0 * e == 0
  • e * 0 == 0
  • 1 * e == e
  • e * 1 == e

Our visitor version would very quickly get very nasty! But not so with the pattern-matching version. The following implements all of these rules, and the code reads much like the problem statement.

Node simplify(Node n) {returnexprswitch(n) {case IntNode -> n;caseNegNode(NegNode(var n)) -> simplify(n);caseNegNode(var n) -> newNegNode(simplify(n));caseAddNode(IntNode(0), var right) -> simplify(right);caseAddNode(var left, IntNode(0)) -> simplify(left);caseAddNode(var left, var right) 
            -> newAddNode(simplify(left), simplify(right));caseMulNode(IntNode(1), var right) -> simplify(right);caseMulNode(var left, IntNode(1)) -> simplify(left);caseMulNode(IntNode(0), var right) -> newIntNode(0);caseMulNode(var left, IntNode(0)) -> newIntNode(0);caseMulNode(var left, var right) 
            -> newMulNode(simplify(left), simplify(right));
    };
}

The _ pattern

Just as the var pattern matches anything and binds its target to that, the _ pattern matches anything -- and binds nothing. Again, this is not terribly useful as a standalone pattern, but is useful as a way of saying "I don't care about this component." Just as with the var pattern, this is entirely a matter of readability; if a subcomponent is not relevant to the matching, we can make this explicit by using a _ pattern. For example, we can rewrite the "multiply by zero" case from the above example using a _ pattern:

case MulNode(IntNode(0), _) -> new IntNode(0);

Which is a way of saying that the right component is irrelevant to the matching logic, and doesn't need to be given a name.

Summary of patterns

We've now see several kinds of patterns:

  • Type-test patterns, which bind the cast target to a binding variable;
  • Destructuring patterns, which destructure the target and recursively match the components to subpatterns;
  • Constant patterns, which match on equality;
  • Var patterns, which match anything and bind their target;
  • The _ pattern, which matches anything.

We've also seen several contexts in which patterns can be used:

  • An enhanced switch statement;
  • A switch-like expression;
  • A matches predicate.

There may be other kinds of patterns too, and other linguistic constructs that could also benefit from pattern matching in the future.

Data polymorphism vs class polymorphism

For a hierarchy like our Node classes, we have two choices of how to implement an operation like eval() -- as an instance method on Node, or as an "outboard" method such as we defined here, using a visitor or pattern match. Both are forms of polymorphic behavior -- and neither is intrinsically more correct than the other. For operations that are truly intrinsic to the hiearchy -- such as evaluating the expression represented by the hierarchy -- an instance method might make a lot of sense. For operations that are more ad-hoc, such as "does this expression contain any intermediate nodes that evaluate to 42", putting it into the definition of the node types themselves would surely be silly. And, as codebases grow, being able to specify operations separately from the data structures on which they operate often has other codebase-management benefits.

The primary value of the Visitor pattern is to allow operations on a (stable) hierarchy to be specified separately from the hierarchy, but that comes at a pretty significant cost -- visitor-based code is bulky, easy to get wrong, annoying to write, and annoying to read. Pattern matching with a pattern-aware switch statement often allows you to achieve the same result without the machinery of Visitors interposing themselves, often resulting in cleaner, simpler, more transparent code.

Exhaustiveness

In the expression form of switch, we evaluate exactly one arm of the switch, which becomes the value of the switch expression itself. This means that there must be at least one arm that applies to any input -- otherwise the value of the switch expression might be undefined. If the switch has a default arm, there's no problem. But for many hierarchies that we might apply pattern matching -- like our Node classes -- we would be annoyed to have to include a never-taken default arm -- we would like to be able to express that the only subtypes of Node are IntNode, AddNode, MulNode, and NegNode, so that the compiler can use this information to verify that a switch over these types is exhaustive.

There's an age-old technique we can apply here: hierarchy sealing. Suppose we declare our Node type to be sealed; this means that only the subtypes that are co-compiled with it (generally from a single compilation unit) can extend it:

sealed interface Node { }

Sealing is a generalization of finality; where a final type has no subtypes, a sealed type can have no subtypes beyond a fixed set of co-declared subtypes. The details of sealing will be the subject of a separate document.

Pattern matching and data classes

Pattern matching connects quite nicely with another feature currently under exploration, data classes. A data class is one where the author commits to the class being a transparent carrier for its data; in return, data classes can implicitly acquire destructuring patterns (as well as other useful artifacts such as constructors, equals(), hashCode(), etc.) We can define our Node hierarchy with data classes quite compactly:

sealed interface Node { }

data classIntNode(int value) implements Node { }
data classNegNode(Node node) implements Node { }
data classSumNode(Node left, Node right) implements Node { }
data classMulNode(Node left, Node right) implements Node { }
data classParenNode(Node node) implements Node { }

We now know that the only subtypes of Node are the ones here, so the switch expressions in the examples above will benefit from exhaustiveness analysis, and not require a default arm. (Astute readers will observe that we have arrived at a powerful and well-known construct, algebraic data types, or GADT. Data classes offer us a compact expression for the product portion of GADTs; sealing offers us the other half, sum types.)

Scoping

Pattern-aware language constructs like matches have a new property: they may introduce binding variables from the middle of an expression. An obvious question is: what is the scope of those binding variables? Let's look at some examples.

if (x matches String s) {
    System.out.println(s);
}

Here, the binding variable s is used in the body of the if statement, which makes sense; by the time we're executing the body, the pattern must have matched, so s is well-defined, and we should include s in the set of variables that are in scope in the body of the if. We can extend this further:

if (x matches String s && s.length() > 0) {
    System.out.println(s);
}

This makes sense too; since && is short-circuiting, so whenever we execute the second condition, the match has already succeeded, so s is again well-defined for this use, and we should include s in the set of variables that are in scope for the second subexpression of the conditional. On the other hand, if we replace the AND with an OR:

if (x matches String s || s.length() > 0) {  // error
    ...
}

we should expect an error; s is not well-defined in this context, since the match may not have succeeded in the second subexpression of the conditional. Similarly, s is not well-defined in the else-clause here:

if (x matches String s) {
    ...
}else {// error
    System.out.println(s + " is not a string");
}

But, suppose our condition inverts the match:

if (!(x matches String s)) {
    ...
}else {
    System.out.println(s + " is not a string");
}

Here, we want s to be in scope in the else-arm; if it were not, we would not be able to freely refactor if-then-else blocks by inverting their condition and swapping the arms, which creates an undesirable assymmetry.

This is far from a precise explanation of the new scoping rules, but instead are intended to give a sense for what is needed. Some generalization of the current notion of scoping is required to make this work; key new concepts include:

  • Binding variables may be defined from the middle of expressions;
  • The existing demarcation of scopes does not accurately describe the boundaries in which binding variables are valid, but we can define a new concept, "includes var in expr", which describes the desired effects;
  • These rules can be extended to cover all the language forms (for loops, while loops, try-catch-finally, switch, etc).

Viewing all articles
Browse latest Browse all 25817

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>