edu.stanford.nlp.trees.tregex
Class TregexPattern

java.lang.Object
  extended by edu.stanford.nlp.trees.tregex.TregexPattern
All Implemented Interfaces:
java.io.Serializable

public abstract class TregexPattern
extends java.lang.Object
implements java.io.Serializable

A TregexPattern is a tgrep-type pattern. Instances of it can be matched against instances of the Tree class.

Currently supported node/node relations and their symbols:

SymbolMeaning
A << B A dominates B
A >> B A is dominated by B
A < B A immediately dominates B
A > B A is immediately dominated by B
A $ B A is a sister of B (and not equal to B)
A .. B A precedes B
A . B A immediately precedes B
A .. B A follows B
A . B A immediately follows B
A <<, B B is a leftmost descendent of A
A <<- B B is a rightmost descendent of A
A >>, B A is a leftmost descendent of B
A >>- B A is a rightmost descendent of B
A <, B B is the first child of A
A >, B A is the first child of B
A <- B B is the last child of A
A >- B A is the last child of B
A <i B B is the ith child of A (i > 0)
A >i B A is the ith child of B (i > 0)
A <-i B B is the ith-to-last child of A (i > 0)
A >-i B A is the ith-to-last child of B (i > 0)
A <: B B is the only child of A
A $++ B A is a left sister of B
A $-- B A is a right sister of B
A $+ B A is the immediate left sister of B
A $- B A is the immediate right sister of B
A <+(C) B A dominates B via an unbroken chain of (zero or more) nodes matching description C
A >+(C) B A is dominated by B via an unbroken chain of (zero or more) nodes matching description C
A <<# B B is a head of phrase A
A >># B A is a head of phrase B
A <# B B is the immediate head of phrase A
A ># B B is the immediate head of phrase A

Label descriptions can be literal strings, which much match labels exactly, or regular expressions in regular expression bars: /regex/. In order to prevent ambiguity with other Tregex symbols, only standard "identifiers" are allowed as literals, i.e. strings matching [a-zA-z]([a-zA-Z0-9_])* If you want to use other symbols, you can do so by using a regular expression instead of a literal string. A disjunctive list of literal strings can be given separated by '|'. The special string '__' (two underscores) can be used to match any node. (WARNING!! Use of the '__' node description may seriously slow down search.) If a label description is preceeded by '@', the label will match any node whose basicCategory matches the description. The basicCategory is defined according to a Function mapping Strings to Strings, as provided by AbstractTreebankLanguagePack.getBasicCategoryFunction().

Nodes can be grouped using parens '(' and ')' as in S < (NP $++ VP) to match an S over an NP, where the NP has a VP as a right sister.

Relations can be combined using the '&' and '|' operators. Thus (NP < NN | < NNS) will match an NP node dominating either an NN or an NNS. (NP > S & $++ VP) matches an NP that is both under an S and has a VP as a right sister.

Relations can be grouped using brackets '[' and ']'. So the expression NP [< NN | < NNS] & > S matches an NP that dominates either an NN or an NNS and is under an S. Without brackets, & takes precidence over |, and equivalent operators are left-associative. Also & is the default combining operator if the operator is omitted in a chain of relations, so that (S < VP < NP) is equivalent to (S < VP & < NP) or (S < VP) < NP. If instead what you want is an S above a VP above an NP, you should write "S < (VP < NP)". As another example, (VP < VV | < NP % NP) can be written explicitly as (VP [< VV | [< NP & % NP] ] )

Relations can be negated with the '!' operator, in which case the expression will match only if there is no node satisfying the relation. For example (NP !< NNP) matches only NPs not dominating an NNP. Label descriptions can also be negated with '!': (NP < !NNP|NNS) matches NPs dominating some node that is not an NNP or an NNS.

In order to consider only the "basic category" of tree labels for matching, a node, prefix that node's description with the @ symbol. For example (@NP < @NN) This can only be used for individual nodes; if you want all nodes to use the basic category, it would be more efficient to use a TreeNormalizer to remove functional tags before passing the tree to the TregexPattern.

Nodes can be given names using '='. A named node will be stored in a map that maps names to nodes so that if a match is found, the node corresponding to the named node can be extracted from the map. For example (NP < NNP=name) will match an NP dominating an NNP and after a match is found, the map can be queried with the name to retreived the matched node using TregexMatcher.getNode(Object o) with (String) argument "name" (not "=name"). Note that a ParseException will be thrown if a named node is used in the scope of a negated relation.

Named nodes that refer back to previous named nodes need not have a node description -- this is known as "backreferencing". In this case, the expression will match only if the subsequently named node is equal to the previously named node (in the == sense). For example: (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma) matches an NP dominating exactly the sequence NP comma NP comma. Multiple backreferences are allowed. If the node w/ no node description does not refer to a previously named node, there will be no error, the expression simply will not match anything.

Another way to refer to previously named nodes is with the "link" symbol: '~'. A link is like a backreference, except that instead of having to be *equal to* the referred node, the current node only has to match the label of the referred to node. A link cannot have a node description, i.e. the '~' symbol must immediately follow a relation symbol.

Relations can be made optional with the '?' operator. This way the expression will match even if the optional relation is not satisfied, but if it is satisfied named nodes under it will still be put into the map.

The HeadFinder used to determine heads for the head relations, and also the Function mapping from labels to Basic Category tags can be chosen by using a TregexPatternCompiler.

Current known bugs/shortcomings:

See Also:
Serialized Form

Field Summary
protected static Function currentBasicCatFunction
           
 
Method Summary
static TregexPattern compile(java.lang.String tregex)
          Creates a pattern from the given string using the default Headfinder and BasicCategoryFunction.
static void main(java.lang.String[] args)
          Use to match a tree pattern to the trees in files.
 TregexMatcher matcher(Tree t)
          Get a TregexMatcher for this pattern on this tree.
 java.lang.String pattern()
           
 void prettyPrint()
          Print a multi-line respresentation of the pattern illustrating it's syntax to System.out.
 void prettyPrint(java.io.PrintStream ps)
          Print a multi-line respresentation of the pattern illustrating it's syntax.
 void prettyPrint(java.io.PrintWriter pw)
          Print a multi-line respresentation of the pattern illustrating it's syntax.
 void setPatternString(java.lang.String patternString)
           
abstract  java.lang.String toString()
          A single-line string representation of the pattern
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

currentBasicCatFunction

protected static Function currentBasicCatFunction
Method Detail

matcher

public TregexMatcher matcher(Tree t)
Get a TregexMatcher for this pattern on this tree.

Parameters:
t - a tree to match on
Returns:
a TregexMatcher

compile

public static TregexPattern compile(java.lang.String tregex)
                             throws ParseException
Creates a pattern from the given string using the default Headfinder and BasicCategoryFunction. If you want to use a different HeadFinder or BasicCategoryFunction, use a TregexPatternCompiler object.

Parameters:
tregex - the pattern string
Returns:
a TregexPattern for the string.
Throws:
ParseException - if the string does not parse

pattern

public java.lang.String pattern()

setPatternString

public void setPatternString(java.lang.String patternString)

toString

public abstract java.lang.String toString()
A single-line string representation of the pattern

Overrides:
toString in class java.lang.Object
Returns:

prettyPrint

public void prettyPrint(java.io.PrintWriter pw)
Print a multi-line respresentation of the pattern illustrating it's syntax.


prettyPrint

public void prettyPrint(java.io.PrintStream ps)
Print a multi-line respresentation of the pattern illustrating it's syntax.


prettyPrint

public void prettyPrint()
Print a multi-line respresentation of the pattern illustrating it's syntax to System.out.


main

public static void main(java.lang.String[] args)
Use to match a tree pattern to the trees in files. Usage:

java edu.stanford.nlp.trees.tregex.TregexPattern [-T] [-C] [-w] [-f] pattern [handle] filepath

It prints out all the matches of the tree pattern to every tree.

Parameters:
args - Command line arguments: Argument 1 is the tree pattern which should name a node with =name (for some arbitrary string "name"), argument 2 is an optional name =name, and argument 3 is a filepath to files with trees. A -T flag causes all trees to be printed as processed. Otherwise just matches are printed. The -C flag suppresses printing of matches, so only a number of matches is printed. The -w flag causes the whole of a tree that matches to be printed. The -f flag causes the filename to be printed.