edu.stanford.nlp.parser.lexparser
Class BaseLexicon

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseLexicon
All Implemented Interfaces:
Lexicon, java.io.Serializable

public class BaseLexicon
extends java.lang.Object
implements Lexicon

This is the default concrete instantiation of the Lexicon interface. It was originally built for Penn Treebank English

See Also:
Serialized Form

Field Summary
protected  int lastSentencePosition
           
protected  int lastSignatureIndex
          We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)
protected  int lastWordToSignaturize
           
protected static short nullTag
           
protected static int nullWord
           
protected  java.util.Set rules
           
protected  java.util.List[] rulesWithWord
           
protected  Counter seenCounter
           
protected  java.util.Set sigs
           
protected  boolean smartMutation
           
protected  int smoothInUnknownsThreshold
           
protected  java.util.Set tags
           
protected  int unknownLevel
           
protected  Counter unSeenCounter
           
protected  java.util.Set words
           
 
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
 
Constructor Summary
BaseLexicon()
           
BaseLexicon(edu.stanford.nlp.parser.lexparser.Options.LexOptions op)
           
 
Method Summary
protected  void addTagging(boolean seen, IntTaggedWord itw, double count)
          Adds the tagging with count to the data structures in this Lexicon.
 double evaluateCoverage(java.util.Collection trees, java.util.Set missingWords, java.util.Set missingTags, java.util.Set missingTW)
          Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon.
protected  java.lang.String getSignature(java.lang.String word, int loc)
          This routine returns a String that is the "signature" of the class of a word.
protected  int getSignatureIndex(int wordIndex, int sentencePosition)
          Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
protected  void initRulesWithWord()
           
 boolean isKnown(int word)
          Checks whether a word is in the lexicon.
 boolean isKnown(java.lang.String word)
          Checks whether a word is in the lexicon.
 void printLexStats()
           
 void readData(java.io.BufferedReader in)
          Populates data in this Lexicon from the character stream given by the Reader r.
 java.util.Iterator ruleIteratorByWord(int word, int loc)
          Get an iterator over all rules with this word and loc
 double score(IntTaggedWord iTW, int loc)
          Get the score of this word with this tag (as an IntTaggedWord) at this loc (Presumably estimate of P(word | tag))
 void train(java.util.Collection trees)
          Trains this lexicon on the Collection of trees.
protected  java.util.List treeToEvents(Tree tree)
           
 void tune(java.util.Collection trees)
           
 void writeData(java.io.Writer w)
          Writes out data from this Object to the Writer w.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

unknownLevel

protected int unknownLevel

smoothInUnknownsThreshold

protected int smoothInUnknownsThreshold

smartMutation

protected boolean smartMutation

rulesWithWord

protected transient java.util.List[] rulesWithWord

rules

protected transient java.util.Set rules

tags

protected transient java.util.Set tags

words

protected transient java.util.Set words

sigs

protected transient java.util.Set sigs

seenCounter

protected Counter seenCounter

unSeenCounter

protected Counter unSeenCounter

nullWord

protected static final int nullWord
See Also:
Constant Field Values

nullTag

protected static final short nullTag
See Also:
Constant Field Values

lastSignatureIndex

protected transient int lastSignatureIndex
We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)


lastSentencePosition

protected transient int lastSentencePosition

lastWordToSignaturize

protected transient int lastWordToSignaturize
Constructor Detail

BaseLexicon

public BaseLexicon()

BaseLexicon

public BaseLexicon(edu.stanford.nlp.parser.lexparser.Options.LexOptions op)
Method Detail

isKnown

public boolean isKnown(int word)
Description copied from interface: Lexicon
Checks whether a word is in the lexicon.

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as an int
Returns:
Whether the word is in the lexicon

isKnown

public boolean isKnown(java.lang.String word)
Checks whether a word is in the lexicon. This version works even while compiling lexicon with current counters (rather than using the compiled rulesWithWord array).

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as a String
Returns:
Whether the word is in the lexicon

ruleIteratorByWord

public java.util.Iterator ruleIteratorByWord(int word,
                                             int loc)
Description copied from interface: Lexicon
Get an iterator over all rules with this word and loc

Specified by:
ruleIteratorByWord in interface Lexicon
Returns:
an Iterator over rules

initRulesWithWord

protected void initRulesWithWord()

treeToEvents

protected java.util.List treeToEvents(Tree tree)

train

public void train(java.util.Collection trees)
Trains this lexicon on the Collection of trees.

Specified by:
train in interface Lexicon

addTagging

protected void addTagging(boolean seen,
                          IntTaggedWord itw,
                          double count)
Adds the tagging with count to the data structures in this Lexicon.


getSignatureIndex

protected int getSignatureIndex(int wordIndex,
                                int sentencePosition)
Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features. Caches the last signature index returned.


getSignature

protected java.lang.String getSignature(java.lang.String word,
                                        int loc)
This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention match the pattern UNK-.* , which is just assumed to not match any real word.

Parameters:
word - The word to make a signature for
loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
Returns:
A String that is its signature (equivalence class)

score

public double score(IntTaggedWord iTW,
                    int loc)
Description copied from interface: Lexicon
Get the score of this word with this tag (as an IntTaggedWord) at this loc (Presumably estimate of P(word | tag))

Specified by:
score in interface Lexicon
Returns:
a double valued score

tune

public void tune(java.util.Collection trees)

readData

public void readData(java.io.BufferedReader in)
              throws java.io.IOException
Populates data in this Lexicon from the character stream given by the Reader r.

Specified by:
readData in interface Lexicon
Throws:
java.io.IOException

writeData

public void writeData(java.io.Writer w)
               throws java.io.IOException
Writes out data from this Object to the Writer w. Rules are separated by newline, and rule elements are delimited by \t.

Specified by:
writeData in interface Lexicon
Parameters:
w - the writer to output to
Throws:
java.io.IOException

printLexStats

public void printLexStats()

evaluateCoverage

public double evaluateCoverage(java.util.Collection trees,
                               java.util.Set missingWords,
                               java.util.Set missingTags,
                               java.util.Set missingTW)
Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon. First arg is the collection of trees; second through fourth args get the results.