extractTerms
Class ExtractTerms

java.lang.Object
  extended byextractTerms.ExtractTerms

public class ExtractTerms
extends java.lang.Object

A first method, extract, extract pairs of adjacent units and estimate their "affinity" with several functions like the mutual information function. Units triple are processed too. But the middle unit is considered as informationless.

A second method wich is static, replace, enable the replacement of separate units encapsulated in an Occurrence object with a term that group the unit. The inverse operation is also possible.

Since:
27/06/2003
Version:
0.8 22/09/2004
Author:
Thomas Heitz for LRI Paris XI University

Field Summary
static int GROUP_MODE
          Determine the mode of replacement in the replace method.
static int UNGROUP_MODE
          Determine the mode of replacement in the replace method.
 
Constructor Summary
ExtractTerms(java.lang.String _inputFileName, java.lang.String _regexUnit, java.lang.String _regexNameOfText, java.lang.String _regexSentPhraSeparator, java.lang.String _regex2Units1, java.lang.String _regex2Units2, int _tagKept2Units, java.lang.String _regex3Units1, java.lang.String _regex3Units2, java.lang.String _regex3Units3, int _tagKept3Units, int _formula, boolean _tfidf, int _pruning2Units, int _pruning3Units, int _maxResults2Units, int _maxResults3Units, java.lang.String _separator, int _relation2Units, int _relation3Units, boolean _occFirst)
          Two regular expressions for matching pair and triple of units must be provided.
 
Method Summary
 boolean extract(java.util.TreeSet currentSkippedTerms, int iteration, javax.swing.JProgressBar extractionJProgressBar)
          Extract terms from the input file and sort them by relevancy with a relevancy measure formula.
 java.lang.String getInputFileName()
          Returns input file name.
 java.util.regex.Pattern getPattern2Units()
          Returns pattern of regular expression used for analysed units pair.
 java.util.regex.Pattern getPattern3Units()
          Returns pattern of regular expression used for analysed units triple.
 java.util.regex.Pattern getPatternNameOfText()
          Returns pattern of regular expression used for extract name of text in the corpus.
 java.util.regex.Pattern getPatternSentPhraSeparator()
          Returns pattern of regular expression for the separator of sentence or phrases.
 java.util.regex.Pattern getPatternUnit()
          Returns pattern of regular expression used for splitting input text into units.
 int getPruning2Units()
          Returns pruning threshold for units pair.
 int getPruning3Units()
          Returns pruning threshold for units triple.
 java.util.TreeSet getResults2Units()
          Returns sorted results by relevancy of units pair.
 java.util.TreeSet getResults3Units()
          Returns sorted results by relevancy of units triple.
 java.lang.String getSeparator()
          Returns separator between word and tag.
 int getTotalDocuments()
          Returns total number of documents in the input file.
 int getTotalResults2Units()
          Returns total results for units pair with pruning.
 int getTotalResults3Units()
          Returns total results for units triple with pruning.
 int getTotalUnitsPair()
          Returns total number of units pair with double contained in the input file.
 int getTotalUnitsPairWithoutDouble()
          Returns total number of units pair without double contained in the input file.
 int getTotalUnitsTriple()
          Returns total number of units triple with double contained in the input file.
 int getTotalUnitsTripleWithoutDouble()
          Returns total number of units triple without double contained in the input file.
 int getTotalUnitsWithoutDouble()
          Returns total number of units without double contained in the input file.
static void replace(java.util.LinkedHashSet units, java.lang.String wordSeparator, java.lang.String inputFileName, java.lang.String newFileName, int mode)
          Replace in the file inputFileName adjacent units contained into units with un/grouped units and write a new file named newFile.
 void setInputFileName(java.lang.String _inputFileName)
          Set input file name.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

GROUP_MODE

public static final int GROUP_MODE
Determine the mode of replacement in the replace method.

See Also:
replace, Constant Field Values

UNGROUP_MODE

public static final int UNGROUP_MODE
Determine the mode of replacement in the replace method.

See Also:
replace, Constant Field Values
Constructor Detail

ExtractTerms

public ExtractTerms(java.lang.String _inputFileName,
                    java.lang.String _regexUnit,
                    java.lang.String _regexNameOfText,
                    java.lang.String _regexSentPhraSeparator,
                    java.lang.String _regex2Units1,
                    java.lang.String _regex2Units2,
                    int _tagKept2Units,
                    java.lang.String _regex3Units1,
                    java.lang.String _regex3Units2,
                    java.lang.String _regex3Units3,
                    int _tagKept3Units,
                    int _formula,
                    boolean _tfidf,
                    int _pruning2Units,
                    int _pruning3Units,
                    int _maxResults2Units,
                    int _maxResults3Units,
                    java.lang.String _separator,
                    int _relation2Units,
                    int _relation3Units,
                    boolean _occFirst)
Two regular expressions for matching pair and triple of units must be provided. It is strongly advised to provide as an input file a tagged file so you could analyse only certain syntaxic relations as Adjective-Noun for example.

Parameters:
_inputFileName - name of input file containing text to analyse
_regexUnit - regular expression used for splitting input text into units
_regexSentPhraSeparator - regular expression for separator of sentence or phrase
_regexNameOfText - regular expression used for extract name of text in the corpus
_regex2Units1 - regular expression of analysed units pair for the first unit
_regex2Units2 - regular expression of analysed units pair for the second unit
_tagKept2Units - tag to keep between the 2 units when group
_regex3Units1 - regular expression of analysed units triple for the first unit
_regex3Units2 - regular expression of analysed units triple for the second unit
_regex3Units3 - regular expression of analysed units triple for the third unit
_tagKept3Units - tag to keep between the 3 units when group
_formula - formula for calculating relevancy
_tfidf - true if relevancy measures are based on tf*idf values rather than number of occurrence
_pruning2Units - units pair that occur stricly less than pruning time doesn't appear in the results
_pruning3Units - units triple that occur stricly less than
_maxResults2Units - maximum number of results of units pair
_maxResults3Units - maximum number of results of units triple pruning time doesn't appear in the results
_separator - separator between word and tag
_relation2Units - position of the element in the list of 2 units selector that was selected for this term ; -1 if you don't want extraction of units pair
_relation3Units - position of the element in the list of 3 units selector that was selected for this term ; -1 if you don't want extraction of units triple
_occFirst - if the measure takes first into account the number of occurrence
See Also:
ExtractTermsExample for an example
Method Detail

getTotalUnitsWithoutDouble

public int getTotalUnitsWithoutDouble()
Returns total number of units without double contained in the input file.

Returns:
total number of units without double.

getTotalUnitsPairWithoutDouble

public int getTotalUnitsPairWithoutDouble()
Returns total number of units pair without double contained in the input file.

Returns:
total number of units pair without double.

getTotalUnitsTripleWithoutDouble

public int getTotalUnitsTripleWithoutDouble()
Returns total number of units triple without double contained in the input file.

Returns:
total number of units triple without double.

getResults2Units

public java.util.TreeSet getResults2Units()
Returns sorted results by relevancy of units pair.

Returns:
list of sorted Occurrence results.

getResults3Units

public java.util.TreeSet getResults3Units()
Returns sorted results by relevancy of units triple.

Returns:
list of sorted Occurrence results.

getTotalResults2Units

public int getTotalResults2Units()
Returns total results for units pair with pruning.

Returns:
number of results for units pair with pruning.

getTotalResults3Units

public int getTotalResults3Units()
Returns total results for units triple with pruning.

Returns:
number of results for units triple with pruning.

getTotalUnitsPair

public int getTotalUnitsPair()
Returns total number of units pair with double contained in the input file.

Returns:
total number of units pair with double.

getTotalUnitsTriple

public int getTotalUnitsTriple()
Returns total number of units triple with double contained in the input file.

Returns:
total number of units triple with double.

getTotalDocuments

public int getTotalDocuments()
Returns total number of documents in the input file.

Returns:
total number of documents.

getPatternUnit

public java.util.regex.Pattern getPatternUnit()
Returns pattern of regular expression used for splitting input text into units.

Returns:
pattern of regular expression used for splitting input text into units.

getPattern2Units

public java.util.regex.Pattern getPattern2Units()
Returns pattern of regular expression used for analysed units pair.

Returns:
pattern of regular expression used for analysed units pair.

getPattern3Units

public java.util.regex.Pattern getPattern3Units()
Returns pattern of regular expression used for analysed units triple.

Returns:
pattern of regular expression used for analysed units triple.

getPatternSentPhraSeparator

public java.util.regex.Pattern getPatternSentPhraSeparator()
Returns pattern of regular expression for the separator of sentence or phrases.

Returns:
pattern for the separator of sentence or phrases.

getPatternNameOfText

public java.util.regex.Pattern getPatternNameOfText()
Returns pattern of regular expression used for extract name of text in the corpus.

Returns:
pattern for name of text

getPruning2Units

public int getPruning2Units()
Returns pruning threshold for units pair.

Returns:
value of pruning threshold.

getPruning3Units

public int getPruning3Units()
Returns pruning threshold for units triple.

Returns:
value of pruning threshold.

getSeparator

public java.lang.String getSeparator()
Returns separator between word and tag.

Returns:
separator.

getInputFileName

public java.lang.String getInputFileName()
Returns input file name.

Returns:
input file name

setInputFileName

public void setInputFileName(java.lang.String _inputFileName)
Set input file name.

Parameters:
_inputFileName - input file name

extract

public boolean extract(java.util.TreeSet currentSkippedTerms,
                       int iteration,
                       javax.swing.JProgressBar extractionJProgressBar)
Extract terms from the input file and sort them by relevancy with a relevancy measure formula.

example :


 ExtractTerms et = new ExtractTerms(...);
 et.extract();
 Iterator it = et.getResults2Units().iterator();
 System.out.println("Relevancy results for units pair");
 while (it.hasNext()) { System.out.println(it.next()); }

No results is return in return parameters but in fields of the ExtractTerms object. You can get results with getResults2Units() and getResults3Units().

Parameters:
currentSkippedTerms - contain the terms (String) to skip
iteration - number of the iteration when this term was extracted
extractionJProgressBar - progress bar of the extraction ; null if not used
Returns:
false if an error as occurred, true otherwise.

replace

public static void replace(java.util.LinkedHashSet units,
                           java.lang.String wordSeparator,
                           java.lang.String inputFileName,
                           java.lang.String newFileName,
                           int mode)
Replace in the file inputFileName adjacent units contained into units with un/grouped units and write a new file named newFile.

unit1 unit2...unitn <-> unit1-unit2-...-unitn

Parameters:
units - list of Occurrence which contain the units pair or triple ungrouped with their lines.
wordSeparator - used for grouping units
inputFileName - input file name
newFileName - name of new file written
mode - GROUP_MODE or UNGROUP_MODE
See Also:
Occurrence