|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.ObjectextractTerms.ExtractTerms
A first method, extract, extract pairs of adjacent units and estimate their "affinity" with several functions like the mutual information function. Units triple are processed too. But the middle unit is considered as informationless.
A second method wich is static, replace, enable the replacement of separate units encapsulated in an Occurrence object with a term that group the unit. The inverse operation is also possible.
Field Summary | |
static int |
GROUP_MODE
Determine the mode of replacement in the replace method. |
static int |
UNGROUP_MODE
Determine the mode of replacement in the replace method. |
Constructor Summary | |
ExtractTerms(java.lang.String _inputFileName,
java.lang.String _regexUnit,
java.lang.String _regexNameOfText,
java.lang.String _regexSentPhraSeparator,
java.lang.String _regex2Units1,
java.lang.String _regex2Units2,
int _tagKept2Units,
java.lang.String _regex3Units1,
java.lang.String _regex3Units2,
java.lang.String _regex3Units3,
int _tagKept3Units,
int _formula,
boolean _tfidf,
int _pruning2Units,
int _pruning3Units,
int _maxResults2Units,
int _maxResults3Units,
java.lang.String _separator,
int _relation2Units,
int _relation3Units,
boolean _occFirst)
Two regular expressions for matching pair and triple of units must be provided. |
Method Summary | |
boolean |
extract(java.util.TreeSet currentSkippedTerms,
int iteration,
javax.swing.JProgressBar extractionJProgressBar)
Extract terms from the input file and sort them by relevancy with a relevancy measure formula. |
java.lang.String |
getInputFileName()
Returns input file name. |
java.util.regex.Pattern |
getPattern2Units()
Returns pattern of regular expression used for analysed units pair. |
java.util.regex.Pattern |
getPattern3Units()
Returns pattern of regular expression used for analysed units triple. |
java.util.regex.Pattern |
getPatternNameOfText()
Returns pattern of regular expression used for extract name of text in the corpus. |
java.util.regex.Pattern |
getPatternSentPhraSeparator()
Returns pattern of regular expression for the separator of sentence or phrases. |
java.util.regex.Pattern |
getPatternUnit()
Returns pattern of regular expression used for splitting input text into units. |
int |
getPruning2Units()
Returns pruning threshold for units pair. |
int |
getPruning3Units()
Returns pruning threshold for units triple. |
java.util.TreeSet |
getResults2Units()
Returns sorted results by relevancy of units pair. |
java.util.TreeSet |
getResults3Units()
Returns sorted results by relevancy of units triple. |
java.lang.String |
getSeparator()
Returns separator between word and tag. |
int |
getTotalDocuments()
Returns total number of documents in the input file. |
int |
getTotalResults2Units()
Returns total results for units pair with pruning. |
int |
getTotalResults3Units()
Returns total results for units triple with pruning. |
int |
getTotalUnitsPair()
Returns total number of units pair with double contained in the input file. |
int |
getTotalUnitsPairWithoutDouble()
Returns total number of units pair without double contained in the input file. |
int |
getTotalUnitsTriple()
Returns total number of units triple with double contained in the input file. |
int |
getTotalUnitsTripleWithoutDouble()
Returns total number of units triple without double contained in the input file. |
int |
getTotalUnitsWithoutDouble()
Returns total number of units without double contained in the input file. |
static void |
replace(java.util.LinkedHashSet units,
java.lang.String wordSeparator,
java.lang.String inputFileName,
java.lang.String newFileName,
int mode)
Replace in the file inputFileName adjacent units contained into units with un/grouped units and write a new file named newFile. |
void |
setInputFileName(java.lang.String _inputFileName)
Set input file name. |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final int GROUP_MODE
replace
,
Constant Field Valuespublic static final int UNGROUP_MODE
replace
,
Constant Field ValuesConstructor Detail |
public ExtractTerms(java.lang.String _inputFileName, java.lang.String _regexUnit, java.lang.String _regexNameOfText, java.lang.String _regexSentPhraSeparator, java.lang.String _regex2Units1, java.lang.String _regex2Units2, int _tagKept2Units, java.lang.String _regex3Units1, java.lang.String _regex3Units2, java.lang.String _regex3Units3, int _tagKept3Units, int _formula, boolean _tfidf, int _pruning2Units, int _pruning3Units, int _maxResults2Units, int _maxResults3Units, java.lang.String _separator, int _relation2Units, int _relation3Units, boolean _occFirst)
_inputFileName
- name of input file containing text to analyse_regexUnit
- regular expression used for splitting input text
into units_regexSentPhraSeparator
- regular expression for separator of
sentence or phrase_regexNameOfText
- regular expression used for extract name of
text in the corpus_regex2Units1
- regular expression of analysed units pair for
the first unit_regex2Units2
- regular expression of analysed units pair for
the second unit_tagKept2Units
- tag to keep between the 2 units when group_regex3Units1
- regular expression of analysed units triple for
the first unit_regex3Units2
- regular expression of analysed units triple for
the second unit_regex3Units3
- regular expression of analysed units triple for
the third unit_tagKept3Units
- tag to keep between the 3 units when group_formula
- formula for calculating relevancy_tfidf
- true if relevancy measures are based on tf*idf values
rather than number of occurrence_pruning2Units
- units pair that occur stricly less than pruning
time doesn't appear in the results_pruning3Units
- units triple that occur stricly less than_maxResults2Units
- maximum number of results of units pair_maxResults3Units
- maximum number of results of units triple
pruning time doesn't appear in the results_separator
- separator between word and tag_relation2Units
- position of the element in the list of 2 units
selector that was selected for this term ; -1 if you don't want
extraction of units pair_relation3Units
- position of the element in the list of 3 units
selector that was selected for this term ; -1 if you don't want
extraction of units triple_occFirst
- if the measure takes first into account the number of
occurrenceExtractTermsExample
for an example
Method Detail |
public int getTotalUnitsWithoutDouble()
public int getTotalUnitsPairWithoutDouble()
public int getTotalUnitsTripleWithoutDouble()
public java.util.TreeSet getResults2Units()
public java.util.TreeSet getResults3Units()
public int getTotalResults2Units()
public int getTotalResults3Units()
public int getTotalUnitsPair()
public int getTotalUnitsTriple()
public int getTotalDocuments()
public java.util.regex.Pattern getPatternUnit()
public java.util.regex.Pattern getPattern2Units()
public java.util.regex.Pattern getPattern3Units()
public java.util.regex.Pattern getPatternSentPhraSeparator()
public java.util.regex.Pattern getPatternNameOfText()
public int getPruning2Units()
public int getPruning3Units()
public java.lang.String getSeparator()
public java.lang.String getInputFileName()
public void setInputFileName(java.lang.String _inputFileName)
_inputFileName
- input file namepublic boolean extract(java.util.TreeSet currentSkippedTerms, int iteration, javax.swing.JProgressBar extractionJProgressBar)
example :
ExtractTerms et = new ExtractTerms(...);
et.extract();
Iterator it = et.getResults2Units().iterator();
System.out.println("Relevancy results for units pair");
while (it.hasNext()) { System.out.println(it.next()); }
No results is return in return parameters but in fields of the ExtractTerms object. You can get results with getResults2Units() and getResults3Units().
currentSkippedTerms
- contain the terms (String) to skipiteration
- number of the iteration when this term was
extractedextractionJProgressBar
- progress bar of the extraction ; null
if not used
public static void replace(java.util.LinkedHashSet units, java.lang.String wordSeparator, java.lang.String inputFileName, java.lang.String newFileName, int mode)
unit1 unit2...unitn <-> unit1-unit2-...-unitn
units
- list of Occurrence which contain the units pair or
triple ungrouped with their lines.wordSeparator
- used for grouping unitsinputFileName
- input file namenewFileName
- name of new file writtenmode
- GROUP_MODE or UNGROUP_MODEOccurrence
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |