Wednesday, February 17, 2010

Iternate Through All Documents in Lucene's Index File [updated with MatchAllDocsQuery]

Update:
There is MatchAllDocsQuery that can do this job too. And I think it's better to use it:)

IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
  IndexSearcher searcher = new IndexSearcher(reader);
  Query query = new MatchAllDocsQuery();

TopDocs docs = searcher.search(query, reader.maxDoc());
for(ScoreDoc scoredoc:docs.scoreDocs){
   
   Document doc = searcher.doc(scoredoc.doc);
   //do your job
  }


One of my project use Nutch to fetch forum posts and index them using Lucene. Each post has been processed to eliminate html tag and bbs code. Later, we need to extract some useful information from posts. Obviously, iterate the fetched raw html files is not a good idea.

Fortunately, the extracted post contents are indexed as a field, postcontent. Therefore, reading Lucene's index file and iterating through documents are much faster than read the original file. Here is the solution:

public void iternateIndex(String indexFolderPath) {
        try {
            Directory index = new SimpleFSDirectory(new File(indexFolderPath));
            IndexReader reader = IndexReader.open(index);
            for (int i = 0; i < reader.maxDoc(); i++) {
                Document doc = reader.document(i);
                if (doc != null) {
                    Field contentField = doc.getField("postcontent");
                    if (contentField != null && contentField.stringValue() != null) {
                        String postContent = contentField.stringValue();
                       //do sth here
                    }
                }
            }
            this.writeToFile();
        } catch (IOException e1) {
            e1.printStackTrace();
        }
    }

Tuesday, February 16, 2010

Extract Email Address From Text Using Regular Expression in Java

I tried to extract email address from a random text using regular expression. But the extraction is not 100% accurate:( So I also provided a validation method to validate the extracted email address. It is also not very smart, only base on "@". I will try to improve it later.


public String extractEmail(String content) {
String email = null;
String regex = "(\\w+)(\\.\\w+)*@(\\w+\\.)(\\w+)(\\.\\w+)*";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
email = matcher.group();

if(!isValidEmailAddress(email)){
email=null;
}

break;
}
return email;
}

public boolean isValidEmailAddress(String emailAddress) {
String expression = "^[\\w\\-]([\\.\\w])+[\\w]+@([\\w\\-]+\\.)+[A-Z]{2,4}$";
CharSequence inputStr = emailAddress;
Pattern pattern = Pattern.compile(expression, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(inputStr);
return matcher.matches();

}