Wednesday, July 14, 2010

Extract ISBN Number From PDF Using JAVA



Part 1: What is ISBN?

ISBN's are described on the ISBN

The International Standard Book Number is known throughout the world as a short, clear and potentially machine-readable identification number which marks any book unmistakably

A valid ISBN should:(define in wikipeda)
The ISBN is 13 digits long if assigned after January 1, 2007, and 10 digits long if assigned before 2007. An International Standard Book Number consists of 4 or 5 parts:

1. for a 13 digit ISBN, a GS1 prefix: 978 or 979 (indicating the industry; in this case, 978 denotes book publishing)
2. the group identifier, (language-sharing country group)
3. the publisher code,
4. the item number, (title of the book) and
5. a checksum character or check digit.

The ISBN separates its parts (group, publisher, title and check digit) with either a hyphen or a space. Other than the check digit, no part of the ISBN will have a fixed number of digits.


Here are some examples:

Part 2: Extract Plain Text From PDF


There are several open source Pdf libraries available in Java world. PdfBox is the one choosen for extracting text from pdf document in this case.

Instead of loading the whole document into memory, We will do extracting page one by one since most of ebooks list their ISBN within the first 10 page.

//we will only try to extract within 10 pages
final int MAX_PAGE = 10;
//begin at the first page
int start = 1;


boolean found = false;
//create PDDocument via Java.io.File
PDDocument document = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();

  while (!found && start <= MAX_PAGE) {

   stripper.setStartPage(start);
   stripper.setEndPage(start + 1);

   String pageText = stripper.getText(document);

   if (pageText!= null) {
  //You can do something about the extracted text here 

                //if done, mark found=true, finish extracting.
    found = true;
   } else {
    start++;
   }

  }
  if (document != null) {
   document.close();
  }

Part 3: Extract ISBN from Plain Text using Regular Expression


Extracted Plain Text Example

Eclipse Rich Client Platform Second Edition

First printing, May 2010 (in Page 5)
Copyright © 2010 Pearson Education, Inc.
All rights reserved. Printed in the United States of America. This publication is protected by copyright,
and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a
retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying,
recording, or likewise. For information regarding permissions, write to:
Pearson Education, Inc.
Rights and Contracts Department
501 Boylston Street, Suite 900
Boston, MA 02116
Fax: (617) 671-3447
ISBN-13: 978-0-321-60378-4
ISBN-10: 0-321-60378-8
Text printed in the United States on recycled paper at RR Donnelley in Crawfordsville, Indiana.
First printing, May 2010

Lucene in Action Second edition (in Page 5)
ISBN 978-1-933988-17-7
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – MAL – 15 14 13 12 11 10

Step 1: extract every line with text "ISBN"
Regular Expression "^.*ISBN.*$" will match the whole line contains string "ISBN".
public final String isbnLinePattern = "^.*ISBN.*$";

        //line contains ISBN information should be 4+10 chars at least
        public final int MIN_LENGTH = 14;

 public List extractISBNLines(String content) {
 Pattern pattern = Pattern.compile(isbnLinePattern, 
        Pattern.MULTILINE| Pattern.CASE_INSENSITIVE);
  Matcher lineMatcher = pattern.matcher(content);
  boolean result = lineMatcher.find();
 List list = new ArrayList();
        // Loop through and create a new String 
        // with the replacements
        while(result) {
         String line = lineMatcher.group();
if (line.length() >= MIN_LENGTH) {

     list.add(line);
    }
            result = lineMatcher.find();
        }

  }
  return list;

 }
After this extraction, we will get two lists for the two books in this examples:

Eclipse Rich Client Platform Second Edition
ISBN-13: 978-0-321-60378-4
ISBN-10: 0-321-60378-8
Lucene in Action Second edition
ISBN 978-1-933988-17-7

Step 2: extract ISBN-10 or ISBN-13 line by line
After step 1, we get clear ISBN line so the extraction is much easier now. So the regular expressions I used here is weak.

public final String isbn10Pattern = "[-0-9Xx ]{13}";

 public final String isbn13Pattern = "[-0-9Xx ]{17}";

public Book extractISBN(List list, Book book) {

  for (String line : list) {
   if (book.getIsbn13() == null) {
 String isbn13 =extractByPattern(this.isbn13Pattern, line);
    book.setIsbn13(isbn13);
   }
   if (book.getIsbn10() == null) {
 String isbn10 = extractByPattern(this.isbn10Pattern, line);
    book.setIsbn10(isbn10);
   }
  }

  return book;
 }

 /**
  * Extract information by regular expression
  * 
  * @param patternStr
  * @param content
  * @return matched string, null if not find
  */
 public static String extractByPattern(String patternStr, String content) {

  Pattern pattern = Pattern.compile(patternStr);
  Matcher matcher = pattern.matcher(content);
  if (matcher.find())
   return matcher.group(0);
  else
   return null;
 }

That's it, enjoy extracting!

0 comments: