Extracting text from Word/PDF documents
What do make a word cloud out of a whole bunch of documents or do some simple statistics? Lucene is great for indexing documents, but I wanted something quick and dirty I could mess around with.
For example, here is a cloud tag from my research papers:
Some quick code I cooked up to extract text from a collection of word/pdf:
You’ll need to include these references:
using Word=Microsoft.Office.Interop.Word;
using System.Text.RegularExpressions;
using org.pdfbox.pdmodel;
using org.pdfbox.util;
using System.IO;
Read up the pdf project parser here. This will get the raw text from a pdf.
private static string ReadPDFText(string name)
{
try
{
var doc = PDDocument.load(name);
var stripper = new PDFTextStripper();
return stripper.getText(doc);
}
catch
{
return "";
}
}
Getting text from word documents, involves using word automation:
// Set up object
var app = new Word.Application();
app.Visible = false;
...
content = OpenWordDoc(app,file);
// Shut down
var missing = Type.Missing;
app.Quit(ref missing, ref missing, ref missing);
You’ll have to open the word document and read it’s text like this:
private static string OpenWordDoc(Word.Application app, string name)
{
object readOnly = true;
object missing = Type.Missing;
object fileName = System.IO.Path.GetFullPath(name);
var doc = app.Documents.Open(ref fileName,
ref missing, ref readOnly, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing);
string text = doc.Content.Text;
app.Documents.Close(ref missing, ref missing, ref missing);
return text;
}
Now that you have the raw text, you just need to “tokenize” it.
private static Dictionary<string, int> GetWords(string content)
{
Dictionary<string, int> dict;
var list = Regex.Split(content, @"\W+");
foreach (var word in list)
{
string low = word.ToLower();
if (!dict.ContainsKey(low))
{
dict[low] = 0;
}
dict[low]++;
}
return dict;
}
Now you can filter down and process the results anyway you like:
var ranked = dict.OrderBy(pair => -pair.Value).ToList();
var filtered = ranked.Where( pair => IsAlphabeticString(pair.Key) ).ToList();
Now, you can simply output this dictionary and use the output in a word cloud generator such as wordle.net
You can download the whole thing here: GenerateWordTagCloud.zip [C# VS2010 Project].
blog comments powered by Disqus