How to analyze text in Java

I have a site with a form.

Users use that form and send me requests, so I naively thought to automate the process of reviewing those requests by analyzing the text - makes sense no ?


Well, apparently I opened a Pandora box called NLP.

It seems that text analyzing is a very vast subject with different algorithms of doing so.

In order to have some order out of the chaos I want to separate the subject to several sub-subjects:
  • Sentence isolation - breaking the paragraph to sentences [not everyone is using "period"]
  • Naming - identifying names, places, dates, currency etc.
  • POS-TAGging - finding the type of each word in the sentence (Noun, Verb etc)
  • Parsing - Identifying sentence parts like subject, direct object etc
There are many more parts and sub parts but the above are those I decided to focus on
Please note that in order to be accurate the tools need a big "dictionary" of the parsed language, thus these tools might be very heavy on megabytes (some are several hundred MB).



I will take the following sentence as an example sentence:
"I live in California and I want to buy a horse"


How will I analyze it ?
(more on java specific implementation in the next paragraph)

First I will use the sentence isolation tool to isolate this sentence
I will then use the Naming tool to identify that the user lives at California
I will use the POS-TAGging tool to identify the noun (NN) as "horse"
Then make sure it is the sentence's "direct object" by using the Parsing tool.



What are the Java tools of the trade ?


Requirements

  1. SIMPLE (many of those tools are very complicated, but I have a small project, so I don't want to delve in tiny language details)
  2. I prefer one tool which has all of the functionality
  3. Up to date (from the last 3 years) and preferably updated regularly
  4. Prefer a tool which has support for several languages
  5. FAST (some of these tools can be very slow in doing their thingy)
  6. Prefer Open source

I didn't review all of the Java libraries out there but here are my results:
Gate - Fell on requirement #1 - although it has tons of documentation, it is too complicated to begin with
OpenNLP - Has all components, updated, seems easy to use, open source
Stanford - Has it all, might be slow, not sure how simple it is - Looks very professional though (for parser consider it's shift reduce parser)
Berkly Parser - Seems like a great and fast Parser, unfortunately I can't find any documentation, and I don't think it contains the rest of the needed tools.  BUT it supports 6 languages, (falls on requirement #2)
Mate-Tools - POS-Tag + Parser for 4 languages, no documentation, seem to have good algorithms, but I don't think it is too user friendly and they lack the full stack of tools I need. (falls on requirement #2)
Dk-Pro-asl - One tool that supposedly has it all, not sure about the number of languages supported



So what will I use?

For a big project I think it is worth it to investigate Gate.
for my regular small projects I will try OpenNLP which is under apache, and have Dk-Pro-asl as a fallback.



Side note: For identifying a site's language I will use: Language-detector

Comments

Popular posts from this blog

How to read Android apk contents

Start Working with AutoIT

SCummVM for Android