Showing posts from July, 2014

How to analyze text in Java

I have a site with a form.
Users use that form and send me requests, so I naively thought to automate the process of reviewing those requests by analyzing the text - makes sense no ?

Well, apparently I opened a Pandora box called NLP.
It seems that text analyzing is a very vast subject with different algorithms of doing so.
In order to have some order out of the chaos I want to separate the subject to several sub-subjects: Sentence isolation - breaking the paragraph to sentences [not everyone is using "period"]Naming - identifying names, places, dates, currency etc.POS-TAGging - finding the type of each word in the sentence (Noun, Verb etc)Parsing - Identifying sentence parts like subject, direct object etc There are many more parts and sub parts but the above are those I decided to focus on Please note that in order to be accurate the tools need a big "dictionary" of the parsed language, thus these tools might be very heavy on megabytes (some are several hundred M…

Crawling a site with Java

I wanted to crawl a site using Java.
I didn't want to invent the wheel so I wanted to find a good java library which crawls sites.
On top of my head I know of several libraries, so I thought of checking them first. I began with Solr & Lucene and continued with Nutch.
After doing some reading I understood that the above libraries are out of my scope and although they will do the job, they are an absolute overkill.
Why ?
Well, I guess I need to begin with my actual requirements Before the requirement I suggest reading the following post, just to understand the basics (I won't use that actual code, but it does help to understand the concept and to sharpen the requirements):
How to build a crawler in Java

Requirements Crawls a full siteMinimum amount of logic so it won't go to the same URL more than once etcMultithreadingGood API so it will be easy to work with, I need a simple API to define the number of threads, site root, filter which pages to pull according to their URL…

2014 July, Eclipse Development Survey Analyzed

Eclipse community have just released their yearly survey (about 900 participants).
I see this survey as a good source of data to feel the current development vibe.

Here is the Eclipse Survey

So, I looked at the survey results several times and decided to write my specific conclusions extracted from the survey. I have analyzed the survey and want to only write about the big changes I saw which are worthy of my attention (and probably yours too if you read my post).
My Personal Survey Conclusions Open source has lost some of it's prestige (a pity)  [Slide 9]Open Hardware and Internet of Things are here to stay [Slide 13, 17]Most Eclipse users, upgrade their IDE soon after each yearly release [Slide 15]Javascript is here to stay [Slide 21]Application Servers: Tomcat remains first, but JBoss is gaining [Slide 23]Repositories: CVS is finally dead, SVN is dying fast and GIT is gaining fast [Slide 24]Build Tools: Ant is dying fast, Maven is leading big time, Gradle is gaining fast [Slide …