Showing posts from July, 2014

How to analyze text in Java

I have a site with a form. Users use that form and send me requests, so I naively thought to automate the process of reviewing those requests by analyzing the text - makes sense no ? Well, apparently I opened a Pandora box called NLP . It seems that text analyzing is a very vast subject with different algorithms of doing so. In order to have some order out of the chaos I want to separate the subject to several sub-subjects: Sentence isolation - breaking the paragraph to sentences [not everyone is using "period"] Naming - identifying names, places, dates, currency etc. POS-TAGging - finding the type of each word in the sentence (Noun, Verb etc) Parsing - Identifying sentence parts like subject, direct object etc There are many more parts and sub parts but the above are those I decided to focus on Please note that in order to be accurate the tools need a big "dictionary" of the parsed language, thus these tools might be very heavy on

Crawling a site with Java

I wanted to crawl a site using Java. I didn't want to invent the wheel so I wanted to find a good java library which crawls sites. On top of my head I know of several libraries, so I thought of checking them first. I began with Solr & Lucene and continued with Nutch. After doing some reading I understood that the above libraries are out of my scope and although they will do the job, they are an absolute overkill. Why ? Well, I guess I need to begin with my actual requirements Before the requirement I suggest reading the following post, just to understand the basics (I won't use that actual code, but it does help to understand the concept and to sharpen the requirements): How to build a crawler in Java Requirements Crawls a full site Minimum amount of logic so it won't go to the same URL more than once etc Multithreading Good API so it will be easy to work with, I need a simple API to define the number of threads, site root, fil

2014 July, Eclipse Development Survey Analyzed

Eclipse community have just released their yearly survey (about 900 participants). I see this survey as a good source of data to feel the current development vibe. Here is the Eclipse Survey So, I looked at the survey results several times and decided to write my specific conclusions extracted from the survey. I have analyzed the survey and want to only write about the big changes I saw which are worthy of my attention (and probably yours too if you read my post). My Personal Survey Conclusions Open source has lost some of it's prestige (a pity)  [Slide 9] Open Hardware and Internet of Things are here to stay [Slide 13, 17] Most Eclipse users, upgrade their IDE soon after each yearly release [Slide 15] Javascript is here to stay [Slide 21] Application Servers: Tomcat remains first, but JBoss is gaining [Slide 23] Repositories: CVS is finally dead, SVN is dying fast and GIT is gaining fast [Slide 24] Build Tools: Ant is dying fast, Maven is leading