Melanie Tosik

27 January 2018

Text mining in Java

During my first semester at NYU, I took a class on predictive analytics. We covered everything from the data mining project life cycle, understanding and preprocessing (noisy) data sets, dimensionality reduction, feature selection, data clustering and classification algorithms, to mining association rules and getting some hands-on experience with large-scale data analytics frameworks, such as MapReduce in Hadoop and Apache Spark.

Throughout the semester, we also designed and implemented a complete text mining pipeline in Java. We started by vectorizing a collection of news articles using term frequency-inverse document frequency (tf-idf). We then used k-means clustering in combination with measuring the cosine similarity between the document vectors to cluster the articles based their content and semantic similarity. Finally, we implemented the k-nearest neighbors algorithm (k-NN) to assign previously unseen documents to the clusters we had established in step two.

The only existing libraries I used were JAMA for the singular-value decomposition (SVD) and Stanford CoreNLP to preprocess the text documents. Everything else was implemented from scratch.

View project on GitHub ☺︎

Til next time,