Text mining in Java
During my first semester at NYU, I took a class on predictive analytics. We covered everything from the data mining project life cycle, understanding and preprocessing (noisy) data sets, dimensionality reduction, feature selection, data clustering and classification algorithms, to mining association rules and getting some hands-on experience with large-scale data analytics frameworks, such as MapReduce in Hadoop and Apache Spark.
Throughout the semester, we also designed and implemented a complete text mining pipeline in Java. We started by vectorizing a collection of news articles using term frequency-inverse document frequency (tf-idf). We then used k-means clustering in combination with measuring the cosine similarity between the document vectors to cluster the articles based their content and semantic similarity. Finally, we implemented the k-nearest neighbors algorithm (k-NN) to assign previously unseen documents to the clusters we had established in step two.
Til next time,