This course shows how to perform document similarity using an information-based retrieval method such as vector space model by using cosine similarity technique.
In the first part of the course, students will learn key concepts related to natural language and semantic information processing such as Binary Text Representation, Bag of Words, Lemmatization, TF, IDF, TF-IDF, Cosine Similarity, CamelCase and Identifiers.
In the second part of the course, students will learn how to develop and implement a natural language software to perform document similarity. The course provides the basics to help students understand the theory and practical in Java Programming. The code sample also provides students techniques of how to modularize, trace and implements algebra functionalities.
We conclude the course by providing some guidelines about how to run and debug the program. Students are also given reference links to external resources which help them in gaining better understanding when dealing with natural language software or machine learning.
At the end of the course, you will have a complete understanding of the fundamental concepts of NLP using programming languages. The objective of the course is to learn and familiarise the concepts at the beginner level but an intermediate level of programming knowledge is required. The coding example in this course uses Java Programming Language to illustrate the document similarity.