iMath Project

Text mining

In the realm of text mining there are various methods for extracting valuable information. Among these methods the vector space model is an essential technique. This model represents documents as vectors in a multi-dimensional space, where each dimension corresponds to a unique term in the document collection. This leads to diverse variants such as latent semantic indexing (LSI ) based on SVD, clustering-based methods, nonnegative matrix factorization, and LGK bidiagonalization. Before using these methods, we must prepare the text by getting rid of common words (stop words ) and simplifying words to their basic form (stemming ). When comparing the previously mentioned methods for extracting the information, it is important to take note of the structure of the text documents, since a method can perform better than others depending on the documents used.

Example:

In the field of e-learning text mining can be useful. By applying it we could extract key concepts from questions in text form, adapt the level based on sentiment analysis of a written feedback provided by the users or even provide a search engine to let the teachers search questions on a dataset for a certain concept.

Reference:

Salton, G., Wong, A., & Yang, C. S. (1975 ). A vector space model for automatic indexing. Communications of the ACM, 18(11 ), 613-620.

Berry, M. W., & Browne, M. (2005 ). Understanding search engines: mathematical modeling and text retrieval. Society for Industrial and Applied mathematics.

Eldén, L. (2019 ). Chapter 12. In Matrix Methods in Data Mining and Pattern Recognition, Second Edition (pp. 133–148 ).

Author of the tip:

Alejandro Fuster López

University of Malaga

Back to the Tips List