Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Education and Culture Executive Agency (EACEA). Neither the European Union nor EACEA can be held responsible for them.

Tips

Using rank-k approximation for keyword and keysentence extraction

Automatic text summarization is an active research field for a long time. We focus here just on obtaining keywords and key sentences to be extracted automatically from a given text based on nonnegative matrix factorization as presented in (Zha, 2002 ). The basis is a term-sentence matrix with entrances a_ij being the occurences of terms in sentences. Interestingly enough, one first should de a so-called stemming, i.e. removing stop words and identifying words with the same stem, but different endings. The procedure finds the so-called saliency scores of terms and sentences based on eigenvalues of AA^T and A^TA respectively. By choosing the largest singular values, the components of the salicency vectors are positive. It is interesting that this method coincides with a rank-1 approximation of the term-sentence matrix. The drawback is that two very similar sentences may come out. Therefore the papers we refer to introduce a so-called rank-k approximation, where the dimension k is chosen greater than or equal to the number of key sentences that we want to extract. In fact, we can think of an iterative procedure which identifies the ""heaviest" columns of A in terms of Euclidean length for a given basis. The representation of A is given by a coordinate matrix D, where we are iteratively interested in the heaviest column. Practically, one can use a QR decomposition of D.

Example:
An example is provided in the book (Elden, 2019 ), where one of the chapters is used to run the methodology. There are 388 terms and 183 sentences. It is interesting, that one first has to remove all latex humbug from the source text. Using matlabs singular value decomposition for sparse matrices, one obtains a plot of the singular vectors. Choosing the 10 largest values, one finds the most important words, but also the dominating sentences. This looks like a very systematic approach.

Reference:
H. Zha. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In Proc., 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002, pp. 113–120.

L. Eldén. Matrix Methods in Data Mining and Pattern Recognition 2nd edition, Chapter 14. SIAM, Philadelphia, 2019

Author of the tip:
Eligius Hendrix
University of Malaga