GRAVITY SCORE: NEW METRIC TO MEASURE PLAGIARISM IN TEXT DOCUMENTS USING THE CONCEPT OF GRAVITATIONAL FORCE

Authors:

Srijit Panja,

DOI NO:

https://doi.org/10.26782/jmcms.2022.12.00002

Keywords:

Natural Language Processing,Text Embedding,Text token,Gravitation,

Abstract

Present-day computational capabilities allow digital assets like images, videos, text, and audio have features comparable to those in real-world entities. Location is one such aspect. Similar to real-world bodies being represented by vectors on cartesian coordinates, digital media entities (like text, as discussed in this paper) when encoded, each component of the encoding representing a feature, conceptually should have a vector representation in each such encoding. The concept is put to practice by text encodings (embedding) techniques like Bag-of-words, TF-IDF, Word2Vec, Glove, and Transformer models like BERT, AlBERT etc which create vectors out of the text. This paper aims to use a combination of features in text analogous to mass and distance and propose a new plagiarism index cloning the formula of gravitational force. Parameters like the length of documents/number of words, semantics, frequency of each word, etc, one or many of which are often missed out in prevalent algorithms of text similarity calculations, are important for detecting and measuring plagiarism. The paper aims to consider all such possible parameters in the formulation of a new plagiarism metric to be coined as Gravity Score.

Refference:

I. Abdi, H., and L. J. Williams. 2010. “Principal component analysis.” Wiley interdisciplinary reviews: computational statistics 2 (4): 433–459.
II. Alemi, A. A., and P. Ginsparg. 2015. “Text segmentation based on semantic word embeddings.” arXiv preprint arXiv:1503.05543.
III. Cattaneo, C. 1958. “General relativity: relative standard mass, momentum, energy and gravitational field in a general system of reference.” Il Nuovo Cimento (1955-1965) 10 (2): 318–337.
IV. Chang, A. X., and C. D. Manning. 2014. “TokensRegex: Defining cascaded regular expressions over tokens.” Stanford University Computer Science Technical Reports. CSTR 2:2014.
V. Chujo, K., and M. Utiyama. 2005. “Understanding the Role of Text Length, Sample Size and Vocabulary Size in Determining Text Coverage.” Reading in a foreign language 17 (1): 1–22.
VI. Danielsson, P.-E. 1980. “Euclidean distance mapping.” Computer Graphics and image processing 14 (3): 227– 248.
VII. Edelbaum, T. N. 1962. “Theory of maxima and minima.” In Mathematics in Science and Engineering, 5:1–32. Elsevier.
VIII. Fock, V. 2015. The theory of space, time and gravitation. Elsevier.
IX. Gan, L., and J. Jiang. 1999. “A test for global maximum.” Journal of the American Statistical Association 94 (447): 847–854.
X. Grefenstette, G., and P. Tapanainen. 1994. “What is a word, what is a sentence?: problems of Tokenisation.”
XI. Kanada, Y. 1990. “A Vectorization Technique of Hashing and Its Application to Several Sorting Algorithms.” In PARBASE, 147–151.
XII. Lahitani, A. R., A. E. Permanasari, and N. A. Setiawan. 2016. “Cosine similarity to determine similarity measure: Study case in online essay assessment.” In 2016 4th International Conference on Cyber and IT Service Management, 1–6. IEEE.
XIII. Mikolov, T., K. Chen, G. Corrado, and J. Dean. 2013. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781.

XIV. Morita, K., E.-S. Atlam, M. Fuketra, K. Tsuda, M. Oono, and J.-i. Aoe. 2004. “Word classification and hierarchy using co-occurrence word information.” Information processing & management 40 (6): 957–972.
XV. Nation, P., and R. Waring. 1997. “Vocabulary size, text coverage and word lists.” Vocabulary: Description, acquisition and pedagogy 14:6–19.
XVI. Niwattanakul, S., J. Singthongchai, E. Naenudorn, and S. Wanapu. 2013. “Using of Jaccard coefficient for keywords similarity.” In Proceedings of the international multiconference of engineers and computer scientists, 1:380–384. 6.
XVII. Pennington, J., R. Socher, and C. D. Manning. 2014. “Glove: Global vectors for word representation.” In Pro- ceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532– 1543.
XVIII. Ramos, J., et al. 2003. “Using tf-idf to determine word relevance in document queries.” In Proceedings of the first instructional conference on machine learning, 242:29–48. 1. New Jersey, USA.
XIX. Van der Maaten, L., and G. Hinton. 2008. “Visualizing data using t-SNE.” Journal of machine learning research 9 (11).
XX. Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. “Attention is all you need.” Advances in neural information processing systems 30.
XXI. Verlinde, E. 2011. “On the origin of gravity and the laws of Newton.” Journal of High Energy Physics 2011 (4): 1–27.
XXII. Zhang, Y., R. Jin, and Z.-H. Zhou. 2010. “Understanding bag-of-words model: a statistical framework.” Interna- tional journal of machine learning and cybernetics 1 (1): 43–52.

View Download