Blog · 21 Jul 2016

The Detection is in the Details

The Inner Workings of Plagiarism Detection Technology

Turnitin.com Editors

There are a number of ways that technology can be used to identify potentially plagiarized content. This post examines the different ways, and how Turnitin uses search technology and content comparison algorithms to help educators help students learn how to use source attribution appropriately.

Plagiarism has always existed as a problem - the origins of the word date back to the 1st century. It's only of late, however, that plagiarism has become a significant concern not just for educators and researchers, but also in the public sphere. New instances of plagiarism seem to hit the news on a daily basis. Whether it's song lyrics, plagiarism by school officials, government ministers, speeches by political figures, or the plagiarism that happens in the classroom, incidents of plagiarism appear to be on the rise everywhere.

We have the internet to thank for that. With the rise of the internet, we've seen exponential growth of content created and made readily available, almost everywhere. The growth is happening on such a large scale that we don't even have a way to grasp how huge of a change in content creation we're witnessing. In 2013, factshunt.com pegged the amount of total internet content at 14.3 trillion pages (article). The growth is happening so fast, that we don't have a way to accurately determine the number of new pages created each day or the total amount of content that currently exists online. The best estimates suggest there are 47 billion indexed and searchable web pages (article). To put this number into perspective, it would take approximately 300 trillion sheets of paper to print out the entire internet, today.

With all of this information so immediately accessible, is it any surprise that we've seen a rise in plagiarism as well? Fortunately, the growth of the internet and our need to find ways to search that content has led to developments in web crawling and indexing technology (the latter of which is used to identify the content that is crawled) that has led to technology that quickly identifies copying and the potential plagiarism of content.

First off, it is important to clarify that plagiarism detection software doesn't specifically identify plagiarism. No software will ever be able to accurately determine intent. And intent is one of the factors that educators consider when looking at incidents of plagiarism in student work. The way that plagiarism detection software works is to identify content similarity matches. That is, the software scans a database of crawled content and identifies the text components and then compares it to the components, or content, of other work. Based on that comparison, the software will generate a report that highlights the content matches. Plagiarism detection software crawls and indexes content very similarly to the ways that search engines, like Google, crawl and index web content. The key difference here is that plagiarism detection software is crawling and indexing content not to make it keyword searchable, but to identify similar content stored in the database of crawled pages.

So how does the software scan for similarity?

There are, generally speaking, four different ways to go about doing this. The first way is through keyword analysis. What does that mean? Like a search engine, you enter in a keyword and the software scans documents to find instances of that word. Another way to scan text for similarity is to look at groups, or strings, of words. Rather than looking just at individual words, the software looks for strings or sequences of words (say 3-4 or more words ordered in such a way to create a sentence or sentence fragments). As you may be able to see already, these two approaches can be pretty effective for identifying the strict or exact copying of content within one document to others. The shortcoming of these approaches, however, is that it doesn't identify paraphrasing--where the ideas and meaning may have been copied--but the text is different enough that it doesn't get identified as a match.

A way to better get at this type of problem would be through a third way, which is to go about scanning for content matches by looking at the style of the writing and to compare that style to other documents. This is not a strict word-to-word analysis, but more of an approach that takes a look at the probability of certain word sequences ("phrases") that may appear in one document and then compare it to other documents. The challenge here is fine grained, word-for-word matches can get lost. Better yet, why not identify a document's unique "fingerprint," and then compare that fingerprint to others? This last approach, that we will discuss in this blog post, is what we largely do at Turnitin.

With "fingerprinting," Turnitin's technology scans and identifies the unique fragments and the ordering of word fragments that appear in a document. With this level of analysis, we can uncover word string matches ("fragments"), but also look at the unique sequences of those matches to create a fingerprint of the document.

Fingerprints are entirely unique and can be identified by the specific features displayed in a print. The same thing can be said for documents, each document has unique features such as phrasing, tone, style that if completely original is like a fingerprint, unique. If a document contains content that is unoriginal in its phrasing, the document will match to other document fingerprints that also contain this feature.

One issue this approach faces is how to avoid picking up very common words--like articles ("the," "an," "a") or conjunctions ("and," "but," "of")--and hone in on the strings of words that make a document unique. Fingerprinting gives us a way to exclude commonly-used words, while providing us with the ability to identify when content is poorly paraphrased. Because Turnitin was developed for use in academic contexts, our approach with fingerprinting is to focus on features of the text that are clearly relevant to the content or subject matter of the document.

For example, if you're finalizing your dissertation thesis you would want to make sure that all of the ideas you discussed were properly referenced and cited and that sections where you paraphrased were paraphrased properly. What gets less emphasized is strict word-to-word matches. If a more keyword search-biased approach were used here, we'd be unable to identify poor paraphrasing or selective word substitution--which incidentally is the majority of what academic and educators see in student work.

If you're looking for strict word-to-word matches, you could use a search engine (which is what everyone did before the advent of plagiarism detection software). If you're looking at comparing one author's style to another, there's an approach for that. As for identifying content matches in academic-type writing, Turnitin has developed a fingerprint-based approach that excels at finding the content that matters, when it matters. In other words, Turnitin is designed to support students who are learning how to use the internet to do research, use source materials, and take ownership of their own writing and ideas.

Subscribe