TurnitinBot General Information Page |
OverviewChances are that you are reading this because you found a reference to this web page from your web server logs. This reference was left by Turnitin.com's web crawling robot, also known as TurnitinBot. This robot collects content from the Internet for the sole purpose of helping educational institutions prevent plagiarism. In particular, we compare student papers against the content we find on the Internet to see if we can find similarities. For more information on this service, please visit www.turnitin.com |
Below are questions grouped into categories to help answer any questions you may have. |
Frequently Asked Questions Grouped By Category |
General Information About Web Crawlers | |
What is a web crawler? | |
How does a web crawler work? | |
What is considered good crawling etiquette? |
General Information About Turnitin.com | |
What services does Turnitin.com offer? | |
Why does Turnitin.com need to crawl my site? |
Controlling TurnitinBot | |
How can I prevent TurnitinBot from accessing certain pages on my site? | |
How can I completely exclude TurnitinBot from my site? | |
What IP Address does TurnitinBot come from? |
Problems With TurnitinBot | |
Why is TurnitinBot crawling pages that do not exist on my site (404 errors)? | |
How can I contact you to report a problem? |
Q: What is a web crawler? | |
A web crawler (aka spider, robot or bot) is a computer program that scours the web gathering content. Some crawlers are specific in what they are looking for, but ours is just interested in gathering as much content as possible. | |
Q: How does a web crawler work? | |
At its most basic level, a crawler follows a simple cycle of downloading a web page, finding the links in the web page, downloading the pages referenced by these links, and so on, in a loop. A more thorough explanation can be found at https://www.robotstxt.org/robotstxt.html. | |
Q: What is considered good crawling etiquette? | |
Good crawling etiquette relies on the crawler obeying a few rules. It should read and obey the directives in the robots.txt file for a site. It should also obey META exclusion tags within pages. To not overload servers with requests, it should limit the rate at which it asks for content from a particular IP address. | |
Q: What services does Turnitin.com offer? | |
Turnitin.com offers various services to the educational community. Most prominently, we provide a widely used and effective plagiarism detection service. We also provide a Peer Review service and a series of class management tools. To learn more about our service, visit www.turnitin.com. | |
Q: Why does Turnitin.com need to crawl my site? | |
Part of the plagiarism prevention service relies on comparing student papers to content found on the Internet. Since we do not know ahead of time which pages on the Internet a student will use we need to gather them all for comparison. However, we do have automated ways of throwing away content and links that would be irrelevant to our service. | |
Q: How can I prevent TurnitinBot from accessing certain web pages on my site? | |
The Robots Exclusion Protocol allows web site maintainers the ability to communicate to a crawler which parts of their site the crawler cannot access. Furthermore, it allows the administrator the ability to create access rules on a crawler by crawler basis.
It works something like this: TurnitinBot visits a web site http://www.somewhere.com. Knowing it hasn't been here before or in a while, it tries to download http://www.somewhere.com/robots.txt. It then examines the robots.txt file for any rules which apply to it. An example of a robots.txt file is:
#This is an example robots.txt file User-agent: turnitinbot the rules would only apply to the TurnitinBot crawler. Please note that both the token "user-agent" and "turnitinbot" are case insensitive. For example, TurnitinBot or TURNITINBOT are equally as effective. The Disallow lines are used to exclude the crawler from particular pages on the site. In this case any page starting with /secret/ or /hide/ will be excluded. For instance, http://www.somewhere.com/secret/world.html would be excluded but http://www.somewhere.com/secret.html wouldn't be. Note: you may see the Turnitin crawler use the user-agent "Turnitin" rather than "TurnitinBot"; these are equivalent, and the Turnitin crawler will respect robots.txt exclusions for both "turnitinbot" and "turnitin".For a more thorough explanation please visit https://www.robotstxt.org/robotstxt.html. |
|
Q: How can I completely exclude TurnitinBot from my site? | |
To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site. Below is an example of a robots.txt file which exludes ONLY our robot from a portion or all of your site. #This is an example robots.txt file #This is an example robots.txt file |
|
Q: What IP Address does TurnitinBot come from? | |
Turnitin use a number of different crawlers and content indexing systems, all of which share the Agent Name "TurnitinBot", and originate from one of a number of static IP addresses - which our system might randomly assign to the crawler/indexer at any given time. The two main use cases are listed below.
Content Partner organisations/Crossref Members: Similarly, for Content Partner organisations providing their metadata (including full-text URLs) to Turnitin over FTP, to allow Turnitin to crawl this content; these IP addresses and the Agent Name "Turnitin" and "TurnitinBot" should be given access to facilitate the indexing of this content.
General Webcrawl: Please note, Crossref Members/Content Partners may also see activity from Turnitin's general webcrawler on their websites; if this causes any inconvenience or slowdown in traffic, this can be handled via robots.txt restrictions or by blocking just the latter IP range (199.47.80.0 to 199.47.87.131 and 199.47.87.136 to 199.47.87.255). |
|
Q: Why is TurnitinBot crawling pages that do not exist on my site (404 errors)? | |
There are two explanations for this. Either your site or another site has a link to your site which is incorrect, i.e. the page doesn't exist. Not knowing better we tried to follow this link generating a 404 error on your server. The other possibility is that TurnitinBot improperly parsed a link from a page. | |
Q: How can I contact you to report a problem? | |
If you still have questions or want to speak with us about our crawler's behavior you can contact us at crawler@turnitin.com. If you could please provide us with the following information it would help us determine what occurred when our crawler visited your site. * A description of your question or problem. |