TurnitinBot General Information Page

Overview

Chances are that you are reading this because you found a reference to this web page from your web server logs. This reference was left by Turnitin.com's web crawling robot, also known as TurnitinBot. This robot collects content from the Internet for the sole purpose of helping educational institutions prevent plagiarism. In particular, we compare student papers against the content we find on the Internet to see if we can find similarities. For more information on this service, please visit www.turnitin.com
Below are questions grouped into categories to help answer any questions you may have.

Frequently Asked Questions Grouped By Category

    General Information About Web Crawlers
What is a web crawler?
How does a web crawler work?
What is considered good crawling etiquette?
    General Information About Turnitin.com
What services does Turnitin.com offer?
Why does Turnitin.com need to crawl my site?
    Controlling TurnitinBot
How can I prevent TurnitinBot from accessing certain pages on my site?
How can I completely exclude TurnitinBot from my site?
What IP Address does TurnitinBot come from?
What is SlySearch?
    Problems With TurnitinBot
Why is TurnitinBot crawling pages that do not exist on my site (404 errors)?
I changed my robots.txt file to exclude TurnitinBot but it still continues to come back?
How can I contact you to report a problem?
Q: What is a web crawler?
A web crawler (aka spider, robot or bot) is a computer program that scours the web gathering content. Some crawlers are specific in what they are looking for, but ours is just interested in gathering as much content as possible.
Q: How does a web crawler work?
At its most basic level, a crawler follows a simple cycle of downloading a web page, finding the links in the web page, downloading the pages referenced by these links, and so on, in a loop. A more thorough explanation can be found at http://www.robotstxt.org/wc/robots.html.
Q: What is considered good crawling etiquette?
Good crawling etiquette relies on the crawler obeying a few rules. It should read and obey the directives in the robots.txt file for a site. It should also obey META exclusion tags within pages. To not overload servers with requests, it should limit the rate at which it asks for content from a particular IP address.
Q: What services does Turnitin.com offer?
Turnitin.com offers various services to the educational community. Most prominently, we provide a widely used and effective plagiarism detection service. We also provide a Peer Review service and a series of class management tools. To learn more about our service, visit www.turnitin.com.
Q: Why does Turnitin.com need to crawl my site?
Part of the plagiarism prevention service relies on comparing student papers to content found on the Internet. Since we do not know ahead of time which pages on the Internet a student will use we need to gather them all for comparison. However, we do have automated ways of throwing away content and links that would be irrelevant to our service.
Q: How can I prevent TurnitinBot from accessing certain web pages on my site?
The Robots Exclusion Protocol allows web site maintainers the ability to communicate to a crawler which parts of their site the crawler cannot access. Furthermore, it allows the administrator the ability to create access rules on a crawler by crawler basis.

It works something like this: TurnitinBot visits a web site http://www.somewhere.com. Knowing it hasn't been here before or in a while, it tries to download http://www.somewhere.com/robots.txt. It then examines the robots.txt file for any rules which apply to it. An example of a robots.txt file is:

#This is an example robots.txt file
User-agent: *
Disallow: /secret/
Disallow: /hide/
Lines starting with # are comments and are ignored by the crawler. The User-agent line is used to indicate which crawler(s) should abide by the rules. In this case, a * means all crawlers. If it were

User-agent: turnitinbot

the rules would only apply to the TurnitinBot crawler. Please note that both the token "user-agent" and "turnitinbot" are case insensitive. For example, TurnitinBot or TURNITINBOT are equally as effective. The Disallow lines are used to exclude the crawler from particular pages on the site. In this case any page starting with /secret/ or /hide/ will be excluded. For instance, http://www.somewhere.com/secret/world.html would be excluded but http://www.somewhere.com/secret.html wouldn't be.

For a more thorough explanation please visit http://www.robotstxt.org/wc/exclusion.html.

Q: How can I completely exclude TurnitinBot from my site?
To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.

Below is an example of a robots.txt file which exludes ONLY our robot from a portion or all of your site.

#This is an example robots.txt file
User-agent: TurnitinBot
Disallow: /hide/     #Will disallow any url starting with /hide/

#This is an example robots.txt file
User-agent: TurnitinBot
Disallow: /            #Will disallow all urls on your site

Another alternative is to contact us with your domain name and all its aliases (for instance www.somewhere.com, www2.somewhere.com, somewhere.com, etc.) and we'll add them to our blacklist of sites to avoid.

Q: What IP Address does TurnitinBot come from?
IP Address: 38.111.147.69 to 38.111.147.94
IP Address: 199.47.82.133 to 199.47.82.254
Q: What is SlySearch ?
SlySearch is the old name of our robot. We decided to change its name to better reflect the service it represents.
Q: Why is TurnitinBot crawling pages that do not exist on my site (404 errors)?
There are two explanations for this. Either your site or another site has a link to your site which is incorrect, i.e. the page doesn't exist. Not knowing better we tried to follow this link generating a 404 error on your server. The other possibility is that TurnitinBot improperly parsed a link from a page.
Q: I changed my robots.txt file to exclude TurnitinBot but it still continues to come back?
If we re-requested the robots.txt file before each page request it would put a significantly larger load on servers and be wasteful of bandwidth. We get around this by caching robots.txt files. For versions Turnitinbot/1.4 and below, we cache the robots.txt file for 48 hours before we refresh our copy. As of version Turnitinbot/1.5, we dropped this value to 12 hours to better suit the needs of webmasters.
Q: How can I contact you to report a problem?
If you still have questions or want to speak with us about our crawler's behavior you can contact us at crawler@turnitin.com. If you could please provide us with the following information it would help us determine what occurred when our crawler visited your site.

* A description of your question or problem.
* The IP address of the server which our crawler visited.
* The approximate time and date of the visit.
* A means to contact you if not by email.
* Entries from your server log(s) that pertain to our visit. In particular,the URLS we visited which triggered the problem.