Articles in this section

How to stop search engines from crawling web pages of websites in Plesk for Linux

kb: how-to Plesk for Linux

Applicable to:

  • Plesk for Linux

Question

How to stop search engines from crawling web pages of websites in Plesk for Linux?

Answer

For all domains via additional nginx directives

 

Warning: This solution will overwrite all existing nginx directives of domains. Make sure existing nginx directives can be safely overwritten.

  1. Connect to your Plesk server via SSH.
  2. Create a temporary file called directive_template:

    # touch directive_template 

  3. Open directive_template in a text editor and add the configuration below:

    CONFIG_TEXT: if ($http_user_agent ~ Googlebot|SputnikBot|omgili|socialmediascanner|Jooblebot|SeznamBot|Scrapy|CCBot|linkfluence|veoozbot|Leikibot|Seopult|Faraday|hybrid|Go-http-client|SMUrlExpander|SNAPSHOT|getintent|ltx71|Nuzzel|SMTBot|Laserlikebot|facebookexternalhit|mfibot|OptimizationCrawler|crazy|Dispatch|ubermetrics|HTMLParser|musobot|filterdb|InfoSeek|omgilibot|DomainSigma|SafeSearch|CommentReader|meanpathbot|statdom|proximic|spredbot|StatOnlineRuBot|openstat|DeuSu|semantic|postano|masscan|Embedly|NewShareCounts|linkdexbot|GrapeshotCrawler|Digincore|NetSeer|help.jp|PaperLiBot|getprismatic|360Spider|Ahrefs|ApacheBench|Aport|Applebot|archive|BaiduBot|Baiduspider|Birubot|BLEXBot|bsalsa|Butterfly|Buzzbot|BuzzSumo|CamontSpider|curl|dataminr|discobot|DomainTools|DotBot|Exabot|Ezooms|FairShare|FeedFetcher|FlaxCrawler|FlightDeckReportsBot|FlipboardProxy|FyberSpider|Gigabot|HTTrack|ia_archiver|InternetSeer|Jakarta|Java|JS-Kit|km.ru|kmSearchBot|Kraken|larbin|libwww|Lightspeedsystems|Linguee|LinkBot|LinkExchanger|LinkpadBot|LivelapBot|LoadImpactPageAnalyzer|lwp-trivial|majestic|Mediatoolkitbot|MegaIndex|MetaURI|MJ12bot|MLBot|NerdByNature|NING|NjuiceBot|Nutch|OpenHoseBot|Panopta|pflab|pirst|PostRank|crawler|ptd-crawler|Purebot|PycURL|Python|QuerySeekerSpider|rogerbot|Ruby|SearchBot|SemrushBot|SISTRIX|SiteBot|Slurp|Sogou|solomono|Soup|spbot|suggybot|Superfeedr|SurveyBot|SWeb|trendictionbot|TSearcher|ttCrawler|TurnitinBot|TweetmemeBot|UnwindFetchor|urllib|uTorrent|Voyager|WBSearchBot|Wget|WordPress|woriobot|Yeti|YottosBot|Zeus|zitebot|ZmEu|Crowsnest|PaperLiBot|peerindex|ia_archiver|Slurp|Aport|NING|JS-Kit|rogerbot|BLEXBot|MJ12bot|Twiceler|Baiduspider|Java|CommentReader|Yeti|discobot|BTWebClient|Tagoobot|Ezooms|igdeSpyder|AhrefsBot|Teleport|Offline|DISCo|netvampire|Copier|HTTrack|WebCopier) {
    return 444;
    }

  4. Save the changes and close the file.
  5. Create a list with the names of all domains:

    # plesk bin site -l > domains_list 

  6. Apply the new nginx configuration to all domains:

    # while read -r domain; do install directive_template -o root -g nginx -m 600 "/var/www/vhosts/system/${domain}/conf/vhost_nginx.conf"; plesk sbin httpdmng --reconfigure-domain "${domain}" -no-restart; done < domains_list && service nginx reload

 

For a domain via additional nginx directives 

 

  1. Log in to Plesk.
  2. Go to Domains > example.com > Hosting & DNS > Apache & nginx Settings.
  3. Add the directives below to the Additional nginx directives field:

    CONFIG_TEXT: if ($http_user_agent ~ Googlebot|SputnikBot|omgili|socialmediascanner|Jooblebot|SeznamBot|Scrapy|CCBot|linkfluence|veoozbot|Leikibot|Seopult|Faraday|hybrid|Go-http-client|SMUrlExpander|SNAPSHOT|getintent|ltx71|Nuzzel|SMTBot|Laserlikebot|facebookexternalhit|mfibot|OptimizationCrawler|crazy|Dispatch|ubermetrics|HTMLParser|musobot|filterdb|InfoSeek|omgilibot|DomainSigma|SafeSearch|CommentReader|meanpathbot|statdom|proximic|spredbot|StatOnlineRuBot|openstat|DeuSu|semantic|postano|masscan|Embedly|NewShareCounts|linkdexbot|GrapeshotCrawler|Digincore|NetSeer|help.jp|PaperLiBot|getprismatic|360Spider|Ahrefs|ApacheBench|Aport|Applebot|archive|BaiduBot|Baiduspider|Birubot|BLEXBot|bsalsa|Butterfly|Buzzbot|BuzzSumo|CamontSpider|curl|dataminr|discobot|DomainTools|DotBot|Exabot|Ezooms|FairShare|FeedFetcher|FlaxCrawler|FlightDeckReportsBot|FlipboardProxy|FyberSpider|Gigabot|HTTrack|ia_archiver|InternetSeer|Jakarta|Java|JS-Kit|km.ru|kmSearchBot|Kraken|larbin|libwww|Lightspeedsystems|Linguee|LinkBot|LinkExchanger|LinkpadBot|LivelapBot|LoadImpactPageAnalyzer|lwp-trivial|majestic|Mediatoolkitbot|MegaIndex|MetaURI|MJ12bot|MLBot|NerdByNature|NING|NjuiceBot|Nutch|OpenHoseBot|Panopta|pflab|pirst|PostRank|crawler|ptd-crawler|Purebot|PycURL|Python|QuerySeekerSpider|rogerbot|Ruby|SearchBot|SemrushBot|SISTRIX|SiteBot|Slurp|Sogou|solomono|Soup|spbot|suggybot|Superfeedr|SurveyBot|SWeb|trendictionbot|TSearcher|ttCrawler|TurnitinBot|TweetmemeBot|UnwindFetchor|urllib|uTorrent|Voyager|WBSearchBot|Wget|WordPress|woriobot|Yeti|YottosBot|Zeus|zitebot|ZmEu|Crowsnest|PaperLiBot|peerindex|ia_archiver|Slurp|Aport|NING|JS-Kit|rogerbot|BLEXBot|MJ12bot|Twiceler|Baiduspider|Java|CommentReader|Yeti|discobot|BTWebClient|Tagoobot|Ezooms|igdeSpyder|AhrefsBot|Teleport|Offline|DISCo|netvampire|Copier|HTTrack|WebCopier) {
    return 444;
    }

  4. Apply the changes.

 

For a domain via robots.txt file


Create a robots.txt file in domain's document root directory (by default it is /var/www/vhosts/example.com/httpdocs/). This file will disallow access to bots for the whole website. The code that must be placed within the robots.txt file is:

CONFIG_TEXT: User-agent: *
Disallow: /

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.