Applicable to:
- Plesk for Linux
Question
How to stop search engines from crawling web pages of websites in Plesk for Linux?
Answer
Warning: This solution will overwrite all existing nginx directives of domains. Make sure existing nginx directives can be safely overwritten.
- Connect to your Plesk server via SSH.
-
Create a temporary file called directive_template:
# touch directive_template
-
Open directive_template in a text editor and add the configuration below:
CONFIG_TEXT: if ($http_user_agent ~ Googlebot|SputnikBot|omgili|socialmediascanner|Jooblebot|SeznamBot|Scrapy|CCBot|linkfluence|veoozbot|Leikibot|Seopult|Faraday|hybrid|Go-http-client|SMUrlExpander|SNAPSHOT|getintent|ltx71|Nuzzel|SMTBot|Laserlikebot|facebookexternalhit|mfibot|OptimizationCrawler|crazy|Dispatch|ubermetrics|HTMLParser|musobot|filterdb|InfoSeek|omgilibot|DomainSigma|SafeSearch|CommentReader|meanpathbot|statdom|proximic|spredbot|StatOnlineRuBot|openstat|DeuSu|semantic|postano|masscan|Embedly|NewShareCounts|linkdexbot|GrapeshotCrawler|Digincore|NetSeer|help.jp|PaperLiBot|getprismatic|360Spider|Ahrefs|ApacheBench|Aport|Applebot|archive|BaiduBot|Baiduspider|Birubot|BLEXBot|bsalsa|Butterfly|Buzzbot|BuzzSumo|CamontSpider|curl|dataminr|discobot|DomainTools|DotBot|Exabot|Ezooms|FairShare|FeedFetcher|FlaxCrawler|FlightDeckReportsBot|FlipboardProxy|FyberSpider|Gigabot|HTTrack|ia_archiver|InternetSeer|Jakarta|Java|JS-Kit|km.ru|kmSearchBot|Kraken|larbin|libwww|Lightspeedsystems|Linguee|LinkBot|LinkExchanger|LinkpadBot|LivelapBot|LoadImpactPageAnalyzer|lwp-trivial|majestic|Mediatoolkitbot|MegaIndex|MetaURI|MJ12bot|MLBot|NerdByNature|NING|NjuiceBot|Nutch|OpenHoseBot|Panopta|pflab|pirst|PostRank|crawler|ptd-crawler|Purebot|PycURL|Python|QuerySeekerSpider|rogerbot|Ruby|SearchBot|SemrushBot|SISTRIX|SiteBot|Slurp|Sogou|solomono|Soup|spbot|suggybot|Superfeedr|SurveyBot|SWeb|trendictionbot|TSearcher|ttCrawler|TurnitinBot|TweetmemeBot|UnwindFetchor|urllib|uTorrent|Voyager|WBSearchBot|Wget|WordPress|woriobot|Yeti|YottosBot|Zeus|zitebot|ZmEu|Crowsnest|PaperLiBot|peerindex|ia_archiver|Slurp|Aport|NING|JS-Kit|rogerbot|BLEXBot|MJ12bot|Twiceler|Baiduspider|Java|CommentReader|Yeti|discobot|BTWebClient|Tagoobot|Ezooms|igdeSpyder|AhrefsBot|Teleport|Offline|DISCo|netvampire|Copier|HTTrack|WebCopier) {
return 444;
} - Save the changes and close the file.
-
Create a list with the names of all domains:
# plesk bin site -l > domains_list
-
Apply the new nginx configuration to all domains:
# while read -r domain; do install directive_template -o root -g nginx -m 600 "/var/www/vhosts/system/${domain}/conf/vhost_nginx.conf"; plesk sbin httpdmng --reconfigure-domain "${domain}" -no-restart; done < domains_list && service nginx reload
- Log in to Plesk.
- Go to Domains > example.com > Hosting & DNS > Apache & nginx Settings.
-
Add the directives below to the Additional nginx directives field:
CONFIG_TEXT: if ($http_user_agent ~ Googlebot|SputnikBot|omgili|socialmediascanner|Jooblebot|SeznamBot|Scrapy|CCBot|linkfluence|veoozbot|Leikibot|Seopult|Faraday|hybrid|Go-http-client|SMUrlExpander|SNAPSHOT|getintent|ltx71|Nuzzel|SMTBot|Laserlikebot|facebookexternalhit|mfibot|OptimizationCrawler|crazy|Dispatch|ubermetrics|HTMLParser|musobot|filterdb|InfoSeek|omgilibot|DomainSigma|SafeSearch|CommentReader|meanpathbot|statdom|proximic|spredbot|StatOnlineRuBot|openstat|DeuSu|semantic|postano|masscan|Embedly|NewShareCounts|linkdexbot|GrapeshotCrawler|Digincore|NetSeer|help.jp|PaperLiBot|getprismatic|360Spider|Ahrefs|ApacheBench|Aport|Applebot|archive|BaiduBot|Baiduspider|Birubot|BLEXBot|bsalsa|Butterfly|Buzzbot|BuzzSumo|CamontSpider|curl|dataminr|discobot|DomainTools|DotBot|Exabot|Ezooms|FairShare|FeedFetcher|FlaxCrawler|FlightDeckReportsBot|FlipboardProxy|FyberSpider|Gigabot|HTTrack|ia_archiver|InternetSeer|Jakarta|Java|JS-Kit|km.ru|kmSearchBot|Kraken|larbin|libwww|Lightspeedsystems|Linguee|LinkBot|LinkExchanger|LinkpadBot|LivelapBot|LoadImpactPageAnalyzer|lwp-trivial|majestic|Mediatoolkitbot|MegaIndex|MetaURI|MJ12bot|MLBot|NerdByNature|NING|NjuiceBot|Nutch|OpenHoseBot|Panopta|pflab|pirst|PostRank|crawler|ptd-crawler|Purebot|PycURL|Python|QuerySeekerSpider|rogerbot|Ruby|SearchBot|SemrushBot|SISTRIX|SiteBot|Slurp|Sogou|solomono|Soup|spbot|suggybot|Superfeedr|SurveyBot|SWeb|trendictionbot|TSearcher|ttCrawler|TurnitinBot|TweetmemeBot|UnwindFetchor|urllib|uTorrent|Voyager|WBSearchBot|Wget|WordPress|woriobot|Yeti|YottosBot|Zeus|zitebot|ZmEu|Crowsnest|PaperLiBot|peerindex|ia_archiver|Slurp|Aport|NING|JS-Kit|rogerbot|BLEXBot|MJ12bot|Twiceler|Baiduspider|Java|CommentReader|Yeti|discobot|BTWebClient|Tagoobot|Ezooms|igdeSpyder|AhrefsBot|Teleport|Offline|DISCo|netvampire|Copier|HTTrack|WebCopier) {
return 444;
} - Apply the changes.
Create a robots.txt file in domain's document root directory (by default it is /var/www/vhosts/example.com/httpdocs/). This file will disallow access to bots for the whole website. The code that must be placed within the robots.txt file is:
CONFIG_TEXT: User-agent: *
Disallow: /
Comments
Please sign in to leave a comment.