TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
On Fri, Sep 10, 2010 at 9:44 AM, Combs, Richard
<richard -dot- combs -at- polycom -dot- com> wrote:
>
> Peter Neilson wrote:
>
> > You also have to keep mention of http://www.a.com/iwwwi and
> > http://www.a.com/ubb off your blogs, tweets, facebooks, youtubes and
> > other public places.
>
> Don't count on that. If a web crawler follows a link to any page on your site,
> it's likely to crawl the entire site.
While this is true, webcrawlers don't usually guess at urls outside of
the standard naming structure. Unless they are directed to a site by
an explicit URL or IP address they don't normally create random folder
and file names. But you're right to say that crawlers will scan
everything attached to found links or embedded file locations.
> You can explicitly exclude specific pages using either a robots.txt file
> or robots meta tags in the individual pages.
This does more harm than good. Malicious robots ignore these
instructions, and think, "hey, there's a /bleep directory? Let's see
what's in it!" and try to scan a directory list. If a crawler is
malicious enough to randomly request file and folder names that don't
exist, they will definitely scan your robots.txt to see what you
explicitly forbid them to index.
If you have control over your server, the best defence is to disable
raw directory listings. If a webcrawler finds an image on any page
with the source location "www.sitename.com/foldera/file.jpg", it will
probably scan the folder level too. I've seen lots of raw directory
listings in Google searches, one owned by even a member of this list
(I found it purely by accident! I swear! Impressive resume,
nonetheless.)
In the absence of full web server control, make sure all directories
have a default page (index.html, index.htm or index.php on Apache
servers, or default.asp on IIS). I've scanned websites images, styles,
and scripts folders to quickly access their content. Even an empty
page, or one that redirects the viewer back to the root, does wonders
for preventing access, at least from the web side.
More complex hackers attack at the console level and these rules no
longer apply. Verify your web folders against a known backup often to
ensure that a hacker hasn't stored files on your system. In the past
my ftp and web roots were exploited for spam or illegal file share. If
I weren't monitoring my sites, I wouldn't have notice this activity.
Create and publish documentation through multiple channels with Doc-To-Help.
Choose your authoring formats and get any output you may need. Try
Doc-To-Help, now with MS SharePoint integration, free for 30-days. http://www.doctohelp.com
LavaCon 2010 in San Diego Sept 29 - Oct 2 is now open for registration.
Use referral code TECHWR-L for $50 off conference tuition!
See program at: http://lavacon.org/
---
You are currently subscribed to TECHWR-L as archive -at- web -dot- techwr-l -dot- com -dot-