How to Find Private Documents on a Public Server or Website - HCCI-215 Session 1

With a few quick tips, you can find out what companies keep on their websites, but don't want you to see.  Using that same knowledge, you can work to secure your own organization.

Enroll in the course now: Enter your e-mail for VIP Updates at the top of the page

The Elevator Fairy

When I was a BA candidate at Drew University many years ago, I was tapped for an internship in the Development Office and had an experience there that would change my perception of business and technology forever.  A few weeks in, my supervisor approached me, knowing my skill with technology, and stated a simple request: "Find me a list of 10 groups or individuals capable of donating an elevator and potentially interested in doing so."  In essence, we needed an elevator fairy, who would wave a wand of funding and deliver this integral project component to us.  A simple request to be sure, but how does one find such a list of individuals, especially back then?

This was back before Google was a force, back when NetScape still held marketshare, back before Google even surpassed North Korea in value.  In other words, I was largely on my own.  Long story short, I used a business directory to build a list of elevator company executives, then laboriously whittled down that list to the individuals who had graduated from small liberal arts colleges (particularly northeastern) or seemed especially inclined to philanthropy.  For perspective, Google News was still in Beta at the time, so even trying to find press releases about large charitable donations posed a challenge.

If posed with the same challenge today, there's no denying it: I'd search Google for philanthropic elevator execs (though today, universities tend to keep much better track of their alumni, so I might have some better options networking through that database).  Google is not the be-all-end-all for research however.  Some webpages are purposefully excluded from Google by the companies that host them.  Other times, you may not know what exactly you are looking for.  Occasionally, you'll have a vague idea of what you're looking for, but need to narrow down the results dramatically.  Finally, you may have the data you need, but find it to be useless without other data that you are missing.

We'll go through several scenarios using real-life examples of Competitive Intelligence techniques that anyone with the skill to use email or look-up movie showtimes is sure to be able carry out themselves.  As a quick guideline, The scenarios will not advise any action that is unethical or illegal and will not reveal any information that could harm anyone, economically or otherwise.  As always, refer any legal questions to your corporate counsel before attempting any of this.

Scenario #1: What the White House Doesn't Want Google to Know

The first Competitive Intelligence scenario is old, but still remarkably effective.  Want a handy list of webpages that The President of the United States doesn't want to show up in search results?  Head over to -- What will pop up on the screen is a bunch of text.  What you are actually seeing is a list of files that the webmaster of does not want search engines to index.  That means, though those links may or may not be publicly available, the webmaster has requested that the search engine not include them (and most search engines comply).  This is what is known as "security through obscurity."  Good things to put into robots.txt: the webpage with forms that your caterer fills out to get paid for running an event.  Bad things to put into robots.txt: your business' security plan which you put online as a convenience for your insurer or employees.

Robots.txt provides absolutely no security, as it is essentially a shopping list of URLs that a hacker (or in IT parlance, a "cracker") might want to attempt to break into.  Americans will be heartened to know that the URLs currently listed in's robots.txt file are only default configuration pages.  That means that this file is just a part of the default server configuration from when the website was last updated and no one at the White House currently thinks that this method provides any real security.  That's good.  What's bad is that back under the Bush Administration, the robots.txt file had grown to a sizeable 2,400 line censorship behemoth.  To be clear, this has everything to do with the choices of technical advisers and little to do with that administration, its politics, or the price of tea in China.  The Obama Administration did use a real, configured custom robots.txt file for a while, but it was nothing like the previous administration's.

Getting down to business, if you're a hospital administrator and you want to know, for example, what software solution your rival hospital 15 miles away is using to recruit and hire new talent, the robots.txt file might be a good place to look.  You also might want to look at the robots.txt file if you've heard rumors of a buyout or a RIF and you aren't sure how to search more specifically.

In that case, if you haven't figured it out, you'd paste the URLs which appear next to "DISALLOW:" into your browser's address bar and see what comes up.  For example, if we look at's robots.txt we find "NOKIA" as one of the entries.  Entering "" (no quotes) into the address bar reveals CNN's mobile news app install page, hardly a state secret, but not something they wanted indexed by Google.

This isn’t theoretical either: as shown in the example above, numerous news organizations, activists, and security researchers actively monitor’s robots.txt file, hoping to find leaked information that might bolster their conclusions, if not their careers and personal celebrity.  It’s fair to presume that foreign nations are doing the same thing.

In another example, according to Computer Weekly (2014) a decade later LinkedIn suffered a massive data breach that was likely orchestrated with help from their robots.txt file.  In the attack, the criminals may have used robots.txt to help determine how to target their automated scraping software.

On a hospital website researched for this article, one entry from robots.txt led to a publications page, which while not strictly private, provided an index of publications that the organization would probably prefer to deliver to its various audiences directly through tailored marketing channels (for metrics and strategic purposes).

Two Related Tricks:

  • To see everything that Google has indexed from a website, search Google for "" without quotes (e.g. "")
  • To see if a website has a beta (experimental) site up for testing, try navigating to "" (e.g. "" which is just a test of a new website layout).  Beta websites can reveal surprising things about an organization.  As of this writing, Google has over 1 million results for ".org" websites with beta websites.

Legal Notes: When in doubt, don't do it.  In general, you should be able to expect that clicking on a link itself should not constitute an illegal act.  If at any point you feel that you are bypassing some element of security or impersonating a 3rd party (even inadvertently), discuss the matter with your legal counsel before proceeding further.

Enroll in the course now: Enter your e-mail for VIP Updates at the top of the page

The entirety of Christopher Lotito's Health Care CFO Competitive Intelligence Master Class can be found online here: -- The self-paced course runs through late March 2015 and the content will remain online after.

Contact the author on LinkedIn or via the comment form.

Popular posts from this blog

How to Keep a Secret Online

How to Turn an Email Server Rogue - HCCI-214 Session 5