Discovering Information Leakage In Files

Discovering information leakage in files

[NOTE: All information was gathered from public websites]

Discovering information leakage in files and why it’s important?

During the build-up to our recent product launch of “”, Shearwater Ethical Hacking team (SEH) conducted hefty amounts of research into phishing attacks, and how they are being used to compromise countless individuals, corporations and governments every day. SEH have been conducting ‘Client Side’ penetration testing for a while, and we are continually fascinated with what we discover.

As it turns out, 9 / 10 Advanced Persistent Threats (APT) start from a phishing attack, and we can almost guarantee that each hacker is conducting the similar initial phase of information collection as we cover here. Hackers use this method of information collection in their phishing attacks because the information is rife, quick to gather, easy to sort, valuable, and because the phishing attack vector has such a high rate of successful compromise.

So we decided that there will be no better time to talk about phishing and how hackers collect your personal information, in this post we will demonstrate one simple method that hackers use to collect information, why it’s a problem, why it affects you, and what you can do about it. Also, to keep this post simple to read, we will keep it business focused, but really the same principles can be applied to individuals or governments.

So all too often, there is the business requirement to host a document ( pdf, doc,  xls, jpeg, etc) on the company website as a means to share it to the world and more importantly your potential customers. While this may kick goals for your organisation, it dramatically increased the field that the hackers can play on. As a demonstration, in the figure below 92,500,000 results are given when searching only for word documents! So it’s already a pretty big field, and this makes you think… Everyone seems to do it, it must be ok…  Wrong.

While there are a ridiculous amount of documents published on the web, they are not all for your organisation. So to give you a better indication, let’s target a few big names.

And so on. Remember though that this is only looking for “.doc” files. There are a plethora of other file types a hacker can use to gather similar information.

Till now we have been setting up the canvas for the explaining the real risk, so here we go. In just about all of these different types of documents, there is lots of juicy information called “metadata” that is stored in the file without the user knowing. This information can include information such as; usernames, printers, software versions, operating systems, email addresses, GPS coordinates, passwords (if we’re luckyJ), and pretty much anything else. It’s a goldmine of information about the organisation being targeted, and it’s this type information that hackers use daily to target your business and you are unknowingly handing it to them.

As any good hacker will attest to, there is no need do things manually when you can automate it! In other words we don’t click through Google downloading and examining each document individually… no we use tools like FOCA. FOCA is a tool written by Informatica64 and automates the whole process of searching, downloading, analysing, and sorting all the information about the targeted organisation. It can also be used to analyse documents that have not even published. It also has many other features, but they will have to wait for another day.

As an example, and so that we are not seen to be targeting any specific organisation for a real world example, we will choose the first 30 word documents that are displayed in the Google search. Remember though that it’s trivial to target a specific organisation if we want to.

So here is a list of the first 30 documents. They were all Microsoft Word documents, and we were able to download all of them all automatically using the links supplied to FOCA by the Google search engine.

Once we have downloaded these documents, it’s only a few mouse clicks before we have extracted and analysed all the metadata stored within them. Out of the 30 documents we downloaded, FOCA was able to extract 44 different usernames;

The figure above does not show all 44, but if we were only targeting a single organisation it still gives us plenty of names to start with. Additionally we can also collect the folders or directory paths that the documents may have been saved in. Sometimes this can give us information such as network file servers etc.

We can see what printer was used to print the document. This can tell us if they are using a network printer or one connected directly to their computer.

We can see the software that has been used to make the document. As well as the computer operating system they use. This information is very important because it helps us choose what exploit we need to run to establish access.

Last but not least we were able to collect one email address. But there are many other more effective methods for retrieving this information.

While this may not be a massive amount of information, it can easily be built upon by using other sources of information such as,,, and many other public sources.  When all this information combined it definitely aids the attackers in producing extremely convincing phishing attacks.

So what can you do to help reduce the surface area of this threat? Well there are a few options available. They may not all be suitable for your working environment, but even with the implementation of a few, they will drastically decrease the leakage of sensitive information through metadata.

  • Prior to publishing any office document it is important to prepare it for publishing. Microsoft Office 2007 onwards has this feature built in, and is available through the file menu. This will scan various areas of the document, and present to you its findings. It then gives you the option to remove the findings.
  • Where possible, publish documents as PDF files as this process typically removes a lot of the sensitive metadata such as review comments, and track changes.
  • By viewing the properties of a file, users are able to view and sometimes modify and delete the metadata. The link is at the bottom of the figure below demonstrates how to remove this information. Keep in mind; you can do this for all types of files, not just documents. (For a bit of an eye opening example, check out some of your own digital photos!)

To stop this blog post turning into a novel, we will end it here. However if you have any questions regarding the post or any other related matters, feel free to contact us.