The Dark Web has a Serious Deduplication Problem

January 17, 2019  |  Tony DeGonia

In a post released on 1/8/19, I wrote about the record number of breaches in 2018. This brought to mind a podcast that I was listening to a few days back hosted by Corey Nachreiner, CTO of WatchGuard Technologies, Inc. on his 443 Podcast. Corey discussed the potential data deduplication problem on the Dark Web. This article will attempt to break down how this can happen and how this can cause issues not only for users of the Dark Web, but also for those whose data has been stolen and placed on the Dark Web for purchase.

The breaches of 2018 were vast and widespread, affecting businesses from fast food to department stores to airlines with record amounts of data being lost. If you look at just the breaches I referenced in the previous article, total PII records counts are over one billion in the United States. In India, every citizen in the country had their data compromised with the breach of Aadhaar, the Indian biometric IT program owned and operated by the government of India. The Aadhar breach alone accounted for 1.1 Billion records lost to hackers.   

Researching this, I discovered that for just the US-based hacks in the article,  Americans and foreign travelers doing business with one of the breached companies had a total of 1.3 billion records stolen. If you figure there are approximately 330 million citizens of the United States and if every person in the US was affected they would have their personally identifiable information exposed to the Dark Web approximately 4 times.

While that may not seem like a lot, please consider that it would be nearly impossible for every US citizen to be breached. The US does not have a mandatory centralized identification system as the Indian government has. Then, of course, not all 330 million Americans were affected by these breaches due to lack of exposure to affected breached sites, age, and other factors. Let’s say that 150 million Americans were affected in some way - which would mean that about half of all US citizens were affected by the breaches of 2018. Let’s also assume that another 150 million citizens of other countries were affected by the breaches of 2018. That would calculate to 300 million total people affected by the breaches of 2018.

With a nice round number like 300 million people being affected one could assume there would be some duplicate records. With that being said, there are probably a lot of duplicate records. The total number of records duplicated per affected person I calculate at 4.333 records. This is admittedly a pretty arbitrary number, considering some people are more active than others on the web or at a particular retailer. Some people fly frequently, while others may not fly or stay in hotels at all. But this is an estimate to work with.

From the results of the 2018 breaches, it is fairly safe to say that a very large number of people globally had their PII stolen and many of those had the information stolen several times. Each time a little more and different information was stolen. Many people look at a cyber breach as a big, scary and mysterious thing. What they should be more concerned with is that their data is stolen multiple times, from different sources.

A lot of information stolen is static, like social security numbers and driver’s license numbers; however, much of it is not. You can change your credit card numbers, passport numbers, addresses, and phone numbers. You can even improve your health or change it in some way that would make the stolen data inaccurate.

Once you look at the statistics from the 2018 breaches and then multiply those numbers (how many?) times over the last 10 years then you will get a reasonably accurate picture of the data deduplication issue. Basic math on the scenario would be over 1 trillion PII records across the roughly 300 million people affected. Doing the math on the above assumption would calculate out to almost 4,000 records per person affected. 4,000 PII records over a 10 year timeframe per person and growing.

That leaves us with a large collection of static and changeable information (that never is changed in the stolen records.) With this data, it is possible for someone with the proper data analysis skills to purchase or steal this information, feed it into a data-engine and see trends and habits. Perhaps even to the degree of being able to create an identity based on the aggregated information. Perhaps even using the information to feed Artificial Intelligence or Machine-Learning engines for predictive analysis on a single person or populations of people based on duplicated data that is considered old or expired. The aggregated information could be quite dangerous in the wrong hands. The Dark Web is indeed a scary place for this and other reasons.

Share this with others

Get price Free trial