I assume I’ve lastly caught my breath after coping with these 23 billion rows of stealer logs final week. That was a bit intense, as is normally the way in which after any giant incident goes into HIBP. However the complicated nature of stealer logs coupled with an overtly lengthy weblog publish explaining them and the conflation of which providers wanted a subscription versus which had been simply accessible by anybody made for a really intense final 6 days. And there have been the problems round supply knowledge integrity on prime of every part else, however I will come again to that.
When we launched the flexibility to look by means of stealer logs final month, that wasn’t the primary corpus of information from an data stealer we would loaded, it was simply the primary time we would made the web site domains they expose searchable. Now that now we have an precise mannequin round this, we’ll begin going again by means of these prior incidents and backfilling the brand new searchable attributes. We have simply accomplished that with the 26M distinctive electronic mail deal with corpus from August final 12 months and added a bunch beforehand unseen situations of an electronic mail deal with mapped towards a web site area. We have additionally now flagged that incident as “IsStealerLog”, so should you’re utilizing the API, you may see that attribute now set to true.
For probably the most half, that knowledge is all dealt with simply the identical as the prevailing stealer log knowledge: we map electronic mail addresses to the domains they’ve appeared towards within the logs then make all that searchable by full electronic mail deal with, electronic mail deal with area or web site area (learn final week’s actually, actually lengthy weblog publish should you want an explainer on that). However there’s one essential distinction that we’re making use of each to the backfilling and the prevailing knowledge, and that is associated to a little bit of cleansing up.
A theme that emerged final week was that there have been electronic mail addresses that solely appeared towards one area, and that was the area the deal with itself was on. If john@gmail.com is in there and the one area he seems towards is gmail.com, what’s up with that? At face worth, John’s particulars have been snared while logging on to Gmail, however it does not make sense that somebody contaminated with an data stealer solely has one web site they’ve logging into captured by the malware. It ought to be many. This appears to be because of a mixture of the supply knowledge containing credential stuffing rows (simply electronic mail and password pairs) amidst data stealer knowledge and someplace in our processing pipeline, introducing integrity points because of the odd inputs. Rubbish in, rubbish out, as they are saying.
So, we have determined to use some Occam’s razor to the scenario and go together with the best rationalization: a single entry for an electronic mail deal with on the area of that electronic mail deal with is unlikely to point an data stealer an infection, so we’re eradicating these rows. And never including any extra that meet that standards. However there is not any doubt the e-mail deal with itself existed within the supply; there isn’t a stage of integrity points or parsing errors that causes john@gmail.com to look out of skinny air, so we’re not eradicating the e-mail addresses within the breach, simply their mapping to the area within the stealer log. I might already defined such a situation in Jan, the place there is likely to be an electronic mail deal with within the breach however no corresponding stealer log entry:
The hole is defined by a mixture of electronic mail addresses that appeared towards invalidly shaped domains and in some instances, addresses that solely appeared with a password and never a website. Criminals aren’t precisely famend for dumping completely shaped knowledge units we are able to seamlessly work with, and I hope of us that fall into that few % hole perceive this limitation.
FWIW, entries that matched this sample accounted for 13.6% of all rows within the stealer log desk, so this hasn’t made an excessive amount of distinction by way of outright quantity.
This takes away an excessive amount of confusion concerning the an infection standing of the deal with proprietor. As a part of this revision, we have up to date all of the stealer log counts seen on area search dashboards, so should you’re utilizing that function, you may even see the quantity drop based mostly on the purged knowledge or enhance based mostly on the backfilled knowledge. And we’re not sending out any further notifications for backfilled knowledge both; there is a threshold at which comms turns into extra noise than sign and I’ve a robust suspicion that is how it will be acquired if we began sending emails saying “hey, that stealer log breach from ages in the past now has extra knowledge”.
And that is it. We’ll preserve backfilling knowledge, and the whole corpus inside HIBP is now cleaner and extra succinct. And we’ll positively clear up all of the UX and web site copy as a part of our impending rebrand to make sure every part is loads clearer sooner or later.
I will go away you with a little bit of levity associated to subscription prices and worth. As I not too long ago lamented, resellers could be a nightmare to cope with, and we’re severely contemplating banning them altogether. However sometimes, they inadvertently share greater than they need to, and we get an perception into how the surface world views the service. Like a current case the place a reseller unintentionally despatched us the bill they’d meant to ship the client who needed to buy from us, full with a 131% worth markup 😲 It was an annual Pwned 4 subscription that is meant to be $1,370, and easily to purchase this on that buyer’s behalf after which hand them over to us, the reseller was charging $3,165. They’ll do that as a result of we make the service dust low-cost. How do we all know it is dust low-cost? As a result of one other reseller inadvertently despatched us this inner communication at the moment:

FWIW, we do have bank cards in Australia, they usually work simply the identical as all over the place else. I nonetheless vehemently dislike resellers, however a minimum of our prospects are getting deal, particularly after they purchase direct 😊