Despite best intentions, lots of keys do get leaked (a median of 1,793 unique keys every day).
Using the search approach outlined in the paper, the median time to discovery for a key leaked to GitHub is 20 seconds, with times ranging from half a second to over 4 minutes. In other words, in the time it takes you to go “oh s*&#”!, Did I just…?” and take a look, it’s probably already too late.
Many leaked secrets remain in GitHub repos for a long time (81% remain after 16 days)
We also found AWS credentials for the website of a major government agency in a Western European country…
The regular expressions can then be used to scan the candidate files from the first phase, with any matches considered “candidate secrets”. These candidate secrets are then passed through a set of filters designed to reduce false negatives
Three validity filters were used to remove false positives:
An entropy filter, which catches secrets with very low entropy A words filter, which catches secrets containing common dictionary words of length at least 5 A pattern filter looking for repeated characters (e.g. ‘AAAA’), ascending characters (‘ABCD’) and descending characters (‘DBCA’)