---
title: "RockYou 2021 - Password leak"
output: html_notebook
---
# Background
Check if your password has been leaked
https://cybernews.com/password-leak-check/
rockyou2021.txt - A Short Summary
https://chris.partridge.tech/2021/rockyou2021.txt-a-short-summary/
This is what Mr. Partridge concludes in his analysis:
> kys234 explicitly removed any non-ASCII characters, and limited password length to 20 characters.
> This makes for a very clean list, in stark contrast to other password breaches or dictionaries which are often very messy or unformatted,
> and can take time to clean before being usable.
Note: On many password leaks, there are passwords longer than 20 characters.
# Analysis
## Password lenghts in the leaked file
Analyze the file. See how many passwords it contains and how long the passwords are.
Running this single-liner in Perl:
```{perl}
perl -ne 'chomp; ++$cnt; $pwlen=length($_); if ($lens{$pwlen}) {++$lens{$pwlen};} else {$lens{$pwlen}=1;} END {printf("Count: %d", $cnt); keys(%lens); while(my($k, $v) = each(%lens)) {printf("Len %d: %d\n", $k, $v);}}' rockyou2021.txt
```
On my Linux, it took well over 40 minutes, but eventually it will output:
```{text}
Count: 8459060239
Len 6: 484236159
Len 7: 402518961
Len 8: 1107084124
Len 9: 1315444128
Len 10: 1314988168
Len 11: 1071452326
Len 12: 835365123
Len 13: 613654280
Len 14: 436652069
Len 15: 317146874
Len 16: 215720888
Len 17: 131328063
Len 18: 97950285
Len 19: 65235844
Len 20: 50282947
```
## Plotting the output with R
Plot the gathered password lengths from 6 to 20 characters with R (note the plot range is from 6 to 30 to maintain visual scale):
```{r}
y <- c(484236159, 402518961, 1107084124, 1315444128, 1314988168, 1071452326, 835365123, 613654280, 436652069, 317146874,
215720888, 131328063, 97950285, 65235844, 50282947)
x <- (6:20)
tab <- data.frame(x, y)
plot(y~x, data=tab, xlim=c(6, 30), xlab="password length", ylab="count passwords")
```
## Creating a model with R
From the shape of the plot, it looks [gaussian](https://en.wikipedia.org/wiki/Gaussian_function). Let's try to model it.
Define some helper functions for gauss and optimization function to return squared sum of differences:
```{r}
gaussf <- function(x, mu, sigma, k)
{
mu * exp(-0.5 * (x - sigma)^2 / k^2)
}
optimf <- function(par)
{
mu <- par[1]
sigma <- par[2]
k <- par[3]
rhat <- gaussf(tab["x"], mu, sigma, k)
out <- sum((tab["y"] - rhat)^2)
cat(sprintf("in: mu: %f, sigma: %f, k: %f\n", mu, sigma, k))
cat(sprintf("out: %f\n", out))
out
}
plot_f <- function(x)
{
gaussf(x, gauss_param[1], gauss_param[2], gauss_param[3])
}
```
Guesstimate start point and scale. This is very tricky, getting this wrong will result in completely useless plots. The values below are obtained by trial-and-error.
R-code:
```{r}
p.init <- c(10000000, 5, 1)
p.scaling <- c(1e-1, 1e-9, 1e-9)
```
Go optimize the bell-curve:
```{r}
(res <- optim(p.init, optimf, method="BFGS", control=list(reltol=1e-8, parscale = p.scaling)))
```
Now we have parameters for gauss-function, plot it:
```{r}
plot(y~x, data=tab, xlim=c(6, 30), xlab="password length", ylab="count passwords")
gauss_param = res[[1]]
plot(plot_f, col=2, add=T, xlim=range(tab$x))
```
## Improving the R-model
Password length of 7 characters doesn't seem to fit well into this model.
Skipping lengths 6 and 7, begin analysis from 8 to improve accuracy:
```{r}
tab <- data.frame(x=tail(x, -2), y=tail(y, -2))
(res <- optim(p.init, optimf, method="BFGS", control=list(reltol=1e-8, parscale = p.scaling)))
```
Plot the new model (green) with old one (red):
```{r}
orig_tab = data.frame(x, y)
plot(y~x, data=orig_tab, xlim=c(6, 30), xlab="password length", ylab="count passwords")
gauss_param = res[[1]]
plot(plot_res_f, col="green", add=T, xlim=c(6, 30))
gauss_param = c(1.260397e+09, 9.931107, 2.681267)
plot(plot_f, col=2, add=T, xlim=range(orig_tab$x))
```
Looking at the fit from password length 9 to 17 looks pretty solid. However, at 18 to 20
characters the model is still off. Sure it looks better now than with previous parameters,
but is still off.
Now that there is a plausible model, extrapolate the password lenghts 21 to 30:
```{r}
cat(sprintf("Len %2d: %d\n", (8:30), as.integer(plot_res_f((8:30)))), sep="")
```
From the table we can see the password lenght having a drastic impact on frequency.