In this article we will work hands on with empirical distributions and frequency analysis. First, we will compute univariate and bivariate distributions over a dataset. Then, we will see how to perform frequency analysis over text, implementing from scratch and then breaking a Caesar cipher.
All the work related to this homework can be found in the
repository, in the
homeworks/homework-02/ directory.
For the frequency analysis part, all the implementations can be tested directly at the end of this page with an interactive tool.
Distribution
For this part we will create a random dataset and compute univariate and bivariate distributions over its attributes.
Metodology and resources
All the related work can be found in the homeworks/homework-02/distribution
directory of the repository.
The dataset is generated randomly with the gen-dataset.py Python script. The
package faker must be installed, as it is a Python dependency of the script.
To install it, you will most likely want to use pip:
pip install faker
The faker package is used to generate fake names and surnames to make the
dataset a bit more realistic. Names and surnames are not statistically relevant
in this context.
The sqlite3 database is used to load the dataset and perform any operation on
it. Though the queries should work with any SQL database, the run.sh script
will use a sqlite3 file database. Using sqlite3 eliminates the need of a
DMBS server, making local testing a pleasant process.
About the dataset
The population units are represented by university students. Each student has the following attributes:
| Attribute | Type | Description |
|---|---|---|
name |
string | Name (not relevant) |
surname |
string | Surname (not relevant) |
fuori_sede |
boolean | Is the student coming from outside the town? |
avg_score |
float | The average score, approximated by 2 decimals |
academic_year |
integer | The academic year, between 1 and 5 |
A database table for the dataset can be created as follows:
CREATE TABLE Person (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR(100) NOT NULL,
surname VARCHAR(100) NOT NULL,
fuori_sede BOOLEAN NOT NULL,
avg_score DECIMAL(4,2) NOT NULL CHECK (avg_score >= 18 AND avg_score <= 30),
academic_year INTEGER NOT NULL CHECK (academic_year BETWEEN 1 AND 5)
);
Computing distributions
Univariate distribution on fuori_sede
The univariate distribution on fuori_sede has only 2 domain values and
basically tells us with which frequency a student is fuori sede or not.
Univariate distributions can be computed in a quite simple way with SQL. First,
we observe we can aggregate all the people in the dataset by an arbitrary
attribute, counting how many students have that attribute set to a specific
value. For the fuori_sede attribute:
SELECT fuori_sede, COUNT(*) AS students
FROM Person
GROUP BY fuori_sede;
To compute the univariate distribution, it is sufficient to divide each count for the total number of students:
SELECT fuori_sede,
CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER () AS frequency
FROM Person
GROUP BY fuori_sede;
Notice how extra care must be taken to perform float division instead of
integer division. SUM(COUNT(*)) OVER () computes the total number of rows in
the Person table.
The result of the query above should be something like this:
fuori_sede,frequency
0,0.42
1,0.58
Query results will be given in CSV, as they are readable and they can be immediately processed further by virtually any program (you can even open them directly with Excel or Calc).
Univariate distribution on avg_score
To compute the univariate distribution on avg_score, we define an interval of
width 1 to group students. In practice, the average scores will be
approximated to integer values:
SELECT CAST(avg_score AS INTEGER) AS avg_score,
CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER () AS frequency
FROM Person
GROUP BY CAST(avg_score AS INTEGER);
The output should be something similar to this:
avg_score,frequency
18,0.09
19,0.09
20,0.06
21,0.12
22,0.06
23,0.09
24,0.14
25,0.06
26,0.05
27,0.07
28,0.08
29,0.09
Univariate distribution on academic_year
As we already computed the univariate distribution on fuori_sede, doing it
for academic_year is straightforward:
SELECT academic_year,
CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER () AS frequency
FROM Person
GROUP BY academic_year;
The output should be something similar to this:
academic_year,frequency
1,0.19
2,0.26
3,0.2
4,0.21
5,0.14
Bivariate distribution on academic_year and avg_score
Computing bivariate distributions in SQL can be done in a clean way.
Using GROUP BY on two attributes performs a cartesian product on such
attributes, aggregating rows inwhich their values are the same.
So we can simply select the academic_year and avg_score and group the rows
by them:
SELECT academic_year, CAST(avg_score AS INTEGER) AS avg_score,
CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER () AS frequency
FROM Person
GROUP BY academic_year, CAST(avg_score AS INTEGER);
The output should be something similar to this:
academic_year,avg_score,frequency
1,18,0.01
1,19,0.01
1,20,0.01
1,21,0.05
1,22,0.02
1,23,0.01
1,24,0.03
1,25,0.01
1,27,0.02
1,28,0.01
1,29,0.01
2,18,0.06
2,19,0.03
[...]
Bonus: Bivariate distribution on academic_year and fuori_sede
Suppose we want a view on the point each student is in their academic path, taking into account if they are fuori sede or not for further analysis. We can compute the bivariate empirical distribution like in the previous case:
SELECT academic_year, fuori_sede,
CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER () AS frequency
FROM Person
GROUP BY academic_year, fuori_sede;
The output should be something similar to this:
academic_year,fuori_sede,frequency
1,0,0.07
1,1,0.12
2,0,0.13
2,1,0.13
3,0,0.09
3,1,0.11
4,0,0.09
4,1,0.12
5,0,0.04
5,1,0.1
Frequency analysis
In this part we will focus on frequency analysis. In particular, we will implements algorithms in JavaScript to:
- perform frequency analysis on the letters of an arbitrary string
- encrypt and decrypt a string with the Caesar cipher (using an arbitrary key)
- break the Caesar cipher using frequency analysis on an arbitrary ciphertext
For this part, an interactive tool can be found at the end of this page. It can be used to play with all the algorithms that are going to be implemented below.
Metodology and resources
All the related work can be found in the
homeworks/homework-02/frequency-analysis directory of the repository.
The training text snippets to compute default frequencies for various spoken languages will be generated with RandomTextGenerator, a neat online utility that outputs random text. This utility avoids using special characters even if they should be used in the chosen language, making the computed frequencies consistent when breaking the cipher and automatically guessing the language.
The sample texts present in the repository used for testing are chosen as follows:
- the English sample is taken from Night Train to Lisbon by Pascal Mercier, one of my favourite novels
- the Italian sample is taken from Il fu Mattia Pascal
- the German sample is taken from Critique of Pure Reason (Kritik der reinen Vernunft) from Immanuel Kant
The Cryptii Caesar cipher utility has been used to perform the first tests of the Caesar cipher implementation.
Generating the frequencies database and running the automated test suite in the
repository requires nodejs to be installed. No third-party libraries are
required.
Keep in mind that a pre-generated frequencies database is present in the repository, along with its training data. Also the interactive tool at the end of this page uses such database, taking it directly from the repository.
Simple frequency analysis
Frequency analysis for a text string consists in computing the frequency with which each letter appears. For example, we expect vowels to be more frequent than consonants in most of the languages. Different spoken languages shall have different letters distributions.
The process of programmatically performing frequency analysis for a string s
is quite simple. First, we eliminate every non-alphabetical character from s.
Then, we build a map (i.e. a dictionary) inwhich each key is represented by a
distinct character, and the corresponding value is the number of times that
character appears. Finally, all the values in the map are divided by the total
number of alphabetical character in s.
Below we give a working implementation in JavaScript:
export function computeFrequencies(s) {
// Initialize frequencies map
const frequencies = {};
for (const i of range(0, 26))
frequencies[String.fromCharCode(LOWER_A_CODE + i)] = 0;
// Eliminate non-alphabetical characters
s = [...s.toLowerCase()].filter(c => c.match(/[a-z]/))
// Count occurrencies for each letter
for (const c of s)
frequencies[c] += 1;
// Compute the actual frequencies by dividing for the total number of letters
for (const [c, count] of Object.entries(frequencies))
frequencies[c] = count / s.length;
return frequencies;
}
Analysing the result
Calling the given implementation on a rather long English text gives the following result:
{
"a": 0.07104967047098537,
"b": 0.014788619193055779,
"c": 0.028773509082141133,
"d": 0.05031345442854847,
"e": 0.13984889889085356,
"f": 0.022343674650377753,
"g": 0.017842790548143386,
"h": 0.033435139045169586,
"i": 0.07458607940845523,
"j": 0.002089696190323099,
"k": 0.004500884102234368,
"l": 0.0374537855650217,
"m": 0.02652306703102395,
"n": 0.0749075711300434,
"o": 0.06879922841986819,
"p": 0.028773509082141133,
"q": 0.0024111879119112683,
"r": 0.07024594116701495,
"s": 0.07265712907892621,
"t": 0.0781224883459251,
"u": 0.03279215560199325,
"v": 0.010287735090821412,
"w": 0.013663398167497186,
"x": 0.003214917215881691,
"y": 0.020253978460054653,
"z": 0.00032149172158816913
}
We can compare it with a set of frequencies taken from the Emory Oxford College Mathematical centre:
{
"a": 0.08167,
"b": 0.01492,
"c": 0.02782,
"d": 0.04253,
"e": 0.12702,
"f": 0.02228,
"g": 0.02015,
"h": 0.06094,
"i": 0.06966,
"j": 0.00153,
"k": 0.00772,
"l": 0.04025,
"m": 0.02406,
"n": 0.06749,
"o": 0.07507,
"p": 0.01929,
"q": 0.00095,
"r": 0.05987,
"s": 0.06327,
"t": 0.09056,
"u": 0.02758,
"v": 0.00978,
"w": 0.02360,
"x": 0.00150,
"y": 0.01974,
"z": 0.00074
}
We can observe that the frequencies are similar. In particular, it can be
observed that vowels have a much higher frequencies, and unusual letters such
as j and z have lower values.
Bonus: Automatic language detection
It is possible to use frequency analysis to automatically detect the language of an arbitrary text. This can be done roughly as follows:
- take the frequency analyses of all the different supported languages
- perform the frequency analysis on the chosen text
- compare the previous analysis with all the ones relative to the supported languages
The frequency analysis that is closest to the one of the chosen text is the one relative to the detected language.
An implementation of the described algorithm is given below:
export function detectLanguage(text, frequencies) {
const textFrequencies = computeFrequencies(text);
let guess = {
language: null,
distance: FREQ_MAX_GUESS_DISTANCE
};
// `frequencies` contains the frequency analyses for all the supported languages
for (const language in frequencies) {
const distance = computeDistance(
textFrequencies, frequencies[language]);
if (distance < guess.distance)
guess = { language, distance }; // Found a better guess
}
return guess;
}
We will discuss in depth how to measure the distance between two frequency analyses in later sections.
At the end of this article, you can try this algorithm directly with the Caesar cipher interactive tool.
Caesar cipher
The Caesar cipher is a simple, ancient, substitution cipher that was in use since the Romans. It basically consists in shifting each character, so the ciphertext becomes unreadable.
To give a more formal definition of the cipher, we quote the Encyclopedia Britannica:
Caesar cipher, simple substitution encryption technique in which each letter of the text to be encrypted is replaced by a letter a fixed number of positions away in the alphabet. For example, using a right letter shift of four, A would be replaced by E, and the word CIPHER would become GMTLIV.
The cipher can be implementated as follows:
- assign a numerical position to each letter (e.g.
ais0,cis2, …) - choose a numerical key
k(between0and25for the English alphabet) - to encrypt shift each letter forward by
kplaces (consider that letters may overflow, so technically letters must be rotated and not shifted) - to decrypt shift each letter backwards by
kplaces (the same overflow observation applies)
Below, a complete JavaScript implementation of the cipher is given:
const LOWER_A_CODE = 'a'.charCodeAt(0);
const LOWER_Z_CODE = 'z'.charCodeAt(0);
const UPPER_A_CODE = 'A'.charCodeAt(0);
const UPPER_Z_CODE = 'Z'.charCodeAt(0);
function encrypt(plaintext, key = 13) {
return [...plaintext].map(c => {
let code = c.charCodeAt(0);
if (code >= LOWER_A_CODE && code <= LOWER_Z_CODE)
code = LOWER_A_CODE + (code - LOWER_A_CODE + key) % 26;
else if (code >= UPPER_A_CODE && code <= UPPER_Z_CODE)
code = UPPER_A_CODE + (code - UPPER_A_CODE + key) % 26;
return String.fromCharCode(code);
}).join('');
}
function decrypt(ciphertext, key) {
return [...ciphertext].map(c => {
let code = c.charCodeAt(0);
if (code >= LOWER_A_CODE && code <= LOWER_Z_CODE)
code = LOWER_A_CODE + (code - LOWER_A_CODE + 26 - key) % 26;
else if (code >= UPPER_A_CODE && code <= UPPER_Z_CODE)
code = UPPER_A_CODE + (code - UPPER_A_CODE + 26 - key) % 26;
return String.fromCharCode(code);
}).join('');
}
We can use the automated test in the repository to give a try to the implementation above:
Testing encrypt() with key 13 (ROT13)
Plaintext is: The quick brown fox jumps over the lazy dog.
Computed ciphertext is: Gur dhvpx oebja sbk whzcf bire gur ynml qbt.
Expected ciphertext is: Gur dhvpx oebja sbk whzcf bire gur ynml qbt.
Ciphertext is correct
Testing decrypt() with key 13 (ROT13)
Ciphertext is: Gur dhvpx oebja sbk whzcf bire gur ynml qbt.
Computed plaintext is: The quick brown fox jumps over the lazy dog.
Expected plaintext is: The quick brown fox jumps over the lazy dog.
Plaintext is correct
At the end of this article, you can try these algorithms directly with the Caesar cipher interactive tool.
Breaking the Caesar cipher
The Caesar cipher can be broken using frequency analysis. The idea is that,
given a ciphertext c, we can perform a frequency analysis on c and compare
the result to the default frequencies of the spoken language for every possible
key k. Notice that this is not a brute-forcing technique, as it is sufficient
to shift the place of the frequencies instead of actually decrypting c for
every possible key.
There are a plethora of ways to compare frequency analyses. For this implementation, we will try to compute a cumulative distance between them. Iterating every letter, we will accumulate (i.e. sum every occurrence of) the absolute difference of the relative frequency. The final result will be considered as the distance.
A JavaScript implementation of an algorithm automatically breaking the cipher may look like the one given below:
function _crack(ciphertext, frequencies) {
const encFrequencies = computeFrequencies(ciphertext);
let guess = {
key: 0,
distance: FREQ_MAX_GUESS_DISTANCE
};
// Iterate over all possible keys
for (const key of range(0, 26)) {
const distance = computeDistance(frequencies, encFrequencies, key);
if (distance < guess.distance)
guess = { key, distance }; // Found a better guess
}
return guess;
}
// Compute the distance between two frequency analyses
// If key is specified, it will be used to shift f2 frequencies
function computeDistance(f1, f2, key = 0) {
let total = 0
for (const i of range(0, 26)) {
const c = String.fromCharCode('a'.charCodeAt(0) + i);
total += Math.abs(f1[c] - f2[encrypt(c, key)]);
}
return total;
}
The frequencies argument must be a pre-computed frequency analysis performed
on a (preferrably long) plaintext string of the spoken language of choice.
Automatic language guessing
The given implementation for the cracking algorithm can be extended to support multiple spoken languages and automatically guess the one used in a ciphertext.
The idea is to build an object containing the frequencies for multiple languages structured like so:
{
"english": {
"a": 0.07104967047098537,
"b": 0.014788619193055779,
"c": 0.028773509082141133,
"d": 0.05031345442854847,
"e": 0.13984889889085356,
"f": 0.022343674650377753,
"g": 0.017842790548143386,
"h": 0.033435139045169586,
"i": 0.07458607940845523,
"j": 0.002089696190323099,
"k": 0.004500884102234368,
"l": 0.0374537855650217,
"m": 0.02652306703102395,
"n": 0.0749075711300434,
"o": 0.06879922841986819,
"p": 0.028773509082141133,
"q": 0.0024111879119112683,
"r": 0.07024594116701495,
"s": 0.07265712907892621,
"t": 0.0781224883459251,
"u": 0.03279215560199325,
"v": 0.010287735090821412,
"w": 0.013663398167497186,
"x": 0.003214917215881691,
"y": 0.020253978460054653,
"z": 0.00032149172158816913
},
"italian": {
"a": 0.12013644600788828,
"b": 0.0118324272465622,
"c": 0.041679991472124506,
"d": 0.032832320648118536,
"e": 0.11310094872614859,
"f": 0.012365419464875812,
"g": 0.016309561880396548,
"h": 0.013751199232491206,
"i": 0.1107557829655687,
"j": 0,
"k": 0.00010659844366272253,
"l": 0.045304338556657074,
"m": 0.0312333439931777,
"n": 0.05926873467647372,
"o": 0.09433962264150944,
"p": 0.027715595352307856,
"q": 0.0027715595352307857,
"r": 0.0719539494723377,
"s": 0.059162136232811004,
"t": 0.06779661016949153,
"u": 0.030380556443875918,
"v": 0.028994776676260527,
"w": 0,
"x": 0.0011725828802899478,
"y": 0,
"z": 0.007035497281739686
}
}
Then we call the cracking algorithm multiple times, using a different language each iteration. The outcome having the minimum frequency distance will be our best guess.
With the cracking algorithm being already implemented, this can be done quite easily in JavaScript:
export function crack(ciphertext, frequencies, language) {
// Give the possibility to explicitly use a specific language
if (!!language)
return _crack(ciphertext, frequencies[language]);
let guess = { distance: CRACK_MAX_GUESS_DISTANCE };
for (const l in frequencies) {
const attempt = _crack(ciphertext, frequencies[l]);
if (attempt.distance < guess.distance)
guess = { language: l, ...attempt };
}
return guess;
}
Testing the implementation
Again, we use the automated test in the repository to try if the code above works:
Testing crack() with key 13 (ROT13)
Frequencies database './frequencies.json' exists
Ciphertext is: Tertbevhf ybbxrq nurnq ng gur cbvagrq gbjref bs gur Uvfgbevpny Zhfrhz bs gur
Pvgl bs Orea, hc gb gur Thegra naq qbja gb gur Nner jvgu vgf tynpvre terra
jngre. N thfgl jvaq qebir ybj-ylvat pybhqf bire uvz, ghearq uvf hzoeryyn vafvqr
bhg naq juvccrq gur enva va uvf snpr. Abj ur abgvprq gur jbzna va gur zvqqyr bs
gur oevqtr. Fur unq yrnarq ure ryobjf ba gur envyvat naq jnf ernqvat va gur
cbhevat enva jung ybbxrq yvxr n yrggre.
Result of the cracking process is: { language: 'english', key: 13, distance: 0.2366831509970971 }
Computed plaintext is: Gregorius looked ahead at the pointed towers of the Historical Museum of the
City of Bern, up to the Gurten and down to the Aare with its glacier green
water. A gusty wind drove low-lying clouds over him, turned his umbrella inside
out and whipped the rain in his face. Now he noticed the woman in the middle of
the bridge. She had leaned her elbows on the railing and was reading in the
pouring rain what looked like a letter.
Expected plaintext is: Gregorius looked ahead at the pointed towers of the Historical Museum of the
City of Bern, up to the Gurten and down to the Aare with its glacier green
water. A gusty wind drove low-lying clouds over him, turned his umbrella inside
out and whipped the rain in his face. Now he noticed the woman in the middle of
the bridge. She had leaned her elbows on the railing and was reading in the
pouring rain what looked like a letter.
Plaintext is correct
Also, at the end of this article, you can try this algorithm directly with the Caesar cipher interactive tool.
Bonus: Building a frequencies database
To make an efficient implementation of a multi-language Caesar cipher automatic cracker, we will compile a database containing all the pre-computed letters frequencies for various languages. Using a script (in JavaScript) we will save the database as a JSON file, so it will be available at any time as a local file or remote resource through the network.
Having all the algorithms already implemented, we can build and save the
database easily using the nodejs filesystem API:
const DEFAULT_OUTPUT = './frequencies.json';
const DEFAULT_TRAINING_DIR = './training';
export async function buildFrequenciesDb(output = DEFAULT_OUTPUT, trainingDir = DEFAULT_TRAINING_DIR) {
console.group("Building letters frequencies database");
const frequencies = {};
for (const file of await fs.readdir('./training/')) {
const language = file.replace(/\.txt$/, '');
console.log(`Processing training text for language '${language}'`);
const text = await fs.readFile(`${trainingDir}/${language}.txt`);
frequencies[language] = computeFrequencies(text.toString());
}
console.log(`Writing frequencies database to '${output}'`);
await fs.writeFile(output, JSON.stringify(frequencies));
console.groupEnd();
}
if (import.meta.main) {
buildFrequenciesDb();
}
The complete code can be found in the build-frequencies-db.js
file
of the repository.
The test suite actually given in the repository and the interactive tool at the end of this page use the pre-computed frequencies database in order to be efficient.
Bonus: Interactive Caesar cipher toolkit
Here you can try all the algorithms described above interactively. Just enter some plaintext or ciphertext and choose an operation to perform.
Of course, the specified key is ignored when cracking.
Conclusions
In this article we performed some basic distribution computations on a randomly generated dataset.
Then, we implemented the Caesar cipher from scratch and broke it using statistics.