This notebook illustrates the use of a vector feature representation of data samples in order to calculate a measure of similarity, which could be used for classification or clustering of text samples.
I define the letter frequency of a text string as a vector with 26 components, where each of the components is the frequency of that letter given as a decimal fraction of the total number of letters in the text.
A measure of 'distance' between two vectors is defined using Pythagoras theorem.
The method is tested by seeing how reliably one can determine which book a random sentece has been taken from, just by looking at the letter frequency of a few sentences take from the book.
A method similar to this could potentially be used as a means for recognising the author or genre of a text. Using letter frequency alone is probably not very accurate. However, one could use a vector approach based on other features, such as frequency of common words, lengths words, or of sentences, the punctuation symbols used, etc. Using such features could give more accurate results.
import math
alphabet = "abcdefghijklmnopqrstuvwxyz"
def get_string_letter_counts( string ):
counts = {}
for letter in alphabet:
counts[letter] = 0
for letter in string:
letter = letter.lower()
if letter in counts:
counts[letter] += 1
return counts
def get_file_letter_counts( file_name ):
with open( file_name) as f:
contents = f.read()
return get_string_letter_counts(contents)
#get_string_letter_counts( "This is a string")
get_file_letter_counts("../data/books/Gadsby.txt")
{'a': 23242, 'b': 4549, 'c': 5646, 'd': 8741, 'e': 0, 'f': 4568, 'g': 7652, 'h': 10414, 'i': 18730, 'j': 487, 'k': 2499, 'l': 11288, 'm': 4406, 'n': 18251, 'o': 22120, 'p': 4039, 'q': 109, 'r': 10101, 's': 14789, 't': 18034, 'u': 8825, 'v': 686, 'w': 5937, 'x': 168, 'y': 6741, 'z': 228}
def sort_keys_by_value( dictionary ):
return sorted( dictionary.keys(),
key = lambda x: dictionary[x])
def display_file_letter_counts( filename ):
percent_per_star = 0.2
counts = get_file_letter_counts( filename)
total = sum( counts.values() )
for letter in sort_keys_by_value(counts):
count = counts[letter]
percent = 100*(count/total)
num_stars = int( percent / percent_per_star)
percent_str = "({}%)".format(round(percent,2))
print( letter, "*" * num_stars, count, percent_str )
display_file_letter_counts("../data/books/Gadsby.txt")
e 0 (0.0%) q 109 (0.05%) x 168 (0.08%) z 228 (0.11%) j * 487 (0.23%) v * 686 (0.32%) k ***** 2499 (1.18%) p ********* 4039 (1.9%) m ********** 4406 (2.08%) b ********** 4549 (2.14%) f ********** 4568 (2.15%) c ************* 5646 (2.66%) w ************* 5937 (2.8%) y *************** 6741 (3.18%) g ****************** 7652 (3.61%) d ******************** 8741 (4.12%) u ******************** 8825 (4.16%) r *********************** 10101 (4.76%) h ************************ 10414 (4.91%) l ************************** 11288 (5.32%) s ********************************** 14789 (6.97%) t ****************************************** 18034 (8.5%) n ****************************************** 18251 (8.6%) i ******************************************** 18730 (8.82%) o **************************************************** 22120 (10.42%) a ****************************************************** 23242 (10.95%)
def letter_frequency_vector( string ):
counts = get_string_letter_counts( string )
total = sum( counts.values() )
return [ counts[letter]/total for letter in alphabet ]
def file_letter_frequency_vector( file_name ):
counts = get_file_letter_counts( file_name )
total = sum( counts.values() )
return [ counts[letter]/total for letter in alphabet ]
def vector_distance( V1, V2 ):
diffs = [ x-y for (x,y) in zip(V1,V2)]
sum_of_squares = sum( [ x**2 for x in diffs ] )
return math.sqrt(sum_of_squares)
def matrix_of_binary_function( function, inputs ):
return [ [ function(i1, i2) for i2 in inputs]
for i1 in inputs
]
def table_of_binary_function( f_name, function, inputs ):
matrix = matrix_of_binary_function( function, inputs )
table = [ [f_name] + inputs ] # header row
for i in range(len(inputs)):
table.append( [inputs[i]] + matrix[i] )
return table
BOOK_LIST = [ "Romeo+Juliet",
"Midsummer-Nights-Dream",
"Wuthering-Heights",
"Jane-Eyre",
# "The-Professor",
"Dracula",
"Jewel-of-Seven-Stars",
"Gadsby"
]
## Calculate the Letter Frequency Vectors of all the books
## and store in a dictionary (key is book name)
LFVs = {}
for b in BOOK_LIST:
print(b)
LFVs[b] = file_letter_frequency_vector( "../data/books/" + b + ".txt")
def book_distance(b1,b2):
return vector_distance( LFVs[b1], LFVs[b2] )
def rounded_book_distance(b1,b2):
return round( vector_distance( LFVs[b1], LFVs[b2] ), 4 )
DIST_TABLE = table_of_binary_function( "LFV distance",
rounded_book_distance,
BOOK_LIST )
Romeo+Juliet Midsummer-Nights-Dream Wuthering-Heights Jane-Eyre Dracula Jewel-of-Seven-Stars Gadsby
Using pandas, we can easily convert the table created above, as a list of lists, into a DataFrame
, which we can easily display.
Notice that in the resulting DataFrame
the row and column names that were in the list of lists representation are not treated as index labels
but as data within the DataFrame
. We could reset the column names and index labels to get a nicer table. Also, the name of the function now appears in the (0,0) position of the DataFrame
and there is no obvious way to attach this to the DataFrame
object without it being considered as data. Hence, even going from the list of lists format to a DataFrame
presents some small problems in preserving the content of the stored information.
from pandas import DataFrame
df = DataFrame(DIST_TABLE)
display(df)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | LFV distance | Romeo+Juliet | Midsummer-Nights-Dream | Wuthering-Heights | Jane-Eyre | Dracula | Jewel-of-Seven-Stars | Gadsby |
1 | Romeo+Juliet | 0 | 0.0089 | 0.0234 | 0.0207 | 0.0195 | 0.0205 | 0.1346 |
2 | Midsummer-Nights-Dream | 0.0089 | 0 | 0.0197 | 0.0168 | 0.0193 | 0.0188 | 0.1397 |
3 | Wuthering-Heights | 0.0234 | 0.0197 | 0 | 0.0096 | 0.0145 | 0.0149 | 0.1416 |
4 | Jane-Eyre | 0.0207 | 0.0168 | 0.0096 | 0 | 0.0152 | 0.0155 | 0.1403 |
5 | Dracula | 0.0195 | 0.0193 | 0.0145 | 0.0152 | 0 | 0.0096 | 0.1378 |
6 | Jewel-of-Seven-Stars | 0.0205 | 0.0188 | 0.0149 | 0.0155 | 0.0096 | 0 | 0.1427 |
7 | Gadsby | 0.1346 | 0.1397 | 0.1416 | 0.1403 | 0.1378 | 0.1427 | 0 |
Another way to display a table nicely is by encoding it into HTML format.
This does have some advantages, in that we have more control over the format of the table. But it does require somwhat complex encoding. Hence, I have created a special function display_datalist_as_html_table
in my own module myhtml
which I import in order to produce an HTML version of the table.
## Could use my function to create a nicer HTML table
import myhtml
myhtml.display_datalist_as_html_table( DIST_TABLE )
LFV distance | Romeo+Juliet | Midsummer-Nights-Dream | Wuthering-Heights | Jane-Eyre | Dracula | Jewel-of-Seven-Stars | Gadsby |
Romeo+Juliet | 0.0 | 0.0089 | 0.0234 | 0.0207 | 0.0195 | 0.0205 | 0.1346 |
Midsummer-Nights-Dream | 0.0089 | 0.0 | 0.0197 | 0.0168 | 0.0193 | 0.0188 | 0.1397 |
Wuthering-Heights | 0.0234 | 0.0197 | 0.0 | 0.0096 | 0.0145 | 0.0149 | 0.1416 |
Jane-Eyre | 0.0207 | 0.0168 | 0.0096 | 0.0 | 0.0152 | 0.0155 | 0.1403 |
Dracula | 0.0195 | 0.0193 | 0.0145 | 0.0152 | 0.0 | 0.0096 | 0.1378 |
Jewel-of-Seven-Stars | 0.0205 | 0.0188 | 0.0149 | 0.0155 | 0.0096 | 0.0 | 0.1427 |
Gadsby | 0.1346 | 0.1397 | 0.1416 | 0.1403 | 0.1378 | 0.1427 | 0.0 |