Analysing Text in terms of Letter Frequency

This notebook illustrates the use of a vector feature representation of data samples in order to calculate a measure of similarity, which could be used for classification or clustering of text samples.

I define the letter frequency of a text string as a vector with 26 components, where each of the components is the frequency of that letter given as a decimal fraction of the total number of letters in the text.

A measure of 'distance' between two vectors is defined using Pythagoras theorem.

The method is tested by seeing how reliably one can determine which book a random sentece has been taken from, just by looking at the letter frequency of a few sentences take from the book.

A method similar to this could potentially be used as a means for recognising the author or genre of a text. Using letter frequency alone is probably not very accurate. However, one could use a vector approach based on other features, such as frequency of common words, lengths words, or of sentences, the punctuation symbols used, etc. Using such features could give more accurate results.

Letter Frequency Vectors¶

Displaying a List of Lists using Pandas DataFrame

Using pandas, we can easily convert the table created above, as a list of lists, into a DataFrame, which we can easily display. Notice that in the resulting DataFrame the row and column names that were in the list of lists representation are not treated as index labels but as data within the DataFrame. We could reset the column names and index labels to get a nicer table. Also, the name of the function now appears in the (0,0) position of the DataFrame and there is no obvious way to attach this to the DataFrame object without it being considered as data. Hence, even going from the list of lists format to a DataFrame presents some small problems in preserving the content of the stored information.

Displaying a Table using HTML

Another way to display a table nicely is by encoding it into HTML format. This does have some advantages, in that we have more control over the format of the table. But it does require somwhat complex encoding. Hence, I have created a special function display_datalist_as_html_table in my own module myhtml which I import in order to produce an HTML version of the table.