What's Your Grade? Is Text Compression A Valid Indicator of Grade Level?

by Akimitsu Hogge

Takoma Park MS

At the beginning of this year, I was very interested in data compression, and I wanted to do a project about image compression. I researched deeper and discovered the field of theoretical data compression. I decided that I needed to connect the concept of entropy to a real life application, so I theorized that entropy could be applied to text analysis. I attempted to apply entropy, a concept in data compression theory, to determine the maturity level of a document.

Entropy is defined as the limit of compression that is theoretically possible to a document using lossless compression, in this case measured in bits per character. Compression raises the information density of a document while keeping the information content constant. There are many ways to estimate entropy, but there is no way to reach maximum entropy, unless the file is extremely repetitive. Repetitiveness and predictability are attributes of a file with higher compression potential (lower entropy) because information density is very low; information is spread out and is inefficiently displayed. This is also one attribute of less mature writing skill.

I hypothesized that there must be a correlation between the entropy of a document written by a student and their grade level, which is a good measure of maturity. I used the four orders of entropy researched by Claude E. Shannon, the father of modern information theory to analyze writing samples of 1st to 7th grade writing. What each formula measured was the weighted mean of the base two representations of the probabilities of all sets of characters possible. For 1st order entropy, the sets of characters were 1 character long, 2nd was 2 characters long, etc. I hypothesized that 4th order would correlate best because it did the most thorough analysis.

For my procedure, I constructed several C++ programs translating entropy formulae into code to find the 1st to 4th order entropies of a document. A visual basic program was developed to "format" the documents so that all of the letters were capitalized and punctuation symbols were deleted. Several documents that I saved over the years, from 1st grade to 7th were inputted into electronic files and analyzed. The entropy output was scaled from bits to grade levels. The whole curve was translated vertically so the minimum entropy was 1 and scaled vertically until the maximum entropy was 7.

My results showed significant correlation between the predicted grade level of the writings and the real grade levels, suggesting that entropy measurement could be developed as one tool to analyze the maturity of a document. My hypothesis that the 4th order would show the best correlation was surprisingly proven false, as the fourth order statistics seemed to be random. Clearly, more data is required to develop a reliable indicator.