Readability Index in Python(NLP)?


Natural language processing is the study of automated generation and understanding of natural human languages. This is becoming more and more interesting tasks to solve, as computer technology is integrated into almost every industry nowadays. We are going to study one specific field within natural language processing; readability. This involves the topic of determining the readability of a text. This indicates how difficult it is to read or understand a text.

A readability index is a numeric value that indicates how difficult (or easy) it is to read and understand a text. There are several different tests for determining readability, and they have different fields of use.

"Readability describes the ease with which a document can be read" [13]. There exist many different tests [9] to calculate readability. Readability tests are "considered to be predictions of reading ease but not the only method for determining readability"

Some of the tests are language neutral, but there are some tests that are more suited for certain languages. Knowledge of the different readability tests is essential for us.

Readability test −
Intended for (language(s)) −
Short description and formula −
Automated Readability Index (ARI)
English
Designed to gauge the understandability of a text. The output is an approximate representation of the U.S grade level needed to comprehend a text.
ARI = 4.71 * (characters/words) 
+ 0.5 * (words/sentence) -21.43

Flesch Reading Ease

English
Designed to indicate how difficult a reading passage is to understand. Higher scores indicate material that is easier to read; lower numbers mark harder-to-read passages.
FRE = 206.835 − 1.015*(total words/ 
total sentences) − 84.6 * (total
syllables/ total words)

FleschKincaid Grade Level
English
Designed to indicate how difficult a reading passage is to understand. The result is a number that corresponds with a U.S grade level.
FKGL = 0.39 * (total words/ total 
sentences) + 11.8 (total syllables/ 
total words) -15.59

Coleman-Liau Index
English
Designed to gauge the understandability of a text. The output is the approximate U.S. grade level thought necessary to comprehend the text.
CLI = (5.89 * (characters/ words)) − 
(30 *(sentences/words)) − 15.8

Gunning Fog Index
English
Designed to measure the readability of a sample of English writing. The resulting index is an indication of the number of years of formal education (U.S grade) that a person requires in order to easily understand the text on the first reading.
GFI = 0.4 * (( words/ sentence) + 
100 * (complex words/ words))
Linsear write
English
A readability metric for English text, developed for the Air Force to help them calculate the readability of their technical manuals. Formula from Wikipedia:

  • Find a 100-word sample from your writing.

  • Calculate the easy words (defined as two syllables or less) and place a number "1" over each word, even including a, an, the, and other simple words.

  • Calculate the hard words (defined as three syllables or more) and place a number "3" over each word as pronounced by the dictionary.

  • Multiply the number of easy words times "1."

  • Multiply the number of hard words times "3."

  • Add the two previous numbers together.

  • Divide that total by the number of sentences.

Rate Index (RIX)
Western European Languages
This is useful because it can be used on documents of any Western European language [3]. The output is a score between 0 (very easy) and 55+ (very difficult).
RIX = (Long Words/ Sentences)
(long words = words where number of characters > 6)
Lesbarhets Index (LIX)
Western European Languages
This is useful because it can be used on documents of any Western European language [2][3]. The output is a index who indicates a grade level. An index below 0.1 is grade 1 while 7.2 and above is college grade.
LIX = (total words/ total sentences) + 
(long words/ total words * 100)
(long words = words where number 
of characters > 6)

For example, below is the program through flesch index to determine the readability of a text file.

Assumption 

Flesch Index
Text file reading Grade
0-30
College
50-60
High School
90-100
Fourth Grade

From above the flesch-kincaid Grade level formula is used to compute the equivalent Grade level G −

FKGL = 0.39 * (total words/ total sentences) + 11.8 (total syllables/ total words) -15.59

Code

import os
dire = os.getcwd()
listOfdir = os.listdir(dire)
while True:
   UserFileName = input('Enter file name:')
   if (UserFileName in listOfdir) and (UserFileName.endswith(".txt")):
      InputFile = open(UserFileName,'r')
      text = InputFile.read()
      sentence = text.count('.') + text.count('!') + text.count(';') + text.count(':') + text.count('?')
      words = len(text.split())
      syllable = 0
      for word in text.split():
         for vowel in ['a','e','i','o','u']:
            syllable + = word.count(vowel)
         for ending in ['es','ed','e']:
            if word.endswith(ending):
               syllable - = 1
         if word.endswith('le'):
            syllable + = 1
      G = round((0.39*words)/sentence+ (11.8*syllable)/words-15.59)
      if G > = 0 and G < = 30:
         print ('The Readability level is College')
      elif G > = 50 and G < = 60:
         print ('The Readability level is High School')
      elif G > = 90 and G < = 100:
         print ('The Readability level is fourth grade')
      print ('This text has %d words' %(words))
   elif UserFileName not in listOfdir:
      print ('This text file does not exist in current directory')
   elif not(UserFileName.endswith('.txt')):
      print ('This is not a text file.')

Output

Enter file name:dataVisualization.txt
The Readability level is College
This text has 64 words

Updated on: 30-Jul-2019

964 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements