Split string into sentences using regex in PHP

PHPServer Side ProgrammingProgramming

Example

 Live Demo

function sentence_split($text) {
   $before_regexes =
      array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)
      \.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.
      \s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?
      :\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp
      \.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
   '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
   '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
      '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s
      *f|vs)\.\s))\Z/su',
   '/(?:(?:\b[Ee]tc\.\s))\Z/su',
   '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
   '/(?:(?:\b\p{L}\.))\Z/su',
   '/(?:(?:\b\p{L}\.\s))\Z/su',
   '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
   '/(?:(?:[\"”\']\s*))\Z/su',
   '/(?:(?:[\.!?…]
[\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
   '/(?:(?:[\.!?…]
[\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
   '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
   $after_regexes = array('/\A(?:)/su',
   '/\A(?:[\p{N}\p{Ll}])/su',
   '/\A(?:[^\p{Lu}])/su',
   '/\A(?:[^\p{Lu}]|I)/su',
   '/\A(?:[^p{Lu}])/su',
   '/\A(?:\p{Ll})/su',
   '/\A(?:\p{L}\.)/su',
   '/\A(?:\p{L}\.\s)/su',
   '/\A(?:\p{N})/su',
   '/\A(?:\s*\p{Ll})/su',
   '/\A(?:)/su',
   '/\A(?:\p{Lu}[^\p{Lu}])/su',
   '/\A(?:\p{Lu}\p{Ll})/su');
$is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
   $count = 13;
   $sentences = array();
   $sentence = '';
   $before = '';
   $after = substr($text, 0, 10);
   $text = substr($text, 10);
   while($text != '') {
      for($i = 0; $i < $count; $i++) {
         if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
            if($is_sentence_boundary[$i]) {
               array_push($sentences, $sentence);
               $sentence = '';
            }
            break;
         }
      }
      $first_from_text = $text[0];
      $text = substr($text, 1);
      $first_from_after = $after[0];
      $after = substr($after, 1);
      $before .= $first_from_after;
      $sentence .= $first_from_after;
      $after .= $first_from_text;
   }
   if($sentence != '' && $after != '') {
      array_push($sentences, $sentence.$after);
   }
   return $sentences;
}
$text = "Hello there, hello from Tokyo, Japan, Universe, Earth.";
print_r(sentence_split($text));

Output

This will produce the following output −

Array ( [0] => Hello there, hello from Tokyo, Japan, Universe, Earth. )

The text is gradually iterated over. At any point in time, the current chunk of text data would have 2 different parts. In this, one part would be the substring candidate that occurs before and sentence boundary.

The other part is the substring candidate that comes after the sentence boundary. The first 20 regex pairs detect the positions. When sentence boundaries are not identified, the before and after are incremented without saving that new sentence.

If no pairs match, match is attempted with the last 3 pairs, thereby detecting a sentence boundary.

raja
Published on 06-Apr-2020 09:06:04
Advertisements