Split string into sentences using regex in PHP

Example

function sentence_split($text) {
   $before_regexes =
      array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)
      \.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.
      \s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?
      :\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp
      \.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
   '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
   '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
      '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s
      *f|vs)\.\s))\Z/su',
   '/(?:(?:\b[Ee]tc\.\s))\Z/su',
   '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
   '/(?:(?:\b\p{L}\.))\Z/su',
   '/(?:(?:\b\p{L}\.\s))\Z/su',
   '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
   '/(?:(?:[\"”\']\s*))\Z/su',
   '/(?:(?:[\.!?…]
[\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?
))\Z/su',
   '/(?:(?:[\.!?…]
[\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
   '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
   $after_regexes = array('/\A(?:)/su',
   '/\A(?:[\p{N}\p{Ll}])/su',
   '/\A(?:[^\p{Lu}])/su',
   '/\A(?:[^\p{Lu}]|I)/su',
   '/\A(?:[^p{Lu}])/su',
   '/\A(?:\p{Ll})/su',
   '/\A(?:\p{L}\.)/su',
   '/\A(?:\p{L}\.\s)/su',
   '/\A(?:\p{N})/su',
   '/\A(?:\s*\p{Ll})/su',
   '/\A(?:)/su',
   '/\A(?:\p{Lu}[^\p{Lu}])/su',
   '/\A(?:\p{Lu}\p{Ll})/su');
$is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
   $count = 13;
   $sentences = array();
   $sentence = '';
   $before = '';
   $after = substr($text, 0, 10);
   $text = substr($text, 10);
   while($text != '') {
      for($i = 0; $i < $count; $i++) {
         if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
            if($is_sentence_boundary[$i]) {
               array_push($sentences, $sentence);
               $sentence = '';
            }
            break;
         }
      }
      $first_from_text = $text[0];
      $text = substr($text, 1);
      $first_from_after = $after[0];
      $after = substr($after, 1);
      $before .= $first_from_after;
      $sentence .= $first_from_after;
      $after .= $first_from_text;
   }
   if($sentence != '' && $after != '') {
      array_push($sentences, $sentence.$after);
   }
   return $sentences;
}
$text = "Hello there, hello from Tokyo, Japan, Universe, Earth.";
print_r(sentence_split($text));

Output

This will produce the following output −

Array ( [0] => Hello there, hello from Tokyo, Japan, Universe, Earth. )

The text is gradually iterated over. At any point in time, the current chunk of text data would have 2 different parts. In this, one part would be the substring candidate that occurs before and sentence boundary.

The other part is the substring candidate that comes after the sentence boundary. The first 20 regex pairs detect the positions. When sentence boundaries are not identified, the before and after are incremented without saving that new sentence.

If no pairs match, match is attempted with the last 3 pairs, thereby detecting a sentence boundary.

AmitDiwan

Updated on: 06-Apr-2020

279 Views

Kickstart Your Career

Get certified by completing the course

Get Started