Split string into sentences using regex in PHP

In PHP, you can split text into sentences using regular expressions to handle complex sentence boundaries. This approach considers abbreviations, titles, and other edge cases that simple period-splitting would miss.

Using Complex Regex Patterns

The most robust approach uses multiple regex patterns to identify sentence boundaries while avoiding false positives with abbreviations and titles ?

<?php
function sentence_split($text) {
    $before_regexes = array(
        '/(?:(?:['"\?][\.!?\?]['""]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?\?]+\p{Pe} )|(?:[\[\(]*\?[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[""']\s*))\Z/su',
        '/(?:(?:[\.!?\?][\x{00BB}\x{2019}\x{201D}\x{203A}"\'\p{Pe}\x{0002}]*\s)|(?:\r?\<br>))\Z/su',
        '/(?:(?:[\.!?\?]['"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?\?]\s))\Z/su'
    );
    
    $after_regexes = array(
        '/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su'
    );
    
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;
    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);
    
    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }
        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }
    
    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence . $after);
    }
    
    return $sentences;
}

$text = "Dr. Smith went to the U.S.A. He met Prof. Johnson. They discussed the project.";
print_r(sentence_split($text));
?>
Array
(
    [0] => Dr. Smith went to the U.S.A. 
    [1] => He met Prof. Johnson. 
    [2] => They discussed the project.
)

Simple Approach for Basic Cases

For simpler text without complex abbreviations, you can use a basic regex pattern ?

<?php
function simple_sentence_split($text) {
    // Split on periods, exclamations, or question marks followed by space and capital letter
    $sentences = preg_split('/(?<=[.!?])\s+(?=[A-Z])/', trim($text));
    return array_filter($sentences); // Remove empty elements
}

$text = "Hello world. How are you? I am fine!";
print_r(simple_sentence_split($text));
?>
Array
(
    [0] => Hello world.
    [1] => How are you?
    [2] => I am fine!
)

How It Works

The complex approach uses paired regex patterns to analyze text context before and after potential sentence boundaries. It considers:

  • Abbreviations − titles like Dr., Prof., Mrs.
  • Acronyms − U.S.A., Ph.D.
  • Numbers − decimal points and measurements
  • Quotations − punctuation inside quotes

The algorithm maintains a sliding window of text, checking each potential boundary against all patterns to determine if it's a true sentence end.

Conclusion

Use the complex regex approach for professional text processing with proper abbreviation handling. For simple cases, the basic pattern splitting on punctuation followed by capital letters works well.

Updated on: 2026-03-15T08:36:12+05:30

548 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements