Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Split string into sentences using regex in PHP
In PHP, you can split text into sentences using regular expressions to handle complex sentence boundaries. This approach considers abbreviations, titles, and other edge cases that simple period-splitting would miss.
Using Complex Regex Patterns
The most robust approach uses multiple regex patterns to identify sentence boundaries while avoiding false positives with abbreviations and titles ?
<?php
function sentence_split($text) {
$before_regexes = array(
'/(?:(?:['"\?][\.!?\?]['""]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
'/(?:(?:\b[Ee]tc\.\s))\Z/su',
'/(?:(?:[\.!?\?]+\p{Pe} )|(?:[\[\(]*\?[\]\)]* ))\Z/su',
'/(?:(?:\b\p{L}\.))\Z/su',
'/(?:(?:\b\p{L}\.\s))\Z/su',
'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
'/(?:(?:[""']\s*))\Z/su',
'/(?:(?:[\.!?\?][\x{00BB}\x{2019}\x{201D}\x{203A}"\'\p{Pe}\x{0002}]*\s)|(?:\r?\<br>))\Z/su',
'/(?:(?:[\.!?\?]['"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
'/(?:(?:\s\p{L}[\.!?\?]\s))\Z/su'
);
$after_regexes = array(
'/\A(?:)/su',
'/\A(?:[\p{N}\p{Ll}])/su',
'/\A(?:[^\p{Lu}])/su',
'/\A(?:[^\p{Lu}]|I)/su',
'/\A(?:[^\p{Lu}])/su',
'/\A(?:\p{Ll})/su',
'/\A(?:\p{L}\.)/su',
'/\A(?:\p{L}\.\s)/su',
'/\A(?:\p{N})/su',
'/\A(?:\s*\p{Ll})/su',
'/\A(?:)/su',
'/\A(?:\p{Lu}[^\p{Lu}])/su',
'/\A(?:\p{Lu}\p{Ll})/su'
);
$is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
$count = 13;
$sentences = array();
$sentence = '';
$before = '';
$after = substr($text, 0, 10);
$text = substr($text, 10);
while($text != '') {
for($i = 0; $i < $count; $i++) {
if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
if($is_sentence_boundary[$i]) {
array_push($sentences, $sentence);
$sentence = '';
}
break;
}
}
$first_from_text = $text[0];
$text = substr($text, 1);
$first_from_after = $after[0];
$after = substr($after, 1);
$before .= $first_from_after;
$sentence .= $first_from_after;
$after .= $first_from_text;
}
if($sentence != '' && $after != '') {
array_push($sentences, $sentence . $after);
}
return $sentences;
}
$text = "Dr. Smith went to the U.S.A. He met Prof. Johnson. They discussed the project.";
print_r(sentence_split($text));
?>
Array
(
[0] => Dr. Smith went to the U.S.A.
[1] => He met Prof. Johnson.
[2] => They discussed the project.
)
Simple Approach for Basic Cases
For simpler text without complex abbreviations, you can use a basic regex pattern ?
<?php
function simple_sentence_split($text) {
// Split on periods, exclamations, or question marks followed by space and capital letter
$sentences = preg_split('/(?<=[.!?])\s+(?=[A-Z])/', trim($text));
return array_filter($sentences); // Remove empty elements
}
$text = "Hello world. How are you? I am fine!";
print_r(simple_sentence_split($text));
?>
Array
(
[0] => Hello world.
[1] => How are you?
[2] => I am fine!
)
How It Works
The complex approach uses paired regex patterns to analyze text context before and after potential sentence boundaries. It considers:
- Abbreviations − titles like Dr., Prof., Mrs.
- Acronyms − U.S.A., Ph.D.
- Numbers − decimal points and measurements
- Quotations − punctuation inside quotes
The algorithm maintains a sliding window of text, checking each potential boundary against all patterns to determine if it's a true sentence end.
Conclusion
Use the complex regex approach for professional text processing with proper abbreviation handling. For simple cases, the basic pattern splitting on punctuation followed by capital letters works well.
