
- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Split string into sentences using regex in PHP
Example
function sentence_split($text) { $before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd) \.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\. \s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(? :\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp \.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su', '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su', '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su', '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s *f|vs)\.\s))\Z/su', '/(?:(?:\b[Ee]tc\.\s))\Z/su', '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su', '/(?:(?:\b\p{L}\.))\Z/su', '/(?:(?:\b\p{L}\.\s))\Z/su', '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su', '/(?:(?:[\"”\']\s*))\Z/su', '/(?:(?:[\.!?…] [\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su', '/(?:(?:[\.!?…] [\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su', '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su'); $after_regexes = array('/\A(?:)/su', '/\A(?:[\p{N}\p{Ll}])/su', '/\A(?:[^\p{Lu}])/su', '/\A(?:[^\p{Lu}]|I)/su', '/\A(?:[^p{Lu}])/su', '/\A(?:\p{Ll})/su', '/\A(?:\p{L}\.)/su', '/\A(?:\p{L}\.\s)/su', '/\A(?:\p{N})/su', '/\A(?:\s*\p{Ll})/su', '/\A(?:)/su', '/\A(?:\p{Lu}[^\p{Lu}])/su', '/\A(?:\p{Lu}\p{Ll})/su'); $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true); $count = 13; $sentences = array(); $sentence = ''; $before = ''; $after = substr($text, 0, 10); $text = substr($text, 10); while($text != '') { for($i = 0; $i < $count; $i++) { if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) { if($is_sentence_boundary[$i]) { array_push($sentences, $sentence); $sentence = ''; } break; } } $first_from_text = $text[0]; $text = substr($text, 1); $first_from_after = $after[0]; $after = substr($after, 1); $before .= $first_from_after; $sentence .= $first_from_after; $after .= $first_from_text; } if($sentence != '' && $after != '') { array_push($sentences, $sentence.$after); } return $sentences; } $text = "Hello there, hello from Tokyo, Japan, Universe, Earth."; print_r(sentence_split($text));
Output
This will produce the following output −
Array ( [0] => Hello there, hello from Tokyo, Japan, Universe, Earth. )
The text is gradually iterated over. At any point in time, the current chunk of text data would have 2 different parts. In this, one part would be the substring candidate that occurs before and sentence boundary.
The other part is the substring candidate that comes after the sentence boundary. The first 20 regex pairs detect the positions. When sentence boundaries are not identified, the before and after are incremented without saving that new sentence.
If no pairs match, match is attempted with the last 3 pairs, thereby detecting a sentence boundary.
- Related Questions & Answers
- Split string into groups - JavaScript
- Split string into equal parts JavaScript
- PHP program to split a given comma delimited string into an array of values
- Split the sentences by comma and remove surrounding spaces - JavaScript?
- Python – Split Numeric String into K digit integers
- Python program to split string into k distinct partitions
- How to split a String into an array in Kotlin?
- Java regex program to split a string at every space and punctuation.
- Java regex program to split a string with line endings as delimiter
- How to split a string into elements of a string array in C#?
- PHP preg_split to split a string with specific values?
- How to split a Java String into tokens with StringTokenizer?
- How can I use Python regex to split a string by multiple delimiters?
- How to split a string column into multiple columns in R?
- How to split string in MySQL using SUBSTRING_INDEX?
Advertisements