Extract string vector elements up to a fixed number of characters in R.


To extract string vector elements up to a fixed number of characters in R, we can use substring function of base R.

For Example, if we have a vector of strings say X that contains 100 string values and we want to find the first five character of each value then we can use the command as given below −

substring(X,1,5)

Example 1

Following snippet creates a sample data frame −

x1<-c("Alabama", "Alaska", "American Samoa", "Arizona", "Arkansas",
"California", "Colorado", "Connecticut", "Delaware", "District of Columbia",
"Florida", "Georgia", "Guam", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa",
"Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts",
"Michigan", "Minnesota", "Minor Outlying Islands", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico",
"New York", "North Carolina", "North Dakota", "Northern Mariana Islands",
"Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Puerto Rico", "Rhode Island",
"South Carolina", "South Dakota", "Tennessee", "Texas", "U.S. Virgin Islands",
"Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin",
"Wyoming")
x1

The following dataframe is created

[1] "Alabama"                   "Alaska"
[3] "American Samoa"            "Arizona"
[5] "Arkansas"                  "California"
[7] "Colorado"                  "Connecticut"
[9] "Delaware"                  "District of Columbia"
[11] "Florida"                  "Georgia"
[13] "Guam"                     "Hawaii"
[15] "Idaho"                    "Illinois"
[17] "Indiana"                  "Iowa"
[19] "Kansas"                   "Kentucky"
[21] "Louisiana"                "Maine"
[23] "Maryland"                 "Massachusetts"
[25] "Michigan"                 "Minnesota"
[27] "Minor Outlying Islands"   "Mississippi"
[29] "Missouri"                 "Montana"
[31] "Nebraska"                 "Nevada"
[33] "New Hampshire"            "New Jersey"
[35] "New Mexico"               "New York"
[37] "North Carolina"           "North Dakota"
[39] "Northern Mariana Islands" "Ohio"
[41] "Oklahoma"                 "Oregon"
[43] "Pennsylvania"             "Puerto Rico"
[45] "Rhode Island"             "South Carolina"
[47] "South Dakota"             "Tennessee"
[49] "Texas"                    "U.S. Virgin Islands"
[51] "Utah"                     "Vermont"
[53] "Virginia"                 "Washington"
[55] "West Virginia"            "Wisconsin"
[57] "Wyoming"

To find first two characters of each value in x1 on the above created data frame, add the following code to the above snippet −

x1<-c("Alabama", "Alaska", "American Samoa", "Arizona", "Arkansas",
"California", "Colorado", "Connecticut", "Delaware", "District of Columbia",
"Florida", "Georgia", "Guam", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa",
"Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts",
"Michigan", "Minnesota", "Minor Outlying Islands", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico",
"New York", "North Carolina", "North Dakota", "Northern Mariana Islands",
"Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Puerto Rico", "Rhode Island",
"South Carolina", "South Dakota", "Tennessee", "Texas", "U.S. Virgin Islands",
"Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin",
"Wyoming")
substring(x1,1,2)

Output

If you execute all the above given snippets as a single program, it generates the following Output −

[1]  "Al" "Al" "Am" "Ar" "Ar" "Ca" "Co" "Co" "De" "Di" "Fl" "Ge" "Gu" "Ha" "Id"
[16] "Il" "In" "Io" "Ka" "Ke" "Lo" "Ma" "Ma" "Ma" "Mi" "Mi" "Mi" "Mi" "Mi" "Mo"
[31] "Ne" "Ne" "Ne" "Ne" "Ne" "Ne" "No" "No" "No" "Oh" "Ok" "Or" "Pe" "Pu" "Rh"
[46] "So" "So" "Te" "Te" "U." "Ut" "Ve" "Vi" "Wa" "We" "Wi" "Wy"

Example 2

Following snippet creates a sample data frame −

x2<-c("Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia",
"Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary",
"Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta",
"Netherlands", "Poland", "Portugal", "Romania", "Slovakia", "Slovenia",
"Spain", "Sweden")
x2

The following dataframe is created

[1]  "Austria" "Belgium"   "Bulgaria"   "Croatia"  "Cyprus"
[6]  "Czechia" "Denmark"   "Estonia"    "Finland"  "France"
[11] "Germany" "Greece"    "Hungary"    "Ireland"  "Italy"
[16] "Latvia"  "Lithuania" "Luxembourg" "Malta"    "Netherlands"
[21] "Poland"  "Portugal"  "Romania"    "Slovakia" "Slovenia"
[26] "Spain"   "Sweden"

To find first two characters of each value in x2 on the above created data frame, add the following code to the above snippet −

x2<-c("Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia",
"Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary",
"Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta",
"Netherlands", "Poland", "Portugal", "Romania", "Slovakia", "Slovenia",
"Spain", "Sweden")
substring(x2,1,2)

Output

If you execute all the above given snippets as a single program, it generates the following Output −

[1]  "Au" "Be" "Bu" "Cr" "Cy" "Cz" "De" "Es" "Fi" "Fr" "Ge" "Gr" "Hu" "Ir" "It"
[16] "La" "Li" "Lu" "Ma" "Ne" "Po" "Po" "Ro" "Sl" "Sl" "Sp" "Sw"

Example 3

Following snippet creates a sample data frame −

x3<-c("Cuba", "Cyprus", "Czech Republic", "Djibouti", "Dominica", "Dominican
Republic", "East Timor", "Ecuador", "Egypt", "El Salvador", "Equatorial
Guinea", "Eritrea", "Estonia", "Ethiopia", "Fiji", "Finland", "France",
"Metropolitan", "French Guiana", "Gambia", "Georgia", "Germany", "Ghana",
"Greenland", "Grenada", "Guatemala", "Honduras", "Hong Kong", "Hungary",
"Iceland", "India", "Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy",
"Jamaica", "Japan", "Jordan", "Kazakhstan", "Kenya", "Mozambique", "Namibia",
"Nepal", "Netherlands", "Nigeria", "Norway", "Oman", "Paraguay", "Peru",
"Philippines")
x3

The following dataframe is created

[1]  "Cuba"          "Cyprus"            "Czech Republic"
[4]  "Djibouti"      "Dominica"          "Dominican Republic"
[7]  "East Timor"    "Ecuador"           "Egypt"
[10] "El Salvador"   "Equatorial Guinea" "Eritrea"
[13] "Estonia"       "Ethiopia"          "Fiji"
[16] "Finland"       "France"            "Metropolitan"
[19] "French Guiana" "Gambia"            "Georgia"
[22] "Germany"       "Ghana"             "Greenland"
[25] "Grenada"       "Guatemala"         "Honduras"
[28] "Hong Kong"     "Hungary"           "Iceland"
[31] "India"         "Indonesia"         "Iran"
[34] "Iraq"          "Ireland"           "Israel"
[37] "Italy"         "Jamaica"           "Japan"
[40] "Jordan"        "Kazakhstan"        "Kenya"
[43] "Mozambique"    "Namibia"           "Nepal"
[46] "Netherlands"   "Nigeria"           "Norway"
[49] "Oman"          "Paraguay"          "Peru"
[52] "Philippines"

To find first two characters of each value in x3 on the above created data frame, add the following code to the above snippet −

x3<-c("Cuba", "Cyprus", "Czech Republic", "Djibouti", "Dominica", "Dominican
Republic", "East Timor", "Ecuador", "Egypt", "El Salvador", "Equatorial
Guinea", "Eritrea", "Estonia", "Ethiopia", "Fiji", "Finland", "France",
"Metropolitan", "French Guiana", "Gambia", "Georgia", "Germany", "Ghana",
"Greenland", "Grenada", "Guatemala", "Honduras", "Hong Kong", "Hungary",
"Iceland", "India", "Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy",
"Jamaica", "Japan", "Jordan", "Kazakhstan", "Kenya", "Mozambique", "Namibia",
"Nepal", "Netherlands", "Nigeria", "Norway", "Oman", "Paraguay", "Peru",
"Philippines")
substring(x3,1,2)

Output

If you execute all the above given snippets as a single program, it generates the following Output −

[1]  "Cu" "Cy" "Cz" "Dj" "Do" "Do" "Ea" "Ec" "Eg" "El" "Eq" "Er" "Es" "Et" "Fi"
[16] "Fi" "Fr" "Me" "Fr" "Ga" "Ge" "Ge" "Gh" "Gr" "Gr" "Gu" "Ho" "Ho" "Hu" "Ic"
[31] "In" "In" "Ir" "Ir" "Ir" "Is" "It" "Ja" "Ja" "Jo" "Ka" "Ke" "Mo" "Na" "Ne"
[46] "Ne" "Ni" "No" "Om" "Pa" "Pe" "Ph"

Updated on: 02-Nov-2021

427 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements