How to extract website name from their links in R?


If we have a list of website links and we want to extract the website name from those links then it is a time-consuming task because we would need to copy each name one-by-one. Therefore, it is better to extract them using a function in R and save time. To extract the website name from the website link, we can use suffix_extract function of urltools package. This will extract the host, subdomain, domain and suffix. And it is known that the domain values are the website names.

Loading urltools package −

library(urltools)

Website links stored in a vector −

Web_Links<-c("https://www.grammarly.com/grammar-check","https://sceptermarketing.com/comma-separated-lists-of-us-states-abbreviations-select-options-etc/","https://www.tutorialspoint.com/machine_learning/index.htm","https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sort","https://www-islaah-in.cdn.ampproject.org/v/s/www.islaah.in/masail/13977/?amp=&usqp=mq331AQFKAGwASA%3D&_js_v=0.1#aoh=16016175660203&referrer=https%3A%2F%2Fwww.google.com&_tf=From%20%251%24s&share=https%3A%2F%2Fwww.islaah.in%2Fmasail%2F13977%2F","http://qoitrat.org/Qa/searchtopic.php?Main=76&MainTopc=245","https://theislamicinformation-com.cdn.ampproject.org/v/s/theislamicinformation.com/aqeeqah-for-baby-boy-and-girl/amp/?usqp=mq331AQFKAGwASA%3D&_js_v=0.1#aoh=16015741096047&referrer=https%3A%2F%2Fwww.google.com&_tf=From%20%251%24s&share=https%3A%2F%2Ftheislamicinformation.com%2Faqeeqah-for-baby-boy-and-girl%2F","https://parenting.firstcry.com/articles/50-popular-turkish-baby-names-for-girls/","https://www.amazon.in/SELF-CHEF-Delhi-Aloo-Tikki/dp/B089GW5ZPL/ref=asc_df_B089GW5ZPL/?tag=googleshopmob-21&linkCode=df0&hvadid=397060787211&hvpos=&hvnetw=g&hvrand=3239398407570685332&hvpone=&hvptwo=&hvqmt=&hvdev=m&hvdvcmdl=&hvlocint=&hvlocphy=9040189&hvtargid=pla-923173707999&psc=1&ext_vrnc=hi","http://ridenow.co.in/?From=Bareilly&To=Delhi&submit=","https://www.savaari.com/delhi/delhi-to-bareilly-cabs","https://www.olxgroup.com/search/operations/delhi-ncr/all-brands","https://unbelievable-facts.com/work-with-us","https://www.tataaiginsurance.in/taig/taig/tata_aig/CorporateCustomerPortal/login.jsp","https://www.dummies.com/programming/r/how-to-change-plot-options-in-r/","http://www.sthda.com/english/wiki/add-titles-to-a-plot-in-r-software")

Printing the vector of website links −

Web_Links

[1] "https://www.grammarly.com/grammar-check" [2] "https://sceptermarketing.com/comma-separated-lists-of-us-states-abbreviations-select-options-etc/" [3] "https://www.tutorialspoint.com/machine_learning/index.htm" [4] "https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sort" [5] "https://www-islaah-in.cdn.ampproject.org/v/s/www.islaah.in/masail/13977/?amp=&usqp=mq331AQFKAGwASA%3D&_js_v=0.1#aoh=16016175660203&referrer=https%3A%2F%2Fwww.google.com&_tf=From%20%251%24s&share=https%3A%2F%2Fwww.islaah.in%2Fmasail%2F13977%2F" [6] "http://qoitrat.org/Qa/searchtopic.php?Main=76&MainTopc=245" [7] "https://theislamicinformation-com.cdn.ampproject.org/v/s/theislamicinformation.com/aqeeqah-for-baby-boy-and-girl/amp/?usqp=mq331AQFKAGwASA%3D&_js_v=0.1#aoh=16015741096047&referrer=https%3A%2F%2Fwww.google.com&_tf=From%20%251%24s&share=https%3A%2F%2Ftheislamicinformation.com%2Faqeeqah-for-baby-boy-and-girl%2F" [8] "https://parenting.firstcry.com/articles/50-popular-turkish-baby-names-for-girls/" [9] "https://www.amazon.in/SELF-CHEF-Delhi-Aloo-Tikki/dp/B089GW5ZPL/ref=asc_df_B089GW5ZPL/?tag=googleshopmob-21&linkCode=df0&hvadid=397060787211&hvpos=&hvnetw=g&hvrand=3239398407570685332&hvpone=&hvptwo=&hvqmt=&hvdev=m&hvdvcmdl=&hvlocint=&hvlocphy=9040189&hvtargid=pla-923173707999&psc=1&ext_vrnc=hi" [10] "http://ridenow.co.in/?From=Bareilly&To=Delhi&submit=" [11] "https://www.savaari.com/delhi/delhi-to-bareilly-cabs" [12] "https://www.olxgroup.com/search/operations/delhi-ncr/all-brands" [13] "https://unbelievable-facts.com/work-with-us" [14] "https://www.tataaiginsurance.in/taig/taig/tata_aig/CorporateCustomerPortal/login.jsp" [15] "https://www.dummies.com/programming/r/how-to-change-plot-options-in-r/" [16] "http://www.sthda.com/english/wiki/add-titles-to-a-plot-in-r-software"

Extracting website names −

host subdomain
1 www.grammarly.com           www
2 sceptermarketing.com       <NA>
3 www.tutorialspoint.com      www
4 www.rdocumentation.org      www
5 www-islaah-in.cdn.ampproject.org www-islaah-in.cdn
6 qoitrat.org                  <NA>
7 theislamicinformation-com.cdn.ampproject.org theislamicinformation-com.cdn
8 parenting.firstcry.com      parenting
9 www.amazon.in                www
10 ridenow.co.in               <NA>
11 www.savaari.com             www
12 www.olxgroup.com            www
13 unbelievable-facts.com      <NA>
14 www.tataaiginsurance.in     www
15 www.dummies.com             www
16 www.sthda.com               www
domain suffix
1 grammarly    com
2 sceptermarketing com
3 tutorialspoint com
4 rdocumentation org
5 ampproject org
6 qoitrat org
7 ampproject org
8 firstcry com
9 amazon in
10 ridenow co.in
11 savaari com
12 olxgroup com
13 unbelievable-facts com
14 tataaiginsurance in
15 dummies com 16 sthda com

Updated on: 16-Oct-2020

170 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements