We can extract content in web pages from a variety of domains such as data mining, information retrieval etc. To extract information from the websites of newspapers and magazines we are going to use newspaper library.
The main purpose of this library is to extract and curates the articles from the newspapers and similar websites.
To Newspaper library installation, run in your terminal:
$ pip install newspaper3k
For lxml dependencies, run below command in your terminal
$pip install lxml
To install PIL, run
$pip install Pillow
The NLP corpora will be downloaded:
$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python
The python newpaper library is used to collect information associated with articles. This includes author name, major images in the article, publication dates, video present in the article, key words describing the article and the summary of the article.
#Import required library from newspaper import Article # url link-which you want to extract url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117" # Download the article >>> from newspaper import Article >>> url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117" >>> article = Article(url) >>> article.download() # Parse the article and fetch authors name >>> article.parse() >>> print(article.authors)
['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com'] # Extract Publication date >>> print("Article Publication Date:") >>> print(article.publish_date) # Extract URL of the major images >>> print(article.top_image)
https://images.wsj.net/im-51122/social # Extract keywords using NLP print ("Keywords in the article", article.keywords) # Extract summary of the article print("Article Summary", article.summary)
Below is the complete program:
from newspaper import Article url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117" article = Article(url) article.download() article.parse() print(article.authors) print("Article Publication Date:") print(article.publish_date) print("Major Image in the article:") print(article.top_image) article.nlp() print ("Keywords in the article") print(article.keywords) print("Article Summary") print(article.summary)
['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com'] Article Publication Date: None Major Image in the article: https://images.wsj.net/im-51122/social Keywords in the article ['state', 'spending', 'sweeping', 'southern', 'security', 'border', 'principle', 'lawmakers', 'avoid', 'shutdown', 'reach', 'weekendthe', 'fund', 'trump', 'union', 'agreement', 'wall'] Article Summary President Trump made the case in his State of the Union address for the construction of a wall along the southern U.S. border, calling it a “moral issue." Photo: GettyWASHINGTON—Senior lawmakers said Monday night they had reached an agreement in principle on a sweeping deal to end a monthslong fight over border security and avoid a partial government shutdown this weekend. The top four lawmakers on the House and Senate Appropriations Committees emerged after three closed-door meetings Monday and announced that they had agreed to a framework for all seven spending bills whose funding expires at 12:01 a.m. Saturday.