- Python Basic Tutorial
- Python - Home
- Python - Overview
- Python - Environment Setup
- Python - Basic Syntax
- Python - Comments
- Python - Variables
- Python - Data Types
- Python - Operators
- Python - Decision Making
- Python - Loops
- Python - Numbers
- Python - Strings
- Python - Lists
- Python - Tuples
- Python - Dictionary
- Python - Date & Time
- Python - Functions
- Python - Modules
- Python - Files I/O
- Python - Exceptions
- Python Advanced Tutorial
- Python - Classes/Objects
- Python - Reg Expressions
- Python - CGI Programming
- Python - Database Access
- Python - Networking
- Python - Sending Email
- Python - Multithreading
- Python - XML Processing
- Python - GUI Programming
- Python - Further Extensions
Python module Newspaper for Article scraping & curation?
We can extract content in web pages from a variety of domains such as data mining, information retrieval etc. To extract information from the websites of newspapers and magazines we are going to use newspaper library.
The main purpose of this library is to extract and curates the articles from the newspapers and similar websites.
To Newspaper library installation, run in your terminal:
$ pip install newspaper3k
For lxml dependencies, run below command in your terminal
$pip install lxml
To install PIL, run
$pip install Pillow
The NLP corpora will be downloaded:
$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python
The python newpaper library is used to collect information associated with articles. This includes author name, major images in the article, publication dates, video present in the article, key words describing the article and the summary of the article.
#Import required library from newspaper import Article # url link-which you want to extract url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117" # Download the article >>> from newspaper import Article >>> url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117" >>> article = Article(url) >>> article.download() # Parse the article and fetch authors name >>> article.parse() >>> print(article.authors)
['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com'] # Extract Publication date >>> print("Article Publication Date:") >>> print(article.publish_date) # Extract URL of the major images >>> print(article.top_image)
https://images.wsj.net/im-51122/social # Extract keywords using NLP print ("Keywords in the article", article.keywords) # Extract summary of the article print("Article Summary", article.summary)
Below is the complete program:
from newspaper import Article url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117" article = Article(url) article.download() article.parse() print(article.authors) print("Article Publication Date:") print(article.publish_date) print("Major Image in the article:") print(article.top_image) article.nlp() print ("Keywords in the article") print(article.keywords) print("Article Summary") print(article.summary)
['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com'] Article Publication Date: None Major Image in the article: https://images.wsj.net/im-51122/social Keywords in the article ['state', 'spending', 'sweeping', 'southern', 'security', 'border', 'principle', 'lawmakers', 'avoid', 'shutdown', 'reach', 'weekendthe', 'fund', 'trump', 'union', 'agreement', 'wall'] Article Summary President Trump made the case in his State of the Union address for the construction of a wall along the southern U.S. border, calling it a “moral issue." Photo: GettyWASHINGTON—Senior lawmakers said Monday night they had reached an agreement in principle on a sweeping deal to end a monthslong fight over border security and avoid a partial government shutdown this weekend. The top four lawmakers on the House and Senate Appropriations Committees emerged after three closed-door meetings Monday and announced that they had agreed to a framework for all seven spending bills whose funding expires at 12:01 a.m. Saturday.
- Related Articles
- Python Tools for Web scraping
- Web Scraping using Python and Scrapy?
- Selenium versus BeautifulSoup for Web Scraping.
- Python Implementing Web Scraping with Scrapy
- Python Implementing web scraping using lxml
- What is Python module for date manipulation?
- Scraping and Finding Ordered Word in a Dictionary in Python
- Collect newspaper cuttings and information in magazines about HIV/AIDS. Write a one-page article of 15 to 20 sentences on HIV/AIDS.
- Implementing Web Scraping in Python with BeautifulSoup?
- Implementing Web Scraping in Python with Scrapy
- Implementing web scraping using lxml in Python?
- Implementing web scraping using lxml in Python Programming
- A dealer sells an article for Rs. 24 and gains as much percent as the cost price of the article. Find the cost price of the article.
- Python getpass Module
- Fraction module in Python