Scraper Package¶
scraper Module¶
- scraper.call_scrape_func(siteList, db_collection, pool_size, db_auth, db_user, db_pass)¶
Helper function to iterate over a list of RSS feeds and scrape each.
Parameters: siteList: dictionary :
Dictionary of sites, with a nickname as the key and RSS URL as the value.
db_collection : collection
Mongo collection to put stories
pool_size : int
Number of processes to distribute work
- scraper.get_rss(address, website)¶
Function to parse an RSS feed and extract the relevant links.
Parameters: address: String. :
Address for the RSS feed to scrape.
website: String. :
Nickname for the RSS feed being scraped.
Returns: results : pattern.web.Results.
Object containing data on the parsed RSS feed. Each item represents a unique entry in the RSS feed and contains relevant information such as the URL and title of the story.
- scraper.parse_config()¶
Function to parse the config file.
- scraper.parse_results(rss_results, website, db_collection)¶
Function to parse the links drawn from an RSS feed.
Parameters: rss_results: pattern.web.Results. :
Object containing data on the parsed RSS feed. Each item represents a unique entry in the RSS feed and contains relevant information such as the URL and title of the story.
website: String. :
Nickname for the RSS feed being scraped.
db_collection: pymongo Collection. :
Collection within MongoDB that in which results are stored.
- scraper.scrape_func(website, address, COLL, db_auth, db_user, db_pass)¶
Function to scrape various RSS feeds.
Parameters: website: String :
Nickname for the RSS feed being scraped.
address: String :
Address for the RSS feed to scrape.
COLL: String :
Collection within MongoDB that holds the scraped data.
db_auth: String. :
MongoDB database that should be used for user authentication.
db_user: String. :
Username for MongoDB authentication.
db_user: String. :
Password for MongoDB authentication.
pages_scrape Module¶
- pages_scrape.scrape(url, extractor)¶
Function to request and parse a given URL. Returns only the “relevant” text.
Parameters: url : String.
URL to request and parse.
extractor : Goose class instance.
An instance of Goose that allows for parsing of content.
Returns: text : String.
Parsed text from the specified website.
meta : String.
Parsed meta description of an article. Usually equivalent to the lede.
mongo_connection Module¶
- mongo_connection.add_entry(collection, text, title, url, date, website)¶
Function that creates the dictionary of content to add to a MongoDB instance, checks whether a given URL is already in the database, and inserts the new content into the database.
Parameters: collection : pymongo Collection.
Collection within MongoDB that in which results are stored.
text : String.
Text from a given webpage.
title : String.
Title of the news story.
url : String.
URL of the webpage from which the content was pulled.
date : String.
Date pulled from the RSS feed.
website : String.
Nickname of the site from which the content was pulled.
Returns: object_id : String