Scraper Package

scraper Module

scraper.call_scrape_func(siteList, db_collection, pool_size, db_auth, db_user, db_pass)

Helper function to iterate over a list of RSS feeds and scrape each.

Parameters:

siteList: dictionary :

Dictionary of sites, with a nickname as the key and RSS URL as the value.

db_collection : collection

Mongo collection to put stories

pool_size : int

Number of processes to distribute work

scraper.get_rss(address, website)

Function to parse an RSS feed and extract the relevant links.

Parameters:

address: String. :

Address for the RSS feed to scrape.

website: String. :

Nickname for the RSS feed being scraped.

Returns:

results : pattern.web.Results.

Object containing data on the parsed RSS feed. Each item represents a unique entry in the RSS feed and contains relevant information such as the URL and title of the story.

scraper.parse_config()

Function to parse the config file.

scraper.parse_results(rss_results, website, db_collection)

Function to parse the links drawn from an RSS feed.

Parameters:

rss_results: pattern.web.Results. :

Object containing data on the parsed RSS feed. Each item represents a unique entry in the RSS feed and contains relevant information such as the URL and title of the story.

website: String. :

Nickname for the RSS feed being scraped.

db_collection: pymongo Collection. :

Collection within MongoDB that in which results are stored.

scraper.scrape_func(website, address, COLL, db_auth, db_user, db_pass)

Function to scrape various RSS feeds.

Parameters:

website: String :

Nickname for the RSS feed being scraped.

address: String :

Address for the RSS feed to scrape.

COLL: String :

Collection within MongoDB that holds the scraped data.

db_auth: String. :

MongoDB database that should be used for user authentication.

db_user: String. :

Username for MongoDB authentication.

db_user: String. :

Password for MongoDB authentication.

pages_scrape Module

pages_scrape.scrape(url, extractor)

Function to request and parse a given URL. Returns only the “relevant” text.

Parameters:

url : String.

URL to request and parse.

extractor : Goose class instance.

An instance of Goose that allows for parsing of content.

Returns:

text : String.

Parsed text from the specified website.

meta : String.

Parsed meta description of an article. Usually equivalent to the lede.

mongo_connection Module

mongo_connection.add_entry(collection, text, title, url, date, website)

Function that creates the dictionary of content to add to a MongoDB instance, checks whether a given URL is already in the database, and inserts the new content into the database.

Parameters:

collection : pymongo Collection.

Collection within MongoDB that in which results are stored.

text : String.

Text from a given webpage.

title : String.

Title of the news story.

url : String.

URL of the webpage from which the content was pulled.

date : String.

Date pulled from the RSS feed.

website : String.

Nickname of the site from which the content was pulled.

Returns:

object_id : String