Welcome to OEDA Scraper’s documentation!¶
This site hosts the documentation for the web scraper used by the Open Event Data Alliance. The scraper functions by specifying a whitelist of trusted RSS feed URLs and scraping the articles from these RSS feeds. The scraper makes use of goose in order to scrape arbitrary pages, and stores the output content in a MongoDB instance.
You should probably create a virtual environment, but in any event doing pip install -r requirements.txt should do the trick. You might (probably will) have to specify something along the lines of --allow-all-external pattern --allow-unverified pattern for the pattern library since it gets downloaded from its homepage.
The scraper requires a running MongoDB instance to dump the scraped stories into. Make sure you have MongoDB installed and type mongod at the terminal to begin the instance if your install method didn’t set up the MongoDB process to run automatically. MongoDB doesn’t require you to prepare the collection or database ahead of time, so when you run the program it should automatically create a database called event_scrape with a collection called stories. Once you’ve run python scraper.py, you can verify that the stories are in the Mongo database by opening a new terminal window and typing mongo.
To interface with Mongo, enter mongo at the command line. From inside Mongo, type show dbs to verify that there’s a database called event_scrape. Enter the database with use event_scrape and type show collections to make sure there’s a stories collection. db.stories.find() will show you the first 20 entries.
After everything is installed, it’s as simple as python scraper.py. That is assuming, of course, that you wish to use the configuration seen in the default_config.ini file. If not, just modify that. For the source type section of the config, the three types of sources are wire, international, and local. It is possible to specify any combination of those source types, with the source types separated by commas in the config file. For more information on the source types, see the Contributing page.