MP Scrape
MP Scrape is aimed at obtaining up-to-date information about MPs all over Europe.
Why MP Scrape Matters
Born out of necessity within the FSFE, MP Scrape is a tool that automates the collection of up-to-date information about Members of Parliament (MPs) across Europe.
MP Scrape supports the FSFE in advancing digital freedom by providing transparent access to information about policymakers across Europe.
The fight for Free Software and digital freedom is ongoing. Support us to come closer to this goal.
Free Software
MP Scrape is Free Software, meaning anyone can use, share, study, and improve it.
Up-to-date information
MP Scrape allows you to collect up-to-date information for Members of Parliament across Europe.
Getting started
The best way to contribute to MP Scrape is by creating Sources for your local, regional, and national parliaments. You may check the documentation on how to create your own sources.
Otherwise, the fight for Free Software and digital freedom is ongoing. Support us to come closer to this goal.
Using MP Scrape
If you feel comfortable with the CLI, the best option is to use the MP Scrape CLI. The MP Scrape CLI allows you to define Workflows that scrape a set of sources.
If you need collaborative access to the data, the MP Scrape UI is the best option for you. MP Scrape UI makes it more difficult to modify your Workflows but allows your team to access and retrieve the up-to-date information.
Using the MP Scrape CLI
With all required dependencies installed, you need to install MP Scrape CLI. The preferred method of distribution is through PyPI:
python3 -m pip install mp_scrape_cli
Now that you have MP Scrape CLI installed, you will need to create your first Workflow. In a nutshell, Workflows describe: what content to scrape from where, what to do with the content, and how to export the content.
A simple Workflow looks like this:
# Scrape data from the European Parliament [workflow] sources = ["euparl"] processes = [] consumers = ["csv"] [sources.euparl] module = "mp_scrape_source_euparl" retrieve_emails = false retrieve_committees = false [consumers.csv] module = "mp_scrape_export_csv" dest = "result.csv"
This example Workflows scrapes basic information from all members of the European Parliament and outputs a CSV file that you can use with any spreadsheet software of your choice.
The next step is to run the Workflow! Copy the example workflow and paste it to a file of your choice, which
we will call workflow.toml
then run the following command:
python3 -m mp_scrape_cli -w workflow.toml -l DEBUG
Glossary
Workflows
Workflows are the core of MP Scrape CLI and MP Scrape UI.
Workflows allow you to define: what data to scrape from where, how to transform that data, and how to export that data.
Workflows are made of three parts: Sources, Processes, and Consumers.
Sources
Sources are responsible for fetching raw data from various places. These could be APIs, databases, websites, or any other data provider. Sources can be configured using arguments to customize their behavior.
Information for developers
Sources (also known as Data Sources), are implementations of the
DataSource
abstract class. Each
DataSource
implementation must:
-
Define metadata by implementing the
metadata
static method, which returns aModuleDefinition
describing the details of the source. -
Implement an async
fetch_data
method, which retrieves the data and returns apandas.DataFrame
. This dataframe represents the raw structured data fetched by the source.
from mp_scrape_core import DataSource, ModuleDescription, ModuleArgument, ModuleDefinition, ModuleMaintainer import pandas as pd import logging class AcmeIncSource(DataSource): def __init__(self, retrieve_emails: bool = True): """ Retrieve information from Acme Inc. :param bool retrieve_emails: (List emails) When enabled, e-mails will be retrieved """ self.retrieve_emails = retrieve_emails @staticmethod def metadata() -> ModuleDefinition: return ModuleDefinition({ "name": "Acme Inc.", "identifier": "acmeinc", # You can generate the description and arguments from the docstring in __init__ "description": ModuleDescription.from_init(AcmeIncSource.__init__), "arguments": ModuleArgument.list_from_init(AcmeIncSource.__init__), "maintainers": [ ModuleMaintainer({ "name": "Jane Doe", "email": "jane@example.com", }), ], }) async def fetch_data(self, logger: logging.Logger): logger.info("Fetching data from Acme Inc. API") # Data fetching magic happens here! return pd.DataFrame(...)
Processes
Processes transform the data obtained from the sources to refine or reshape it for your needs. They take data as input, apply transformations, and output the transformed data. Processes can be configured using arguments.
Information for developers
Processes (also known as Pipeline Processes), are implementations of the
PipelineProcess
abstract class. Each
PipelineProcess
implementation must:
-
Define metadata by implementing the
metadata
static method, which returns aModuleDefinition
describing the details of the process. -
Implement an async
pipeline
method, which receives apandas.DataFrame
, an identifier, and a logger, and returns apandas.DataFrame
with the transformed data.
from mp_scrape_core import PipelineProcess, ModuleDescription, ModuleArgument, ModuleDefinition, ModuleMaintainer import pandas as pd import logging class FilterEmailsProcess(PipelineProcess): def __init__(self, domain: str = "example.com"): """ Filter emails by domain. :param str domain: (Domain) Only emails with this domain will be kept """ self.domain = domain @staticmethod def metadata() -> ModuleDefinition: return ModuleDefinition({ "name": "Filter Emails", "identifier": "filter_emails", # You can generate the description and arguments from the docstring in __init__ "description": ModuleDescription.from_init(FilterEmailsProcess.__init__), "arguments": ModuleArgument.list_from_init(FilterEmailsProcess.__init__), "maintainers": [ ModuleMaintainer({ "name": "Jane Doe", "email": "jane@example.com", }), ], }) async def pipeline(self, logger: logging.Logger, identifier: str, data: pd.DataFrame): logger.info(f"Filtering emails with domain '{self.domain}'") # Data filtering magic happens here! return data[data["email"].str.endswith(self.domain)]
Consumers
Consumers are responsible for using the transformed data, enabling you to store, display, or further process the results. They take data as input and perform actions with it. Consumers can be configured using arguments.
Information for developers
Consumers (also known as Data Consumers), are implementations of the
DataConsumer
abstract class. Each
DataConsumer
implementation must:
-
Define metadata by implementing the
metadata
static method, which returns aModuleDefinition
describing the details of the consumer. -
Implement an async
consume
method, which receives apandas.DataFrame
and a logger, and performs an action with the data.
from mp_scrape_core import DataConsumer, ModuleDescription, ModuleArgument, ModuleDefinition, ModuleMaintainer import pandas as pd import logging class CSVConsumer(DataConsumer): def __init__(self, path: str = "/tmp/emails.csv"): """ Saves the data in a CSV file. :param str path: (Path) Path where the CSV will be saved """ self.path = path @staticmethod def metadata() -> ModuleDefinition: return ModuleDefinition({ "name": "CSV", "identifier": "csv", # You can generate the description and arguments from the docstring in __init__ "description": ModuleDescription.from_init(CSVConsumer.__init__), "arguments": ModuleArgument.list_from_init(CSVConsumer.__init__), "maintainers": [ ModuleMaintainer({ "name": "Jane Doe", "email": "jane@example.com", }), ], }) async def consume(self, logger: logging.Logger, data: pd.DataFrame): logger.info(f"Saving data to '{self.path}'") # Data saving magic happens here! data.to_csv(self.path)
Workflows
Workflows tie together sources, processes, and consumers to create a complete data pipeline. They define which sources to use, how to transform the obtained data with processes, and how to consume the final results.
Let's consider a concrete example to illustrate how these components work together.
Imagine a workflow with the following configuration:
# Scrape data from the European Parliament [workflow] sources = ["euparl"] processes = [] consumers = ["csv"] [sources.euparl] module = "mp_scrape_source_euparl" retrieve_emails = false retrieve_committees = false [consumers.csv] module = "mp_scrape_export_csv" dest = "result.csv"
This workflow is designed to:
-
Source: Fetch data using the
mp_scrape_source_euparl
source, related to data from the European Parliament. - Process: It has no processes defined, meaning the data will not be transformed.
-
Consumer: Export the fetched data to a CSV file named
result.csv
using themp_scrape_export_csv
consumer.
In essence, this workflow extracts data from the European Parliament (without emails or committee details) and saves it directly to a CSV file, providing a simple yet effective way to archive this information.
In order to run this workflow, you need to have the mp_scrape_source_euparl
and
mp_scrape_export_csv
modules installed and accessible.