MP Scrape

MP Scrape is aimed at obtaining up-to-date information about MPs all over Europe.

Why MP Scrape Matters

Born out of necessity within the FSFE, MP Scrape is a tool that automates the collection of up-to-date information about Members of Parliament (MPs) across Europe.

MP Scrape supports the FSFE in advancing digital freedom by providing transparent access to information about policymakers across Europe.

The fight for Free Software and digital freedom is ongoing. Support us to come closer to this goal.

Getting started

Free Software

MP Scrape is Free Software, meaning anyone can use, share, study, and improve it.

Up-to-date information

MP Scrape allows you to collect up-to-date information for Members of Parliament across Europe.

Getting started

The best way to contribute to MP Scrape is by creating Sources for your local, regional, and national parliaments. You may check the documentation on how to create your own sources.

Otherwise, the fight for Free Software and digital freedom is ongoing. Support us to come closer to this goal.

Create your own Source Contribute Support our work Using MP Scrape

Using MP Scrape

If you feel comfortable with the CLI, the best option is to use the MP Scrape CLI. The MP Scrape CLI allows you to define Workflows that scrape a set of sources.

If you need collaborative access to the data, the MP Scrape UI is the best option for you. MP Scrape UI makes it more difficult to modify your Workflows but allows your team to access and retrieve the up-to-date information.

Learn to use the CLI

Using the MP Scrape CLI

With all required dependencies installed, you need to install MP Scrape CLI. The preferred method of distribution is through PyPI:

python3 -m pip install mp_scrape_cli

Now that you have MP Scrape CLI installed, you will need to create your first Workflow. In a nutshell, Workflows describe: what content to scrape from where, what to do with the content, and how to export the content.

A simple Workflow looks like this:

# Scrape data from the European Parliament

[workflow]
sources = ["euparl"]
processes = []
consumers = ["csv"]

[sources.euparl]
module = "mp_scrape_source_euparl"
retrieve_emails = false
retrieve_committees = false

[consumers.csv]
module = "mp_scrape_export_csv"
dest = "result.csv"

This example Workflows scrapes basic information from all members of the European Parliament and outputs a CSV file that you can use with any spreadsheet software of your choice.

The next step is to run the Workflow! Copy the example workflow and paste it to a file of your choice, which we will call workflow.toml then run the following command:

python3 -m mp_scrape_cli -w workflow.toml -l DEBUG

Glossary

Workflows

Workflows are the core of MP Scrape CLI and MP Scrape UI.

Workflows allow you to define: what data to scrape from where, how to transform that data, and how to export that data.

Workflows are made of three parts: Sources, Processes, and Consumers.

Sources

Sources are responsible for fetching raw data from various places. These could be APIs, databases, websites, or any other data provider. Sources can be configured using arguments to customize their behavior.

Information for developers

Sources (also known as Data Sources), are implementations of the DataSource abstract class. Each DataSource implementation must:

Define metadata by implementing the metadata static method, which returns a ModuleDefinition describing the details of the source.
Implement an async fetch_data method, which retrieves the data and returns a pandas.DataFrame. This dataframe represents the raw structured data fetched by the source.

Example implementation

from mp_scrape_core import DataSource, ModuleDescription, ModuleArgument, ModuleDefinition, ModuleMaintainer
import pandas as pd
import logging

class AcmeIncSource(DataSource):
    def __init__(self, retrieve_emails: bool = True):
        """
        Retrieve information from Acme Inc.

        :param bool retrieve_emails: (List emails) When enabled, e-mails will be retrieved
        """

        self.retrieve_emails = retrieve_emails
    
    @staticmethod
    def metadata() -> ModuleDefinition:
        return ModuleDefinition({
          "name": "Acme Inc.",
          "identifier": "acmeinc",
          # You can generate the description and arguments from the docstring in __init__
          "description": ModuleDescription.from_init(AcmeIncSource.__init__),
          "arguments": ModuleArgument.list_from_init(AcmeIncSource.__init__),
          "maintainers": [
              ModuleMaintainer({
                  "name": "Jane Doe",
                  "email": "jane@example.com",
              }),
          ],
        })
    
    async def fetch_data(self, logger: logging.Logger):
        logger.info("Fetching data from Acme Inc. API")
        # Data fetching magic happens here!
        return pd.DataFrame(...)

Processes

Processes transform the data obtained from the sources to refine or reshape it for your needs. They take data as input, apply transformations, and output the transformed data. Processes can be configured using arguments.

Information for developers

Processes (also known as Pipeline Processes), are implementations of the PipelineProcess abstract class. Each PipelineProcess implementation must:

Define metadata by implementing the metadata static method, which returns a ModuleDefinition describing the details of the process.
Implement an async pipeline method, which receives a pandas.DataFrame, an identifier, and a logger, and returns a pandas.DataFrame with the transformed data.

Example implementation

from mp_scrape_core import PipelineProcess, ModuleDescription, ModuleArgument, ModuleDefinition, ModuleMaintainer
import pandas as pd
import logging

class FilterEmailsProcess(PipelineProcess):
    def __init__(self, domain: str = "example.com"):
        """
        Filter emails by domain.

        :param str domain: (Domain) Only emails with this domain will be kept
        """
        self.domain = domain
    
    @staticmethod
    def metadata() -> ModuleDefinition:
        return ModuleDefinition({
          "name": "Filter Emails",
          "identifier": "filter_emails",
          # You can generate the description and arguments from the docstring in __init__
          "description": ModuleDescription.from_init(FilterEmailsProcess.__init__),
          "arguments": ModuleArgument.list_from_init(FilterEmailsProcess.__init__),
          "maintainers": [
              ModuleMaintainer({
                  "name": "Jane Doe",
                  "email": "jane@example.com",
              }),
          ],
        })
    
    async def pipeline(self, logger: logging.Logger, identifier: str, data: pd.DataFrame):
        logger.info(f"Filtering emails with domain '{self.domain}'")
        # Data filtering magic happens here!
        return data[data["email"].str.endswith(self.domain)]

Consumers

Consumers are responsible for using the transformed data, enabling you to store, display, or further process the results. They take data as input and perform actions with it. Consumers can be configured using arguments.

Information for developers

Consumers (also known as Data Consumers), are implementations of the DataConsumer abstract class. Each DataConsumer implementation must:

Define metadata by implementing the metadata static method, which returns a ModuleDefinition describing the details of the consumer.
Implement an async consume method, which receives a pandas.DataFrame and a logger, and performs an action with the data.

Example implementation

from mp_scrape_core import DataConsumer, ModuleDescription, ModuleArgument, ModuleDefinition, ModuleMaintainer
import pandas as pd
import logging

class CSVConsumer(DataConsumer):
    def __init__(self, path: str = "/tmp/emails.csv"):
        """
        Saves the data in a CSV file.

        :param str path: (Path) Path where the CSV will be saved
        """
        self.path = path
    
    @staticmethod
    def metadata() -> ModuleDefinition:
        return ModuleDefinition({
          "name": "CSV",
          "identifier": "csv",
          # You can generate the description and arguments from the docstring in __init__
          "description": ModuleDescription.from_init(CSVConsumer.__init__),
          "arguments": ModuleArgument.list_from_init(CSVConsumer.__init__),
          "maintainers": [
              ModuleMaintainer({
                  "name": "Jane Doe",
                  "email": "jane@example.com",
              }),
          ],
        })
    
    async def consume(self, logger: logging.Logger, data: pd.DataFrame):
        logger.info(f"Saving data to '{self.path}'")
        # Data saving magic happens here!
        data.to_csv(self.path)

Workflows

Workflows tie together sources, processes, and consumers to create a complete data pipeline. They define which sources to use, how to transform the obtained data with processes, and how to consume the final results.

Let's consider a concrete example to illustrate how these components work together.

Imagine a workflow with the following configuration:

# Scrape data from the European Parliament

[workflow]
sources = ["euparl"]
processes = []
consumers = ["csv"]

[sources.euparl]
module = "mp_scrape_source_euparl"
retrieve_emails = false
retrieve_committees = false

[consumers.csv]
module = "mp_scrape_export_csv"
dest = "result.csv"

This workflow is designed to:

Source: Fetch data using the mp_scrape_source_euparl source, related to data from the European Parliament.
Process: It has no processes defined, meaning the data will not be transformed.
Consumer: Export the fetched data to a CSV file named result.csv using the mp_scrape_export_csv consumer.

In essence, this workflow extracts data from the European Parliament (without emails or committee details) and saves it directly to a CSV file, providing a simple yet effective way to archive this information.

In order to run this workflow, you need to have the mp_scrape_source_euparl and mp_scrape_export_csv modules installed and accessible.