Make GDPR safe backups for WooCommerce

WordPress, GDPR, GitHub Actions, Continuous Integration

A client of mine runs WooCommerce on a WordPress site. They have a lot of orders going through this site. One thing I was bumping my head into for a long time is how to create GDPR-safe backups that I can use in my local development environment.

The challenge with WordPress

The hard part with WordPress is that many things resolve back to a post: A post is a post (duh), but also, a page, category, and products are posts. And WooCommerce orders are also posts. This means that all data of these types are saved in 2 tables: wp_posts and wp_postmeta. This makes for a pretty flexible programming experience: Adding extra data to any of these types is easy. But getting this data safely out of the database is something different.

Smile GdprDump

After investigating some tools, I tried to build my tool to do this. But as I was inspecting other tools on how they did the database dumping part, I figured out I could use an existing tool for this purpose: Smile GdprDump.

This tool allows you to configure how dumps are created. It allows for several configurable options: Only select a part of the table, use Faker to change data, or only dump the table structure but not the data.

I removed the data from the dump as I don't need the orders on my local system.

Installation is simple: Install it through Composer:

composer require --dev smile/gdpr-dump

Configuration

I ended up with this configuration:

---
database:
  host: '%env(DB_HOST)%'
  user: '%env(DB_USER)%'
  password: '%env(DB_PASSWORD)%'
  name: '%env(DB_NAME)%'

dump:
  output: 'wordpress-development-dump.sql.gz'
  compress: 'gzip'

tables:
  wp_comments:
    truncate: true

  wp_wc_customer_lookup:
    truncate: true

  wp_woocommerce_order_itemmeta:
    truncate: true

  wp_posts:
    filters:
      - ['post_type', 'in', ['nav_menu_item', 'page', 'post', 'product', 'product_variation']]

  wp_postmeta:
    filters:
      - ['post_id', 'in', 'expr: select ID from wp_posts where post_type IN ("nav_menu_item", "page", "post", "product", "product_variation")']

  wp_users:
    filters:
      - ['user_email', 'in', 'expr: select user_email from wp_users where user_email like "%@controlaltdelete.nl"']

  wp_usermeta:
    filters:
      - ['user_id', 'in', 'expr: select ID from wp_users where user_email like "%@controlaltdelete.nl"']

This does a few things:

  • This creates a gzipped file on wordpress-development-dump.sql.gz.

  • It truncates the data for wp_comments, wp_wc_customer_lookup, and wp_woocommerce_order_itemmeta . This means that it adds the table structure to the dump, but not the data.

  • For wp_posts and wp_postmeta it only dumps items that are a nav_menu_item, page, post, product, or product_variation. All other types are skipped. So that includes things like orders.

  • For wp_users and wp_usermeta it only dumps the data for users where the email ends on @controlaltdelete.nl. All other users are skipped.

It is always a good idea to check the project you are working on, it could be that there is more privacy-sensitive data in there. I like to search on @gmail.com in the dump to find extra tables to change.

Running the GDPR dumper

I have created a little bash script that runs this script for me. This is the content of that file:

export DB_HOST=db
export DB_USER=wordpress
export DB_PASSWORD=wordpress
export DB_NAME=wordpress

/var/www/html/vendor/bin/gdpr-dump /var/www/html/gdpr-dump/wordpress.yml

I'm running all my projects in Docker, so the path is always the same. Calling this is something like this:

sh create-gdpr-safe-backup.sh

Automating the GDPR dump

Now we have this working, we can take this a step further. I've created a GitHub Action that takes these steps:

  • Download a backup from Amazon S3. This is an automated backup that is created daily.

  • Start a MySQL server (withing GitHub Actions), and import this dump in there.

  • Run the GDPR dump program.

  • Upload the result back to S3 in a separate folder.

The file looks like this:

name: Create development dump
on:
  workflow_dispatch:
  schedule:
    - cron: '33 9 * * 1'

jobs:
  create-development-dump:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1

      - name: Download latest backup
        run: |
          BUCKET="s3://supersecret.com-backup/"
          OBJECT="$(aws s3 ls $BUCKET --recursive |grep ".zip" | sort | tail -n 1 | awk '{print $4}')"
          aws s3 cp s3://supersecret.com-backup/$OBJECT gdpr-dump/ --no-progress
          unzip gdpr-dump/$OBJECT supersecret.com.sql -d gdpr-dump/backup/

      - name: Start mysql & import database
        run: |
          sudo /etc/init.d/mysql start
          mysql -e 'CREATE DATABASE wordpress;' -uroot -proot
          mysql -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'root';" -uroot -proot
          cat gdpr-dump/backup/supersecret.com.sql | mysql --binary-mode -uroot -proot wordpress

      - name: Setup PHP
        uses: shivammathur/setup-php@v2
        with:
          php-version: '8.0'

      - name: Run Composer Install
        run: |
          composer install --no-interaction --no-progress

      - name: Run GDPR dump
        run: |
          DB_HOST=localhost DB_NAME=wordpress DB_USER=root DB_PASSWORD=root vendor/bin/gdpr-dump gdpr-dump/wordpress.yml

      - name: Upload to S3
        run: aws s3 cp ./wordpress-development-dump.sql.gz s3://supersecret.com-backup/development-dump/

For safety purposes, I have created a IAM user in S3 that has read-only access to the full dump and read/write access to the development-dump folder.

This script will run once a week, as that's more than enough for me.

What's next?

I don't need to have this automated for development purposes: I don't have to reset my local database that often. But my end goal is to automate the WordPress (plugin) updates. For this, I need end-2-end tests which can be run in the GitHub Actions pipeline. This backup is a big part of that, as it will get imported into the temporary environment before running the tests.

Want to respond?