# Data Preparation

The Government of Canada is sitting on a gold mine of data. Some of this data is more suitable for process mining than others. The goal of this section is to help provide insights on how to find and identify good data, negotiate access, clean data, and prepare it for your process mining analysis. Combining the insights in this section with the process qualification questionnaires (above) should allow you to identify process mining projects that deliver a promising return on investment.

# Where to Start

Source: Process Intelligence in Action by Lars Reinkemeyer (2024)

The most mined process in academic literature is the procure-to-pay (P2P) process. As cited in Reinkemeyer (2024), the core processes of accounts payable, procure-to-pay, order-to-cash and accounts receivable account for more than 50% of the total value realized by private companies. Starting with data within your ERP system can often be a quick way to show the value of process mining, as ERPs typically possess good audit logs of activities.

Within Government, good processes to start with would be high-volume processes running on ERP or case management systems. Platforms such as SAP, Salesforce and Microsoft Dynamics typically collect logs that are ripe for process mining. These platforms are even starting to embed process mining functionality as part of their core platform.

# Negotiating Access to Data

Obtaining access to data is one of the hardest parts of process mining. People who are new to Process Mining may not fully understand it, or fear that it may expose issues within their business that are not well perceived. In a Government context where anything can be publicly scrutanized (e.g. ATIP), this is especially true. To help teams obtain management consent to access process mining data, we've prepared a briefing note template to help accelerate the process.

Sample Briefing Note (DOCX)

data-access-process-mining-briefing-note.docx 35.6KB

# Minimum Data Requirements

Process mining uses digital footprints inside existing information systems to generate what are known as event logs. An event log is a collection of events of different cases, with timestamps and activities. Three columns are needed at a minimum for process mining: case_id, event, and timestamp.

The key requirement to producing an event log is that actions are recorded in information systems. Systems that record event log data are known as process aware information systems (PAIS) (Dumas et al., 2005). Examples of PAIS include ERP systems (Oracle, SAP), CRM systems (Salesforce, Microsoft Dynamics) and intelligent business process management suites (Appian, IBM, Pegasystems).

An example event log looks similar to this table below:

case_id	event	timestamp	purchaser
121	purchaseOrderCreated	31/12/2020 11:59	Bob
121	purchaseOrderApproved	05/01/2021 13:53	Bob
121	purchasePaid	01/02/2021 14:55	Bob
122	purchaseOrderCreated	05/01/2021 12:55	Jane
122	purchaseOrderApproved	10/01/2021 09:30	Jane
122	purchasePaid	28/01/2021 15:33	Jane
123	purchaseOrderCreated	10/04/2021 23:59	John
123	purchaseOrderApproved	24/04/2021 23:59	John
123	purchasePaid	04/05/2021 23:59	John
124	purchaseOrderCreated	12/01/2021 15:22	Jill
124	purchaseOrderApproved	12/15/2021 16:49	Jill
124	purchasePaid	05/01/2022 13:00	Jill

Event logs are typically stored in CSV or XLSX format. There are more sophisticated formats such as XES which allow for advanced process mining techniques. Sample (open data) event logs for process mining can be found below:

IEEE Task Force on Process Mining - Event Logs

logs

# Data Quality Checklist

The team at Fluxcion (authors of the Disco PM software) have a great data quality checklist that we include below:

Data Quality Checklist by Fluxicon

https://files.fluxicon.com/Templates/Data-Quality-Checklist.pdf

# Data Extraction & Cleaning

Data extraction and cleaning is by far the longest part of a process mining project. As we mentioned above, data quality differs, depending on the source system. There are two ways of extracting data from a source system:

Using plug-ins (aka connectors) provided process mining tools to connect directly to source systems.
Manual data extraction via a database or API export.

If you’re lucky enough to be working with a source system and process mining tool that has a ready-made connector, then option #1 is surely the best way forward and will accelerate time to value. However, in many cases, option #2 will prevail. Most PM projects we’ve undertaken in the GC (so far) used option #2 as it was rare for us to get permission to connect directly to source systems without an intermediary data analyst who would filter data exports to ensure no confidential or personal information was shared.

logprep4pm: Log Filtering and Preprocessing Library for Process Mining

https://github.com/ProcessMining-uOttawa/logprep4pm

When engaging in option 2, we’ve made heavy use of Python, the popular Pandas library and the logprep4pm module developed in collaboration with the University of Ottawa. This approach has several benefits:

The use of Jupyter notebooks and Pandas allows you to document the steps taken to clean your dataset and transform it into event log format. This enhances reproducibility.
The notebook can be easily turned into a Python script for continuously updating new data into a process mining tool.