#
Data Preparation
The Government of Canada is sitting on a gold mine of data. Some of this data is more suitable for process mining than others. The goal of this section is to help provide insights on how to find and identify good data, negotiate access, clean data, and prepare it for your process mining analysis. Combining the insights in this section with the process qualification questionnaires (above) should allow you to identify process mining projects that deliver a promising return on investment.
#
Where to Start
Within Government, good processes to start with would be high-volume processes running on ERP or case management systems. Platforms such as SAP, Salesforce and Microsoft Dynamics typically collect logs that are ripe for process mining. These platforms are even starting to embed process mining functionality as part of their core platform.
#
Negotiating Access to Data
Obtaining access to data is one of the hardest parts of process mining. People who are new to Process Mining may not fully understand it, or fear that it may expose issues within their business that are not well perceived. In a Government context where anything can be publicly scrutanized (e.g. ATIP), this is especially true. To help teams obtain management consent to access process mining data, we've prepared a briefing note template to help accelerate the process.
#
Minimum Data Requirements
Process mining uses digital footprints inside existing information systems to generate what are known as event logs. An event log is a collection of events of different cases, with timestamps and activities. Three columns are needed at a minimum for process mining: case_id, event, and timestamp.
The key requirement to producing an event log is that actions are recorded in information systems. Systems that record event log data are known as process aware information systems (PAIS) (Dumas et al., 2005). Examples of PAIS include ERP systems (Oracle, SAP), CRM systems (Salesforce, Microsoft Dynamics) and intelligent business process management suites (Appian, IBM, Pegasystems).
An example event log looks similar to this table below:
Event logs are typically stored in CSV or XLSX format. There are more sophisticated formats such as XES which allow for advanced process mining techniques. Sample (open data) event logs for process mining can be found below:
#
Data Quality Checklist
The team at Fluxcion (authors of the Disco PM software) have a great data quality checklist that we include below:
#
Data Extraction & Cleaning
Data extraction and cleaning is by far the longest part of a process mining project. As we mentioned above, data quality differs, depending on the source system. There are two ways of extracting data from a source system:
- Using plug-ins (aka connectors) provided process mining tools to connect directly to source systems.
- Manual data extraction via a database or API export.
If you’re lucky enough to be working with a source system and process mining tool that has a ready-made connector, then option #1 is surely the best way forward and will accelerate time to value. However, in many cases, option #2 will prevail. Most PM projects we’ve undertaken in the GC (so far) used option #2 as it was rare for us to get permission to connect directly to source systems without an intermediary data analyst who would filter data exports to ensure no confidential or personal information was shared.
When engaging in option 2, we’ve made heavy use of Python, the popular Pandas library and the logprep4pm module developed in collaboration with the University of Ottawa. This approach has several benefits:
- The use of Jupyter notebooks and Pandas allows you to document the steps taken to clean your dataset and transform it into event log format. This enhances reproducibility.
- The notebook can be easily turned into a Python script for continuously updating new data into a process mining tool.