Python & openpyxl – If the fraud team tells you to do something; do it quickly!

A member of our Fraud team contacted me unexpectedly at work.

Hey. I got this excel workbook of a customer’s statement with the data spread across multiple sheets. Do you know how to consolidate them into one sheet?

“OK,” I thought, “It can’t be that hard.”

She sent me the workbook. It had 447 sheets in it! The statement was initially in a PDF format and the Fraud team exported it to Excel. This resulted in a sheet for each page; looking like this:

The data that we were interested in was between A7 and E29 on each page. I visually spot checked half a dozen sheets across the workbook and they were in similar format

Ditching VBA for openpyxl

In yesteryear; I would have VBA inside Excel to consolidate the sheets. But I felt it was time to grow up from VBA and see what python had to offer.

After a quick Google; I came across Data Camp’s Python Excel Tutorial.

It was surprisingly quick and easy to program up. I honestly thought it would have taken me as long; if not longer in VBA.

Here is the scratch code for your perusal

import openpyxl 
import logging
from datetime import datetime
import csv

# Formats a row into a format for the delimited file
def format_row(in_record):

    try:
        in_record[0] = datetime.strftime(in_record[0], "%Y-%m-%d")  # Converts the python date to YYYY-MM-DD
    except TypeError as e:
        root_logger.warning("Cannot convert " + in_record[0] + " to date/time")

    in_record[1] = in_record[1].replace("\n", " \u21B5")    # Adds in a return arrow

    if in_record[2] == None:
        in_record[2] = ""

    if in_record[3] == None:
        in_record[3] = ""
    
    return in_record


if __name__ == '__main__':

    now = datetime.now()

    log_formatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-8.8s]  %(message)s")
    root_logger = logging.getLogger()
    root_logger.setLevel(logging.INFO)

    log_console_handler = logging.StreamHandler()
    log_console_handler.setFormatter(log_formatter)
    root_logger.addHandler(log_console_handler)

    final_array = []

    wb = openpyxl.load_workbook("CONSOLIDATEDExcel.xlsx") 

    for focus_sheet_name in wb.sheetnames:

        root_logger.info("Processing sheet: " + focus_sheet_name)
        
        focus_sheet = wb[focus_sheet_name]
        root_logger.debug("Total number of rows: " + str(focus_sheet.max_row) + '. And total number of columns: ' + str(focus_sheet.max_column))


        if focus_sheet["A6"].value == "Date":   # Checks for the key field "Date" in position A6

            my_list = []

            for sheet_row in focus_sheet.iter_rows(
                min_row=7, max_row=26, min_col=1, max_col=5,    # Gets specific section of the worksheet
                values_only=True):

                sheet_row_list = list(sheet_row)

                if sheet_row_list[0] != None:   # Ignore lines that are blank
                    final_array.append(format_row(sheet_row_list))
        else:
            root_logger.info("Cannot find key column \"DATE\".  Skipping sheet")


    # Output records to a tab delimited file
    with open("Consolidated_output.tab", 'w', newline='', encoding="utf-8") as myfile:
        wr = csv.writer(myfile, quoting=csv.QUOTE_ALL, delimiter="\t")
        wr.writerows(final_array)
        

Also; it ran blindingly quick – as quick as I feel in VBA with Application.ScreenUpdating = False

Anyway; Fraud team was happy I turned their request around quickly. I’m guessing if the fraud team is involved – someone is about to have a bad day.