A member of our Fraud team contacted me unexpectedly at work.
Hey. I got this excel workbook of a customer’s statement with the data spread across multiple sheets. Do you know how to consolidate them into one sheet?
“OK,” I thought, “It can’t be that hard.”
She sent me the workbook. It had 447 sheets in it! The statement was initially in a PDF format and the Fraud team exported it to Excel. This resulted in a sheet for each page; looking like this:
The data that we were interested in was between A7 and E29 on each page. I visually spot checked half a dozen sheets across the workbook and they were in similar format
Ditching VBA for openpyxl
In yesteryear; I would have VBA inside Excel to consolidate the sheets. But I felt it was time to grow up from VBA and see what python had to offer.
After a quick Google; I came across Data Camp’s Python Excel Tutorial.
It was surprisingly quick and easy to program up. I honestly thought it would have taken me as long; if not longer in VBA.
Here is the scratch code for your perusal
import openpyxl
import logging
from datetime import datetime
import csv
# Formats a row into a format for the delimited file
def format_row(in_record):
try:
in_record[0] = datetime.strftime(in_record[0], "%Y-%m-%d") # Converts the python date to YYYY-MM-DD
except TypeError as e:
root_logger.warning("Cannot convert " + in_record[0] + " to date/time")
in_record[1] = in_record[1].replace("\n", " \u21B5") # Adds in a return arrow
if in_record[2] == None:
in_record[2] = ""
if in_record[3] == None:
in_record[3] = ""
return in_record
if __name__ == '__main__':
now = datetime.now()
log_formatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-8.8s] %(message)s")
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
log_console_handler = logging.StreamHandler()
log_console_handler.setFormatter(log_formatter)
root_logger.addHandler(log_console_handler)
final_array = []
wb = openpyxl.load_workbook("CONSOLIDATEDExcel.xlsx")
for focus_sheet_name in wb.sheetnames:
root_logger.info("Processing sheet: " + focus_sheet_name)
focus_sheet = wb[focus_sheet_name]
root_logger.debug("Total number of rows: " + str(focus_sheet.max_row) + '. And total number of columns: ' + str(focus_sheet.max_column))
if focus_sheet["A6"].value == "Date": # Checks for the key field "Date" in position A6
my_list = []
for sheet_row in focus_sheet.iter_rows(
min_row=7, max_row=26, min_col=1, max_col=5, # Gets specific section of the worksheet
values_only=True):
sheet_row_list = list(sheet_row)
if sheet_row_list[0] != None: # Ignore lines that are blank
final_array.append(format_row(sheet_row_list))
else:
root_logger.info("Cannot find key column \"DATE\". Skipping sheet")
# Output records to a tab delimited file
with open("Consolidated_output.tab", 'w', newline='', encoding="utf-8") as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL, delimiter="\t")
wr.writerows(final_array)
Also; it ran blindingly quick – as quick as I feel in VBA with Application.ScreenUpdating = False
Anyway; Fraud team was happy I turned their request around quickly. I’m guessing if the fraud team is involved – someone is about to have a bad day.