Python & openpyxl – If the fraud team tells you to do something; do it quickly!
A member of our Fraud team contacted me unexpectedly at work.
Hey. I got this excel workbook of a customer’s statement with the data spread across multiple sheets. Do you know how to consolidate them into one sheet?
“OK,” I thought, “It can’t be that hard.”
She sent me the workbook. It had 447 sheets in it! The statement was initially in a PDF format and the Fraud team exported it to Excel. This resulted in a sheet for each page; looking like this:
The data that we were interested in was between A7 and E29 on each page. I visually spot checked half a dozen sheets across the workbook and they were in similar format
Ditching VBA for openpyxl
In yesteryear; I would have VBA inside Excel to consolidate the sheets. But I felt it was time to grow up from VBA and see what python had to offer.
It was surprisingly quick and easy to program up. I honestly thought it would have taken me as long; if not longer in VBA.
Here is the scratch code for your perusal
import openpyxl
import logging
from datetime import datetime
import csv
# Formats a row into a format for the delimited file
def format_row(in_record):
try:
in_record[0] = datetime.strftime(in_record[0], "%Y-%m-%d") # Converts the python date to YYYY-MM-DD
except TypeError as e:
root_logger.warning("Cannot convert " + in_record[0] + " to date/time")
in_record[1] = in_record[1].replace("\n", " \u21B5") # Adds in a return arrow
if in_record[2] == None:
in_record[2] = ""
if in_record[3] == None:
in_record[3] = ""
return in_record
if __name__ == '__main__':
now = datetime.now()
log_formatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-8.8s] %(message)s")
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
log_console_handler = logging.StreamHandler()
log_console_handler.setFormatter(log_formatter)
root_logger.addHandler(log_console_handler)
final_array = []
wb = openpyxl.load_workbook("CONSOLIDATEDExcel.xlsx")
for focus_sheet_name in wb.sheetnames:
root_logger.info("Processing sheet: " + focus_sheet_name)
focus_sheet = wb[focus_sheet_name]
root_logger.debug("Total number of rows: " + str(focus_sheet.max_row) + '. And total number of columns: ' + str(focus_sheet.max_column))
if focus_sheet["A6"].value == "Date": # Checks for the key field "Date" in position A6
my_list = []
for sheet_row in focus_sheet.iter_rows(
min_row=7, max_row=26, min_col=1, max_col=5, # Gets specific section of the worksheet
values_only=True):
sheet_row_list = list(sheet_row)
if sheet_row_list[0] != None: # Ignore lines that are blank
final_array.append(format_row(sheet_row_list))
else:
root_logger.info("Cannot find key column \"DATE\". Skipping sheet")
# Output records to a tab delimited file
with open("Consolidated_output.tab", 'w', newline='', encoding="utf-8") as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL, delimiter="\t")
wr.writerows(final_array)
Also; it ran blindingly quick – as quick as I feel in VBA with Application.ScreenUpdating = False
Anyway; Fraud team was happy I turned their request around quickly. I’m guessing if the fraud team is involved – someone is about to have a bad day.
Websites store cookies to enhance functionality and personalise your experience. You can manage your preferences, but blocking some cookies may impact site performance and services.
Essential cookies enable basic functions and are necessary for the proper function of the website.
Name
Description
Duration
Cookie Preferences
This cookie is used to store the user's cookie consent preferences.
30 days
These cookies are needed for adding comments on this website.
Name
Description
Duration
comment_author
Used to track the user across multiple sessions.
Session
comment_author_email
Used to track the user across multiple sessions.
Session
comment_author_url
Used to track the user across multiple sessions.
Session
Statistics cookies collect information anonymously. This information helps us understand how visitors use our website.
Google Analytics is a powerful tool that tracks and analyzes website traffic for informed marketing decisions.
Contains information related to marketing campaigns of the user. These are shared with Google AdWords / Google Ads when the Google Ads and Google Analytics accounts are linked together.
90 days
__utma
ID used to identify users and sessions
2 years after last activity
__utmt
Used to monitor number of Google Analytics server requests
10 minutes
__utmb
Used to distinguish new sessions and visits. This cookie is set when the GA.js javascript library is loaded and there is no existing __utmb cookie. The cookie is updated every time data is sent to the Google Analytics server.
30 minutes after last activity
__utmc
Used only with old Urchin versions of Google Analytics and not with GA.js. Was used to distinguish between new sessions and visits at the end of a session.
End of session (browser)
__utmz
Contains information about the traffic source or campaign that directed user to the website. The cookie is set when the GA.js javascript is loaded and updated when data is sent to the Google Anaytics server
6 months after last activity
__utmv
Contains custom information set by the web developer via the _setCustomVar method in Google Analytics. This cookie is updated every time new data is sent to the Google Analytics server.
2 years after last activity
__utmx
Used to determine whether a user is included in an A / B or Multivariate test.
18 months
_ga
ID used to identify users
2 years
_gali
Used by Google Analytics to determine which links on a page are being clicked
30 seconds
_ga_
ID used to identify users
2 years
_gid
ID used to identify users for 24 hours after last activity
24 hours
_gat
Used to monitor number of Google Analytics server requests when using Google Tag Manager