Behind the Scenes: How We Solved the Signature Images Problem When Processing Replies by Email
One of the most used features of TeamworkPM is the ability to reply to email notifications and have your comment added to your project. Attachments you send via the email response will also be added to the comment and displayed in the Files tab of your project. One of my core areas on the development team is email processing. If email is being sent to Teamwork, I’m the guy that makes it all work. As email is the method of communication there are many “problems” that need to be addressed from our end. These include handling Text/HTML emails, parsing out the “reply” bit and discard the quoted content from the original notification, validate the user sending the response, save the attachments and so on. In this article I thought it would be interesting to focus on one of the challenges we faced when handling email attachments and the solution we implemented: Signature Images.
When we first launched Teamwork email notifications were one way. You could only receive notifications but we didn’t accept replies. You had to log in to Teamwork to reply to a Message or Comment. People are used to email and expect to be able to reply to emails. We added the ability to reply to the email notifications and have the reply added as a Comment or Message on your project. Back then you could add a Comment on Teamwork but you couldn’t attach Files to the Comment. Because of this limitation we used to skip any attachments when processing the emails.As Teamwork progressed, we added the ability to attach Files to Comments in Teamwork. This meant that we could now also process any attachments on email replies. Fantastic! People loved it… But a new problem arose: People’s Signature Images were now being added to the Comment and worse, they were being added to the Files section…on every reply. The feedback started to roll in.
When an email arrives to Teamwork we get a list of attached Files. Some are genuine attachments, some are Signature Images. We have absolutely no way of knowing which are legitimate attachments and which are part of a signature. We analyzed hundreds of emails and could not find a standard way of distinguishing the attachments. There were some File names that looked standard (such as image001.png) which we thought we could simply ignore, but then we found that images pasted in-line on an email could also be named image001.png. We couldn’t ignore these as they could be an important screenshot!
Initially, we thought there’s nothing we can do about this. We reckoned people receive email every day with the same Signature Images in Gmail, Thunderbird, Outlook, etc, so they’ll understand when the Signature Images are also received multiple times in Teamwork. But Teamwork is different. The Files tab was being cluttered with these signature images and File space was being used by these images. The feedback came in thick and heavy. Something needed to be done. I was getting tired of answering the same feedback over and over again… And then I had an epiphany. What if we could get a “fingerprint” of each image file so no matter what the File was named we’d know it was the same image. I could use a Hash function to get the fingerprint! A hash function is an algorithm that takes an arbitrary block of data and returns a fixed-size bit string. For example, if you hashed an ebook by passing the contents in to the function you would get back a 32 character string. If you hashed the same ebook 10 times you’d always get back the same 32 character string. If you changed even a single letter in the ebook and passed the contents back through the function you’d get a completely different 32 character string. So I did a bit of testing…. I created a unit test for a new function I wrote that reads the binary file contents, converts it to a string and then hashed the contents using an MD5 hash. I took 3 copies of the same image, renamed them to file1.jpg, file2.jpg and file3.jpg and ran each one through my function and the same “fingerprint” string was returned. I then expanded my unit test so it would connect to a POP mailbox, process the attachments and generate the “fingerprint” of each attachment. I sent 3 emails to this POP box each with the same image attached and it worked! All 3 attachment fingerprints matched. I spent a few hours refining my function, expanding the Teamwork code base and implemented the attachment hashing. As Signature Images were the problem I only had to workout the fingerprint for attachments that were actually image files (png,gif,bmp,jpeg,jpg,tiff etc) Signature Images are also predominantly small size files so I only really need to process image files under 100kb. When we process an email and get the list of attachments, we pass any image attachments through my new function which returns the ‘fingerprint’ string. We process all the attachments as normal and we store the ‘fingerprint’ in the database along with the File reference. The next time an email is processed, we do the same thing for each image attachment, but we look up the database to see if we already have an image with the same ‘fingerprint’. If we do get a match we log the match for analysis later, skip the attachment and move on to the next, and so on. We added a index on the Files table in the database for the ‘fingerprint’, which means that even if there are millions of records in the table we’ll get our match instantly. Since we implemented this yesterday our log files have shown a massive increase in the amount of Signature Images detected and removed from the reply-by-email functionality. The very first time a reply-by-email is processed for a user their signature images will be added as normal, but from then on they will be skipped. If you delete the Files from your Teamwork project we’ll still continue to match the image ‘fingerprints’ on subsequent processing. Hopefully, I’ll never answer feedback related to Signature Images again! Dan.