DataSklr

View Original

Enron: JSON Files


Use Python to Analyze Email Corpus

What is Enron?

Enron was a major energy company based on Texas. It was involved with accounting fraud resulting in a scandal that dominated the news in 2001, and eventually ended in the bankruptcy of the company. Enron’s senior management was dismissed and several key individuals were convicted of various crimes and received jail time as well as significant fines.

During the litigation and bankruptcy proceedings, emails sent and received by agents of Enron were collected and preserved. Select e-mails were made available for research. This database is called the Enron Corpus and contains email communication of over 150 employees including that of then CEO, Kenneth Lay.

 What are we trying to do?

I did bit of forensic analysis. I wanted to see if I could find email communication to and from the CEO, Ken Lay among the 250,000 emails in the extract. As it turned out, the exercise was tricky because Ken did not usually communicate through his own email account. I wanted to count the messages sent and received by Kenneth Lay, and see who was the lucky person who received the most messages from Kenneth Lay. I also wanted to see who sent the most messages to him. I thought it would be interesting to see if the volumes of Ken’s emails increased or decreased before and after the bankruptcy. We also know that Arthur Andersen, the accounting firm played a significant role in the scandal. So I decided to count the number of messages mentioning Arthur Andersen. And as a last item, I wanted to create a nice graphic from the emails. I created a word cloud to…well, why not?

Let’s go already!

First, we need to load all the necessary packages for this work. The data is available from lots of different sources. Just Google it! I downloaded it and created a pickle for myself. I used the pickle in the analysis.

See this content in the original post

It appears that Ken communicated through many different emails and he also asked others to send communication in his behalf. Some inspection of the data led to several email addresses. I found that his secretary, Rosalee, sent messages on the CEO’s behalf, but Ken communicated through kenneth.lay@enron.com, ken.lay@enron.com and office.chairman@enron.com email addresses as well. Also, there were some emails from Ken using the enron.announcements@enron.com email address.

See this content in the original post

A screenshot of Python output: kens_mail3 represents al emails from Ken Lay sent by his secretary, Rosalee Fleming

A Bit More Specific Digging for Emails Sent by Kenneth Lay Under His Own Name:

I first searched for Kenneth Lay’s emails based on typical corporate email nomenclature such as kenneth.lay@enron.com, ken.lay@enron.com, kenneth_lay@enron.com, ken_lay@enron.com, and klay@enron.com.

Kenneth Lay appeared to have sent emails under kenneth.lay@enron.com (n=20) and ken.lay@enron.com (n=1). The other email possibilities tried did not result in the identification of emails sent by the disgraced CEO. Below is a simple code to count some of those emails.

See this content in the original post

Emails Sent by Kenneth Lay’s Secretary:

During inspection of emails, it became evident that Kenneth Lay’s secretary, Rosalee Fleming (rosalee.fleming@enron.com) was involved with managing his email correspondence. In fact, several emails sent by the secretary contained the following text string in the body of the email: “Rosalee for Ken”. A total of 164,030 such emails were identified, although most of these were repeat emails as a result of email chains and forwards.

See this content in the original post

Emails Sent Under a Corporate Title (e.g. Chairman):

Further inspection of emails resulted in the identification of several emails that contained the following strings in the body: “On Behalf Of Ken Lay” or “Office Of The Chairman”. Both queries identified a large number of emails: 193,484 (On Behalf Of Ken Lay) and 209,820 (Office Of The Chairman).

See this content in the original post

Select Some Emails for Analysis

Let’s go back to the pkl file for a few seconds. The file enron_email_df2.pkl was ingested and inspected. The file contained 250,758 emails with 13 variables: body, Date From, Message, Subject, To, X-From, X- To, X-bcc, X-cc, mailbox, size, and subfolder.

We already talked about Ken’s emails . Kenneth Lay sent most of his messages from enron.announcements@enron.com (n=4,179), and from office.chairman@enron.com. Rosalee Fleming’s emails were further scrubbed because not all of them were communication of Ken Lay. Finally, the email addresses that received most emails addressed to Kenneth Lay were Kenneth.lay@enron.com (n=693) and klay@enron.com (n=799).

Next, Rosalee Fleming’s email correspondence was filtered for relevant email. Rosalee sent nine emails with Rosalee for Ken in the body of the email. In other words, these emails were Kenneth Lay’s communication but were sent out by his secretary, Rosalee.

Further, several messages were sent out by other individuals, who concluded their email with “On Behalf Of Ken” in the body of the email. There were 45 such instances in the Enron Corpus. However, it was possible that these emails were not original communication but forwarded emails of Ken Lay. Further study was necessary for the 45 emails.

During the forensic phase, it became clear, that Kenneth Lay sent some emails from mailbox with “Ken Lay – Chairman of the Board@ENRON” in the X-From field. The following code revealed, that all of these emails were sent from no.address@enron.com. There were 107 such emails.

Next, inspection of emails revealed that some emails contained “Office Of The Chairman” or “ENA Office of the Chairman” in the body. These emails were selected for further study. There were eight emails with “Office Of The Chairman” and six emails contained “ENA Office of the Chairman” in the body.

The dataframes of each filtered datasets (themselves dataframes) were concatenated. Note that that two different dataframes were created. The first (final_data) contained all emails sent by Kenneth Lay, while the second dataframe contained all emails sent to Kenneth Lay.

See this content in the original post

A quick check of the data frame - final_data - revealed that there were some emails erroneously included in the analysis. These emails were selected during one of the routines that selected emails based on the presence of a specific string. All of the emails marked in red below were removed from the final analytic data set (final_data)

See this content in the original post

Count No. of Emails Before and After Bankruptcy

The goal of the analysis was to understand whether more or fewer emails were sent by Ken Lay before and after the date when Enron declared bankruptcy. Note that the date of Enron’s bankruptcy was December 2, 2001.

The Date column was first transformed from object to datetime format. This was accomplished for both final_data and final_data2 dataframes, which represent the emails from and to Kenneth Lay, respectively.

See this content in the original post

Next, an object was created signifying Enron's bankruptcy date of 2001 December 2, 2001. In this process, a string was first created (2001,12, 2). The string signifying Enron’s bankruptcy date was then transformed to a timestamp using a datetime format.

The bankruptcy date was compared against each emails' timestamp in the Date column. If an email's timestamp was before the bankruptcy date, a code 1 was issued, while an email timestamp earlier than the bankruptcy date was coded as 0 in a column called new. The procedure was done with both final_data (emails from Kenneth Lay) and final_data2 (emails to Kenneth Lay).

See this content in the original post

Finally, the number of emails sent from and to Kenneth Lay before and after the bankruptcy date were counted. Kenneth Lay sent 4,737 emails before December 2, 2001, but sent only two after Enron applied for bankruptcy protection. He also received 836 emails pre-bankruptcy and 659 emails post-bankruptcy. Interestingly, Kenneth Lay resigned from the company three weeks after the bankruptcy date, yet the number of emails he received during this time period is significant. A graphical representation of the data was also prepared.

See this content in the original post

Count No. of Emails with Arthur Andersen in the Body

The total number of emails with Arthur Andersen explicitly discussed in Ken Lay’s emails was 12. This finding is specific to the emails sent by Ken Lay since only the final_data dataframe was used in this analysis.

Where did Ken Get Most of His Emails from?

When considering the emails sent by Ken Lay, the most frequently appearing recipient was identified. Separately, focusing only emails received by Ken Lay, the entity sending the largest number of emails was also identified. It appears that Ken Lay rarely sent emails to individuals but rather used companywide communication when using email. He sent 1,375 emails to all.wordlwide@enron.com , which appears to be a company wide distribution list. As for the most frequent senders to Ken Lay, Steven Kean (steven.kean@enron.com) sent 26 emails, which was the most frequent in the dataset.

See this content in the original post


Final Code for Fun

The last issue I wanted to show with the Enron Corpus is a word cloud. One can call it a type of sentiment analysis. First, I created two wordclouds using the words in Ken’s emails. Lastly, I got a picture of Ken from the Internet and used its contour (mask) to create a wordcloud to resemble Ken. At least as much as possible!

First the data was saved a string, and tokenized into a list of sentences as well as into a list of words. This was done separately for final_data and final_data2.

See this content in the original post

I then tokenized the data. The first represents the result of the sentence tokenization, while the second is the result of word tokenization. Again, this was done separately for final_data and final_data2. I am only showing a portion of the output to save some space.

See this content in the original post

Stop words and a list of punctuation was applied so that filler words and punctuation do not appear in the wordcloud.

See this content in the original post

The most frequently appearing words were counted among the list of tokenized words. The same was accomplished with the filtered words, e.g. the list devoid of stopwords and punctuation. This was accomplished using a counter.

See this content in the original post

The top image shows the most frequently used words in the emails sent by Ken Lay. The bottom image represents frequent words in the emails sent to Ken Lay.

See this content in the original post

And now we are ready for the wordclouds:

See this content in the original post
See this content in the original post
See this form in the original post