An Analysis of WeRateDog Tweets

Project Details:

This project is from Udacity's Data Wrangling course, which is part of the Data Analyst Nanodegree program. The goal is practice data wrangling, i.e. to use Python and its libraries to gather data from various sources of different formats, assess its quality and tidiness, and then clean this data. Additionally, we will analyze and report on our findings. These efforts will be documented in this Jupyter Notebook.

The steps include:

Please note some of the descriptive text in this notebook has been taken directly from the Udacity course materials. In particular, the indented sections are specific instructions for the project.

Table of Contents

Introduction

We will download a twitter archive and supplement it with additional data. As per to the Udacity course materials:

The dataset that we will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

Part I: Gathering Data

The twitter archive contains basic data for over 5000 WeRateDogs tweets. We will gather data from two additional sources in order to make our analysis more interesting. Specifically, we will need:

  1. The WeRateDogs Twitter archive (twitter_archive_enhanced.csv). This file has already been placed in the same directory as this notebook.

  2. Tweet image predictions that have been created with a neural network image classifier and are accessible as part of the Udacity course materials.

  3. Additional information gathered by using the Twitter API, such as like and retweet counts.

Let's start by importing some necessary Python libraries for our work.

Let's load the WeRateDogs twitter archive into a Pandas dataframe.

Next we will gather the tweet image predictions, i.e., what is the dog's breed, etc.

Every image in the WeRateDogs Twitter archive has been run through a neural network that can classify breeds of dogs. The results include a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images). This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

Because the archive doesn't contain important information such as retweet and favorite (like) counts, we will need to get these from Twitter's API.

Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt. Each tweet's JSON data should be written to its own line.

Note: The student was unable to obtain API access from Twitter, therefore, as per the course instructions, the following code from the Udacity course materials has been cut and pasted into this notebook. It has been commented out, as running it would produce errors without the proper keys. If you wish to run it, you will need to insert your own consumer key and secret, as well as your own access token and secret and then uncomment all of the code. The tweet_json.txt file that would have been produced by running this code has been taken from the provided course materials and uploaded into the same directory as this notebook in order to simulate the final result of running this code.

Next, we need to read this the text file that was created above line by line into a pandas DataFrame, saving the tweet ID, retweet count, and favorite count.

Again, note that tweet_json.txt file that would have been produced by the above code has been taken from the provided course materials and uploaded into the same directory as this notebook

Part II: Assessing Data

Now that we have gathered our data, we must assess it visually and programmatically to identify any quality and tidiness issues. Our work must meet the following project standards:

Let's analyze each dataframe at a time starting with the archive. Then, all documented issues found will be summarized in a issues list section below.

(1) Twitter Archive Data

We'll start by printing our dataframe and doing a quick visual assessment.

There are a number of issues that we can see right away:

  1. The in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp columns all appear to contain many null (NaN) values. If a value is present in any of these columns, it indicates that the tweet is actually a reply (the first two) or a retweet (the last three). Because one of our requirements is that "we only want original ratings (no retweets)", we should remove any rows that have non-null values in any of these columns. Aftward, we can drop these columns, as they won't contain any values and they will only clutter our dataset.

  2. The source column appears to be more wordy than it needs to be, i.e. do we actually need the hyperlink since it appears that this column is chiefly about identifying how the tweet was sent, i.e. from an iphone, the website or via another app?

  3. The text column is actually a concatination of the tweet text, followed by the dog's rating, then a shortened hyperlink to the actual tweet itself on twitter. The rating has been split into the rating_numerator and rating_denominator columns. The full hyperlink has been expanded in the expanded_urls column. We could extract just the text part to isolate it and make it tidy, but unless we need this later in our analysis, this doesn't appear necessary at this time.

  4. Sometimes the expanded_urls column contains more than one URL and scanning through them visually, they always appear to be duplicates. After opening up these URLs in a browser, it seems like there are duplicates when there is more than 1 photo in the tweet. However, these are not the URLs for the images, they are links to the original tweet itself. Unless we are going to be doing analysis on this column, fixing this issue doesn't seem to be a priority.

  5. The rating_denominator column has at least one value that is not 10. Recall that ratings are supposed to be some number "out of 10."

  6. The name column doesn't always contain a name. For example, there are words like "a" and "such" indicating that the extraction of the dog's name from the tweet text was not perfect. Also, there are "None" values.

  7. The doggo, floofer, pupper, and puppo columns contain many "None" values. If the value is not "None", then it is simply a repeat of the column name. Also, it appears that either all of these columns within a given row are "None," or if not, then there is at most one of the four that is not "None." This is very untidy. These columns represent WeRateDog's unique "dog stages" (i.e. what stage of life a dog is in). Since these columns all ultimately represent one variable, they should be merged into one dog_stage column.

Next, let's explore some of the columns programatically to better understand the issues at hand.

Let's inspect source column's values to determine how we should handle this column.

The source column has only a few values and they appear to represent the name of the app that was used to create the tweet, i.e. Twitter for iPhone, Vine, the Twitter Web Client (i.e. twitter.com in a browser), and TweetDeck. We clean it up by replacing the link with the actual app name.

Next, let's see if we can better understand the problem in the name column and understand why names were extracted incorrectly.

So, it appears that in addition to "None," very common English words like "a", "the", "an", "very", etc. were extracted.

Let's look at this a little further by checking the values of all names that start with a lower case letter.

There's actually not many of these, so we could replace these with nulls. Did any of the more popular of these ("a", "the" or "an") appear at the beginning of a sentence, i.e. where they would presumably be capitalized.

These non-names appear to have been extracted mid-sentence. Let's do a quick check on capitalized names to see if we see any other instances of issues that stand out.

It looks like we can concentrate our efforts on fixing the list of uncapitalized words and "None" to clean this column.

Next, let's look at the rating_denominator column. As we saw above, dog ratings are generally on scale of some number "/10". Let's examine this more closely to see if we can determine how the ratings were extracted incorrectly from the text column.

It appears that we run into problems when certain strings that look like ratings (i.e. "960/00") appear before the actual rating: For example, when a date occurs before the rating (i.e. "4/20" for April 20) or when a percentage is given (i.e. 3 1/2). Sometimes, however, a rating is given for a group of dogs (i.e. 44/40) and in these cases it seems OK that we don't have the standard format of "/10."

Scanning through the list it appears there are less than 10 that are actually wrong and they can be fixed manually.

Next, let's examine the four columns that represent dog stages a little more closely (i.e. doggo, floofer, pupper, and puppo). From our initial visual assessment, it appeared that all four columns within a given row are "None" or at most one of them is has a different value. But is this actually true?

It does appear that we have some rows that contain more than one dog stage. Given that there are at most 14 of them (depending on whether or not they are a reply, retween etc.), it seems that for the purpose of this analysis we can simply pick one of them when we merge these columns into a single column. Let's just verify that the only values that each of the 4 columns contains is None and a repeat of the name of that column itself.

Now that we've examined the columns that we know have issues, let's look examine our dataframe structure as a whole so we can investigate issues such as null counts and data types a little more closely.

The timestamp is being stored as a string. This should be changed to datetime type.

Besides the columns for replies, retweets and stages mentioned above. It looks like the expanded_urls column also contains null values. Let's examine when this occurs.

All but four of these instances of null values are actually replies and one of those four is actually a retweet. Therefore, most of these rows will be removed from our set anyway. Let's look at the remaining three:

The issue appears to be that these three are missing the URL at the end of the text field, hence it can't be extracted into the expanded_urls column. If we plug the tweet id's into the twitter url format (for example, https://twitter.com/dog_rates/status/696518437233913856) and visit each of them in a browser, we can see that these tweets don't have photos. In order to follow our project's requirements, we will need to exclude these three rows.

Finally, let's verify that our dataset doesn't contain duplicates.

That wraps up our initial assement of the archive data!

(2) Image Predictions

As per the Udacity project description, the image predictions file is:

a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images). So for the last row in that table:

  • tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
  • p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever
  • p1_conf is how confident the algorithm is in its #1 prediction → 95%
  • p1_dog is whether or not the #1 prediction is a breed of dog → TRUE
  • p2 is the algorithm's second most likely prediction → Labrador retriever
  • p2_conf is how confident the algorithm is in its #2 prediction → 1%
  • p2_dog is whether or not the #2 prediction is a breed of dog → TRUE
  • etc.

Again, we'll start our assesment by printing our dataframe and doing a quick visual assessment.

The first thing that jumps out is that the values p1, p2 and p3 columns are dirty. There is inconsistent capitalization and underscores between words. We will need to clean these columns.

At a first glance all values appear to be present. However, we'll need to verify programatically later if nulls are present.

Another thing that stands out is jpg_url column. This contains a link to each dog photo in Twitter's image hosting domain. Since tweets can contain up to four images, we should dertermine if tweet id's are repeated:

There are no duplicate tweet id's, which will make it easier to merge this dataset with the archive, if necessary.

Next, let's confirm that there are no nulls and verify that our data types are correct.

Our dataset contains no nulls and the data types for each column appear to be correct.

One thing that stands out, however, is that we have substantially fewer rows in this dataset than in the archive. Let's compare actual counts.

Recall that we will need to remove some rows from the archive because they are replies or retweets. There were also 3 regular tweets that do not contain images that we will need to remove. Let's get a total count of the archive rows that will remain and compare it to our image prediction row size.

So, we already know we will be exlcuding 262 records (3 + 78 + 181), which leaves us with a total of about 2094 df_twitter_archive rows, which is still higher than the 2075 image predictions.

Bottom line: There are about 19 tweets in the archive for which we will not have image predictions.

This brings us to our last point. From a tidiness standpoint, while we there is a case to be made to store this information in it's own separate dataframe/table, it will facilitate our analysis if it is moved into the same dataframe as our archive data.

(3) API data

There are no glaring issues that are visible when scrolling through the dataset. Let's check for nulls and mismatched datatypes programatically.

There are no nulls, however all of our columns are numbers (specifically, integers), yet they are being stored as text. Retweet_count and favorite_count should be changed to integer types, as we will want to be able to perform numeric and statistical funtions on them. Initially, it might not seem necessary to convert the tweet_id column, as we will not be performing mathematical functions on it, however if we wish to join our data with the archive data (where tweet_id is an integer), our column should be of the same type.

Next let's check to see if there are any duplicates.

Note, however, that there are only 2354 rows in the api data, but there are 2356 rows in our archive. While we will be removing replies, retweets and tweets without photos, it is possible that there could be some missing data.

Finally, from a tidiness perspective, there is no reason this data needs to be in its own separate dataframe. In fact, it should be merged into the larger twitter archive dataframe as retweet and favorite counts are essential to a tweet.

Assesment Summary: Issues List

Below is a list of our identified quality and tidiness issues for all dataframes that we will need to clean.

Quality

df_twitter_achive dataframe
df_image_predictions dataframe
df_twitter_api_data dataframe

Tidiness

df_twitter_achive dataframe
df_image_predictions dataframe
df_twitter_api_data dataframe

Part III: Cleaning Data

Clean each of the issues you documented while assessing. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

The first thing that need to do copy our datasets to a new frames before we start cleaning them.

Next, we'll adress the issues above in an organized fashion. We will start by handling issues with missing data. Then, we will address our tidiness issues. Finally, we will fix the quality issues.

(1) Missing Data

Issue: Replies and retweets are included in our archive dataset.

Define

Keep only rows that contain null values in the in_reply_to_status_id and in retweeted_status_id columns. This will "delete" all of the replies and retweets respectively.

Code
Test

Issue: Tweets have no photo are included in our archive dataset.

Define

Remove rows that have nulls in the expanded_urls column.

Code
Test

Issue: Not all tweets in the archive have an image prediction.

There are 2075 image predictions and 2356 tweets in the archive. Even after we remove replies, retweets and tweets without photos there will still be 2094 tweets in the archive. There will be at most 19 tweets that don't have associated image predictions. It seems safe to be able to exclude these from our analysis. If we perform an inner join when merging the image predictions into the archive dataset, this issue will be addressed as a result. There's no action required at this time.

Issue: Not all tweets in the archive have associated api data.

There are only 2354 rows in the api data, but there are 2356 rows in our archive. There will be at most 2 tweets that don't have associated api data. It seems safe to be able to exclude these from our analysis. If we perform an inner join when merging the api data into the archive dataset, this issue will be addressed as a result. There's no action required at this time.

(2) Tidiness

Issue: The columns associated with replies and retweets should be removed from the twitter archive.

Define

Drop the in_reply_to_status_id, in_reply_to_user_id, in retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp columns from df_archive_clean.

Code
Test

Issue: The doggo, floofer, pupper, and puppo columns of the image predictions dataset should be converted into a single dog_stage column.

Define

Merging columns just screams for using the pandas "melt" function. We pick our set of columns to be used as the "id" for each row (this will be all of the columns that we're not merging or "melting" together), as well as the four that we are melting together (these are the "values"). This will reduce our four columns down to two: one for the original column name (the "variable") and one for it its "value." Note that in our case this will create four rows for each tweet id. We don't need the "variable" column, so we can drop that. Now, if only we could we could easily pick the one of these four rows that we need to keep. Fortunately, the Panda's sort_values function can help. Becuase it will sort capitalized words before uncapitalized ones, we will easily be able to identify a non-"None" value if one exists, as it will be the last one after sorting. Then all we need to do is to drop the preceding rows. As we saw in the Assess section above, there is the rare case where there is actually more than one dog stage. In this case we will chose the one that gets sorted to the last position.

To summarize, in order to "melt" the above four columns into a single dog_stages column, we will have to perform the following steps:

  1. Prepare the parameters we need in order to use the Panda's melt command.
  2. Perform the actual melt.
  3. Delete the unneeded "variable" column.
  4. Sort our resulting dataframe by dog_stage, causing rows with a "None" stage to float to the top.
  5. Drop the "duplicate" rows for each tweet_id, keeping only the one the last sorted dog_stage "value."
Code

Before we begin, lets get a count of how many of our tweets contain a value in one of the 4 columns named above. This number should match the number of records that have a value in the newly created dog_stage column when we are done.

Test

Issue: The image predictions dataset should be merged into the larger tweet archive, joined on tweet_id

Define
Code
Test

Issue: The api data should be merged into the larger tweet archive, joined on tweet_id (which needs to be converted to an integer type first)

Define
Code
Test

(3) Quality

Issue: timestamp has the wrong data type (it should be datetime and not text)

Define

Convert the timestampcolumn to datetime type

Code
Test

Issue: retweet_count and favorite_count should be converted to integer format

Define
Code
Test
Define
Code
Test

Issue: name contains erroneous values ("a", "such", "None")

Define
Code
Test

Issue: rating_numerator and rating_denominator were not always extracted correctly, especially if there was a date or percentage preceding the rating in the tweet.

Define

There were only a small amount of tweets in which the rating was incorrectly extracted. We can eyeball these and fix these manually:

Code

We observe the following:

  1. index: 100, rating_numerator: 9, rating_denominator: 10
  2. index: 314, rating_numerator: 10, rating_denominator: 10
  3. index: 455, rating_numerator: 13, rating_denominator: 10
  4. index: 476, rating_numerator: 11, rating_denominator: 10
  5. index: 1146, rating_numerator: 14, rating_denominator: 10

The others are ratings for multiple dogs so it's OK if the denominators are more than 10.

Test

Issue: The p1, p2 and p3 columns are dirty (i.e. capitalization is inconsistent and there are underscores between words.)

Define
Code
Test

We have finally resolved all of our identified issues and cleaned our data. Note that we may not have identified every single issue, however we have a much cleaner data set and we can always revisit this section to resolve any additional issues.

Let's save a final copy of our dataframe.

Part IV: Storing, Analyzing and Visualizing Data

Now that are data has been cleaned we should store it in a new file that can be used in future work. Next, we can start analyzing and visualizing our data to identify interesting trends.

(1) Store Data

Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately.

(2) Analyze and Visualize Data

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

One thing that stuck out when we were cleaning the dog ratings, especially because the account is famous for its higher then 10/10 ratings, was that one of the ones that we cleaned was 9/10. We'll definately want to explore the ratings more. Also, given our image predictions and retweet and favorite data, it would be interesting to see if tweets about certain breeds got more user interaction or interest.

While there are a plethora of questions to investigate, we will explore the following questions in this analyis:

  1. What is the average rating?
  2. Have rating trends changed over time, i.e. are the ratings higher in later tweets vs. earlier tweets?
  3. Do certain dog breeds have higher ratings?
  4. Based on retweets and favorites, what dog breeds are the most popular?

Question (1): What is the average rating?

Given that we chose to keep rating denominators that were higher than 10 (i.e. multiple dogs were included in the same rating), there are a couple of ways we can procede to answer this question:

Converting this into a fraction, we have an average across all dogs of 11.67/10.

Let's look at just the one dog ratings next.

Interesting! When only a single dog is rated, the rating is actually slightly higher. (Note that the rating is printed as "11.7" when it is actually 11.70.)

Just out of curiosity, let's calculate the average per dog in tweets that contain a single rating with more than one dog.

This is actually quite a bit lower. Given that the difference between the average rating per dog and the average one dog rating was much smaller, it seems likely that there aren't many ratings in our set for multiple dogs at once. Let's get a count of them.

So, we can see that there aren't many multiple dog ratings (12 in total) in the set.

In general, we can safely say that our average rating is about 11.7/10

Finally, our analysis of ratings wouldn't be complete without looking for outliers and then plotting our distribution. First, let's check the range of our values.

The values of 1776 and 420 look like they are one-offs to celebrate special occasions such as American independence and Earth Day. They only occur once. On the other end of the spectrum, there is a single 0 rating.

Finally, let's plot our distribution. For now, we'll ignore our five rating outliers (26, 27, 75, 420, 1776) that occur only one time.

As we can see, looking at the average rating per dog, our plot is left skewed and we have a small number of low ratings with a much larger number of those in the 11-13/10 range.

Next, let's determine how ratings have evolved over time. Can we tell if our rater become more generous over time or are they more demanding?

Let's create a new dataframe that holds normalized ratings (i.e. all will be on a scale of 10, which means will take the average rating per dog for multiple dog ratings).

Let's take a preliminary look at our ratings trends over time.

Yikes! First, plotting our results by each timestamp gives us resuts that are way too fine/precise. We should take an average per month.

More importantly, we've got those outliers we saw above. Since we are just trying to get an idea of the general trend over time, we don't need these in our set. Let's remove the top two highest values and the single 0 value.

Now, let's look at average our results by month.

Here we can clearly see that our average ratings have increased over time. From the time our dataset begins in late 2015, the average rating was below 10. When we finish in July of 2017, it has increased to 13. Our rater may have become more generous over time, but we'd need more evidence to prove this definitively. For example, did tweet content change over time, justifying a higher rating, etc.

Question (3): Do certain dog breeds have higher ratings?

Which predicted dog breeds get higher ratings on average? Are there any conclusions might we be able to draw from this?

To simplify our analysis, let's look only at tweets where a dog was the first predicted image and where the confidence is atleast 0.5 (50%).

Now, let's get a list of our highest rated dogs by breed on average.

Let's plot our findings for the top 10 highest rated breeds.

Our top result is very interesting, as it has a much higher average rating compared to the other breeds. Let's check to see how many 'clumbers' we have in this dataset.

There's only 1. Let's check the counts in our set of the top ten breeds with highest average ratings listed above.

These counts aren't very high considering that there are almost a thousand dogs in our set. We've stumbled upon something interesting. Most of these breeds are not among the most well known, at least not in North America, and they are not predicted very frequently. Let's check our value counts for breed predictions to get a better sense of which breeds are predicted most frequently.

This shows us that there are actually more predictions for more common, well-known breeds such as "golden retriever." Let's look at the average ratings for the top 10 predicted breeds.

It appears that rarer, less predicted breeds are rated higher on average that more well-known and more frequently predicted breeds.

While we might be tempted to think that our rater has a preference for more exotic dog breeds, we'd need more evidence before coming to this conclusion definitively. Also, we must remember that we are dealing with image predictions, which may or may not be correct in reflecting the actual breed of the dog in the tweeted photos.

Question (4): Based on retweets and favorites, what dog breeds are the most popular?

Retweeting and favoriting a tweet are generally signs of approval on Twitter. Based on this, what dog breeds seem more "popular" in our set of tweets?

Let's create a new dataframe with the information that we'll need. Again, we're going to limit our set to those tweets for which the first image prediction is for a dog with confidence of greater than 50%. Also, to keep things simple, we're going to assume that popularity, as far as Twitter is concerned can calculated by summing the retweet and favorite count for a given tweet.

Now, let's examine which predicted breeds have the highest popularity based on the average of the sum of retweet and favorite counts.