Scraping Twitter

Here we present two ways to scrape data from Twitter. The first is through the Twitter API. The trouble with the API method is that Twitter only provides data 7-10 old through it's API, rendering our analysis useless for old seasons which go back to 2009.

The second method we show and use directly queries the Twitter website, dynamically scrolls through each Twitter search using a ghost chrome browser, and then scrapes the HTML. We then process the HTML through Beautiful Soup.

In [1]:
%matplotlib inline

import oauth2
from twython import Twython
import simplejson
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
from pyquery import PyQuery as pq
from bs4 import BeautifulSoup
import requests
import datetime
import json


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys

import unittest, time, re

API Method

The API Method is good, but only gives us very recent twitter data. Below is an example of the type of code we would use to interact with the API

In [2]:
#APP_KEY = "qtmevmQ18N1vyWTAXfxqmh4oN"
#APP_SECRET = "MdZibormo3teZPTfMyeLEcuzMURHYidArOml0GtOQyrl6dI13R"

#access_token = '2694571580-Y8DsMjB0iMTGmm3Pwpo6IL3enhhFdAZQSXDIxO8'
#access_secret = 'AYciwyU197r6adpNziDT8pB0tmT3bKIihMrx7SPfbofRO'

#twitter = Twython(APP_KEY, APP_SECRET, access_token, access_secret)
#search_results = oauth_req('https://api.twitter.com/1.1/statuses/home_timeline.json', \
#                          access_token, access_secret)

#for tweet in search_results["statuses"]:
#    print tweet["text"]

#Define Twitter GET function using OAUTH2
#Function from https://dev.twitter.com/oauth/overview/single-user
#def oauth_req(url, key, secret, http_method="GET", post_body="", http_headers=None):
#    consumer = oauth2.Consumer(key=APP_KEY, secret=APP_SECRET)
#    token = oauth2.Token(key=key, secret=secret)
#    client = oauth2.Client(consumer, token)
#    resp, content = client.request( url, method=http_method, body=post_body, headers=http_headers )
#    return content

Manual Web Scrape of Twitter

Manual Scraping of Twitter presents two challenges:

1) Twitter uses JavaScript for interactive webpage scrolling. If a search produces multiple results, once a reader gets to the end of a page, instead of being prompted with a "Next Page" link, twitter automatically queries it's JSON backend and dynamically loads the page.

To work around this issue, we use the package Selenium which mimics "scrolling" the webpage for us. After scrolling though a set number of pages, we extract the HTML from the page, as suffient XHR requests have been made by Twitter.

2) Manual page data is not in nice JSON format, so we must use html parsing to get at the data.

Since we are interested in the positive/negative vibes of a tweet, we use Twitter's sentiment analysis in our search queries for a particular contestant. Then all we need to do is count the number of tweet tags that we scraped.

Step 1: Create Function to Scrape Twitter

In [3]:
#We borrow heavily from http://stackoverflow.com/questions/12519074/scrape-websites-with-infinite-scrolling
def scrape_page(since, until, contestant, \
                base_url="https://twitter.com/search?f=tweets&vertical=default&q=%23thebachelor", \
                pages_to_scroll=3, ):
    
    #### Initiate Chrome Browser #######
    #Must download ChromeDriver executable from https://sites.google.com/a/chromium.org/chromedriver/downloads
    driver = webdriver.Chrome('/Users/dcusworth/chrome_driver/chromedriver') #Specify location of driver
    driver.implicitly_wait(30)
    verificationErrors = []
    accept_next_alert = True

    #Create URL that will get the text
    ender = "&src=typd"
    
    #Use Twitter Sentiment Analysis - REMOVED as it may be underestimating tweets
    #if is_happy:
    #    sentiment = "%20%3A)"
    #else:
    #    sentiment = "%20%3A("
    
    since_time = "%20since%3A" + str(since)
    until_time = "%20until%3A" + str(until)
    contestant_name = "%20" + contestant    
        
    final_url = base_url + contestant_name  + since_time + until_time + ender
    #print final_url
    
    #Jump onto the webpage and scroll down
    delay = 3
    driver.get(final_url)
    for i in range(1,pages_to_scroll):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(4)
    html_source = driver.page_source
    
    #After scrolling enough times, get the text of the page
    data = html_source.encode('utf-8')
    driver.quit()

    return data

Step 2: Retrive Data

We load in scraped Wikipedia Data that gives us a contestant's name and dates they appeared on the Bachelor. For each season/contestant pair, we create a dataframe of episode date, positive tweets, and negative tweets.

In [2]:
#Load Contestant Name Data from wiki scrape
with open("tempdata/seasonsDict.json") as json_file:
    wiki_data = json.load(json_file)

#Fix known formatting problems:
wiki_data['19'][19]['eliminated'] = u'Eliminated in week 2'
wiki_data['19'][20]['eliminated'] = u'Eliminated in week 1'

w19 = []
for ww in wiki_data['19'][0:29]:
    w19.append(ww)
    
wiki_data['19'] = w19
In [3]:
#Scrape Web to find the airdates of each episode
#Use http://epguides.com/Bachelor/
sdat = requests.get("http://epguides.com/Bachelor/")

#Parse through Beautriful Soup
ssoup = BeautifulSoup(sdat.text, "html.parser")

#Get all episode text in rows
row_text = ssoup.find_all("pre")[0]

uurls = []
ep_nam = []
for r in row_text.find_all("a"):
    if "Week" in r.get_text():
        uurls.append(r.get("href"))
        ep_nam.append(r.get_text())
        
#Fix Season 19 episode problems
ep_nam[140:] = [ee + " (S19)" for ee in ep_nam[140:]]
In [4]:
good_dates = []

for uurl in uurls:
    time.sleep(1)
    #Open up subpages
    subpage = requests.get(uurl)
    soup2 = BeautifulSoup(subpage.text, "html.parser")
    
    #Find box with date in it
    pars = soup2.find_all("br")
    pp = pars[0].get_text().split()
    pind = ["Airdate" in d for d in pp]
    
    #Convert date from page into usable date
    date_string = "-".join(pp[np.where(pind)[0]+1: np.where(pind)[0]+4])
    date_string = re.sub(",", "",date_string)
    date_object = datetime.datetime.strptime(date_string, "%b-%d-%Y")
    good_dates.append(date_object.strftime("%Y-%m-%d"))
/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:15: DeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
In [5]:
#Extract the Season Number
season_num = []
for ee in ep_nam:
    start_string = ee.find("(")
    season_num.append(int(ee[(start_string+2):(len(ee)-1)]))

#Count up Episode Numbers
ep_num = []
start_val = 0
season_start = 1
for i in range(len(season_num)-1):
    if season_num[i] == season_start:
        start_val += 1
        ep_num.append(start_val)
    else:
        season_start += 1
        start_val = 1
        ep_num.append(start_val)

ep_num.append(ep_num[-1] + 1)

#Put Season / Episodes / Dates into a Pandas Dataframe
date_guide = pd.concat([pd.Series(season_num, name="Season"), pd.Series(ep_num, name="Episode"), \
                        pd.Series(good_dates,name="Date")], axis=1)

#Save as CSV for other scripts
date_guide.to_csv("date_guide.csv")
In [8]:
#Use Date Guide + Wiki info to set up inputs to scrape_page
#For a given Season, get all contestant names
#For each contestant find how many episodes they were on (minus their elimination episode)
#For each episode, count positive / negative tweets they received
#Output a dictionary with the Season as Key, and a dictionary of of each contestant's pos/neg splits as values

def scrape_season_tweets(season):
    
    season_dat = wiki_data[str(season)]
    all_eps = date_guide[date_guide.Season == season]
    result_dict = {}
    
    for sd in season_dat:
        #Get contestant's name
        cnam = sd["name"]          
        
        if len(cnam.split(">")) > 1:
            cnam2 = cnam.split(">")[1]
            contestant = cnam2.encode("utf-8").split(" ")[0]
        else:
            contestant = cnam.encode("utf-8").split(" ")[0]
        
        for ch in ["[", "]", "u\"","<",">"]:
            contestant = contestant.replace(ch, "")
        print contestant

        #Find week they are elminated, and then select weeks to run scraper
        elim = sd['eliminated']
        if ("Win" in elim) | ("Run" in elim):
            elim_week = all_eps.shape[0] - 1
            eweek = all_eps.iloc[0:elim_week]
            use_dats = eweek["Date"]
        else:
            elim_week = int(elim[(len(elim)-1):len(elim)]) - 1
            eweek = all_eps.iloc[0:elim_week]
            use_dats = eweek["Date"]

        dats = [datetime.datetime.strptime(idate, '%Y-%m-%d') for idate in use_dats]

        #For each date, run scraper, save in dictionary
        ep_dict = []
        if len(dats)==0 | ("href" in contestant):
            result_dict[contestant] = None
        else:
            for run_date in dats:
                #Make time range
                start_time = run_date +  datetime.timedelta(days=-1)
                end_time = run_date +  datetime.timedelta(days=2)

                #Collect all tweets
                tweet_page = scrape_page(since=start_time.strftime('%Y-%m-%d'), until=end_time.strftime('%Y-%m-%d'), \
                                        contestant=contestant, pages_to_scroll=10)
                soup = BeautifulSoup(tweet_page, "html.parser")
                user_tweets = soup.find_all("p", attrs={"class": "TweetTextSize"})
                
                each_tweet = [uu.get_text() for uu in user_tweets]
                            
                #FOLLOWING CODE if doing Twitter-built-in sentiment analysis
                #Find all positive tweets
                #happy_time = scrape_page(since=start_time.strftime('%Y-%m-%d'), until=end_time.strftime('%Y-%m-%d'), \
                #                         is_happy=True, contestant=contestant)
                #soup = BeautifulSoup(happy_time, "html.parser")
                #happy_tweets = len(soup.find_all("p", attrs={"class": "TweetTextSize"}))

                #Find all sad tweets
                #sad_time = scrape_page(since=start_time.strftime('%Y-%m-%d'), until=end_time.strftime('%Y-%m-%d'), \
                #                         is_happy=False, contestant=contestant)
                #soup = BeautifulSoup(sad_time, "html.parser")
                #sad_tweets = len(soup.find_all("p", attrs={"class": "TweetTextSize"}))

                print run_date.strftime('%Y-%m-%d')

                #Save the results to a dictionary
                ep_dict.append({run_date.strftime('%Y-%m-%d'):each_tweet})
            result_dict[contestant] = ep_dict

    return result_dict

Run scraping code individually for each season

In [28]:
tweets13 = scrape_season_tweets(13)
with open('tweets13.json', 'w') as fp:
    json.dump(tweets13, fp)
Melissa
2009-01-05
2009-01-12
2009-01-19
2009-01-26
2009-02-02
2009-02-09
2009-02-16
Molly
2009-01-05
2009-01-12
2009-01-19
2009-01-26
2009-02-02
2009-02-09
2009-02-16
Jillian
2009-01-05
2009-01-12
2009-01-19
2009-01-26
2009-02-02
2009-02-09
Naomi
2009-01-05
2009-01-12
2009-01-19
2009-01-26
2009-02-02
Stephanie
2009-01-05
2009-01-12
2009-01-19
2009-01-26
Lauren
2009-01-05
2009-01-12
2009-01-19
Megan
2009-01-05
2009-01-12
2009-01-19
Shannon
2009-01-05
2009-01-12
2009-01-19
Nicole
2009-01-05
2009-01-12
2009-01-19
Erica
2009-01-05
2009-01-12
Kari
2009-01-05
2009-01-12
Natalie
2009-01-05
2009-01-12
Raquel
2009-01-05
Sharon
2009-01-05
Lisa
2009-01-05
Ann
Dominique
Emily
Jackie
Julie
Nicole
Renee
Shelby
Stacia
Treasure
In [29]:
tweets14 = scrape_season_tweets(14)
with open('tweets14.json', 'w') as fp:
    json.dump(tweets14, fp)
Vienna
2010-01-04
2010-01-11
2010-01-18
2010-01-25
2010-02-01
2010-02-08
2010-02-15
Tenley
2010-01-04
2010-01-11
2010-01-18
2010-01-25
2010-02-01
2010-02-08
2010-02-15
Gia
2010-01-04
2010-01-11
2010-01-18
2010-01-25
2010-02-01
2010-02-08
Ali
2010-01-04
2010-01-11
2010-01-18
2010-01-25
2010-02-01
Corrie
2010-01-04
2010-01-11
2010-01-18
2010-01-25
Ashleigh
2010-01-04
2010-01-11
2010-01-18
Jessie
2010-01-04
2010-01-11
2010-01-18
Kathryn
2010-01-04
2010-01-11
2010-01-18
Ella
2010-01-04
2010-01-11
2010-01-18
Elizabeth
2010-01-04
2010-01-11
Valishia
2010-01-04
2010-01-11
Michelle
2010-01-04
2010-01-11
Ashley
2010-01-04
Christina
2010-01-04
Rozlyn
2010-01-04
Alexa
Caitlyn
Channy
Elizabeth
Emily
Kimberly
Kirsten
Sheila
Stephanie
Tiana
In [30]:
tweets15 = scrape_season_tweets(15)
with open('tweets15.json', 'w') as fp:
    json.dump(tweets15, fp)
Emily
2011-01-03
2011-01-10
2011-01-17
2011-01-24
2011-01-31
2011-02-07
2011-02-14
2011-02-21
2011-02-28
Chantal
2011-01-03
2011-01-10
2011-01-17
2011-01-24
2011-01-31
2011-02-07
2011-02-14
2011-02-21
2011-02-28
Ashley
2011-01-03
2011-01-10
2011-01-17
2011-01-24
2011-01-31
2011-02-07
2011-02-14
2011-02-21
Shawntel
2011-01-03
2011-01-10
2011-01-17
2011-01-24
2011-01-31
2011-02-07
2011-02-14
Michelle
2011-01-03
2011-01-10
2011-01-17
2011-01-24
2011-01-31
2011-02-07
Britt
2011-01-03
2011-01-10
2011-01-17
2011-01-24
2011-01-31
2011-02-07
Jackie
2011-01-03
2011-01-10
2011-01-17
2011-01-24
2011-01-31
Alli
2011-01-03
2011-01-10
2011-01-17
2011-01-24
2011-01-31
Lisa
2011-01-03
2011-01-10
2011-01-17
2011-01-24
Marissa
2011-01-03
2011-01-10
2011-01-17
2011-01-24
Ashley
2011-01-03
2011-01-10
2011-01-17
2011-01-24
Lindsay
2011-01-03
2011-01-10
2011-01-17
Meghan
2011-01-03
2011-01-10
2011-01-17
Stacey
2011-01-03
2011-01-10
2011-01-17
Kimberly
2011-01-03
2011-01-10
Sarah
2011-01-03
2011-01-10
Madison
2011-01-03
2011-01-10
Keltie
2011-01-03
Melissa
2011-01-03
Raichel
2011-01-03
Britnee
Cristy
Jessica
Jill
Lacey
Lauren
Lisa
Rebecca
Renee
Sarah
In [31]:
tweets16 = scrape_season_tweets(16)
with open('tweets16.json', 'w') as fp:
    json.dump(tweets16, fp)
Courtney
2012-01-02
2012-01-09
2012-01-16
2012-01-23
2012-01-30
2012-02-06
2012-02-13
2012-02-20
2012-02-27
Lindzi
2012-01-02
2012-01-09
2012-01-16
2012-01-23
2012-01-30
2012-02-06
2012-02-13
2012-02-20
2012-02-27
Nicki
2012-01-02
2012-01-09
2012-01-16
2012-01-23
2012-01-30
2012-02-06
2012-02-13
2012-02-20
Kacie
2012-01-02
2012-01-09
2012-01-16
2012-01-23
2012-01-30
2012-02-06
2012-02-13
Emily
2012-01-02
2012-01-09
2012-01-16
2012-01-23
2012-01-30
2012-02-06
Rachel
2012-01-02
2012-01-09
2012-01-16
2012-01-23
2012-01-30
2012-02-06
Jamie
2012-01-02
2012-01-09
2012-01-16
2012-01-23
2012-01-30
Casey
2012-01-02
2012-01-09
2012-01-16
2012-01-23
2012-01-30
Blakeley
2012-01-02
2012-01-09
2012-01-16
2012-01-23
2012-01-30
Jennifer
2012-01-02
2012-01-09
2012-01-16
2012-01-23
Elyse
2012-01-02
2012-01-09
2012-01-16
2012-01-23
Monica
2012-01-02
2012-01-09
2012-01-16
Samantha
2012-01-02
2012-01-09
2012-01-16
Jaclyn
2012-01-02
2012-01-09
Erika
2012-01-02
2012-01-09
Brittney
2012-01-02
2012-01-09
Shawn
2012-01-02
Jenna
2012-01-02
Amber
Amber
Anna
Dianna
Holly
Lyndsie
Shira
In [32]:
tweets17 = scrape_season_tweets(17)
with open('tweets17.json', 'w') as fp:
    json.dump(tweets17, fp)
Catherine
2013-01-07
2013-01-14
2013-01-21
2013-01-28
2013-02-04
2013-02-05
2013-02-11
2013-02-18
Lindsay
2013-01-07
2013-01-14
2013-01-21
2013-01-28
2013-02-04
2013-02-05
2013-02-11
2013-02-18
AshLee
2013-01-07
2013-01-14
2013-01-21
2013-01-28
2013-02-04
2013-02-05
2013-02-11
2013-02-18
Desiree
2013-01-07
2013-01-14
2013-01-21
2013-01-28
2013-02-04
2013-02-05
2013-02-11
Lesley
2013-01-07
2013-01-14
2013-01-21
2013-01-28
2013-02-04
2013-02-05
Tierra
2013-01-07
2013-01-14
2013-01-21
2013-01-28
2013-02-04
2013-02-05
Daniella
2013-01-07
2013-01-14
2013-01-21
2013-01-28
2013-02-04
Selma
2013-01-07
2013-01-14
2013-01-21
2013-01-28
2013-02-04
Sarah
2013-01-07
2013-01-14
2013-01-21
2013-01-28
2013-02-04
Robyn
2013-01-07
2013-01-14
2013-01-21
2013-01-28
Jackie
2013-01-07
2013-01-14
2013-01-21
2013-01-28
Amanda
2013-01-07
2013-01-14
2013-01-21
Leslie
2013-01-07
2013-01-14
2013-01-21
Kristy
2013-01-07
2013-01-14
Taryn
2013-01-07
2013-01-14
Kacie
2013-01-07
2013-01-14
Brooke
2013-01-07
Diana
2013-01-07
Katie
2013-01-07
Ashley
Ashley
Kelly
Keriann
Lacey
Lauren
Paige
In [33]:
tweets18 = scrape_season_tweets(18)
with open('tweets18.json', 'w') as fp:
    json.dump(tweets18, fp)
Nikki
2014-01-06
2014-01-13
2014-01-20
2014-01-27
2014-02-03
2014-02-10
2014-02-17
2014-02-24
Clare
2014-01-06
2014-01-13
2014-01-20
2014-01-27
2014-02-03
2014-02-10
2014-02-17
2014-02-24
Andi
2014-01-06
2014-01-13
2014-01-20
2014-01-27
2014-02-03
2014-02-10
2014-02-17
2014-02-24
Renee
2014-01-06
2014-01-13
2014-01-20
2014-01-27
2014-02-03
2014-02-10
2014-02-17
Chelsie
2014-01-06
2014-01-13
2014-01-20
2014-01-27
2014-02-03
2014-02-10
Sharleen
2014-01-06
2014-01-13
2014-01-20
2014-01-27
2014-02-03
2014-02-10
Kat
2014-01-06
2014-01-13
2014-01-20
2014-01-27
2014-02-03
Cassandra
2014-01-06
2014-01-13
2014-01-20
2014-01-27
2014-02-03
Alli
2014-01-06
2014-01-13
2014-01-20
2014-01-27
Danielle
2014-01-06
2014-01-13
2014-01-20
2014-01-27
Kelly
2014-01-06
2014-01-13
2014-01-20
2014-01-27
Elise
2014-01-06
2014-01-13
2014-01-20
Lauren
2014-01-06
2014-01-13
2014-01-20
Christy
2014-01-06
2014-01-13
Lucy
2014-01-06
2014-01-13
Amy
2014-01-06
Chantel
2014-01-06
Victoria
2014-01-06
Alexis
Amy
Ashley
Christine
Kylie
Lacy
Lauren
Maggie
Valerie
In [9]:
tweets19 = scrape_season_tweets(19)
with open('tweets19.json', 'w') as fp:
    json.dump(tweets19, fp)
Whitney
2015-01-05
2015-01-12
2015-01-19
2015-01-26
2015-02-02
2015-02-09
2015-02-15
2015-02-16
Becca
2015-01-05
2015-01-12
2015-01-19
2015-01-26
2015-02-02
2015-02-09
2015-02-15
2015-02-16
Kaitlyn
2015-01-05
2015-01-12
2015-01-19
2015-01-26
2015-02-02
2015-02-09
2015-02-15
2015-02-16
Jade
2015-01-05
2015-01-12
2015-01-19
2015-01-26
2015-02-02
2015-02-09
2015-02-15
Carly
2015-01-05
2015-01-12
2015-01-19
2015-01-26
2015-02-02
2015-02-09
Britt
2015-01-05
2015-01-12
2015-01-19
2015-01-26
2015-02-02
2015-02-09
Megan
2015-01-05
2015-01-12
2015-01-19
2015-01-26
2015-02-02
Kelsey
2015-01-05
2015-01-12
2015-01-19
2015-01-26
2015-02-02
Ashley
2015-01-05
2015-01-12
2015-01-19
2015-01-26
2015-02-02
Mackenzie
2015-01-05
2015-01-12
2015-01-19
2015-01-26
Samantha
2015-01-05
2015-01-12
2015-01-19
2015-01-26
Ashley
2015-01-05
2015-01-12
2015-01-19
Juelia
2015-01-05
2015-01-12
2015-01-19
Nikki
2015-01-05
2015-01-12
2015-01-19
Jillian
2015-01-05
2015-01-12
2015-01-19
Amber
2015-01-05
2015-01-12
Tracy
2015-01-05
2015-01-12
Trina
2015-01-05
2015-01-12
Alissa
2015-01-05
Jordan
2015-01-05
Kimberly
Tandra
2015-01-05
Tara
2015-01-05
Amanda
Bo
Brittany
Kara
Michelle
Nicole
In [10]:
tweets12 = scrape_season_tweets(12)
with open('tweets12.json', 'w') as fp:
    json.dump(tweets12, fp)
Shayne
2008-03-17
2008-03-23
2008-03-31
2008-04-07
2008-04-14
2008-04-21
2008-04-28
Chelsea
2008-03-17
2008-03-23
2008-03-31
2008-04-07
2008-04-14
2008-04-21
2008-04-28
Amanda
2008-03-17
2008-03-23
2008-03-31
2008-04-07
2008-04-14
2008-04-21
Noelle
2008-03-17
2008-03-23
2008-03-31
2008-04-07
2008-04-14
Marshana
2008-03-17
2008-03-23
2008-03-31
2008-04-07
Robin
2008-03-17
2008-03-23
2008-03-31
2008-04-07
Ashlee
2008-03-17
2008-03-23
2008-03-31
Kelly
2008-03-17
2008-03-23
2008-03-31
Holly
2008-03-17
2008-03-23
2008-03-31
Erin
2008-03-17
2008-03-23
Amy
2008-03-17
2008-03-23
Kristine
2008-03-17
2008-03-23
Michelle
2008-03-17
Carri
2008-03-17
Erin
2008-03-17
Alyssa
Amanda
Denise
Devon
Lesley
Michele
Rebecca
Stacey
Tamara
Tiffany
In [ ]:
 
In [ ]: