Profession Clustering

Here we categorize each contestant's profession into one of nine classes according to the International Standard Classification of Occupations (http://www.ilo.org/public/english/bureau/stat/isco/isco08/index.htm).

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
from pyquery import PyQuery as pq
from bs4 import BeautifulSoup
import requests
import json
from geopy import geocoders
import math
import gensim
import nltk
import difflib
//anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Open contestant dictionaries with occupation data (from Wikipedia scrape).

In [2]:
#Import seasonsDict.json which contains hometown information
with open("tempdata/listAllDicts.json") as json_file:
    seasons = json.load(json_file)
In [52]:
#Make a function that get's contestant's profession for a given season
def get_profession(choose_season):
    town_dict = {}
    for idict in seasons:
        if idict["season"] == choose_season:
            if idict["elimination"] == "bachelor":
                bachtown = idict["occupation"]
            else:
                if idict["name"] == "Kacie Boguskie":
                    idict["occupation"] = "Administrative assistant"
                town_dict.update({idict['name']:idict['occupation']})
    return town_dict
In [3]:
#Download a csv of a list of classified professions from the ISCO
#https://en.wikipedia.org/wiki/International_Standard_Classification_of_Occupations
#http://www.ilo.org/public/english/bureau/stat/isco/isco08/index.htm
professions = pd.read_csv("professions.csv")
professions.head(3)
Out[3]:
ISCO 08 Code Title EN
0 1 Managers
1 11 Chief executives, senior officials and legisla...
2 111 Legislators and senior officials

Categorize

We categorize by matching the contestants' profession to the list of keyed professions given by the ISCO. We only require 75% of a string match to cateogorize. If there are several profession types that match the contestants' profession, we categorize by finding which category gave the most matches.

Since not all contestant professions are given by the ISCO, we make exceptions for several edge cases - i.e. we manually categorize the professions to give a match.

In [68]:
prof_sent = map(lambda r: r.split(" "), professions["Title EN"].tolist())
isco_code = professions["ISCO 08 Code"].tolist()
prof_list = professions["Title EN"].tolist()

def get_occupation(season_num):
    #Get names and professions
    nprof = get_profession(season_num)
    conts = nprof.keys()
    all_jobs = nprof.values()

    #Make profession names cleaner for processing
    all_jobs = map(lambda r: r.replace("&",""), all_jobs)
    all_jobs = map(lambda r: r.replace("/"," "), all_jobs)
    all_jobs = map(lambda r: r.replace("  "," "), all_jobs)

    #Replace words we a prior (or post-priori) know will give problems to cluster
    all_jobs = [u"assistant" if "aralegal" in s else s for s in all_jobs]
    all_jobs = [u"lawyer" if "ttorney" in s else s for s in all_jobs]
    all_jobs = [u"Professionals" if s=="Wedding Coordinator" else s for s in all_jobs]
    all_jobs = [u"child care" if (s=="Nanny") | (s=="Homemaker") else s for s in all_jobs]
    all_jobs = [u"sales" if "merchant" in s else s for s in all_jobs]
    all_jobs = [u"dancer" if s=="Radio City Rockette" else s for s in all_jobs]
    all_jobs = [u"fashion model" if ("Model" in s) | ("model" in s) else s for s in all_jobs]
    all_jobs = [u"Hairdressers" if "tylist" in s else s for s in all_jobs]
    all_jobs = [u"sports" if s=="acrobat" else s for s in all_jobs]
    all_jobs = [u"executive recruiter" if s=="IT recruiter" else s for s in all_jobs]
    all_jobs = [u"Administrative Assistant" if s=="Assistant" else s for s in all_jobs]
    all_jobs = [u"beauticians" if ("sthetician" in s) | (s=="Salon Owner") | \
                (s=="Cosmetics Consultant") else s for s in all_jobs]
    all_jobs = [u"Songwriter singer" if s=="Singer-songwriter" else s for s in all_jobs]
    all_jobs = [u"Medical Assistant" if s=="Medical Technician" else s for s in all_jobs]
    all_jobs = [u"Advertising account manager" if s=="Advertising Executive" else s for s in all_jobs]
    all_jobs = [u"sports" if s=="WWE Diva-in-Training" else s for s in all_jobs]
    all_jobs = [u"doctor" if "hysician" in s else s for s in all_jobs]
    all_jobs = [u"author" if s=="Blogger" else s for s in all_jobs]
    all_jobs = [u"chief executive" if "ntrepreneur" in s else s for s in all_jobs]
    all_jobs = [u"food service" if "waitress" in s else s for s in all_jobs]
    all_jobs = [u"education" if ("student" in s) | ("Student" in s) else s for s in all_jobs]
    all_jobs = [u"Creative and performing artists" if "Theatre" in s else s for s in all_jobs]
    all_jobs = [u"Aircraft pilot" if s=='Commercial Pilot' else s for s in all_jobs]
    all_jobs = [u"education assistant" if s=='College Admissions' else s for s in all_jobs]
    all_jobs = [u"Management and organization analysts" if s=='Personal Organizer' else s for s in all_jobs]
    all_jobs = [u"Journalists" if s=='Local News Reporter' else s for s in all_jobs]
    all_jobs = [u"Medical assistants" if 'Nurse' in s else s for s in all_jobs]
    all_jobs = [u"Services managers not elsewhere classified" if s=='Nursing Home Owner' else s for s in all_jobs]
    all_jobs = [u"dancers" if "heerleader" in s else s for s in all_jobs]
    all_jobs = [u"Medical sales" if 'Cadaver Tissue Saleswoman' in s else s for s in all_jobs]
    all_jobs = [u"education" if s=='Guidance Counselor' else s for s in all_jobs]


    #Now we sift through the list of professions to just get the nouns
    #Our assumption is that the nouns of a profession provide the best classification
    all_nouns = []
    for sentence in all_jobs:
        stokens = nltk.word_tokenize(sentence)
        sent_noun = []
        for word, part_of_speech in nltk.pos_tag(stokens):
            if part_of_speech in ['NN', 'NNS', 'NNP', 'NNPS']:
                sent_noun.append(word)
        all_nouns.append(sent_noun)

    #Do a string search comparison with the professions.csv file
    #We match (with 75% accuracy) each contestant's profession with the ISCO professions
    #If there is a match, we save the category number (from 0xxx - 9xxx)
    prof_dict = {}
    all_codes = []
    all_names = []
    for iprof,sent in enumerate(all_nouns):
        for word in sent:
            for ii,profs in enumerate(prof_sent):
                wmatches = difflib.get_close_matches(word, profs, cutoff=.75)
                if wmatches:
                    all_names.append(prof_list[ii])
                    all_codes.append(isco_code[ii])
            
        #Get first value of codes
        fnum = map(lambda r: int(str(r)[0]), all_codes)
        counts = [fnum.count(q) for q in range(9)] #Count which code is most
        job_type = np.where(counts==np.max(counts))[0][0]
        if np.sum(counts) == 0:
            job_type=999
            
        #Append jobs type
        prof_dict.update({conts[iprof] : job_type})
        all_codes=[]
        all_names=[]
        
    return prof_dict
In [69]:
season_nums = range(13,20)

#Run over all seasons
profession_dict = {}
for season_num in season_nums:
    profession_dict.update({season_num: get_occupation(season_num)})
    print "season ", season_num, " done"
season  13  done
season  14  done
season  15  done
season  16  done
season  17  done
season  18  done
season  19  done
In [71]:
with open('profession_dict.json', 'w') as fp:
    json.dump(profession_dict, fp)

Visualization

Here we visualize the distribution of contestants for a given profession type

In [52]:
#Open Data
with open("profession_dict.json") as json_file:
    professions = json.load(json_file)
    
#Get winner data
cont = pd.read_csv("contestantDF.csv")
winners = cont[cont['elimination week']=="Winner"]["name"].tolist()
In [50]:
#Turn into array
prof_array=[]
win_array = []
for pkeys in professions.keys():
    for ckeys in professions[pkeys].keys():
        prof_array.append(professions[pkeys][ckeys])
        if ckeys in winners:
            win_array.append(professions[pkeys][ckeys])

#Count the number of profession types
un_vals = np.unique(prof_array)
un_vals = np.append(un_vals, 0)
prof_counts = []
win_counts = []
for uu in un_vals:
    prof_counts.append(np.sum([pp==uu for pp in prof_array]))
    win_counts.append(np.sum([pp==uu for pp in win_array]))

    
fig = plt.figure()
ax = fig.add_subplot(111)

#Plot all professions
ax.bar(un_vals, prof_counts, color="red", label="All contestants", align="center")

#Plot winners
ax.bar(un_vals, win_counts, color="blue", label="Winners", align="center")
xtickNames = ax.set_xticklabels(["","Managers", "Professionals", "Technicians", "Clerical", "Service", \
                                 "Agriculture", "Craft", "Operators"])
ax.set_xlim([0,9])
plt.setp(xtickNames, rotation=90, fontsize=15)
ax.legend()
plt.title("Distribution of Bachelor Professions")
plt.ylabel("Counts")
plt.show()

Since there aren't too many winners, we don't see a clear trend in the professions of winning candidates. However, with the little data we have, we see that the "Professionals" category represents most of the contestants, but has so far been less representative for winners.

In [ ]: