Analyzing #first7jobs tweets with MonkeyLearn and R

Analyzing #first7jobs tweets with MonkeyLearn and R

Contributed by Maëlle Salmon, creator and maintainer of MonkeyLearn R package.

Have you tweeted about your #firstsevenjobs? I did!

Analyzing tweets with Machine Learning and R programming

#firstsevenjob and #first7jobs tweets initial goal was to provide a short description of the 7 first activities people were paid for. It was quite fun to read them in my timeline! Of course the hashtag was also used by spammers, for making jokes, and for commenting the hashtag, so not all the tweets contain 7 job descriptions.

However, I am confident quite a lot of #firstsevenjobs and #first7jobs actually describe first jobs, so I decided to use them as example of text analysis in R with MonkeyLearn, starting from querying Twitter API with the rtweet package, then cleaning the tweets a bit, and then using the monkeylearn package to classify the jobs in a field of work.

I stored the source code I used for this post on a Github repository, check it out here.

Getting the Tweets

I used the rtweet R package for getting tweets via the Twitter API, searching for both #firstsevenjobs and #first7jobs hashtags and then keeping only unique non-retweeted tweets in English. I got 4,858 tweets, sent between the 2016-08-10 and the 2016-08-20. This does not mean there were only that few tweets produced with the hashtags, but the Twitter API does not output all the tweets. You'd have to pay for it. But hey that's a good number of tweets to start with, so I won't complain. Here is part of the table I got:

status_idtext
765226304073404416What Were Your #FirstSevenJobs?: #firstsevenjobsphotography salespot washerbartenderurban plann... https://t.co/wwGp8NCohG #Architectbiz
764629565431947264The unexpected joys of #FirstSevenJobs https://t.co/wV8XeFVlv8
764104419185229824My piece on #firstsevenjobs https://t.co/Il3a2Wrm0I
765643154964025344#first7jobs: milkshake maker, national anthem singer, babysitter, phone bank caller, tutor, camp counselor, bartender (<- badly)
765407468499374080@BenSPLATT Oh, I thought you were posting your #firstsevenjobs!
76633436244715110413 Entrepreneurs and CEOs Share Their #First7Jobs #jobseekers #advice https://t.co/k1hhSFKznH
764296241698238468Babysitter, educational video actor, Little League umpire, sales clerk, archery instructor, security guard, audiobook narrator. #first7jobs
763868874211078144#firstsevenjobs 1. Landscaper 2. passenger train car attendant 3. Mail sorter at Post office 4. pt Evenings/weekends/cruiser/prod 1090 CHEC
763922164655415296#firstsevenjobs fashion intern, retail sales, rec coordinator,receptionist, teacher, school administrator
7662995122058117121. Golf course kitchen 2. Whole foods 3. Dean and Deluca 4. Hyatt 5. Gotts 6. Hillstone 7. Cheesecake Factory #firstsevenjobs

Parsing the Tweets

So you see, part of them contains actual job descriptions, others don't. I mean, even I polluted the hashtag for advertising my own analysis! Among those that do describe jobs, some use commas or new lines between descriptions, or number them, or simply use spaces... Therefore, parsing tweets for getting 7 job descriptions per tweet was a little challenge.

I counted the number of possible separators for finding which one I should probably use to cut the tweet into 7 parts. This yielded tweets cut in several parts -- sometimes less than 7, sometimes more. I could not parse tweets whose descriptions were separated only by spaces because words inside a description are separated by spaces too so I could not make the difference. Besides, some people have tweeted about less or more than 7 jobs. For instance, one tweet says I have not had seven jobs yet but so far...- Accounts Assistant- Executive PA- Social Media Lead,yoga instructor?#FirstSevenJobs". I did my best to remove tweet parts that were something like "Here are my #firstsevenjobs", in order to keep only the job descriptions. At the end I kept only the tweets that had exactly 7 parts. Out of 48,58 I got 1,637 tweets, that is 11,459 job descriptions. That is a lot. Here is an excerpt of the table:

status_idwordsgrouprank
763505013675229193Shopping bag1
763505013675229193Shopping assistant2
763505013675229193Housekeeper3
763505013675229193Cashier at Empik4
763505013675229193Fast food worker5
763505013675229193Microsoft's consultant6
763505013675229193Cashier at Sport shop7
763511170196135936Dish Pig1
763511170196135936Toy Packer2
763511170196135936Asian Chef3
763511170196135936Bike Fitter4
763511170196135936Bike Shop Dude5
763511170196135936Beard Grower6
763511170196135936Sports Therapist7
763512991731945472babysitter1
763512991731945472busperson2
763512991731945472camp counselor3
763512991731945472secretary/clerk4
763512991731945472graduate assistant5
763512991731945472college prof6
763512991731945472full time writer7

Rank is the rank of the jobs in the tweet, which should be the chronological rank too. For instance, for the first tweet, the first job is shopping bag, the second shopping assistant, etc.

MonkeyLearn magic: summarizing the information by assigning a field to each job

It would take a long time to read them all the tweets, although I did end up reading a lot of tweets while preparing this post. I wanted to have a general idea of what people did in their life. I turned to machine learning to help me get some information out of the tweets. I'm the creator and maintainer of an R package called monkeylearn, which is part of the rOpenSci project, that allows to use existing MonkeyLearn classifiers and extractors, so I knew that MonkeyLearn had a cool job classifier. I sent all the 11,459 job descriptions to MonkeyLearn API.

MonkeyLearn's job classifier assigns a field out of 31 possible fields (called industries) and the confidence of the prediction. The algorithm uses a supported vector machines (SVM) model for predicting the tag of a job. It was originally developed by a user of MonkeyLearn as a public model, and was then further developed by the MonkeyLearn team, still as a public model -- I really like this collaborative effort. As a MonkeyLearn user one could fork the classifier and play with tag definitions, add or improve data for training the model, etc. With my package one can only use existing models, so that a possible workflow would be to develop models outside of R and then to use them in R in production. If you wish to know more about classifiers, you can have a look at MonkeyLearn knowledge base or even take a Machine learning MOOC such as this one. But I disgress, I've been using the jobs classifier as it is, and it was quite fun and above all promising.

I decided to keep only job descriptions for which the probability given by the classifier was higher than 50%. This corresponds to 6,801 job descriptions out of the initial 11,459 job descriptions.

Tweets coverage by the classifier

I then wondered how many jobs could be classified with a probability superior to 50% inside each tweet.

Analyzing tweets with Machine Learning and R programming

For each 1,637 of the tweets I sent to the jobs classifier, I got a field with a probability higher to 0.5 for on average 4-5 job descriptions. We might want even more, and as I'll point it out later, we could get more if we put some effort into it and take full advantage of MonkeyLearn possibilities!

What particular jobs are found within each field?

In this work I used the classifier as it was without modifying it, but I was curious to know which jobs ended up in each tag. I had a glance at descriptions by field but this can take a while given the number of jobs in some tags. Thankfully Federico Pascual reminded me I could use MonkeyLearn's keyword extractor on all job descriptions of each tag to find dominant patterns. Such a nice idea, and something my package supports. I chose to get 5 keywords per field. Here's the result:

labelkeyword
Accounting / FinanceAccounting clerk, financial analyst, account manager, Bookkeeper, Accountant
Administrativeoffice manager, front desk, office assistant, receptionist, assistant
Architecture / DraftingLand surveyor, surveyor, Job, applications, Landscaper
Art/Design / EntertainmentHouse painter, sandwich artist, web designer, Graphic Designer, designer
Banking / Loan / InsurancePrivate tutor, University, insurance, bank teller, teller
Beauty / Wellnesshot dog vendor, Physical Therapy Aide, dog sitter, Dog walker, Dog
Business Development / ConsultingBusiness Owner, Mgmt consultant, strategist, analyst, consultant
Educationhigh school teacher, substitute teacher, library assistant, Math tutor, teacher
Engineering (Non-Software)Audio Engineer, Engineer intern, network engineer, sales engineer, engineer
Facilities / General Laborfactory worker, Grocery Bagger, bagger, Janitor, Warehouse
HospitalityGas station attendant, Gas Station, Kitchen porter, Stock boy, Hostess
Human Resourcesevent coordinator, Recruitment Consultant, Manager, Recruiter, coordinator
Installation / Maintenance / Repairgolf course maintenance, ice cream shop, shop assistant, maintenance, shop
LegalLaw Office Runner, corporate filth monkey, Law clerk, Paralegal, Law firm
Managementretail assistant manager, assistant manager, staff, manager, Director
Manufacturing / Production / Construction / Logisticspark ride operator, Assembly line worker, construction laborer, construction worker, assembly
Marketing / Advertising / PRMarket researcher, Social Media, Marketing Intern, intern, Marketing
Medical / Healthcareice cream scooper, Paperboy, Waiter, Waitress, Babysitter
Non-profit / Volunteeringstudent assistant, Orientation Leader, social worker, camp counselor, Camp
Product Management / Project ManagementProgram manager, Programming Intern, Production Manager, project manager, manager
Real EstateReal Estate Broker, mortgage broker, Actor Commercials, trainee, real estate
Restaurant / Food Servicesfast food, Barista, Bartender, Dishwasher, clerk
Retailgrocery clerk, grocery store, retail sales, Grocery, cashier
Sales / Customer CareCustomer Service Rep, Sales assistant, customer service, Sales Associate, sales
Science / Researchtech support, lab tech, research assistant, Tech, research
Security / Law Enforcementoffice temp, Office admin, Security guard, Security, office
Skilled TradeComputer Repair Tech, Manufacturer, repair, Carpenter, summer
Software Development / ITSoftware Engineer, Web Developer, data entry, Programmer, Developer
Sports / FitnessGymnastics coach, Soccer Referee, Swim instructor, Lifeguard, instructor
Travel / Transportationdelivery driver, Bus Boy, Paper route, Newspaper delivery, Pizza delivery
Writing / Editing / PublishingFreelance Writer, Copywriter, reporter, writer, intern

For some tags, keywords seem natural to us, for some others we might be more surprised. For instance, the algorithm was trained with data which included "'Pet Stylist', 'Dog Trainer', 'Pet Stylist (DOG GROOMER)'" for the Wellness/Beauty tag, and no "Dog sitter", so that's why dog sitting is a wellness job in our results. But wait, having a dog is good for your health so people caring for your dog helps your wellness, right?

Analyzing tweets with Machine Learning and R programming

Say hi to my sibling Mowgli. He's quite a beauty.

So, well, as any statistical or machine learning prediction, the data you use for training your model is quite crucial. The jobs classifier could probably use even more data for improving classification. As any MonkeyLearn public model, it can be built upon and improved, so who's in for forking it? In the meanwhile, it still offers an interesting output to play with. I nearly wanted to add "MonkeyLearn user" as an "Entertainment" job because our sample of classified job descriptions is a nice playground for looking at life trajectories :)

What sorts of jobs did people describe in their tweets?

The 6,801 jobs for which we predicted a tag with a probability higher than 0.5 are divided as follows among industries:

Analyzing tweets with Machine Learning and R programming

Job tags count. Click on the image to enlarge.

The most important tags are Restaurant/Food services and Retail. Usual first jobs...

Juniorness of the jobs in each field

Since we know for each job whether it was the first, third or seventh job of the tweeter, we can explore whether some tags are rather first jobs than late first jobs. For this, inside each field we can look if the field was mostly a label for first jobs or for seventh first jobs. See it for yourself:

Analyzing tweets with Machine Learning and R programming

Rank and counts by job tags. Click on the image to enlarge

I'd tend to say that some industries such as Business Development / Consulting are not first-entry jobs (more yellow/green i.e. later jobs), while Non-Profit / Volunteering have a higher proportion of brand-new workers (more blue). Not a real surprise I guess?

Transitions between industries

I've said I wanted to look at life trajectories. This dataset won't give me any information about the level of the job of course, e.g. whether you start as a clerk and end up leading your company, but I can look at how people move from one tag to another. My husband gave me a great idea of a circle graph he had seen in a newspaper. For this I used only job descriptions for which a field was predicted with a probability higher than 0.5. I kept only possible transitions where there were present more than 10 times in the data, otherwise we'll end up looking at a hairball.

Analyzing tweets with Machine Learning and R programming

Circle graph for job trajectories. Click to enlarge.

On this circle you see different industries, and the transition between them. The length of the circle occupied by each field depends on the number of jobs belonging to this tag, so again the Restaurant / Food Services tag is the biggest one.

One can see that people taking a position in the Hospitality industry, below the circle, often come from the Restaurant / Food Services or the Retail industries. When they leave Hospitality industry, they'll often go work in the Restaurant / Food Services industry. David Robinson suggested me to find the most common transitions and show them in a directed graph but I'll keep this idea for later, since this post is quite long already, ah!

Final words

I'm quite excited by the possibilities offered by MonkeyLearn for text mining. I might be a grumpy and skeptical statistician so I'll tend to look at all the shortcomings of predictions, but really I think that if ones takes the time to train a model well, they can then get pretty cool information from text written by humans.

Now, if you tweet about this article, I might go and look at MonkeyLearn's sentiment analysis for tweets model instead of reading them.

I invite you to check out the source code I've used for this post, and try to replicate this analysis, would be great to hear the insights you may discover within the data.

Acknowledgements

All analyses were performed in R (https://www.R-project.org/). Note that I used the following R packages: rtweet, dplyr, tidyr, ggplot2, stringr, circlize, purrr, readr and of course MonkeyLearn. Thanks a lot to their authors, and obviously thanks to people whose tweets I used... I might be a little bit more grateful to people who used separators and only posted 7 descriptions in their tweet :) If you want to read another #first7 analysis in R, I highly recommend David Robinson's post about the 7FavPackages hashtag.

Federico Pascual

September 1st, 2016

Posts you might like...

MonkeyLearn Logo

Text Analysis with Machine Learning

Turn tweets, emails, documents, webpages and more into actionable data. Automate business processes and save hours of manual data processing.

Try MonkeyLearn
Clearbit LogoSegment LogoPubnub LogoProtagonist Logo