9:00 am: Guten Morgen
9:15 am: NLP with Python
11:20 am: Challenges & Fletcher work
12:00 pm: Lunch
1:30 pm: Work like you've never worked before
6:00 pm: Midcourse Destress Party. Fancy stuff.
NLTK.ipynb (39.4 KB)
Create a mongo collection of tweets about something (Using the twitter API, make a search and get at least 500 tweets). Each mongo document should contain the text, username, favorite count and retweet count of the tweet.
Calculate the tf-idf for each word in the tweets you uploaded to your mongodb. For each tweet, print the highest tf-idf term. What are potential problems when trying to calculate tf-idf on really short documents like tweets?
Overencompassing yet still short nltk tutorial
The official book of the nltk
(Awesome, easy to read, very short but to the point. Check this out.) TextBlob full documentation
List of part-of-speech tags
What does VBZ mean? etc.
MIT slides on chunking
If you want to learn more on chunking and prepare your own chunking classifiers, these will help.
Demo for different tokenizers
Demo for different stemmers
These let you get a feel for what different classes do.
tf-idf on Wikipedia
tf-idf tutorial with textblob
Stanford slides on text classification with naive Bayes
Naive Bayes on Wikipedia
Naive Bayes Spam Filtering on Wikipedia
Empirically, naive Bayes works really well on text classification.
One example is spam filtering. Two classes (spam/not spam).
A bag of words model (treating each word as an independent feature), using tf-idf values as weights for the features in MultinomialNB works wonders.
However, naive Bayes is not necessarily the ultimate text classifier. It works reasonably well if you don't have ginormous amounts of data. However, if you have enough data, generally successful classifiers like SVMs can surpass it (but they may take longer to train). So don't be shy in trying other classifiers.