The purpose of this post is to test topic modeling techniques with Python on arabic texts in order to grasp the efficiency of the approach used in my previous work on a different langage.
The same code may be applied as is to any “brand” by changing the keywords searched when querying the Twitter API.
My approach is the following:
- Use the Twitter API to extract up to 500 arabic tweets using selected keywords related to a brand (I will choose Renault “رينو”) in this exemple
- Save the tweets into a Mongo database
- Filter retweets, arabic and english stop-words
- Tokenize (using words, bigrams and trigrams)
- Vectorize (using normalised tf-idf)
- Reduce dimensionality
- Apply Agglomerative Clustering or Latent Dirichlet Allocation techniques in order to identify relevant topics
Follow this link in order to learn more about this approach. You can also contact me for further explanation if you are interested in applying this approach to your own brand by analyzing a massive amount of Arabic text…