A data scientist blog, by Philippe Dagher

Classifying Bees With Google TensorFlow

There is often confusion about the differences between bumblebees and honeybees and even some of our top media channels will publish pictures of bumblebees when they are discussing/ writing about honeybees.

These bees have different behaviors and appearances, but given the variety of backgrounds, positions, and image resolutions it can be a challenge for machines to tell them apart.

Wild bees are important pollinators and the spread of colony collapse disorder has only made their role more critical.

In this post, we will build a basic Tensorflow algorithm to determine the genus—Apis (honey bee) or Bombus (bumble bee)—based on photographs of the insects. The purpose is to test Google Tensorflow and not to reach the 99.56 accuracy obtained during the Metis challenge.

How to Install Step by Step a Local Data Lake (2/3)

This post is the second of a series on How to Install Step by Step a Local Data Lake. Before reading, I suggest to follow this tutorial which will allow you to get the tools up and running on a hosted or virtual machine.

You should have now the following architecture ready to receive data flows, crunch them and expose them to Machine learning or Business Intelligence tools:


The purpose of this post is to process the data received according to its type of flow, schema and format, save it in parquet tables for a later use by Machine learning tools and in Hive tables for JDBC access by BI tools.


Step by Step Installation of a Local Data Lake (1/3)

This post will guide you through a step by step installation and configuration of a Local Data Lake on ubuntu with packages such as Hadoop, Hive, Spark, Thriftserver, Maven, Scala, Python, Jupyter and Zeppelin.

It is the first of a series of 3 posts that will allow you to familiarize with state of the art tools to practice Data Science on Big Data.

In the first post we will setup the environment on ubuntu using a cloud host or a virtual machine. In the second post we will crunch incoming data and expose it to data mining and machine learning tools. In the third post, we will apply machine learning and data science techniques to conventional business cases.

Metis Discourse

You will find below snapshots from what I learned and practiced during 12 weeks of Metis Data Science Bootcamp.

I will be organizing soon in Paris, afterwork sessions for persons who are curious about Data Science and eager to learn without being afraid of getting their hands dirty.

Arabic Reputation

The purpose of this post is to test topic modeling techniques with Python on arabic texts in order to grasp the efficiency of the approach used in my previous work on a different langage.

Behaviour Modeling

The objective of my final project at Metis from weeks 9 to 12, is to categorize drivers based on their behaviour on the roads - their driving style and the type of roads that they follow.

The challenge associated with this objective is to identify uniquely a driver (and hence his proper “driving behaviour”) based on the GPS log of a mobile phone located inside the car.

My idea to solve this issue is to experiment Topic Modeling techniques especially Latent Semantic Indexing/Analysis (LSI/LSA) and Latent Dirichlet Allocation (LDA) and explain the observed trips by the unobserved behaviour of drivers.

Unsupervised Learning

Renault, a leading French car manufacturer, is currently launching the “Espace V”. Let’s discover with data science how the market is welcoming this new car model.

For that purpose, I scraped more than 3000 forum messages from 3 major automotive websites in France: Forum-Auto, Passion-Espace and PlanetRenault to analyze the sentiment of Renault lovers on the Espace V.

This post is related to the work done in weeks 7 and 8 at Metis New Economy Skills Training in New York, using Natural Language Toolkit (NLTK), Unsupervised Learning techniques such as Clustering Algorithms (K-means, Hierarchical Clustering, DBSCAN, Mean Shift, etc), Dimensionality Reduction, Topic Modeling with Latent Dirichlet Allocation, as well as Nearest neighbor and approximate nearest neighbor algorithms (kd-trees, Locality Sensitive Hashing, etc).

I was able to identify the main topics around which internet users are discussing:

Supervised Learning

Can we predict heart disease? Yes!

Knowledge of the risk factors associated with heart disease helps health care professionals to identify patients at high risk of having heart disease. The main objective of this project that I led on week 4, 5 and 6 at Metis New Economy Skills Training in New York - is to develop an Intelligent Heart Disease Prediction System that uses the patient’s diagnosis data to perform the prediction.

The dataset I looked at is publicly available from the University of California; in particular, 4 databases coming from the Hungarian Institute of Cardiology in Budapest, the University Hospitals of Zurich and Basel in Switzerland, as well as the V.A. Medical Center in Long Beach and the Cleveland Clinic Foundation in the USA.

Risk factors associated with heart disease proved to be age, blood pressure, smoking habit, total cholesterol, diabetes, family history of heart disease, obesity, lack of physical activity, etc. The attributes from each patient that I considered are described in this file and will be detailed in the code section below.

To build my prediction model, I used all supervised machine learning classifiers such as Logistic Regression, K Nearest Neighbor, Decision Trees, Random Forests, various Naive Bayes implementations as well as Support Vector Machines and Generalized Linear Models (using Poisson and Ordinal regressions). I also tried deep learning techniques such as Neural Networks and the Restricted Boltzmann Machine. On the other hand, I applied feature selection and feature extraction techniques in order to improve my model.

The metrics that I wanted to optimize are Precision and Recall. The Precision is the ratio of people that actually develop heart disease out of those the model says will. A precision of 50% means only half those the model says will develop heart disease actually develop it. We need a high Precision in order to avoid predicting heart disease to healthy people!

Linear Regression

This post is about Linear Regression or Project Luther that I led on week 2 and 3 at Metis New Economy Skills Training in New York. My client - PROMOCINEMA - an advertising agency located in France, is in charge of promoting movies during their release on the screens in France.

Promocinéma works for major distrubutors that represent exclusively 14 US studios. They are paid on a percentage basis of the revenues generated by these movies in the French market. Hence, they need to minimize their advertising budget in order to maximize profit.

My proposal is to predict the revenues generated by a movie, based on:
1- the rating issued by the French press when they preview the movie before its release
2- the revenues generated by the movie in the US market during the first WE of its release
3- the delta in release dates between French and US markets
so that Pomocinéma can adjust their budjet accordingly…

The conclusion of this analysis is twofold:
1- each star given by a journalist is worth 21948 entries,
2- one week of delay to release a movie in the French market is worth 45074 entries.