Welcome to Yanting's Data Science Blog

How to Deal with High Cardinality Categorical Variables

Background and Methods

In machine learning problems, we encounter categorical features very often, such as gender, address, zip code, etc. For low cardinality attributes, which only takes a small number of possible values, one hot encoding (OHE) is widely used. This encoding scheme represents each value of the original categorical …

more ...

Identifying Duplicate Quora Question Pairs (Kaggle Competition Bronze Medal Winner)

We explored the current methods in NLP, including word2vec embedding (gensim package in python), LSTMs(use keras neural networks API), tf-idf, python nltk package, etc.
We built machine learning models which identified duplicate Quora question pairs with high accuracy (logloss ~0.151)
We are ranked top 8% in this Kaggle …

more ...

Using Geopy for Spatial Data Analysis

Geopy is a useful python package to deal with spatial data, such as locating the coordinates of addresses, cities, countries, and landmarks or reverse. In the Two Sigma Connect Competition in Kaggle, I found that some of the given latitude longitude do not match the addresses, so I use Geopy …

more ...

Creating Word Cloud in Python

In text analysis, creating word clouds is a useful technique to visualize text data. Words bigger and bolder in size represent a higher frequency of occurance in word corpus. In other word, key words stand out and catch our eyes. The color of the text are generated randomly.

It is …

more ...

How I Build My First Pelican Blog

After completed several data science projects, I am eager to document them and share them with people. It took me several days to research, set up and write my blog, but I feel it can be much easier and faster to build a Pelican blog, so I am sharing with …

more ...