What are Isolation Forests? How to use them for Anomaly Detection?

All of us know random forests, one of the most popular ML models. They are a supervised learning algorithm, used in a wide variety of applications for classification and regression. Can we use random forests in an unsupervised setting? (where we have no labeled data?) Isolation forests are a variation of random forests that can…

What is One-Class SVM ? How to use it for anomaly detection?

One-class SVM is a variation of the SVM that can be used in an unsupervised setting for anomaly detection. Let’s say we are analyzing credit card transactions to identify fraud. We are likely to have many normal transactions and very few fraudulent transactions. Also, the next fraud transaction might be completely different from all previous…

What does the typical day of a data scientist look like ?

Being a data scientist is much more than simply churning models with lot of math! This video breaks down and explains the tasks in the typical day of a data scientist : Communicating with stake holders Analyzing data Designing the end to end data pipeline Building models Tuning models Testing and debugging Evaluating models Measuring…

Can we use the AUC Metric for a SVM Classifier ? 

What is AUC ? AUC is the area under the ROC curve. It is a popularly used classification metric. Classifiers such as logistic regression and naive bayes predict class probabilities  as the outcome instead of the predicting the labels themselves. A new data point is classified as positive if the predicted probability of positive class…

Finding the Right Data Science Job with Online Networking

When I was graduating from University of Utah, there were not a lot of companies that used to turn up for campus placements since we had a good but a very small department with less than 20 students in MS + PhD around then. While I had a few companies that interviewed me, I felt…

What is the difference between a BarChart and a Histogram ?

A Histogram represents the distribution of a numerical variable.  A bar-chart is typically used to compare numeric values corresponding to categorical variables. To construct a histogram:  X-axis: Usually the range of values is binned. In other words, the entire range is divided into a series of intervals and each interval occupies a slot on the…

Learn Data Science and Machine Learning from Scratch

The task of transitioning to a new field is challenging ! not for the faint hearted… It is not very different from climbing a mountain ! To become a data scientist you need to learn Some math (Stats, linear algebra, optimization) Programming (preferably Python / R) The art of working with and analyzing data But…

What is the difference between a Histogram and a Pareto plot ?

A histogram is a bar graph that uses the height of the bar to convey the  frequency of an event occurring. Each bar in a histogram corresponds  to the frequency of occurrence of a specific event. A Pareto chart displays bars by the height of the bars, signifying the order of impact. It follows the Pareto philosophy (the 80/20 rule) through…

What is ACID property in a database? For data analytics tasks, do you need to care about ACID properties ?

ACID properties are important in an RDBMS setting where operations are transnational and there are database updates involved as a part of the task. For instance a banking or an e-commerce application where real-time user data is updated typically needs an RDBMS. A data analyst typically handles structured data using query languages such as SQL. However,…

What are the different types of Joins while wrangling data?

Here are the different types of the JOINs in SQL: (INNER) JOIN: Returns records that have matching values in both tables LEFT (OUTER) JOIN: Returns ALL records from the left table, and the matched records from the right table RIGHT (OUTER) JOIN: Returns ALL records from the right table, and the matched records from the…