# Lakshay Kumar  # Let numbers speak!

## Mathematics & Statistics for Data Analysis

Machine Learning and Data Science are all about playing with numbers. In this blog, I'll be covering some basic mathematical functions/terms that can help you in data analysis.

## Logarithm

Imagine you have to visualize the following table

 Company Revenue Tesla 31 Uber 11 Amazon 386 Jindal Steel 4.7 Axis Bank 5.6 Vedanta 11.3

Plotting this through a bar plot, this is what we'll get We can see that Amazon have such a high revenue that it is making it difficult to read the revenue for Jindal Steel and Axis Bank. So what we do now, here comes the concept of log. The Log is the inverse of an exponent. In the given equation - this means that b raised to the power c gives us a*,* where b is the base. The common base used is a base of 10. So in the above data, if I add a column for log base 10, this is what we'll get

 Company Revenue Log (base 10) Tesla 31 1.491361694 Uber 11 1.041392685 Amazon 386 2.586587305 Jindal Steel 4.7 0.6720978579 Axis Bank 5.6 0.748188027 Vedanta 11.3 1.053078443

We can see that in the log column, the numbers are very much comparable so as its plot, we can clearly see the difference in revenue of the companies in much better way than the above. In the log, every consecutive number is base times greater than the preceding one, for example, log5125 is 10 times greater than log525.

Practical examples include Earthquakes, where on a Richter Scale 5 is 10 times more powerful than 4.

## Mean and Deviations

Consider the following data on salaries for employees -

 Employee Name Salary (in USD) Jack 1678.3 Alan 2242.9 Kelvin 998.8 Jake 1938.6 Elly 1200.4

We can say that the average salary of a company is 1611.8 USD, but there is a huge difference between the salary of Kelvin and Alan when compared to the average. Imagine we have a new employee Addie with a salary of 25000 USD and then our new mean salary would be 5509.83 USD, but this isn't the case. We see except 1, everyone has their salary less than 2500 USD. So we can calculate the deviation from mean -

 Employee Name Salary (in USD) Deviation (Abs) Jack 1678.3 3831.533333 Alan 2242.9 3266.933333 Kelvin 998.8 4511.033333 Jake 1938.6 3571.233333 Elly 1200.4 4309.433333 Addie 25000 -19490.16667

We see there is a lot of deviation from the mean salary. Now how to sum this up in a single number?

Average won't work, because we have negative values here, so instead of normal averaging, we can square the numbers, add them, divide by the total number of samples, and then square root. This is called Standard Deviation.

 Employee Name Salary (in USD) Deviation (Abs) Deviation Sq. Jack 1678.3 3831.533333 14680647.68 Alan 2242.9 3266.933333 10672853.4 Kelvin 998.8 4511.033333 20349421.73 Jake 1938.6 3571.233333 12753707.52 Elly 1200.4 4309.433333 18571215.65 Addie 25000 -19490.16667 379866596.7 Mean 8726.343666

We can see here that there is so much deviation from the mean salary here..