Monthly Archives: March 2017
The average quantitative strategy may take from 10 weeks to seven months to develop, code, test and launch. It is important to note that alpha generation platforms differ from low latency algorithmic trading systems. Alpha generation platforms focus solely on quantitative investment research rather than the rapid trading of investments. While some of these platforms do allow analysts to take their strategies to market, others focus solely on the research and development of these highly complex mathematical and statistical models.
Cross Validation: Each Sample is separated into random equal sized sub-samples, Helps to improve model performance.
Different Forms of cross Validation:
- Train-Test Split – low variance but more bias
- LOOCV(Leave one out Cross validation) – Leave one data point out and apply model on rest of the data. -low bias but high variance,
Now in the above two methods we have limitations related to Bias and variance, So what to do? Let’s fire-up ‘Cross-Validation’!
There are various other important Cross Validation Examples/Methods those are interesting like Time-series_Split, Leave_P_out(LPO), Random_permutation_Split(Shuffle and split), StarifiedKfold,:
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use stratified sampling as implemented in
StratifiedShuffleSplit to ensure that relative class frequencies is approximately preserved in each train and validation fold.
Quant trading is intellectually very satisfying because it draws heavily from fields as diverse as computer science, statistics, math, psychology, economics, business, operations, research, and history.
Quant trading combines programming and trading. Programming is the fastest way to take an idea and turn it into reality (compare it to writing a book, publishing, etc or to architecture a building, getting zoning permits,doing construction, etc). Trading is the most direct approach to making money (compare it with starting the next facebook, programming, hiring, advertising, etc or to creating a retail chain, hiring, waiting for money to flow in to reinvest, etc). Combining programming and trading, you have the most direct path from an idea in your brain to cash but never forget that all you have to do is minimize Risk.
Quant trading is intellectually very satisfying because it draws heavily from fields as diverse as computer science, statistics, math, psychology, economics, business, operations research, and history.
Trading and Trading Strategies:
There are many words those sound complicated when you read or listen about Trading things for example on of those is Statistical Arbitrage Trading. It looks complicated but in real manner it is another form of Pair’s Trading. Pair’s trading is often considered as selling/buying two inter-related stocks/commodities/currencies/futures. Trading strategy is set of calculations those a Trader/Quant/Person believes that will work-out in near or long future in trading markets. There are various methods of trading like Fully-Automated or Semi-Automated. To write a Trading strategy one does not need to know programming or advanced mathematics but on the other hand one need to understand simple concepts of Statistical methods. On the other hand a degree in Advances degree in Mathematics or similar also does not assure that one can write “Profitable” Trading strategies. I believe it is more the matter of Past experience in Trading as well.
How Easy it is to earn from Trading:
When we talk about programming languages often one question is asked, ‘How fast it is?’ and always counter question is ‘Define Fast’. I would like to implement same phenomenon here and that is ‘define easy’ or we can sat that Earning from Trading is EASY but we must not consider easy as ‘Instant’. Most of the times people or brokerage firms are more interested in intra-day trading because big volume is there and that means more buy and sell opportunities.
Borrow Money While Trading:
Good thing about Trading is you have to small amount of Money and you can always get leverage. Leverage is kind you can lend from trading firms for trading so if you will get profit it will also be shared with Firm and on the other hand in the case of loss it will be all yours. Good trading Strategy is considered as one having good returns, better sharp Ratio, small draw-downs, and uses as low leverage as possible, Suppose you got very good tip from anywhere about one really interesting trading investment that you are pretty sure will work even in that case you should use as less as possible. it is dangerous to over-leverage in pursuit of overnight riches.
How much time is required for self running Quantitative Business:
Time is really dependent on the type of Trading strategies you are using. Some trading strategies or Machine-learning models need to be trained few minutes before market opens as well as some models need to be trained after closing the market. Sometimes one has to come up with trading strategies those train models once a week and runs the Buy/Sell calls once a Month.
How Hard it is to Create Trading Strategies:
Creating Trading strategies are as hard as doing calculus and derivative or as easy as finding mean or median from a given set of data. You might talk about various things while creating Trading Strategies but simpler one is think more and more about Quantitative Techniques. Don’t look at the preciseness, perfect or most accurate results you want to achieve but look for the valuable information or quantify those valuable things you need to know about your Trading model.
Importance of Programming in Quantitative Finance:
Developing a Trading Strategy is taste of mind. One can develop various trading strategies by just looking at the things and writing a daily schedule like generate a Buy call at this time and Generate a Sell call at this time and that would also workout but real reason behind all this is you want to create a trading-Bot/Robot that does it automatically so you could observe it as external factor and look into various things those can be tuned so converting a Trading Strategy into Program/code/algorithm is one of the most important task we should persuade.
A Dollar-neutral Portfolio:
The market value of the long positions equals the market value of the short positions.
The beta of the portfolio with respect to a market index is close to zero, where beta measures the ratio between
the expected returns of the portfolio and the expected returns of the market index) require twice the capital or leverage of a long- or short-only portfolio.
Directional Strategy – That only does one thing either buy or sell.
Bi-directional Strategy – That does both buy and sell.
Survivorship Bias Free Data and it’s importance:
One can use a database as Back-test that is not survivorship Bias free for intra-day trading as well because in most of the cases a good company does not fall in one day and same happens with rise of company.
How holding period does effect you Trading Strategies:
I was doing trading strategies based on various factors, Using machine-learning I was predicting if tomorrow’s close price is higher than today’s then buy otherwise sell the stock. it is great Practice to do but in real we have to understand one thing carefully and that is I am holding a stock for whole day, Anything could go wrong in whole day so what I had to do is shorter the holding period of the stock, That means for less time I will hold that particular stock less chances will be there for anything Going wrong and less Risk. 🙂
Sharp-Ratio(Most important thing):
Sharp Ratio defines how consistent your returns are! Think of sharp-ratio as independent from Benchmark if you want to create Strategy with Good returns and High Sharpe Ratio that means your approximation or understanding of Market should come up with various rules of Generalization. Your Sharpe ration should be more than 1(one), Strategy returns could be less or more than benchmark returns. Sharp-Ratio is also considered as information-Ratio and formula to calculate sharp-Ratio is as follows:
Information Ratio(Sharpe-Ratio) = Average of Excess Returns/Standard Deviation of Excess Returns
Excess Returns = portfolio returns- benchmark returns
As a rule of thumb, any strategy that has a Sharpe ratio of less than 1 is not suitable as a stand-alone strategy. For a strategy that achieves profitability almost every month, its (annualized) Sharpe
ratio is typically greater than 2. For a strategy that is profitable almost every day, its Sharpe ratio is usually greater than 3.
One important thing you mush be knowing that Benchmarks varies according to the Market/Exchange you are working with.
Sharpe-Ratio Vs Returns:
If the Sharpe ratio is such a nice performance measure across different strategies, you may wonder why it is not quoted more often instead of returns. A higher Sharpe ratio will actually allow you to make more profits in the end, since it allows you to trade at a higher leverage. It is the leveraged return that matters in the end, not the nominal return of a trading strategy.
Why and How Draw-Down is bad:
A strategy suffers a draw-down whenever it has lost money recently.
A draw-down at a given time t is defined as the difference between the current equity value (assuming no redemption or cash infusion) of the portfolio and the global maximum of the equity curve occurring on or before time t.
You must know draw-down of strategy before using it!
Length and depth of your Draw down:
Your draw-down’s length defines how long it would take to recover the market and depth defines how much you can loose, but those results are based on your back-testing in real you have ti understand things in better way because in the real trading strategies Draw-Downs could be very less or more than benchmark results.
What is Slippage— ?
This delay can cause a “slippage,” the difference between the price that triggers the order and the execution price.
How Will Transaction Costs Affect the Strategy?
Transaction cost is brokerage amount or something that costs when you buy/sell any stock. Now as we understand that less the hold period will be more profit you can make or at-least less risk you will be having. One thing you must never forget and that is minimize the risk that is all behind the strategies and Algorithmic Trading. Every time a strategy buys and sells a security, it incurs a transaction cost. The more frequent it trades, the larger the impact of transaction costs will be on the profitability of the strategy. These transaction costs are not just due to commission fees charged by
the broker. There will also be the cost of liquidity—when you buy and sell securities at their market prices, you are paying the bid-ask spread. If you buy and sell securities using limit orders, however,
you avoid the liquidity costs but incur opportunity costs. This is because your limit orders may not be executed, and therefore you may miss out on the potential profits of your trade. Also, when you buy or sell a large chunk of securities, you will not be able to complete the transaction without impacting the prices at which this transaction is done. This effect on the market prices due to your own order is called market impact, and it can contribute to a large part of the total transaction cost when the security is not very liquid.
Why survivorship bias should not be there?
This is especially true if the strategy has a “value” bent; that is, it tends to buy stocks that are cheap. Some stocks were cheap because the companies were going bankrupt shortly. So if your strategy includes only those cases when the stocks were very cheap but eventually survived (and maybe prospered) and neglects those cases where the stocks finally did get de-listed, the backtest performance will, of course, be much better than what a trader would actually have suffered at that time.
Data-Snooping Bias?(Model Over-fitting)
If you build a trading strategy that has 100 parameters, it is very likely that you can optimize those parameters in such a way that the historical performance will look fantastic. It is also very likely that the future performance of this strategy will look nothing like its historical performance and will turn out to be very poor. By having so many parameters, you are probably fitting the model to historical accidents in the past that will not repeat themselves in
future. Actually, this so-called data-snooping bias is very hard to avoid even if you have just one or two parameters (such as entry and exit thresholds).
Important Questions to ask yourself:
1. How much time do you have for baby-sitting your trading programs?
2. How good a programmer are you?
3. How much capital do you have?
4. Is your goal to earn steady monthly income or to strive for a large, long-term capital gain?
Important questions you must ask yourself before using a Trading Strategy:
1. Does it outperform a benchmark?
2. Does it have a high enough Sharpe ratio?
3. Does it have a small enough drawdown and short enough draw-
4. Does the backtest suffer from survivorship bias?
5. Does the strategy lose steam in recent years compared to its ear-
5. Does the strategy have its own “niche” that protects it from intense competition from large institutional money managers?
Sharp-Ratio and drop-downs(length and duration):
Quantitative traders use a good variety of performance measures. Which set of numbers to use is sometimes a matter of personal preference, but with ease of comparisons across different strategies and traders in mind, I would argue that the Sharpe ratio and draw-downs are the two most important. Notice that I did not include average annualized returns, the measure most commonly quoted by investors, because if you use this measure, you have to tell people a number
of things about what denominator you use to calculate returns. For example, in a long-short strategy, did you use just one side of capital or both sides in the denominator? Is the return a leveraged one (the
denominator is based on account equity), or is it leveraged (the denominator is based on market value of the portfolio)? If the equity or market value changes daily, do you use a moving average as the denominator, or just the value at the end of each day or each month? Most (but not all) of these problems associated with comparing re-turns can be avoided by quoting Sharpe ratio and draw-down instead as the standard performance measures.
Interesting things about Back-Testing:
Back-testing is the process of creating the historical trades given the historical information available at that time, and then finding out what the subsequent performance of those trades is. This process seems easy given that the trades were made using a computer algorithm in our case, but there are numerous ways in which it can go wrong. Usually, an erroneous back-test would produce a historical performance that is better than what we would have obtained in actual trading. We have already seen how survivorship bias in the data used for back-testing can result in inflated performance.
Importance of Sample Size (How much historical data you need?):
The most basic safeguard against data-snooping bias is to ensure that you have a sufficient amount of back-test data relative to the number of free parameters you want to optimize. As a rule of thumb, let’s assume that the number of data points needed for optimizing your parameters is equal to 252 times the number of free parameters your model has. So, for example, let’s assume you have a daily trading model with three parameters. Then you should
have at least three years’ worth of back-test data with daily prices.
However, if you have a three-parameter trading model that updates positions every minute, then you should have at least 252/390 year, or about seven months, of one-minute back-test data. (Note that if
you have a daily trading model, then even if you have seven months of minute-by-minute data points, effectively you only have about 7 × 21 = 147 data points, far from sufficient for testing a three parameter model.)
Training-Set and Test-Set:
It is very simple concept. You have to split data into two portions one would be training set for your model to learn and other would be Test-Set.
What is Profit cap?:
Profit cap is limit that at what amount you want your strategy to exit. In real achieving a profit cap is ultimate goal for the strategy, A greedy strategy without profit cap and Stop-loss could destroy all of your liquidity.
This is like self sustaining model that does all the stuff by itself, that means you need to make sure that your model is safe and secure and all the parameters like profit-cap is being calculated by itself.
The advantage of a parameterless trading model is that it minimizes the danger of over-fitting the model to multiple input parameters (the so-called “data-snooping bias”). So the back-test performance should be much closer to the actual forward performance. (Note that parameter optimization does not necessarily mean picking one best set of parameters that give the best back-test performance. Often, it is better to make a trading decision based on some kind of averages over different sets of parameters.)
A Simple understanding of Back-testing:
Backtesting is about conducting a realistic historical simulation of the performance of a strategy. The hope is that the future performance of the strategy will resemble its past performance, though as your investment manager will never tire of telling you, this is by no means guaranteed!
There are many nuts and bolts involved in creating a realistic historical back-test and in reducing the divergence of the future Backtesting.
Things to take care in Back-Testing:
Data: Split/dividend adjustments, noise in daily high/low, and survivorship bias.
Performance measurement: Annualized Sharpe ratio and maximum draw-down.
Look-ahead bias: Using unobtainable future information for past trading decisions.
Data-snooping bias: Using too many parameters to fit historical data, and avoiding it using large enough sample, out-of-sample testing, and sensitivity analysis.
Transaction cost: Impact of transaction costs on performance.
Strategy refinement: Common ways to make small variations on the strategy to optimize performance.
Importance of Speed in Algorithmic Trading:
There are various things included when you talk about HFT and speed. It does matter that which part of your Trading algorithm takes much more time for execution. Think of this as 90/10 rule in software development. Optimize that 10% portion of your code that Takes 90% time. If your Trading strategy is written in Python that means it could be slow on various portions so it’s better to use C or C++ for such purpose but on the other hand you can also use Cython which is really fast in the case of development as well as in the case of Execution of code. On the other hand your internet connection should be fast as well so you would be able to get data really fast and make decisions based on that data.
Let’s again talk about importance of Programming:
You can use various available custom Platforms for trading as well as you can also create custom ones that uses various back-tests, Different Trading Exchanges(Using APIs),Different Machine-learning models as well as different Platforms and Programming Languages.
How to Decrease Brokerage Cost using Programming?:
In order to minimize market impact cost, you should limit the size (number of shares) of your orders based on the liquidity of the stock. One common measure of liquidity is the average daily volume (it is your choice what lookback period you want to average over).
As a rule of thumb, each order should not exceed 1 percent of the average daily volume. As an independent trader, you may think that it is not easy to reach this 1 percent threshold, and you would be right when the stock in question is a large-cap stock belonging to the S&P 500. However, you may be surprised by the low liquidity of
some small-cap stocks out there.
Paper Trading and Re-creating Strategies:(Testing in Real Market)
The moment you start paper trading you will realize that there is a glaring look-ahead bias in your strategy—there may just be no way you could have obtained some crucial piece of data before you enter an order! If this happens, it is “back to the drawing board.” You should be able run your ATS, execute paper trades, and then compare the paper trades and profit and loss (P&L) with the theoretical ones generated by your backtest program using the latest data. If the difference is not due to transaction costs (including an expected delay in execution for the paper trades), then your software likely has bugs.
Another benefit of paper trading is that it gives you better intuitive understanding of your strategy, including the volatility of its P&L, the typical amount of capital utilized, the number of trades per day, and the various operational difficulties including data issues. Even though you can theoretically check out most of these features
of your strategy in a back-test, one will usually gain intuition only if
one faces them on a daily, ongoing basis. Back-testing also won’t reveal the operational difficulties, such as how fast you can download all the needed data before the market opens each day and how you
can optimize your operational procedures in actual execution.
Psychology and Trading:
This is one of the most important concept you must be knowing. Trading is real money and that real money could make you really mad and in lots of ways. I am again pointing this thing out which is Algorithmic Trading is all about Minimizing your Risk not to get Really rich instantly. Yes you can create Strategies with high Sharpe ratio and sell it to firms like JP Morgan or other big Banks then you will be rich very quickly. 🙂
More or less it’s not just the Trading-Strategy you create but it’s your mind and experiences as well those help you to grow as well as your Capital.
How we can implement RISK-Management Using Programming?:
Calculating the Kelly criterion is relatively simple and relies on two basic components: your trading strategy’s win percentage probability and its win to loss ratio.
The win percentage probability is the probability that a trade will have a positive return. The win to loss ratio is equal to your total trading profits divided by your total trading losses.
These will help you arrive at a number called the Kelly percentage. This gives you a guide to what percentage of your trading account is the maximum amount you should risk on any given trade.
The formula for the Kelly percentage looks like this:
Kelly % = W – [(1 – W) / R]
- Kelly % = Kelly percentage
- W = Win percentage probability (total number of winning trades/total number of trades)
- R = Win to loss ratio (total amount gained by winning trades/total amount lost by losing trades)
Quantitative Trading by Ernie P Chan: http://www.amazon.in/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889
Sometimes we believe that Computer programs are magic, Softwares work like magic in the machine, You click on various buttons on Your PC and Boom! Real interest or real things are really far away from all this stuff. All you have to do is handle large databases, parse those as per requirement and do lots of UI tricks as well as many Exceptional Handling techniques those lead you to a great product or may be the one a real Human can handle! Sometimes it feel that it is also True in the case of Machine-Learning and Data-Science. Selecting some predicates and just applying most popular model might not be enough because we are developing a model and learning with it, So it makes sense to grow with it and get your models grow as well.
Let’s do some discussions with ourselves and learn how Good models could lead to better Trading-Strategies.
Why we use ML models?:
ML Models are used because it saves us to write 100’s or may be 1000’s of lines of code.That is mathematics, one equation can save knowledge of worth hundreds of pages in it. (Remember Binomial Expressions for calculation of Probability? )
Machine learning solutions are cost effective and it takes really less amount of time to implement those models rather than designing an expert system which takes large amount of time and might come with various error pron results on the other hand a simple ML model can be built in few hours(A bit complicated and little bit smart) and years can be used to implement and make it more reliable towards decision making process.
How Machine learning models can learn more Effectively?:
When we are really into model building process there is one thing we might get confused and that is which model to use, really I guess that is really a question one should look for in real. There are not just hundreds but thousands of algorithms one can look for and each year those algorithms get developed more and more with different set of features but here real thing again comes out as intensive question that really which model to use?
Model you want to use is really dependent on your requirement and type of features you believe could be helpful to predict the outcomes. While choosing any algorithm one should take care of three things: Representation, Evaluation, Optimization. Representation describes the features one need to pick for predictions and your algorithm must be able to handle that feature space. This space is also called hypothesis space.
Evaluation plays an important role as well. When you are figuring out which Algorithm you should use or which one can come up with better results, All you need to do is use some kind of Evaluation function for your algorithm/model/classifier and such Evaluation function could be internal as well as external, It all depends on your choice of Algorithm/Model.
Optimization is one another thing that really plays very important role in each aspect of building a model. There are various factors we consider such like ‘Predictability score of model’, ‘Learning rate’, ‘Your model’s memory consumption’ and so on. And there are various factors those are included while doing optimization parts which are something like how we handle all the search criteria.
How Magically Machine-Learning Algorithms do predictions?
One thing we have to understand carefully and that is Approximation/Generalization. My most favorite line about mathematics is : ‘Mathematics for Engineers is all about Finding an approximation Solution and That approximation solution is just enough to put rover on Mars’. An approximation solution leads almost close to various predictions. Splitting your data into train and test set is the best practice and one should never underestimate train test because based on these results of train and test set we are able to initiate the process of optimization. After trying various optimizations on our model which are like choosing the type of search either we are going to use ‘Beam Search’ or ‘Greedy Search’ or how test score improves. Notice that generalization being the goal has an interesting consequence for machine learning. Unlike in most other optimization problems, we don’t have access to the function we want to optimize! So all your Machine learning model does is just finds a common relation between various predicates by experimenting thousands/millions/billions times on your Training data and that experimentation process to find common patterns those may or may not be true for future-use are really based on such Experimentations/Hypothesis/Generalization.
Why Machine learning and Predictions are called BIG-Data?:
Let’s take an example that will explain how much data is good data. Suppose you want to apply for a job so chance of getting a job will be more if you will be able to apply for various positions. So more data is better than clever algorithm because each time machine will be having various cases to understand the classifications but on the other hand BIG data also causes the usage for heavy computing power and Programs like Apache-Spark or Hadoop are industry standard Examples those can be used to process and understand big data in understandable form with fastest way possible.
How many models/Algorithms I must learn?
There would be very simple answer and that is learn as many as you can and apply as many you can. Ensemble learning is known as very popular technique where multiple Algorithms can be used to achieve better results for example in Random-Forest A set of Decision-Trees is being used that means each model/sub-model or each decision tree has given a particular amount of weight and after various trainings random best weights of different Decision trees are being used to predict the outcomes. To understand the process of Ensemble learning one has to look into three things those are most important: 1. Bagging 2. Boosting and 3. Stacking
Bagging: we simply generate random variations of the training set by re-sampling, learn a classifier on each, and combine the results by voting. This works because it greatly reduces variance while only slightly increasing bias.
Boosting: In boosting, training examples have weights, and these are varied so that each new classifier focuses on the examples the previous ones tended to get wrong.
Stacking: Stacking, the outputs of individual classifiers become the inputs of a “higher-level” learner that figures out how best to combine them.
Combining it all becomes: Split training sets into random variations, Apply algorithm on each variance, give a rank(highest – lowest) to each[Bagging] , Now the algorithms/Models having lowest ranks will be improved(How improved?-look for optimization section) [Boosting], Now we have each individual model with weights/ranks and outputs as each have learned from different set of variances Let’s Stack them all – In stacking you have just to suppose that for higher level Model each individual lower level model acts as training set. Here we could have some biasness regarding the sub-models having greater scores those we assigned in Bagging but higher level Model/Algorithm also considers the decisions of Low-Ranked individual model so that reduces the effect of Biasness.
Which are better Models? – Simple or Complicated?:
When it comes to modeling a model or developing a model always it is assumed that simple models are performed better and generate less errors, Yes that is true in various cases but one must not think biased and that is how we profile our models. Simple models have less error and mush faster than complicated one but that does not imply all the time that simple models should be developed for each problem and one should never approach complicated models. Complicated models exist because those also work and provide good results as well Ensemble learning is good approach that has made this concept completely valid.
How to know that which model/Algorithm should apply?:
Few weeks ago I read an article about how one can come up with the solutions/results/models/algorithms those can solve all the problems of the world- and for sure answer is AI(Artificial intelligence) But are we really there yet? there will be very long debate on this section but real truth is or at-least I believe could be is that one must present or visualize the relationships between two or more variables of data so one could be able to understand that if the relationship is linear or non-liner Or moreover one also need to understand how such relations effect the actual outcome of the predictor. For example there could be perfect linear relationship between close prices of two stocks and Linear-Regression could be better apply-able for that than Logistic or any other. This thing simply conveys that all models don’t solve all the problems so difficult thing could be how to find out which model will be able to solve which problem in better way. For that a top overview of many models must be required.
If it is correlated that means it will always be true/predictable:
That is really a good assumption but what would be the need of Data-Scientist if it would be true all the time? Correlation does not imply causation and that means it must not be the only reason to construct a model, When building model there are various Hypothesis As a Data-Scientist need to propose and come with the ones those which come out as real relationship or better predictions when applied on actual life. In stocks we can come up with correlation that news directly effect the stock prices and constructing a strategy that uses NLP(Natural Language Processing) to generate buy/sell calls we might have come up with Negative returns and investors might lost their money. because There might be a correlation between news and stock prices but that does not mean it is the only factor that must be considered to build a model that runs trading strategy based on NLP, so at the end we have various things to consider like how much Stock-volume is present in the company, What are the rankings of stock in SP500 or how much money is available in books of company or how that particular stock is performing in the past decades of time.
Some Final thoughts:
When you think about ML models and how those can create happiness in your life, Here happiness means how you can come up with models those generate best predictions and always come up with great predictions there is only one thing to remember which is ‘Rule of Evolution‘. You can sit on one technology/Design for rest of your life OR you can grow! Thinking about Machine-Learning or building models those are related to Machine-learning is Learning. Machine is desperate for learning and learning is only possible by doing lots of stuff(Experimentation) getting good results keeping those models for further improvements and neglecting those are not Good enough to use but also give those a try may be sooner or later.
Preprocessing plays very important role in Data Science. When are going to feed data to Algorithm there are various things to take care of like normalization, label-encoding, scaling and so on.
A really manual method to remove outliers from your data is as follows:
**Your data should be in between median-2*STD and median+2*STD
STD = Standard Deviation
In the above formula we are considering median as measurement of central tendency by just assuming that it is better than mean but in real life it may not be accurate formula to work on.
In this post we will discuss about various data preprocessing Techniques used in Data-science and Machine-learning those may help to increase the prediction rate, Sometimes you can live without preprocessing but sometimes it is good to do Preprocessing on your data. This ‘Sometime’ depends on your understanding of work you do.
The real Health of your Data describes the real wealth of your Model, That means your data is the most important part that is the main reason it always take much more time like 70-80% while preparing your data for better use, well not really 70-80% If you are Python Ninja 😀 🙂
Let’s Understand Data-Preprocessing in Richard man’s Style.(What is Richard Man’s Style?)
Hack the source!!
Data Scaling or Standardization:
Sigma represents Standard-Deviation and Mu represents Mean.
It is always great to have less biased data with low variance(Spread-out) but Why?
Think of activation function in neural network while performing Forward-propagation, What our activation function does is convert each input between the range of Zero to One(0 to 1) so that would minimize/scale the range of data but in other Algorithms like Regression or Classification we don’t have that automatic-Scaling facility so what we do is apply manual scaling methods.(Too bad I have to write one more function before training my Data :D) One thing also we should remember that ‘Your neural network will be much faster if you feed it with normal data ‘
So decreasing the spread-Out/variance we can achieve better predications because it is easy for system/Algorithm to find patterns into Smaller area. here is small portion of wikkipedia article about feature scaling you might find interesting:
For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
from sklearn import preprocessing
import numpy as np
frag_mented_array = np.array([2,78,12,-25,22,89])
defrag_array = preprocessing.scale(frag_mented_array)
/home/metal-machine/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by the scale function.
array([-0.67832556, 1.18502658, -0.43314765, -1.34030593, -0.18796973,
**Please never underestimate that warning in real life. Try to look into advance-section of numpy array that will teach you how you can assign ‘types’ to Numpy arrays!
Now the point arises should I scale my Training data or Testing data as well, Well answer is Both. Look out for StandardScaler class in Scikit_learn:
There are some other useful features as well:
MinMaxScaler – Scale data between range (0,1)
MaxAbsScaler – Scale data between range (-1,1)
One question here arises that when I should use Normal Standardization or MinMaxScaler or MaxAbsScalar?
Answer could be really tricky or ‘not in the plain form’, It just depend on the Requirements or your model/ML-Algorithm that you are going to apply on your your data.
We can also understand it this way as well that how decreasing ‘Euclidean distance’ does effect the performance of your model or not.
Normalization is the process of finding unit normals for individual groups,
Remember one thing carefully: Scaled data will always have zero mean and Unit Variance, achieving that is the real cause behind scaling or standardization. Scaling is kind of personal choice that how and at what limits you want to scale your data but when it comes to Normalization you have to figure-out an external standard.
Normalizing can either mean applying a transformation so that you transformed data is roughly normally distributed. Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).
It is really simple and effective process to think of. It converts your data into binary for that is 0 and 1. which means boolean values, In the algorithms those are used in Stock’s-Trading are heavily based on binarization where we use 0 if stock’s predicted price will go down and 1 if price will go up or vice-versa. but important thing to note down is how Binarization is important to make final decisions as well for Algorithm to predict values in the form of 0 and 1.
Removing Biasness from Your Data:
Biased data leads your model towards one form of real universe and your model will only understand that biasness in your data and learn/make predictions only based on that biasness(That is the reason we must use Hypothesis-Testing and Cross-Validation before running our model into production) One thing that comes on my mind as an example of biasness is as follows:
Suppose you picked Apple Price stock from period 01-01-14 ti 01-01-15 and let’s assume in that period Apple’s Stock were going up Everyday/Every-Month/Every-Quarter so after training your model with that particular time period data your model will predict apple’s future price as higher than present price because that became nature of model after learning from Biased data.
This is just an Stupid example to tell readers that how Biased does effect your data.
Survivorship Biased data:
A historical database of stock prices that does not include stocks that have disappeared due to bankruptcies, de-listings, mergers, or acquisitions suffer from the so-called survivorship bias, because only “survivors” of those often unpleasant events remain in the database. (The same term can be applied to mutual fund or hedge fund databases that do not include funds that went out of business.)
Backtesting a strategy using data with survivorship bias can be dan-
gerous because it may inflate the historical performance of the strat-
Disclaimer: This is not full proof post about data preprocessing there are so many other things to know like PCA(Principle component analysis) or Gradient Descent for more understanding of Machine-learning operations while applying on your data.
- Julia Ipython
Julia is able to run very well on you Ipython notebook Environment. After all, All you have to do is Data-Science and Machine-Learning. 🙂
1.1 Open Julia Prompt(At Ubuntu it works like typing ‘julia’ command in your Terminal)
1.2 run command > Pkg.add(“IJulia”) # it will do almost all the work.
2. DataFrames: Whenever you have to read lot of files in Excel-Style Julia DataFrames Package is good to go.
A Julia Package for interacting with Arduino.
4. Neural Network Implementation of Julia
5. Visualizing and Plotting in Julia:
6. Reading and writing CSV files in Julia
7. DataClusting in Julia:
For more Large number of Packages, Please refer following link:
Note*: You can also run most of the Shell commands in Julia environment as well. 🙂
Things those need to be understood in many ways.
- Various important parts of Statistics and implementation
- Hypothesis Testing
- Probability Distributions and Importance
- AIC and BIC
- Baysian models
- Some black Magics of OOPS
Ok have some fun first. 😀
Whenever you will read any post or paper related to Machine-Learning or Data-Science you will get word ‘Correlation’ many times and how it’s value is important in your model Building.
A simple definition of Correlation: A mutual relationship or connection between two or more things. (that’s layman’s definition and It should be enough most of the times 😉 )
Coefficient of Correlation is just an integer, From which we understand how two or more things
are related to each-other. As we discussed Coefficient of Correlation is an integer so it could be +ve or -ve and value of correlation decides how two data-sets effect each other.
Following two images tell lot about Correlation and it’s Value.
Coefficient of Correlation between range -0.5 to +0.5 is not that valuable by why and how we calculate correlation?
What is Covariance ?
Now if you still feel that something is really missing we should talk about Variance:
Let’s Remove Co from Covariance.
Variance is Measurement of randomness. So How you would calculate Variance of Data?
Give me data:
Data = [4,5,6,7,12,20]
I will find means and subtract it from each individual- Isn’t that Mean-Deviation ? 😀 OMG!
Have a look at the following Picture:
Let’s wait for stuff like:
Coefficient of determination, Probable Error and interpretation.