219 Frequently Asked Machine Learning Interview Questions and Answers
1. What is machine learning?
Answer: Machine learning is a method of data analysis which will automate analytical model building. It is a branch of artificial intelligence based on the view that systems can able to learn from data, identify patterns and make decisions with minimal human intervention.
2. What are Different types of learning in machine learning?
Answer: Following are the different type of learning in machine learning:
4.Semi supervised learning
3. What are different applications of machine learning?
Answer: Following are the applications of machine learning:
5.Credit-card fraud detection
6.Financial market analysis
10.Internet fraud detection
19.Time series forecasting
20.User behavior analytics
4.What is Supervised learning ?
Answer: Supervised learning is the process when data is labelled during training.
5. What is unsupervised learning?
Answer: Unsupervised learning is the process when the data is not labeled during training.
6. What is Reinforcement learning?
Answer: In Reinforcement, agent can learn how to behave in a environment by performing actions and seeing the results learning. It is a type of Machine Learning.
7. What is Semi supervised learning?
Answer: Semi-supervised learning is a class of machine learning tasks and techniques which also make use of unlabeled data for training. A small amount of labeled data with a large amount of unlabeled data is used.
8. What is Anomaly detection?
Answer: Anomaly detection is the identification of data points, items, observations or events which will not conform to the expected pattern of a given group. These anomalies can occur very infrequently but may signify a large and significant threat like cyber intrusions or fraud.
9. What is Bias, Variance and Trade-off?
Answer: Bias is the simplifying assumptions which are made by the model for making the target function easier so that it can be approximated.
Variance is the amount which is the estimate of the target function will change given different training data.
Trade-off is tension between the error which is introduced by the bias and the variance.
10. What are best algorithms for supervised learning?
Answer: Following are the best algorithms for supervised learning:
11. What are best algorithms for unsupervised learning?
Answer: Following are best algorithms for unsupervised learning
5.Principal Component Analysis (PCA)
12. What are best algorithms for Reinforcement learning?
Answer: Following are best algorithms for Reinforcement learning
DQN – Deep Q Network
DDPG – Deep Deterministic Policy Gradient
13. What is regression? When we will use this method?
Answer: Regression is a technique to determine the statistical relationship between two or more variables. In this technique a change in a dependent variable is associated with and which depends on, a change in one or more independent variables.
14. What is clustering? When we will use this method?
Answer: Cluster analysis or clustering is the task of grouping a set of objects in a way that objects in the same group should have more similarity to each other than to those in other groups.
15. What is regularization?
Answer: This is a form of regression which constrains/ regularizes or shrinks the coefficient estimates towards zero. This technique will discourages learning a more complex or flexible model to avoid the risk of overfitting.
16. What is Difference between l1 regularization and l2 regularization?
Answer: The difference between the L1and L2 is that L2 is the sum of the square of the weights and L1 is just the sum of the weights.
17. What ensemble learning?
Answer: Ensemble learning is a process by which multiple models such as classifiers or experts can be strategically generated. It will be combined for solving a particular computational intelligence problem.
18. What are ensemble techniques?
Answer: Following are ensemble techniques:
Basic: max voting, averaging, weighted averages
Advanced: Stacking, Blending, Bagging, Boosting
19. What is root mean square error?
Answer: The root-mean-square error or root-mean-square deviation is a measure of the differences between values predicted by a model, an estimator and the values observed.
20. What is R-Square?
Answer: R-squared is also called coefficient of determination or the coefficient of multiple determinations for multiple regressions and it is a statistical measure of how close the data are to the fitted regression line. It is the percentage of the response variable variation which is explained by a linear model.
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% will indicate, the model can explain none of the variability of the response data around its mean.
100% indicates which the model explains all the variability of the response data around its mean. Basically, the higher the R-squared, the better the model fits the data.
21. What is Confusion Matrix?
Answer: Confusion Matrix is a performance measurement for machine learning classification problem where output will be two or more classes. It is a table with four different combinations of actual values and predicted.
22. What is Type I Error and Type 2 Error?
Answer: A type I error is the rejection of a true null hypothesis which is also called as a false positive finding in statistical hypothesis testing,
Type II error is the failure to reject a false null hypothesis which is also called as a false negative finding.
Type I error is to falsely infer the existence of something which is not there means confirming to common belief with false information and type II error is to falsely infer the absence of something which is present which is going against the common belief with false information.
23. What is Precision and Recall?
Answer: Precision is the fraction of relevant instances among the retrieved instances pattern recognition, information retrieval and binary classification.
Recall is the fraction of relevant instances which have been retrieved over the total amount of relevant instances.
24. What is ROC curve? When you will use it?
Answer: The area under an ROC (Receiver Operating Characteristic) curve is a measure of the usefulness of a test. Usually a greater area means a more useful test. The areas under ROC curves will be used to compare the usefulness of tests.
25. What is F1 Score?
Answer: The f1-score is a performance metrics. The F1 score is defined as a measurement of a model’s performance.
26. What is best programming language for machine learning Python or R or Spark or Sas?
Answer: Python is best option for machine learning.
27. What are dummy variables?
Answer: Dummy variables are also called indicator variables which are used in regression analysis and Latent Class Analysis. These variables are used with two or more categories.
28. What is one-hot encoding?
Answer: One-hot is a group of bits in which the legal combinations of values are with a single high bit and all the others low, in digital circuits and machine learning.
29. Can you handle missing data in your data set?
Answer: If the number of the missing data in data set is less than 5% of the sample, then the researcher can drop it in statistical language.
30. How can you handle duplicate values in your data set?
Answer: We can remove duplicate values in the data set by using pandas dataframe with df.duplicates() function.
31. How can you handle outliers values in your data set?
Answer: Outlier values in the data set can be handled by following method:
32. What are best libraries of machine learning?
Answer: Best libraries of machine learning are:
33. What are best libraries of data visialization?
Answer: Best libraries of data visialization are:
34. What are different types of function in machine learning for Feature scaling?
Answer: Following are the types of function in machine learning for feature scaling:
35. How PCA will work?
Answer: PCA (Principal component analysis) is a technique which is used for emphasising variation and brings out strong patterns in a dataset.
It is often used for making data easy for exploring and visualising.
36. What is bagging?
Answer: Bagging is bootstrap aggregating which is a meta-algorithm that will take M subsamples with replacement from the initial dataset and will train the predictive model on those subsamples. The final model is obtained by averaging the bootstrapped models which yields better results.
37. What is boosting?
Answer: Boosting is a machine learning ensemble meta-algorithm which will primarily reduce bias and is also variance in supervised learning; it is a family of machine learning algorithms which convert weak learners to strong ones.
38. What are the best bagging algorithms?
Answer: Best bagging algorithms are:
39. What are the best boosting algorithms?
Answer: Best boosting algorithms are:
2.Gradient boosting algorithm (GBM)
3.Extreme gradient boosting (XBM)
40. What is dimensionality reduction?
Answer: Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables in statistics, machine learning, and information theory. Dimensionality reduction is divided into feature extraction and Feature selection.
Dimensionality can be reduced by combining features with feature engineering, removing collinear features, or using algorithmic dimensionality reduction.
41. What are best dimensionality reduction algorithms?
Answer: Best dimensionality reduction algorithms are:
1.Missing Value Ratio
2.Low Variance Filter
3.High Correlation Filter
5.Backward Feature Elimination
6.Forward Feature Selection
8.Principal Component Analysis
9.Independent Component Analysis
10.t-Distributed Stochastic Neighbor Embedding (t-SNE)
42.What is recommendation system?
Answer: A recommendation system is a subclass of information filtering system which will seek for predicting the rating or preference a user would give to an item.
43. What are best techniques for recommendation system?
Answer: Following are best techniques for recommendation system:
1.Content based filtering
44. What is Content based filtering?
Answer:Content-based filtering is the technology behind Netflix and Pandora’s recommendation engines.
45. What is collaborative filtering?
Answer: Collaborative filtering is a technique which is used by recommender systems. It has two senses a narrow one and a general one.
46. What is overfitting? How can you overcome from it?
Answer: Overfitting is a modeling error which will occur if a function is very much closely fit to a limited set of data points. Overfitting the model generally will take the form of making an overly complex model for explaining idiosyncrasies in the data under study.
Overfitting will occur when the model is working well with test data and fail at test data.
Early stopping rules will provide guidance as to how many iterations will be run before the learner begins to over-fit.
Pruning will be used extensively when building CART models. It will simply remove the nodes which will add little predictive power for the problem in hand.
47. What is cross Validation?
Answer: In simple form, Cross Validation is a one round validation. In cross validation, we leave one sample as in-time validation and rest for training the model. To keep lower variance a higher fold cross validation is preferred.
48. What is underfitting?
Answer: Underfitting will occur when a statistical model or machine learning algorithm do not capture the underlying trend of the data. Intuitively, underfitting can occur when the model or the algorithm will not fit the data well enough. Specifically, underfitting can occur, if the model or algorithm can show low variance but high bias.
49. How can you improve model performance?
Answer: We can improve the model performance by following methods:
1.Add more data
2.Treat missing and Outlier values
50. How can you deploy your model?
Answer: Model can be deployed by two libraries in python pickle and joblib. It can be easily deployed using flask api.
51. What is Time series modeling?
Answer: A time series is a series of data points indexed which are listed or graphed in time order. A time series is a sequence which is taken at successive equally spaced points in time.
52. What are Stationary data and non stationary data?
Answer: A stationary data has a statistical property such as the mean, variance and autocorrelation which are all constant over time. Hence, a non-stationary data has statistical properties change over time.
53. What are different types of Times series algorithms?
Answer: Following are types of time series algorithm:
Single Exponential smoothing
Holt’s linear trend method Method
Holt’s Winter seasonal method Method
ARIMA (Autoregressive and moving average)
54. What is difference between statistical model and machine learning model?
Answer: Machine Learning is an algorithm which can learn from data without depending on rules-based programming.
Statistical Modelling can be defined as formalisation of relationships between variables in the form of mathematical equations.
55. What is mean square error?
Answer: The mean squared error of an estimator will measure the average of the squares of the error which is the average squared difference between the estimated values and original values which is estimated in statistics.
56. What is artificial intelligence?
Answer: Artificial Intelligence is a branch of Computer Science which studies and researches to develop machines that have intelligence of human being and they will learn from experience and can deal with new situations smartly.
57. What is the difference between artificial intelligence and machine learning?
Answer: Machine learning is the branch of Artificial Intelligence (AI). AI can deal with broader context of developing a machine which can act as human and smartly. In machine learning, we can provide data to machines and they will learn for themselves from that data.
58. What do you know about logistic regression?
Answer: The logistic regression is a predictive analysis which is used for describing data. It can be used for explaining the relationship between one dependent binary variable and nominal, ordinal, and interval or ratio-level independent variables.
59. What is the difference between linear regression and correlation?
Answer: We can get an index which describes the linear relationship between two variables from correlation.
Linear regression can able to predict the relationship between more than two variables and It identify which variables x can predict the outcome variable y in regression. Regression is going back towards average
60. When to use decision tree vs logistic regression?
Answer: A logistic regression model is searching for a single linear decision boundary in feature space. A decision tree is essentially partitioning feature space into half-spaces using axis-aligned linear decision boundaries.
Logistic regression is nice when data points aren’t easily separated by a single hyperplane and decision trees are so flexible that it will depend on specific problem and the data. Decision trees and logistic regression should be able to handle continuous and categorical data. Logistic regression is tend to be less susceptible to overfitting. Decision trees are tend to be overfitting.
Decision trees will automatically take into account interactions between variables. In logistic regression, we have to manually add those interaction terms.
61. How is KNN different from k-means clustering?
Answer: K-nearest neighbors is a classification algorithm that is a subset of supervised learning. K-means is a clustering algorithm that is a subset of unsupervised learning. Basically they are two different algorithms with two very different end results.
62. What is Ordinary Least Squares Regression?
Answer: Ordinary least squares (OLS) are a method for estimating the unknown parameters in a linear regression model in statistics.It is also called linear least squares. The goal is to minimise the sum of the squares of the differences between the observed responses in the given dataset and those are predicted by a linear function of a set of explanatory variables.
63. Briefly describe Naïve Bayes Classification.
Answer: According Bayes Theorem, Naive Bayes is a collection of classification algorithms. It is a family of algorithms which all share a common principle which states that every feature which is classified is independent of the value of any other feature.
A fruit may be considered to be an orange if it is orange in color, round, and about 4″ in diameter. A Naive Bayes classifier will consider each of these features orange, round, 4” in diameter for contributing independently to the probability which the fruit is an orange, without considering any correlations between features. Features will not be always be independent which is considered as a shortcoming of the Naive Bayes algorithm and therefore it is labeled as naive.
64. Do you know the meaning of SVM?
Answer: SVM means Support Vector Machine which is a supervised machine learning algorithm that can be used for either classification or regression challenges. It can be mostly used in classification problems. We can plot each data item as a point in n-dimensional space where n is number of features, feature are the value of a particular coordinate. We will perform classification by finding the hyper-plane which differentiates the two classes.
65. What is the difference between supervised and unsupervised machine learning?
Answer: Supervised learning will require training labeled data.
For example, to do classification a supervised learning task, requires to first label the data which is use to train the model for classifying data into labeled groups.
Unsupervised learning does not require labelling data explicitly.
66. What is Bayes’ Theorem?
Answer: Bayes’ Theorem provides the posterior probability of an event which is known as prior knowledge.
It can be expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition. If we had a 60% chance of actually having the flu after a flu test, but out of people who have the flu, the test will be false 50% of the time. The overall population has a 5% chance to have the flu. Bayes’ Theorem will say no. It will say that we have a (.6 * 0.05) True Positive Rate of a Condition Sample/ (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance to get a flu.
67. What is the trade-off between bias and variance?
Answer: The bias-variance will decompose the learning error from any algorithm by adding the bias, the variance and some irreducible error due to noise in the underlying dataset. If the model is made more complex and adds more variables are added, bias is lost but gains some variance. To get the optimally reduced amount of error, tradeoff between bias and variance is required. Either high bias or high variance is not desired in model.
68. What’s a Fourier transform?
Answer: A Fourier transform is a generic method for decomposing generic functions into a superposition of symmetric functions. The Fourier transform will find the set of cycle speeds, amplitudes and phases to match any time signal. A Fourier transform can convert a signal from time to frequency domain. It is a very common way for extracting features from audio signals or other time series such as sensor data.
69. What is deep learning, and how does it contrast with other machine learning algorithms?
Answer: Deep learning is a subset of machine learning which is concerned with neural networks. It explains the way to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. Deep learning will represent an unsupervised learning algorithm which can learn representations of data through the use of neural nets.
70. What’s the difference between a generative and discriminative model?
Answer: A generative model can learn categories of data. A discriminative model will learn the differences between different categories of data. Discriminative models can outperform generative models on classification tasks.
71. How is a decision tree pruned?
Answer: Pruning can be done in decision trees when branches have weak predictive power are removed, for reducing the complexity of the model. It will increase the predictive accuracy of a decision tree model. Pruning may happen bottom-up and top-down, with approaches like reduced error pruning and cost complexity pruning.
Reduced error pruning may be considered the simplest version:replace each node. This heuristic may occur pretty close to an approach which will be optimizing for maximum accuracy.
72. Which is more important to you– model accuracy, or model performance?
Answer: Model accuracy is only a subset of model performance, and at that, sometimes it is misleading one.
For example, to detect fraud in a massive dataset with a sample of millions, a more accurate model will most likely predict no fraud at all if only a vast minority of cases were fraud. This will be useless for a predictive model — a model designed for finding fraud which asserted there was no fraud at all!
73. How would you handle an imbalanced dataset?
Answer: Following is the way to handle imbalanced dataset:
Collect more data for even the imbalances in the dataset.
Resample the dataset for correcting for imbalances.
Try a different algorithm altogether on the dataset.
74. When should you use classification over regression?
Answer: Classification may produce discrete values and dataset for stricting categories, while regression will give continuous results which allow to better distinguish between individual points. Classification over regression can be used if results are desired to reflect the belongingness of data points in dataset to certain explicit categories.
75. Name an example where ensemble techniques might be useful.
Answer: Ensemble techniques can use a combination of learning algorithms for optimising better predictive performance. They will typically reduce overfitting in models and will make the model more robust therefore it should not be influenced by small changes in the training data.
Examples of ensemble methods, from bagging to boosting to a bucket of models method and it will demonstrate how they could increase predictive power.
76. What’s the “kernel trick” and how is it useful?
Answer: The Kernel trick will involve kernel functions which can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension. Kernel trick will permit the very useful attribute for calculating the coordinates of higher dimensions which is computationally cheaper than the explicit calculation of the coordinates. Many algorithms are expressed in terms of inner products.
77. How do you handle missing or corrupted data in a dataset?
Answer: We can handle missing or corrupted data in a dataset by following manner:
1. Find missing or corrupted data in a dataset.
2. Drop those rows or columns.
3. Decide to replace them with another value.
In Pandas, there are methods called isnull() and dropna() that will help to find columns of data with missing or corrupted data and drop those values. To fill the invalid values with a placeholder value, for example 0, fillna() method can be used.
78. Do you have experience with Spark or big data tools for machine learning?
Answer: Spark is the big data tool which is able to handle immense datasets with speed.
79. What are some differences between a linked list and an array?
Answer: An array is an ordered collection of objects. A linked list is a series of objects with pointers which will direct how to process them sequentially. An array will assume that every element are of the same size, unlike the linked list. A linked list will more easily grow organically: an array are pre-defined or re-defined for organic growth. Shuffling a linked list will involve changing which points direct where but shuffling an array is more complex and takes more memory.
80. Describe a hash table.
Answer: A hash table is a data structure which will produce an associative array. A key is mapped to certain values through the use of a hash function. They will often used for tasks like database indexing.
81. Which data visualization libraries do you use?
Answer: Popular tools are R’s ggplot, Python’s seaborn and matplotlib, and tools like Plot.ly and Tableau.
82. How would you implement a recommendation system for our company’s users?
Answer: We have to research the company and its industry in-depth, the revenue drivers the company, and the types of users, the context of the industry.
83. What are parametric models? Give an example.
Answer: Parametric models are with a finite number of parameters. To predict new data, it is needed to know the parameters of the model. Examples are linear regression, logistic regression, and linear SVMs.
Non-parametric models are with an unbounded number of parameters, which will allow for more flexibility. To predict new data, it is needed to know the parameters of the model and the state of the data which has been observed. Examples are decision trees, k-nearest neighbors, and topic models using latent dirichlet analysis.
84. What is the “Curse of Dimensionality?”
Answer: The difficulty of searching through a solution space will become much harder more features such as dimensions
The analogy of searching for a penny in a line vs. a field vs. a building. The more dimensions means, the higher volume of data is needed. This is the Curse of dimesionality.
85. What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?
Answer: Both algorithms are methods to find a set of parameters which minimize a loss function by evaluating parameters against data and then make adjustments.
In standard gradient descent, required to evaluate all training samples for each set of parameters. This is akin for taking big, slow steps toward the solution.
In stochastic gradient descent, required to evaluate only one training sample for the set of parameters before updating them. This is akin for taking small quick steps toward the solution.
86. When would you use GD over SDG, and vice-versa?
Answer: GD theoretically will minimize the error function which is better than SGD. SGD will converge much faster once the dataset becomes large. This means GD is preferable for small datasets but SGD is preferable for larger ones.
SGD will be used for many applications because it can minimise the error function very well and it is very faster and memory efficient for large datasets.
87. What is the Box-Cox transformation used for?
Answer: The Box-Cox transformation is a generalized power transformation which will transform data for making the distribution more normal.
If lambda parameter is 0, it will be equivalent to the log-transformation and it is used for stabilizing the variance (eliminate heteroskedasticity) and to normalize the distribution.
88. What are 3 data preprocessing techniques to handle outliers?
Answer: Following are the three data preprocessing techniques to handle outliers:
1.Winsorize (cap at threshold).
2.Transform for reducing skew (using Box-Cox or similar).
3.Remove outliers if they are anomalies or measurement errors.
89. What are 3 ways of reducing dimensionality?
Answer: Following are three ways of reducing dimensionality:
1.Removing collinear features.
2.Performing PCA, ICA, any other forms of algorithmic dimensionality reduction.
3.Combining features with feature engineering.
90. How much data should you allocate for your training, validation, and test sets?
Answer: If the test set is very small, and have an unreliable estimation of model performance. Performance statistic usually has high variance. If training set is very small, actual model parameters have high variance.
A good rule of thumb is to apply an 80/20 train/test split. Then, the train set will be further split into train/validation or into partitions for cross-validation.
91. If you split your data into train/test splits, is it still possible to overfit your model?
Answer: It is a common beginner mistake to re-tuning a model or training new models with different parameters after seeing the performance on the test set.
In this case, it is the model selection process which causes the overfitting. The test set is not required to be tainted until we are ready to make final selection.
92. What are the advantages and disadvantages of decision trees?
Answer: Advantages: Decision trees are easy to interpret, nonparametric which means they are robust to outliers and are relatively few parameters for tuning.
Disadvantages: Decision trees are prone to be overfit. It is addressed by ensemble methods such as random forests or boosted trees.
93. What are the advantages and disadvantages of neural networks?
Answer: Advantages: Neural networks specifically deep NNs have led to performance breakthroughs for unstructured datasets like images, audio, and video. Their incredible flexibility will allow them to learn patterns which no other ML algorithm can learn.
Disadvantages: It will require a large amount of training data to converge. It is very difficult to pick the right architecture, and the internal hidden players are incomprehensible.
94. How can you choose a classifier based on training set size?
Answer: If training set is small, high bias / low variance models such as Naive Bayes tend to perform better because it is less likely to be overfit.
If training set is large, low bias / high variance models like Logistic Regression tend to perform better because it can reflect more complex relationships.
95. Explain Latent Dirichlet Allocation (LDA).
Answer: Latent Dirichlet Allocation (LDA) is a common method to model a topic, or to classify documents by subject matter.
It is a generative model. LDA will represent documents as a mixture of topics and each topic will have their own probability distribution of possible words.
The Dirichlet distribution is a distribution of distributions. In LDA, documents are distributions of topics which are distributions of words.
96. Explain Principle Component Analysis (PCA).
Answer: PCA is a method to transform features in a dataset by combining them into uncorrelated linear combinations.
These new features, or principal components will sequentially maximize the variance represented example for the first principal component will have the most variance, the second principal component will have the second most, and so on.
PCA is very useful for dimensionality reduction because arbitrary variance cutoff can be set.
97. What is AUC (a.k.a. AUROC)?
Answer: AUC is area under the ROC curve, and It is a common performance metric to evaluate binary classification models.
It is equivalent to the expected probability which a uniformly drawn random positive is ranked before a uniformly drawn random negative.
98. Why is Area under ROC Curve (AUROC) better than raw accuracy as an out-of- sample evaluation metric?
Answer: AUROC is robust to class imbalance, but raw accuracy is not.
For example, if we want to detect a type of cancer that is prevalent in only 1% of the population, we can build a model which will achieve 99% accuracy by simply classifying everyone has cancer-free.
99. Why are ensemble methods superior to individual models?
Answer: Ensemble method average out biases, reduces variance, and is usually not overfit.
A common line in machine learning is: “ensemble and get 2%.”
This implies that we can build the models as usual and typically expect a small performance boost from ensembling.
100. What is the difference between inductive and deductive learning?
Answer: Inductive learning is the process to use observations for drawing conclusions. Inductive machine learning will begin with examples from which to conclude.
Deductive learning is the process to use conclusions to form observations. They learn by deducing right or wrong about that conclusion.
101. What are some key business metrics for (S-a-a-S startup | Retail bank | e-Commerce site)?
Answer: Key business metrics for (S-a-a-S startup | Retail bank | e-Commerce site) can be:
S-a-a-S startup: Customer lifetime value, new accounts, account lifetime, churn rate, usage rate, social share rate
Retail bank: Offline leads, online leads, new accounts -segmented by account type, risk factors, product affinities
e-Commerce: conversion rate, Product sales, average cart value, cart abandonment rate, email leads.
102. How can you help our marketing team be more efficient?
Answer: This will depend on the type of company. Following are some examples.
Clustering algorithms for building custom customer segments for each type of marketing campaign.
Natural language processing to headlines for predicting performance before running ad spends.
Predict conversion probability based on a user’s website behavior in order for creating better re-targeting campaigns.
103. How would you explain Machine learning to a school-going kid?
Answer: Following way we can explain Machine Learning to a school-going kid:
Suppose your friend will invite you to her party where you meet total strangers. Since you have no knowledge about them, you mentally classify them on the basis of gender, age group, dressing, etc.
In this scenario, the strangers will represent unlabeled data and the process of classifying unlabeled data points is nothing but unsupervised learning.
Since you have not used any prior knowledge about people and classified them on-the-go, this will become an unsupervised learning problem.
104. How does Deep Learning differ from Machine Learning?
Answer: Deep Learning is a form of machine learning which is inspired by the structure of the human brain and is effective in feature detection.
Machine Learning is all regarding algorithms which parse data, learn from that data, and then apply what they’ve learned for making informed decisions.
105. What do you understand by selection bias?
Answer: 1.It is a statistical error which will cause a bias in the sampling portion of an experiment.
2.The error will cause one sampling group to be selected often than other groups included in the experiment.
3.Selection bias can produce an inaccurate conclusion, if the selection bias is not identified.
106. Give example of Precision and Recall?
Answer: Recall is the ratio of the number of events, can be correctly recalled, to the total number of events.
If we can recall all 10 events correctly, then recall ratio is 1.0 (100%) and if we can recall 7 events correctly, recall ratio is 0.7 (70%)
Precision is the ratio of a number of events we can correctly recall, to the total number of events we can recall (mix of correct and wrong recalls).
For example (10 real events, 15 answers: 10 correct, 5 wrong), we will get 100% recall but precision is only 66.67% (10 / 15).
107. What is a Confusion Matrix?
Answer: A confusion matrix or an error matrix is a table which can be used for summarizing the performance of a classification algorithm.
108. What is the difference between Gini Impurity and Entropy in a Decision Tree?
Answer: Gini Impurity and Entropy are the metrics used to decide how to split a Decision Tree.
Gini measurement is the probability of a random sample being classified correctly if randomly pick a label according to the distribution in the branch.
Entropy is a measurement for calculating the lack of information. Calculate the Information Gain is the difference in entropies by making a split. This measure will help to reduce the uncertainty about the output label.
109. What is the difference between Entropy and Information Gain?
Answer: Entropy is an indicator of how messy the data is. It will decreases as it is reach closer to the leaf node.
After a dataset is split on an attribute, The Information Gain is based on the decrease in entropy. It will be continuing on increasing as we reach closer to the leaf node.
110. How do you ensure you’re not overfitting with a model?
Answer: Over-fitting will occur when a model studies the training data to such an extent that it will negatively influence the performance of the model on new data.
This means that the disturbance in the training data will be recorded and it is learned as concepts by the model.
Three main methods for avoiding overfitting:
Collect more data for the model to be trained with varied samples.
Use ensembling methods, like Random Forest. It will be based on the idea of bagging that is used for reducing the variation in the predictions by combining the result of multiple Decision trees on different samples of the data set.
111. What is the difference between Gini Impurity and Entropy in a Decision Tree?
Answer: Gini Impurity and Entropy are the metrics used to decide how to split a Decision Tree.
Gini measurement is the probability of a random sample which is classified correctly if we randomly pick a label according to the distribution in the branch.
Entropy is a measurement for calculating the lack of information. Calculate the Information Gain which is the difference in entropies by making a split. This measure will help to reduce the uncertainty about the output label.
112. What is the difference between Entropy and Information Gain?
Answer: Entropy is an indicator of how messy the data is. It will decrease as we reach closer to the leaf node.
After a dataset is split on an attribute, The Information Gain is based on the decrease in entropy. It will keep on increasing as we reach closer to the leaf node.
113. How would you screen for outliers and what should you do if you find one?
Answer: The following methods are used for screening outliers:
Boxplot: A box plot represents the distribution of the data and its variability. The box plot contains the upper and lower quartiles, therefore the box is basically span the Inter-Quartile Range (IQR). Box plots will be used for detecting outliers in the data. Since the box plot will span the IQR, it can detect the data points which lie outside this range. These data points are outliers.
Probabilistic and statistical models: Statistical models like normal distribution and exponential distribution will be used for detecting any variations in the distribution of data points. If data points are found outside the distribution range, it will be rendered as an outlier.
Linear models: Linear models like logistic regression will be trained to flag outliers. In this way, the model will pick up the next outlier it sees.
Proximity-based models: K-means clustering model is an example of this model wherein, data points will form multiple or ‘k’ number of clusters based on features like similarity or distance. Since similar data points will form clusters, the outliers will form their own cluster. Proximity-based models are easily helped detect outliers.
Following are the way to find outliers
1. If the data set is huge and rich then risk can be taken to drop the outliers.
2.If the data set is small then cap the outliers, by setting a threshold percentile. For example, the data points which are above the 95th percentile will be used for caping the outliers.
3.Based on the data exploration stage, narrow down some rules and can impute the outliers based on the business rules.
114. What are collinearity and multicollinearity?
Answer: Collinearity will occur when two predictor variables (e.g., x1 and x2) in a multiple regression can have some correlation.
Multicollinearity can occur when more than two predictor variables (e.g., x1, x2, and x3) will inter-correlated.
115. What do you understand by Eigenvectors and Eigenvalues?
Answer: Eigenvectors: Eigenvectors are vectors whose direction will remain unchanged even after a linear transformation is performed on them.
Eigenvalues: Eigenvalue is the scalar which is used to transform an Eigenvector.
116. What is A/B Testing?
Answer: A/B is Statistical hypothesis testing to randomize experiment with two variables A and B. It is used for comparing two models which use different predictor variables in order for checking which variable fits best for a given sample of data.
Consider a scenario where we have created two models using different predictor variables which can be used for recommending products for an e-commerce platform.
A/B Testing are used for comparing these two models for checking which one best recommends products to a customer.
117. How do classification and regression differ?
Answer: Classification can predict group or class membership. Regression will involve predicting a response. Classification is a better technique when a more definite answer is required.
118. There’s a game where you are asked to roll two fair six-sided dice. If the sum of the values on the dice equals five, then you win $21. However, you have to pay $5 to play each time you roll both dice. Do you play this game? And in the follow-up: If he plays 6 times what is the probability of making money from this game?
Answer: According to first condition, if the sum of the values on the 2 dices is equal to 5, then we win $21. But for all the other cases we must pay $5.
First, let’s calculate the number of possible cases. Since we have two 6-sided dices, the total numbers of cases => 6*6 = 36.
Out of 36 cases, we must calculate the number of cases which produces a sum of 5 in such a way that the sum of the values on the 2 dices is equal to 5.
Possible combinations which will produce a sum of 5 is, (1,4), (2,3), (3,2), (4,1). All these 4 combinations generate a sum of 5.
This means that out of 36 chances, only 4 will produce a sum of 5. On taking the ratio, we get: 4/36 = 1/9, therefore this suggests that we have a chance of winning $21, once in 9 games.
Therefore to answer the question if a person plays 9 times, he will win one game of $21, whereas for the other 8 games he will have to pay $5 each, which is $40 for all five games. Therefore, he will face a loss because he wins $21 but ends up paying $40.
119. You are given a cancer detection data set. Suppose when you are building a classification model you will achieve an accuracy of 96%. If you are not happy with your model performance? What can you do about it?
Answer: We can do the following:
1.Add more data
2.Treat missing outlier values
120. What is kernel SVM?
Answer: The abbreviated version of kernel support vector machine is Kernel SVM. Kernel methods are a class of algorithms to analyse pattern and the kernel SVM is the most common.
121. What is decision tree classification?
Answer: A decision tree can build classification or regression models as a tree structure, with datasets broken up into ever-smaller subsets. A decision tree handles both categorical and numerical data.
122. What is a recommendation system?
Answer: Recommendation system is an information filtering system which will predict what a user may want to hear or see based on choice patterns are provided by the user.
123. What is Cluster Sampling?
Answer: Cluster sampling is a process to randomly select intact groups within a defined population, of similar characteristics. It is a probability sample in which each sampling unit is a collection or cluster of elements.
For example, if we are clustering the total number of managers in a set of companies, in that case, managers represents elements and companies will represent clusters.
124. Running a binary classification tree algorithm is quite easy. Do you know how the tree will decide on which variable to split at the root node and its succeeding child nodes?
Answer: Measures like, Gini Index and Entropy are used to decide which variable is best fitted for splitting the Decision Tree at the root node.
We calculate Gini as following:
To calculate Gini for sub-nodes, use the formula – sum of square of probability for success and failure (p^2+q^2).
To calculate Gini for split use weighted Gini score of each node of that split
Entropy is the measure of impurity or randomness in the data
, for binary class:
p and q is the probability of success and failure respectively in the node.
Entropy will be zero when a node is homogeneous and entropy will be maximum when both the classes are present in a node at 50% – 50%. The entropy must be very low in order to decide whether or not a variable is suitable as the root node.
125. Name a few libraries in Python used for Data Analysis and Scientific Computations.
Answer: Following Libraries in Python used for Data Analysis and Scientific Computations.
126. Which library would you prefer for plotting in Python language: Seaborn or Matplotlib or Bokeh?
Answer: It will depend on the visualization we are trying to achieve. Each of these libraries has a specific purpose:
Matplotlib: Used for basic plotting such as bars, pies, lines, scatter plots, etc.
Seaborn: It is built on top of Matplotlib and Pandas for easing data plotting. It will be used for statistical visualizations such as creating heatmaps or showing the distribution of data.
Bokeh: It is used for interactive visualization. In case data is too complex and haven’t found any “message” in the data, then use Bokeh to create interactive visualizations which will allow viewers to explore the data themselves.
127. How are NumPy and SciPy related?
Answer: NumPy will define arrays along with some basic numerical functions such as indexing, sorting, reshaping, etc. It is part of SciPy.
SciPy will implement computations like numerical integration, optimization and machine learning using NumPy’s functionality.
128. You are given a data set consisting of variables having more than 30% missing values? Suppose out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?
Answer: Assign a unique category to the missing values, who will know the missing values might uncover some trend.
We will remove them blatantly.
We have to sensibly check their distribution with the target variable, and if we are finding any pattern we should keep those missing values and assign them a new category and removes others.
129. How do you map nicknames (kat, Andy, Nick, Joan, etc) to real names?
Answer: This problem is solved in n number of ways. Assume that we are given a data set containing 1000s of twitter interactions. We will study the relationship between two people by carefully analyzing the words which are used in the tweets.
This kind of problem statement will be solved by implementing Text Mining using Natural Language Processing techniques, wherein each word in a sentence can be broken down and co-relations between various words are found.
NLP is actively used to understand customer feedback, perform sentimental analysis on Twitter and Facebook. Therefore, one of the ways to solve this problem is by Text Mining and Natural Language Processing techniques.
130. You are working on a time series data set. You are asked to build a high accuracy model. You have started with the decision tree algorithm because you know that it works fairly well on all kinds of data. After that, you have tried a time series regression model and got higher accuracy than the decision tree model. Can this happen? Why?
Answer: Time series data is based on linearity but a decision tree algorithm is work best to detect non-linear interactions.
Decision tree has failed to provide robust predictions because:
The reason is that it can not map the linear relationship as a regression model did.
We know that a linear regression model provides a robust prediction only if the data set can satisfy its linearity assumptions.
131. How would you predict who will renew their subscription next month? What data would you need to solve this? What analysis would you do? Would you build predictive models? If so, which algorithms?
Answer: Let’s assume that we have to try to predict renewal rate for Netflix subscription. Problem statement is for predicting which users can renew their subscription plan for the next month.
We should understand the data which is needed for solving this problem. We required to check the number of hours the channel is active for each household, the number of adults in the household, number of kids, which channels are mostly streamed, time spent on each channel, how much has the watch rate varied from last month, etc. Such data is needed to predict whether or not a person can continue the subscription for the upcoming month.
After collecting this data, it is important to find patterns and correlations. If we know that a household has kids, then they are chance to get subscription. Similarly, by studying the watch rate of the previous month, we can predict whether a person is still interested in a subscription. Such trends should be studied.
The next step is analysis. For this kind of problem statement, you should use a classification algorithm which classifies customers into 2 groups:
Customers who can subscribe next month
Customers who can not subscribe next month
Would we build predictive models? Yes, in order to achieve this we must build a predictive model which classifies the customers into 2 classes like mentioned above.
Which algorithms to choose? We can choose classification algorithms such as Logistic Regression, Random Forest, Support Vector Machine, etc.
Once we have opted the right algorithm, we must perform model evaluation to calculate the efficiency of the algorithm. This is followed by deployment.
132. A jar has 1000 coins, of which 999 are fair and 1 is double headed. You have to pick a coin at random, and toss it 10 times. Provided you see 10 heads, what is the probability that the next toss of that coin is also a head?
Answer: There are two ways to choose a coin. One is to pick a fair coin and the other is to pick the one with double heads.
Probability to select fair coin = 999/1000 = 0.999
Probability to select unfair coin = 1/1000 = 0.001
Selecting 10 heads in a row = Selecting fair coin * Getting 10 heads + Selecting an unfair coin
P (A) = 0.999 * (1/2)^10 = 0.999 * (1/1024) = 0.000976
P (B) = 0.001 * 1 = 0.001
P( A / A + B ) = 0.000976 / (0.000976 + 0.001) = 0.4939
P( B / A + B ) = 0.001 / 0.001976 = 0.5061
Probability to select another head = P(A/A+B) * 0.5 + P(B/A+B) * 1 = 0.4939 * 0.5 + 0.5061 = 0.7531
133. Suppose you are given a data set which has missing values spread along 1 standard deviation from the median. What percentage of data would remain unaffected and Why?
Answer: Since the data is spread across the median, let’s assume it’s a normal distribution.
In a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), that leaves ~32% of the data unaffected. Therefore, ~32% of the data will remain unaffected by missing values.
134. What makes segmentation CNNs have an encoder-decoder structure?
Answer: The encoder CNN is imagined as a feature extraction network. The decoder CNN are thought of as something that uses the particular information to make out the image segments by quickly decoding the essential features and presenting and upscaling to the original size of the image.
135. Why does classification CNNs consist of max-pooling?
Answer: This is something which has a humongous role in computer vision. Max-pooling has the ability for reducing computation due to the fact that feature maps become smaller after the pooling. There is too much loss of meaningful information because one is taking out the maximum activation. Max-pooling has also been provided the credit of providing translation in-variance to the CNNs.
136. What is algorithm independent machine learning?
Answer: Independent of any particular classifier or learning algorithm as Mathematical foundations is called as algorithm independent machine learning.
137. What is the difference between artificial learning and machine learning?
Answer: Designing and developing algorithms according to the behaviours based on empirical data are called as Machine Learning. Artificial intelligence will cover other aspects like knowledge representation, natural language processing, planning, robotics etc.
138. Suppose you found that your model is suffering from low bias and high variance. Which algorithm you think could handle the situation and Why?
Answer: Type 1: Tackle high variance
Low bias will occur when the model’s predicted values are near to actual values.
We will use the bagging algorithm such as Random Forest to tackle high variance problem.
Bagging algorithm can divide the data set into its subsets with repeated randomized sampling.
Once divided, these samples are used to generate a set of models using a single learning algorithm. The model predictions can be combined using voting (classification) or averaging (regression).
Type 2: Tackle high variance
Lower the model complexity by using regularization technique, where higher model coefficients will be penalized.
We use top n features from variable importance chart. It might be possible that with the entire variable in the data set, the algorithm is facing difficulty in finding the meaningful signal.
139. You are given a data set. The data set will contain many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?
Answer: Possibly, we may be tempted to say no, but that will be incorrect.
If Discard correlated variables, it will have a substantial effect on PCA. In the presence of correlated variables, the variance are explained by a particular component gets inflated.
140. You are asked to build a multiple regression model but your model R² isn’t as good as you wanted. For improvement, if you remove the intercept term then your model R² becomes 0.8 from 0.3. Is it possible? How?
Answer: Yes, it is possible.
The intercept term will refer to model prediction without any independent variable, mean prediction
R² = 1 – ∑(Y – Y´)²/∑(Y – Ymean)²; Y´ is the predicted value.
In the presence of the intercept term, R² value can evaluate the model with respect to the mean model.
In the absence of the intercept term (Ymean), the model will make no such evaluation.
With large denominator,
Value of ∑(Y – Y´)²/∑(Y)² equation will become smaller than actual, thereby resulting in a higher value of R².
141. You’re asked to build a random forest model with 10000 trees. During training, you received training error as 0.00. But, on testing the validation error was 34.23. What is going on? Haven’t you trained your model perfectly?
Answer: The model is overfitting the data.
Training error of 0.00 means, the classifier mimicked the training data patterns to an extent.
But when this classifier runs on the unseen sample, it will not find those patterns and will return the predictions with more number of errors.
In Random Forest, it will usually happen, if we use a larger number of trees than necessary. Hence, to avoid such situations, we have to tune the number of trees using cross-validation.
142. ‘People who bought this also bought…’ recommendations seen on Amazon is based on which algorithm?
Answer: E-commerce websites such as Amazon uses Machine Learning to recommend products to their customers. The basic idea of recommendation will come from collaborative filtering. Collaborative filtering is the process of comparing users with similar shopping behaviours in order to recommend products to a new user with same shopping behavior.
143. Mention the difference between Data Mining and Machine learning?
Answer: Machine learning can relate with the study, design and development of the algorithms which give computers the capability for learning without being explicitly programmed. Data mining are defined as the process in which the unstructured data will try to extract knowledge or unknown interesting patterns. During this process machine, learning algorithms will be used.
144. What are the five popular algorithms of Machine Learning?
Answer: Following are the five popular algorithms of Machine Learning:
2.Neural Networks (back propagation)
5.Support vector machines
145. What are the three stages to build the hypotheses or model in machine learning?
Answer: Following are three stages to build the hypotheses or model in machine learning
1. Model building
2. Model testing
3. Applying the model
146. What is the standard approach to supervised learning?
Answer: The standard approach for supervising learning is to split the set of example into the training set and the test.
147. What is ‘Training set’ and ‘Test set’?
Answer: In various areas of information science such as machine learning is a set of data that is used to discover the potentially predictive relationship which is called as ‘Training Set’. Training set is an example which is given to the learner, and Test set is used for testing the accuracy of the hypotheses generated by the learner. It is the set of example held back from the learner and Training set are distinct from Test set.
148. List down various approaches for machine learning?
Answer: The following are different approaches in Machine Learning
1) Concept Vs Classification Learning
2) Symbolic Vs Statistical Learning
3) Inductive Vs Analytical Learning
149. What is not Machine Learning?
Answer: Following are not machine Learning
1) Artificial Intelligence
2) Rule based inference
150. Explain what is the function of ‘Unsupervised Learning’?
Answer: Following are the function of unsupervised Learning:
1) Find clusters of the data
2) Find low-dimensional representations of the data
3) Find interesting directions in data
4) Interesting coordinates and correlations
5) Find novel observations/ database cleaning
151. Explain what is the function of ‘Supervised Learning’?
Answer: Following are the function of supervised Learning:
2) Speech recognition
4) Predict time series
5) Annotate strings
152. What is batch statistical learning?
Answer: Statistical learning techniques will allow learning a function or predictor from a set of observed data which can make predictions about unseen or future data. Batch statistical learning techniques can provide confirmation on the performance of the learned predictor on the future unseen data which is completely based on a statistical assumption on the data generating process.
153. What is PAC Learning?
Answer: PAC Learning stands for Probably Approximately Correct learning and is a learning framework. It has been introduced to analyze learning algorithms and their statistical efficiency.
154. A rise in the temperature of the globe has led to a decrease in the number of pirates. Does that mean that a decrease in the number of pirates has caused climate change?
Answer: This is the case of causation and correlation. No, we cannot definitely assume that a decrease in the number of pirates will led to a massive change in the global temperature. There are several other factors influencing this particular outcome or phenomenon.
There can be a correlation between the total number of pirates and global average temperature but it should not be concluded that the pirates will died completely due to the increase in the global temperature.
155. In what areas Pattern Recognition is used?
Answer: Pattern Recognition are used in
1) Computer Vision
2) Speech Recognition
3) Data Mining
5) Informal Retrieval
156. What is Genetic Programming?
Answer: Genetic programming is one of the two techniques which is used in machine learning. It is based on the test and selection of the best choice among a set of results.
157. What is Inductive Logic Programming in Machine Learning?
Answer: Inductive Logic Programming (ILP) in machine learning can use logical programming which represents background knowledge and examples.
158. What is Model Selection in Machine Learning?
Answer: The process of selecting models among different mathematical models that are used for describing the same data set is called as Model Selection. It is applied to the fields of statistics, machine learning and data mining.
159. What are the two methods used for the calibration in Supervised Learning?
Answer: The two methods which are used for predicting good probabilities in Supervised Learning are
1) Platt Calibration
2) Isotonic Regression
These methods are designed for binary classification, is not trivial.
160. Which method is frequently used for preventing overfitting?
Answer: If there is sufficient data ‘Isotonic Regression’ is used for preventing an overfitting issue.
161. What is Perceptron in Machine Learning?
Answer: Perceptron is an algorithm for supervised classification of the input into one of several possible non-binary outputs in Machine Learning.
162. Explain the two components of Bayesian logic program?
Answer: Bayesian logic program is consist of two components. The first component is a logical one which consists of a set of Bayesian Clauses that captures the qualitative structure of the domain. The second component is a quantitative one which encodes the quantitative information about the domain.
163. What are Bayesian Networks (BN)?
Answer: Bayesian Network is used for representing the graphical model for probability relationship among a set of variables.
164. Why instance based learning algorithm sometimes referred as Lazy learning algorithm?
Answer: Instance based learning algorithm will delay the induction or generalization process until classification is performed. Therefore instance based learning algorithm sometimes referred as Lazy learning algorithm.
165. What are the two classification methods that SVM (Support Vector Machine) can handle?
Answer: Following two classification methods which SVM (Support Vector Machine) can handle:
1) Combining binary classifiers
2) Modifying binary to incorporate multiclass learning
166. In what areas Pattern Recognition is used?
Answer: Pattern Recognition is used in
1) Computer Vision
2) Speech Recognition
3) Data Mining
5) Informal Retrieval
167. What is Model Selection in Machine Learning? What is an Incremental Learning algorithm in ensemble?
Answer: Incremental learning method in ensemble is the ability of an algorithm for learning from new data. New data may be available after classifier has already been generated from available dataset.
The process of selecting models among different mathematical models that are used for describing the same data set is called as Model Selection. Model selection will be applied to the fields of statistics, machine learning and data mining.
168. What is PCA, KPCA and ICA used for?
Answer: Principal Components Analysis (PCA), Kernel based Principal Component Analysis (KPCA) and Independent Component Analysis (ICA) is important feature extraction techniques which are used for dimensionality reduction.
169. What are the two methods used for the calibration in Supervised Learning?
Answer: The following two methods are used for predicting good probabilities in Supervised Learning:
1) Platt Calibration
2) Isotonic Regression
These methods are designed for binary classification. It is not trivial.
170. What is the difference between heuristic for rule learning and heuristics for decision trees?
Answer: The heuristics for decision trees can evaluate the average quality of a number of disjointed sets. Rule learners can evaluate the quality of the set of instances which is covered with the candidate rule.
171. What are the components of relational evaluation techniques?
Answer: Following are important components of relational evaluation techniques:
1) Data Acquisition
2) Ground Truth Acquisition
3) Cross Validation Technique
4) Query Type
5) Scoring Metric
6) Significance Test
172. What are the different methods for Sequential Supervised Learning?
Answer: Following are different methods to solve Sequential Supervised Learning problems:
1) Sliding-window methods
2) Recurrent sliding windows
3) Hidden Markow models
4) Maximum entropy Markow models
5) Conditional random fields
6) Graph transformer networks
173. List out the areas in robotics and information processing where sequential prediction problem arises?
Answer: The areas in robotics and information processing where sequential prediction problem can arise are listed below :
1) Imitation Learning
2) Structured prediction
3) Model based reinforcement learning
174. What are the different categories you can categorized the sequence learning process?
Answer: Following are the different categories we can categorised the sequence learning process:
1) Sequence prediction
2) Sequence generation
3) Sequence recognition
4) Sequential decision
175. What is sequence learning?
Answer: Sequence learning is a method to teach and learn in a logical manner.
176. What are two techniques of Machine Learning?
Answer: Following are the two techniques of Machine Learning:
1) Genetic Programming
2) Inductive Learning
177. Give a popular application of machine learning that you see on day to day basis?
Answer: The recommendation engine which is implemented by major ecommerce websites uses Machine Learning
178. How will you know which machine learning algorithm to choose for your classification problem?
Answer: If accuracy is a major concern while choosing on a machine learning algorithm then try different parameters within each algorithm and select the best one by cross-validation. A general rule of thumb for choosing a good enough machines learning algorithm to classify problem is based on size large of training set. If the training set is small then use low variance/high bias classifiers for example Naïve Bayes is advantageous over high variance/low bias classifiers like k-nearest neighbour algorithms as it may overfit the model. High variance/low bias classifiers will tend to win when the training set grows in size.
179. How will you explain machine learning in to a layperson?
Answer: Machine learning is about to make decisions based on previous experience with a task with the intent of improving its performance. There are multiple examples which will be given to explain machine learning to a layperson:
We have observed that obese people often tend to get heart diseases therefore we can make the decision that we will try to remain thin otherwise we may suffer from a heart disease. We have observed lots of data and establish a general rule of classification.
We are playing blackjack which is based on the sequence of cards we see, we decide whether to hit or to stay. In this case based on the previous information we have and by looking at what happens, we make a decision quickly.
180. List out some important methods of reducing dimensionality.
Answer: Following are important methods of reducing dimensionality:
1.Combine features with feature engineering.
2.Use some form of algorithmic dimensionality reduction such ICA or PCA.
3.Remove collinear features to reduce dimensionality.
181. You are given a dataset where the number of variables (p) is greater than the number of observations (n) (p>n). Choose the best technique to use and why?
Answer: If the number of variables is greater than the number of observations, it will represent a high dimensional dataset. In that cases, we will not be to calculate a unique least square coefficient estimate. Penalized regression methods such as LARS, Lasso or Ridge seem work well under these circumstances because they tend to shrink the coefficients for reducing variance. Whenever the least square estimates will have higher variance, Ridge regression technique may work best.
182. “People who bought this, also bought….” recommendations on Amazon are a result of which machine learning algorithm?
Answer: Recommender systems can implement the collaborative filtering machine learning algorithm which will consider user behaviour for recommending products to users. Collaborative filtering machine learning algorithms will exploit the behaviour of users and products through ratings, reviews, transaction history, browsing history, selection and purchase information.
183. Name some feature extraction techniques used for dimensionality reduction.
Answer: Following are some extraction techniques used for dimensionality reduction:
1.Independent Component Analysis
2.Principal Component Analysis
3.Kernel Based Principal Component Analysis
184. List some use cases where classification machine learning algorithms can be used.
Answer: Following are use cases where classification machine learning algorithms can be used:
1.Natural clanguage processing (Best example for this is Spoken Language Understanding )
3.Text Categorization (Spam Filtering )
4.Bioinformatics (Classifying proteins according to their function)
185. What kind of problems does regularization solve?
Answer: Regularization can be used to address overfitting problems because it penalizes the loss function by adding a multiple of an L1 (LASSO) or an L2 (Ridge) norm of weights vector w.
186. How much data will you allocate for your training, validation and test sets?
Answer: It is required to be a balance/equilibrium when allocating data for training, validation and test sets.
If we make the training set too small, then the actual model parameters may have high variance. If the test set is too small, then there are chances of unreliable estimation of model performance. A general thumb rule can be followed is to use 80: 20 train/test spilt. After this the training set are further split into validation sets.
187. Which one would you prefer to choose – model accuracy or model performance?
Answer: Model accuracy is a subset of model performance but it is not the be-all and end-all of model performance.
188. What is the most frequent metric to assess model accuracy for classification problems?
Answer: Percent Correct Classification (PCC) can measure the overall accuracy irrespective of the kind of errors which are made, all errors which are considered to have same weight.
189. Why is Manhattan distance not used in kNN machine learning algorithm to calculate the distance between nearest neighbours?
Answer: Manhattan distance has restrictions on dimensions and It calculate the distance either vertically or horizontally. Euclidean distance is better option in kNN for calculating the distance between nearest neighbours because the data points are represented in any space without any dimension restriction.
190.Comparision between Machine Learning and Big Data.
|Machine Learning Vs Big Data|
|Feature||Machine Learning||Big Data|
|Data Use||Technology which helps in reducing human intervention.||Data research, especially if working with huge data.|
|Operations||Existing data will help to tech machine what will be done further||Design patterns with analytics on existing data.|
|Pattern Recognition||Similar to Big Data, existing data will help in pattern recognition.||Sequence and classification analysis will help in pattern recognition.|
|Data Volume||Best performance, when working with small-datasets.||Datasets will help in understanding and solving problems associated with large data volumes.|
191. How is F1 score used?
Answer: The average of Precision and Recall of a model is F1 score measure. If the F1 score is 1 then it is best and 0 being the worst.
192. What is the difference between an array and Linked list?
Answer: An array is a collection of objects in ordered fashion and
a linked list is a series of objects which are processed in a sequential order.
193. Define a hash table?
Answer: A hash table is a data structure which produces an associative array and it is used for database indexing.
194. Mention any one of the data visualization tools that you are familiar with?
Answer: This is question where one has to be honest and also providing out personal experience with these types of tools is really important. Example of the data visualization tools are Tableau, Plot.ly, and matplotlib.
195. What is your opinion on our current data process?
Answer: The individuals have to carefully listen to use case and the reply should be in a constructive and insightful manner.
196. Please let us know what was your last read book or learning paper on Machine Learning?
Answer: This type of question can be asked to see whether the individual has a keen interest towards learning and they are up to the latest market standards. This is important that every candidate should be looking out for and it is vital for every individual to read through the latest publishings.
197. What is your favourite use case for machine learning models?
Answer: The decision tree is favourite use case for machine learning models.
198. Is rotation necessary in PCA?
Answer: The rotation is necessary because it will maximise the differences between the variance captured by the components.
199. What happens if the components are not rotated in PCA?
Answer: If the components are not rotated in PCA then it will be diminished eventually. It is required to use a lot of various components for explaining the data set variance. It is called straight effect.
200. How Recall and True positive rate are related?
Answer: Recall and True Positive rate are related in:
True Positive Rate = Recall
201. Assume that you are working on a data set, explain how would you select important variables?
Answer: The following are few methods will be used to select important variables:
1. Use of Lasso Regression method.
2. Using Random Forest,
3. Plot variable importance chart
4. Using Linear regression.
202. Explain how we can capture the correlation between continuous and categorical variable?
Answer: We can capture the correlation between continuous and categorical variable by using ANCOVA technique. It will stand for Analysis of Covariance.
It is used for calculating the association between continuous and categorical variables.
203. Explain the concept of machine learning to a 5-year-old baby.
Answer: Machine learning is exactly the way how babies do their day to day activities, as they walk or sleep etc. It is a very common that babies cannot walk straight away and they fall and then they get up again and then try. Similarly when it will come to machine learning, it is all about the way the algorithm is working and at the same time redefining every time to make sure the end result is as perfect as possible.
204. What is the difference between Machine learning and Data Mining?
Answer: Data mining is about to work on unstructured data and then to extract it to a level where the interesting and unknown patterns are identified.
Machine learning is a process which is closely relate to design, development of the algorithms which provide an ability to the machines to capacity for learning.
205. What is inductive machine learning?
Answer: Inductive machine learning is all regarding a process of learning by live examples.
206. Please state few popular Machine Learning algorithms?
Answer: Following are popular Machine learning algorithms:
3.Decision Trees etc
4.Support vector machines
207. What are the three stages to build the model in machine learning?
Answer: Following are three stages to build the model in Machine learning:
3.Applying the model
208. What are the advantages of Naive Bayes?
Answer: The advantages of Naive Bayes are:
• The classifier can converge quicker than discriminative models
• It will not learn the interactions between features
209. What are the disadvantages of Naive Bayes?
Answer: Following are disadvantages of Naive Bayes:
• It is because the problem will arise for continuous features.
• It will make a very strong assumption on the shape of data distribution
• It happens because of data scarcity.
210. What are the conditions when Overfitting happens?
Answer: The possibility of overfitting is because the criteria used to train the model is not as the criteria used for judging the efficacy of a model.
211. What are the different use cases where machine learning algorithms can be used?
Answer: The following are different use cases where machine learning algorithms can be used:
2. Face detection
3. Natural language processing
4. Market Segmentation
5. Text Categorization
212. What are parametric models and Non-Parametric models?
Answer: Parametric models are with a finite number of parameters and to predict new data, we need to know the parameters of the model.
Non Parametric models are with an unbounded number of parameters, allowing for more flexibility and to predict new data, we need to know the parameters of the model and the state of the data which has been observed.
213. What are the three stages to build the hypotheses or models in machine learning?
Answer: There are three stages for building the hypotheses or model in machine learning:
1. Model building
2. Model testing
3. Applying the model
214. What are the advantages of neural networks?
Answer: Neural networks are led to performance breakthroughs for unstructured datasets like images, audio, and video. Their incredible flexibility will allow them to learn patterns that no other Machine Learning algorithm learns.
215. What are the disadvantages of neural networks?
Answer: Neural Network will require a large amount of training data to converge. It is very difficult to select the correct architecture, and the internal “hidden” layers are incomprehensible.
216. What Are The Steps Involved In Machine Learning Project?
Answer: There are several important steps, we should follow for achieving a good working model and they are data collection, data preparation, choosing a machine learning model, training the model, model evaluation, parameter tuning and lastly prediction.
217. What makes CNNs translation never changing?
Answer: The convolution kernel has the ability for acting as its own feature detector. Suppose one is doing an object detection then it does not matter where the object is located in the image. This is because; one is specifically going to apply the convolution in a sliding window manner across the entire range of the image under consideration.
218. What is the marked importance of Residual Networks?
Answer: The prime significance of Residual Networks is generally that it will allow the direct feature access from the previous layers. This directly will contribute to the circulation and propagation of the information fast through the entire network. The utilization of local skip connection, a multi-path structure can be provided to the network. This will provide the features of different paths for propagating through the complete network.
219. Explain false negative, false positive, true negative and true positive with an example.
Answer: True Positive: If the alarm will go on in case of a fire.
Fire will be positive and prediction made by the system will be true.
False Positive: If the alarm will go on, and there is no fire.
System has predicted fire to be positive which is a wrong prediction, hence the prediction is false.
False Negative: If the alarm has not ring but there was a fire.
System has predicted fire to be negative that is false since there was fire.
True Negative: If the alarm will not ring and there is no fire.
The fire is negative and this prediction is true.
Machine Learning doesn’t need any explanation to tech community in these days. This is one the most on demand technology and also making everyone excited. Machine learning is an Artificial Intelligence application that enables the ability to system to automatically learn and improve from the experienced gained through earlier occurrences.
Programmer to Business owner, everyone who is associated with technology gets excited with AI and Machine Learning due to the results that can produce. There is no limit for imagination and expectation from these technologies. World believes the future is going to be ruled by AI. Hence, the need for Machine Learning Expertise is expected to grow exponentially grow in coming years.
If you are aspiring for career in Machine Learning, then we suggest you to go through above 219 frequently asked Machine Learning Interview Questions to have the advantage in your next job interview.