## Top 166 Data Science Questions and Answers for Job Interview

**1. What does the term ‘Statistics’ mean?**

**Answer:** ‘Statistics’ is a branch of mathematics connected to the collection, analysis, interpretation, and presentation of a huge amount of numerical data.

**2. What are the different types of ‘Statistics’?**

**Answer:** There are two types of ‘Statistics’:

• Descriptive ‘Statistics’

• Inferential ‘Statistics’

**3. What do you mean by ‘Descriptive Statistics’?**

**Answer:** ‘Descriptive Statistics’ helps us to organize data and it majorly focuses on the main characteristic of the data. It provides a summary of all the data numerically and graphically, through means of mean, mode, standard deviation, correlation, etc.

**4. What do you mean by ‘Inferential Statistics’?**

**Answer:** ‘Inferential Statistics’ operates on even larger data and applies probability theory to draw a conclusion from the same.

**5. Tell us about the Mean value in ‘Statistics’.**

**Answer:** The Mean is the average value of the data set.

**6. Tell us about the Mode value in ‘Statistics’.**

**Answer:** The Mode value is the value that has been repeated the most in the data set.

**7. Tell us about the Median value in ‘Statistics’.**

**Answer:** The Median is that data value that can be approximately considered as the middle value of the data set. It correctly represents the characteristics of the complete data set.

**8. Tell us about Variance in ‘Statistics’.**

**Answer:** The Variance is a data value that is used by statisticians to measure how far each number in the set is from the mean value.

**9. Tell us about Standard Deviation in ‘Statistics’.**

**Answer:** Standard Deviation is the square root value of variance.

**10. Tell us about the types of variables in ‘Statistics’.**

**Answer:** There are mainly 14 types of variables in ‘Statistics’ that are used throughout all operation:

i. Categorical variable

ii. Confounding variable

iii. Continuous variable

iv. Control variable

v. Dependent variable

vi. Discrete variable

vii. Independent variable

viii. Nominal variable

ix. Ordinal variable

x. Qualitative variable

xi. Quantitative variable

xii. Random variables

xiii. Ratio variables

xiv. Ranked variables

**11. Tell us about the types of distributions in ‘Statistics’.**

**Answer:** There are 5 types of distributions in ‘Statistics’:

Bernoulli Distribution

i. Uniform Distribution

ii. Binomial Distribution

iii. Normal Distribution

iv. Poisson Distribution

v. Exponential Distribution

**12. What do you mean by ‘Normal distribution’?**

**Answer:** ‘Normal distribution’ is like a bell curve distribution. In such a distribution, the values of mean, mode and median are equal. In ‘Statistics’, we generally operate on normal distributions.

**13. What do you mean by ‘Standard Normal distribution’?**

**Answer:** In a ‘Standard Normal distribution’, the mean value is 0 and the standard deviation is 1.

**14. What do you mean by ‘Binominal Distribution’?**

**Answer:** A ‘Binominal Distribution’ is such a distribution which has only two possible outcomes: success or failure. The probability of both success and failure is same for all the trials.

**15. What is ‘Bernoulli Distribution’?**

**Answer:** Any distribution that has only two possible outcomes, i.e., success and failure is known as a ‘Bernoulli Distribution’. Both these outcomes can be verified by using a single trial.

**16. What is ‘Poisson Distribution’?**

**Answer:** Any distribution can be termed as a ‘Poisson Distribution’ when the following assumptions are true:

• Any successful event should not impact the outcome of another successful event.

• The probability of success over a short interval should be equal to the probability of success over a longer interval.

• The probability of success in any interval should approach zero as the interval becomes smaller.

**17. What do you mean by ‘Central Limit Theorem’ in ‘Statistics’?**

**Answer:** The ‘Central Limit Theorem’ in ‘Statistics’ has the following rules:

• The mean of the sample data means is always close to the mean of the population.

• The standard deviation of the sample distribution is calculated using the population standard deviation divided by the square root of sample size N. such a standard deviation is also known as the ‘Standard Error of Means’.

• If the population is not a Normal Distribution and the sample size is greater than 30, the sampling distribution of sample can be considered a normal distribution.

**18. What do you understand by the P-value? How is it useful in ‘Statistics’?**

**Answer:** P-value is the level of marginal significance within a statistical hypothesis test which used to represent the probability of the occurrence of any given event.

• If p<=0.05, it indicates strong evidence against the null hypothesis. This means the null hypothesis can be ruled out.

• If p>0.05, It indicates weak evidence against the null hypothesis. Hence, we cannot reject the null hypothesis.

**19. What do you mean by the Z-value or the Z-score? How is it useful in ‘Statistics’?**

**Answer:** Z-score, also called standard score, is used to indicate the number of standard deviations from the mean. It is.

Formula for Z-score is: z = (X – μ) / σ

Some characteristics of the Z-score are:

• Very useful in statistical testing.

• Always between -3 and 3.

• Useful to find outliers in large volumes of data.

**20. What do you mean by T-Score? How is it useful in ‘Statistics’?**

**Answer:** T-Score is the ratio between the difference between two groups and the difference within the groups. Larger the T-score, more is the difference between groups. Smaller the T-score, more is the similarity between groups. The T-score can be used when the sample size is less than 30. It is also used in statistical testing.

**21. What do you know about the IQR and its use in ‘Statistics’?**

**Answer:** IQR is the acronym for Interquartile Range. It is the difference between the 75th and the 25th percentiles. For some samples, it is the difference between the upper and the lower quartiles. The IQR is also known as Misspread Data or Middle 50%.

Formula: IQR = Q3-Q1

**22. What do you mean by ‘Hypothesis Testing’?**

**Answer:** ‘Hypothesis Testing’ is one of the various statistical methods. It is generally used in making statistical decisions from experimental data. This type of testing is an assumption that is made to know more about the population parameter.

**23. What are the different types of ‘Hypothesis Testing’?**

**Answer:** There are two types of ‘Hypothesis Testing’ in ‘Statistics’:

• Null Hypothesis

• Alternative Hypothesis

**24. What do you mean by Type 1 error?**

**Answer:** The Type 1 Error is known as the FP – False Positive. In ‘Statistics’, the type 1 error is used to indicate the rejection of a true null hypothesis.

**25. What do you mean by Type 2 error?**

**Answer:** The Type 2 Error is known as the FN – False Negative. In ‘Statistics’, the type 2 error indicates that a false null hypothesis has been found and hence, the null hypothesis has to be considered and cannot be rejected.

**26. What do you mean by the term ‘population’ in ‘Statistics’?**

**Answer:** A ‘Population’ is a distinct group of people or things that can easily be identified by a minimum of one common characteristic for the purposes of data collection and analysis.

**27. What do you mean by the term ‘Sampling’?**

**Answer:** The process exclusively used in statistical analysis in which a determined number of observations are taken from a larger population is known as a ‘Sampling’.

**28. What are the different types of sampling techniques?**

**Answer:** There are two types of sampling:

1. PROBABILITY SAMPLING

• Simple Random Sampling

• Stratified Random Sampling

• Systematic Sampling

• Cluster Sampling

• Multi-stage Sampling

2. NON-PROBABILITY SAMPLING

• Purposive Sampling

• Convenience Sampling

• Snow-ball Sampling

• Quota Sampling

**29. What do you understand by ‘Sample Bias’?**

**Answer:** The ‘Sample bias’ is a type of bias that is caused due to the selection of non-random data for statistical analysis.

**30. What do you understand by ‘Selection Bias’?**

**Answer:** The ‘Selection bias’ is a type of error with the sampling. It arises when we have a selection for analysis that has not been properly randomized.

**31. Define the terms Univariate, Bivariate, and **Multi Variate** Analysis.**

**Answer:** Univariate analysis is used for single variable data. Bivariate analysis is used while operating on data comprising two variables. Multi-variate analysis is used while working with multiple variables.

**32. Define ‘Data Science’.**

**Answer:** ‘Data Science’ is the study of information. We study about its source, what is it being used to represent and the efficient use of this data to obtain something meaningful. Data science also involves the collection of both structured and unstructured data in order to identify patterns. These patterns help to reduce cost, increase the efficiency of the system, and identify new market opportunities.

**33. Define ‘Machine Learning’.**

**Answer:** Machine Learning can be defined as the scientific study of algorithms and statistical models built on the computer use to gradually improve their performance for a certain task.

**34. Define ‘Deep Learning’.**

**Answer:** Deep Learning can be defined as a subfield of Machine Learning that is specifically concerned with algorithms with the help of artificial neural networks.

**35. Define ‘Supervised Learning’.**

**Answer:** In supervised learning, the data is labeled and the algorithms are designed in such a way that the computer learns from the labelled data to predict the output. It is a branch of ‘Machine Learning’.

**36. Define ‘Unsupervised Learning’.**

**Answer:** Unsupervised learning is also a branch of ‘Machine Learning’. In this system, the computer is fed with algorithms that help it to learn from test data that has not been labeled, classified or categorized.

**37. Define ‘Reinforcement Learning’.**

**Answer:** Reinforcement learning is another area of ‘Machine Learning’. It is related with the behavior of software agents in any environment in order to maximize the concept of any cumulative reward.

**38. Define ‘Transfer learning’.**

**Answer:** Transfer learning uses knowledge gained while solving one problem and applies the same to a different but closely related problem.

**39. Define ‘Regression’.**

**Answer:** In ‘Statistics’, regression is termed as a measure of the relation between the mean value of one variable and the matching values of other variables.

**40. Define ‘Classification’.**

**Answer:** In both ‘Machine Learning’ and ‘Statistics’, classification is defined as the act of identification of a new observation and assigning it to a set of categories. Classification is done based on a training set of data containing predetermined observations with known categories.

**41. Define ‘Clustering’.**

**Answer:** ‘Clustering’ is the common term for ‘Cluster Analysis’. It is the process of grouping a set of objects and grouping similar objects together and hence, creating a number of collections.

**42. Define ‘Bias’.**

**Answer:** Bias is defined as the difference between the average prediction of any model and the correct value that has to be predicted.

**43. Define ‘Variance’.**

**Answer:** Variance is defined as the variability of the final model prediction for a given data point which basically characterizes data.

**44. Define EDA.**

**Answer:** EDA is the acronym for exploratory data analysis. In ‘Statistics’, EDA is the approach used to analyzing data sets to summarize their main characteristics, often with visual methods. EDA is primarily used for seeing what data can tell us past the formal modeling or hypothesis testing task.

**45. Define the given terms: Overfitting, Underfitting **and** Trade-off.**

**Answer:**

• Overfitting – This model works only on training data but does not perform well on test data.

• Underfitting- This model is unable to understand patterns in data.

• Trade-off – This model works by maintaining a balance between bias and variance.

**46. Mention the steps in building a Machine learning model.**

**Answer:** The following steps are followed while building a Machine Learning model:

1. Problem Statement

2. Gathering Data

3. Data Preparation

4. EDA

5. Model Training

6. Validation

7. Performance Tuning

8. Model Deployment

**47. Define ‘Data Pre-processing’.**

**Answer:** ‘Data preprocessing’ is a pivotal step in the data mining process. Data-gathering systems are often controlled in a loose manner which result in a number of out-of-range values, impossible data combinations, missing values, and so on and so forth.

**48. How can a user deal with missing data values?**

**Answer:** The treatment of missing data values depends upon the type of data. Mean or Median values are used as replacement depending the type of data. However, if the missing points are negligible, they can be removed entirely.

**49. How can one find outliers in a data distribution?**

**Answer:** Outliers in data can be identified by using box plot graphs. If the data is large, the z-value ranges from -3 to 3. The corresponding IQR value ranges from -1.5 to 1.5.

**50. Define types of Regression algorithms.**

**Answer:** There are 7 types of regression algorithms in Machine Learning:

• Linear Regression

• Logistic Regression

• Polynomial Regression

• Stepwise Regression

• Ridge Regression

• Lasso Regression

• ElasticNet Regression

**51. Tell us something about KNN.**

**Answer:** KNN is the acronym for K Nearest Neighbor. It is special type of algorithm used for supervised learning.

**52. How does one choose the correct K value in KNN?**

**Answer:** In order to correctly calculate the K-value for the KNN algorithm, we must use the given formula: sqrt (n), where n is the number of data samples on which the algorithm will operate.

**53. Define the different types of boosting algorithms.**

**Answer:** The different types of boosting algorithms are:

• AdaBoost

• Gradient Boosting

• XGBoost

• LogitBoost

• LPBoost

• TotalBoost

• BrownBoost

• MadaBoost

**54. Define K-Means.**

**Answer:** K-means is a type of Clustering and a form of unsupervised algorithm that is used to determine the best possible clusters from the data. This algorithm identifies groups within a data sample.

**55. How does one choose the value of k in the K-Means algorithm?**

**Answer:** In order to choose the correct value of k in the K-Means algorithm, one must use the elbow method to determine the optimal number of clusters.

**56. Mention the different types of Clustering Techniques.**

**Answer:** The different types of clustering techniques are:

• Partitioning methods

• Hierarchical clustering

• Fuzzy clustering

• Density-based clustering

• Model-based clustering

**57. Define PCA.**

**Answer:** PCA is the acronym for Principal component analysis. It is a statistical procedure in which an orthogonal transformation is used to alter a set of observations of correlated variables into a set of linearly uncorrelated variables. These altered variables are called principal components. This method reduces the dimensionality of data.

**58. Mention the types of metrics in Regression.**

**Answer:** The different types of metrics that are used in Regression are:

• RMSE – Root Mean Square Error

• MSE – Mean Square Error

• MAE – Mean Absolute Error

• R2 score

**59. How can the user successfully improve the accuracy of any model?**

**Answer:** Any user can successfully improve the accuracy of any model by making use of the following:

• Feature selection

• Dimensionality reduction

• Ensemble methods (bagging and boosting algorithms)

• Hyper parameter tuning

**60. Mention the types of loss/cost function in machine learning.**

**Answer:** The types of loss/cost function in Classification are:

• log loss

• focal loss

• KL Divergence/Relative entropy

• Exponential loss

• Hinge Loss

The types of loss/cost function in Regression are:

• mean square error

• mean absolute error

• huber loss/ smooth mean absolute error

• log cosh loss

• quantile loss

**61. While building a model, which should be preferred: model performance or model accuracy?**

**Answer:** One should use model performance as means for building a model since model accuracy is a subset of the model performance.

**62. Mention the type of metrics in Classification.**

**Answer:**

• Confusion Matrix = ((TP + FN)/(FP + TN))

• Accuracy score = (TP+TN)/TP+TN+FP+FN

• Recall , True positive rate, – ( TP/TP+FN)

• Precision – (TP/TP+TN)

• F1score = 2(precision*recall)/precision+recall

**63. Elaborate on the various data visualization methods with the help of different charts in Python.**

**Answer:** The following are the various methods of data visualization used in Python:

• Histogram,

• Bar plots

• Linegraph

• Pie Chart

• Scatter Plot

• Box plots

**64. Mention the best programming libraries of machine learning.**

**Answer:** The following are the best programming libraries in R and Python:

• Scikitlearn

• Pandas

• Scikit Learn

• Tensorflow

• Keras

• Pytorch

• Numpy

• Matplotlib

• Seaborn

**65. Mention the Machine Learning libraries in Python.**

**Answer:** The following Machine Learning algorithms are available within Python:

• Numpy

• Pandas

• Scipy

• Scikit Learn

• Tensorflow

• Keras

• Pytorch

• Matplotlib

• Seaborn

**66. What is a Data Analyst’s role in an organization?**

**Answer:** The major responsibilities of any standard data analyst comprise the following:

• To understanding the structure of data and other such sources concerning business

• To extract data from concerned sources efficiently within a proper time limit

• The identification, evaluation and implementation of services and tools from external sources to support data validation and cleaning

• To perform a thorough checking of data and solve any data issues for the sake of business

• Ensuring database security by developing an access system at the various user levels

• To analyze, identify as well as interpret different process trends or patterns based on complex data sets and similar trigger alerts

• To evaluate historical data and make concerned forecast that might help to develop business

• To develop and validate the different predictive models in order to improve business processes and locate all key growth strategies.

**67. Mention the necessary skill set that a data scientist should possess.**

**Answer:** Any standard data scientist is supposed to have the following skills in order to excel in any organization:

• Knowledge of Mathematics, majorly ‘Statistics’: A typical Data scientist should be deft with his/her statistical concepts. Only a well-trained data scientists will be able to perform complex tasks with data.

• Programming skills: A good data scientists must be good at programming especially in one scripting language, such as Matlab, and Python, Spreadsheet such as Excel, one Statistical Language, such as SAS, R, and SPSS, Querying Language, such as SQL, Hive, and Pig. The other computer skills that he/she must possess include big data tools, such as Spark, and Hive HQL, and programming skill such as JavaScript, and XML.

• Logical Deduction: A good data scientist must be sharp enough to identify any anomalies quickly and design strategies from trends visible in the data.

• A good domain knowledge is also a quality that should be possessed by a data scientist.

**68. What are the must steps that should be followed in an analytics project?**

**Answer:** The following steps must be followed in an analytics project:

• Defining the objective function perfectly

• To identifying the key sources of data required for analysis

• Preparation and cleaning of data

• Modelling of data

• Validation of the model created

• Implementation and tracking by deployment and constant monitoring the results

**69. What do you mean by Data Cleansing/Cleaning?**

**Answer:** Data Cleansing/Cleaning is the process of detection and correction of corrupt or wrong records from a record set, table, or database. It also refers to the identification of incomplete, irrelevant, wrong, and inconsistent parts of the record set, table, or database data. All this data is then replaced, modified, or deleted. Data cleaning also means the identification of data anomalies that cannot be represented consistently by one model in model development.

**70. Mention some of the best practices followed during data cleaning.**

**Answer:** These are some of the best practices for data cleaning:

• Treatment of missing value

• To understand the range, mean, median and plot a normal curve

• Identification of outliers in data and treatment of these outliers

**71. Define ‘Logistic Regression’.**

**Answer:**Logistic regression can be defined as a statistical method for the examination of a dataset comprising one or more independent variables that are used to define an outcome.

**72. What are the best tools useful for **analysis** of data?**

**Answer:** The following are the best tools useful for analysis of data:

• NodeXL

• KNIME

• Solver

• R Programming

• SAS

• Weka

• Apache Spark

• Orange

• Io

• Talend

• RapidMiner

• OpenRefine

• Tableau

• Google Search Operators

• Google Fusion Tables

• Wolfram Alpha’s

• Pentaho

**73. Differentiate between data profiling and data mining.**

**Answer:**

• Data profiling is the process of analyzing data from an existing information source like a database and collecting informative summaries about the same data. It may be information pertaining to various attributes like discrete value, value range etc.

• Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, and database systems. It can be used to focus on cluster analysis, dependencies, sequence discovery, detection of unused records and so on and so forth.

**74. Mention some of the problems commonly faced by data analysts.**

**Answer:** These are some of the problems that are faced by data analysts:

• Data storage and quality

• Identification of overlapping data

• Misspelling of data

• Duplication of data entries

• Representation of values in a varying manner

• Missing data values

• Presence of Illegal values

• Security and privacy of data

**75. Define Hadoop MapReduce.**

**Answer:** Hadoop MapReduce is the name of the programming framework developed by Apache that is used to process large data sets, for an application in any distributed computing environment.

**76. Mention some of the missing patterns that are observed generally.**

**Answer:** Some of the missing patterns that are observed generally are:

• Missing at random (MAR)

• Missing completely at random (MCAR)

• Not missing at random (NMAR)

• Missing depending on unobserved input variable

• Missing depending on the missing value itself

**77. Describe the KNN imputation method.**

**Answer:** In the KNN imputation method, the missing attribute values are attributed by making use of the attribute values that are similar to the attribute for which the values are missing. The similarity of two attributes can be calculated using the Distance Formula.

**78. Mention the data validation methods generally used by a data analyst.**

**Answer:** The following are the data validation methods generally used by a data analyst:

• Data screening

• Data verification

Some of these validation methods include:

• Allowed character checks

• Batch totals

• Cardinality check

• Consistency checks

• Control totals

• Cross-system consistency checks

• Data type checks

• File existence check

• Format or picture check

• Logic check

• Limit check

• Presence check

• Range check

• Referential integrity

• Spelling and grammar check

**79. Mention the steps that should be used by a data analyst when he/she confronts suspected or missing data.**

**Answer:** The steps that should be used by a data analyst when he/she confronts suspected or missing data:

1) To prepare a detailed validation report which is used to provide information for all missing data values.

2) The suspected data should be then analyzed to validate credibility.

3) To replace and assign a validation code to any invalid data value.

A data analyst should use the best analysis techniques like deletion method, model-based methods, single imputation methods, and so on, in order to replace the missing values.

**80. What are the steps to be followed by a data analyst while dealing with **a multi-source problems**?**

**Answer:** The steps to be followed by a data analyst while dealing with a multi-source problems are:

• To perform a schema integration through restructuring of schemas.

• To identify and merge similar records into a single record which will contain all relevant attributes without redundancy.

**81. Define an Outlier.**

**Answer:** An outlier is used to refer to a value/observation that appears far from the sample and diverges from the overall pattern. Such a value is considered to be far from the actual sample but not as an anomaly.

**82. Mention the various types of outliers.**

**Answer:** There are three different types of outliers:

• Collective outliers

• Contextual/conditional outliers

• Global outliers or point anomalies

**83. Define the Hierarchical Clustering Algorithm.**

**Answer:** The hierarchical clustering algorithm is an algorithm that is used to group similar objects into groups known as clusters. The Algorithm is the process of combining and dividing the existing data groups in order to create a hierarchical structure that can be used to represent the order in which the groups are divided or merged, according to the requirement.

**84. What do you mean by the time series analysis?**

**Answer:** The Time Series Analysis is a statistical technique that is used to work with time series data or trend analysis. It is used for forecasting the output of a process by means of an analysis of the previous data using statistical methods like exponential smoothening, log-linear regression method, etc.

**85. What are the different statistical methods that are very useful for data-analyst?**

**Answer:** The different statistical methods that are very useful for data-analyst:

• Markov process

• Spatial and cluster processes

• Imputation techniques, etc.

• Mathematical optimization

• Bayesian method

• Simplex algorithm

• Rank ‘‘Statistics’’, percentile, outliers detection

**86. Tell us about the K-mean algorithm.**

**Answer:** The K-mean algorithm is used for data partitioning in a clustered architecture. The K-mean algorithm is used to classify a given data set through a certain number of clusters. The objects are divided into several k groups. In case of the k-mean algorithm, the clusters are spherical so that data points in a cluster are centered on that cluster and the variance or the spread of the cluster is almost similar. Each data point belongs to the closest cluster.

**87. What do you mean by collaborative filtering?**

**Answer:** The collaborative filtering is an algorithm that is used to design a recommendation system based on the actual user behavioral analytics. This algorithm is commonly used by big sites with collaborative filters. Some other user behavioral response are popups that are based on the user’s browsing history.

**88. Define MapReduce.**

**Answer:** The MapReduce is a programming model and a connected implementation to process and to generate large data sets with a parallel, distributed algorithm on a cluster. MapReduce is used to split big data sets into subsets, to process each subset on a different server and then to blend the results obtained on each.

**89. Tell us about the Correlogram analysis.**

**Answer:** A Correlogram is the visual inspection of correlation statistics. It is a graph that is generally used to interpret a set of autocorrelation coefficients. It is a commonly used tool for checking the randomness in a given data set.

**90. What do you know about n-gram?**

**Answer:** The N-Gram is a sequence of tokens (usually words, characters or subsets of characters). The N=Gram is a probabilistic language model that is used to predict the next item in the sequence following the form of (n-1).

**91. Tell us about the imputation process. What are the different types of imputation techniques?**

**Answer:** The imputation process is a method that is used to replace the missing data elements with substituted values. The two types of imputation processes with subtypes:

• Multiple imputations

• Single imputation

The Sub-types of single imputation comprise the following:

• Hot-deck imputation

• Cold deck imputation

• Mean imputation

• Regression imputation

• Stochastic regression

**92. Tell us about the Logistic Regression.**

**Answer:** Logistic regression is one of the statistical methods used by data analysts for examining a dataset where a single and multiple independent variables are used to define an outcome. It is used to estimate the probability of a binary outcome based on one or more predictor/independent) variables. This method is also used for depicting the presence of a risk factor which in turn increases the odds of a given outcome by a specific factor.

**93. What do you know about a hash table collision? How can we prevent it?**

**Answer:** A hash table collision occurs when two or more elements are hashed/mapped to the same value by the system. Collisions generally occur when multiple values are linked to a single key in the hash table.

In order to avoid hash collisions, the hash function selection should be done very carefully. This process can be simplified by creating a set of hash functions and choosing one at a random for execution. The Open Addressing technique can also be used for preventing hash collisions.

**94. Differentiate between supervised and unsupervised machine learning.**

**Answer:** Supervised machine learning requires training labeled data while unsupervised machine learning doesn’t required labeled data.

**95. Define bias and variance **trade off**.**

**Answer:**

• Bias: Bias is an error introduced in any ML model due to over-simplification of the machine learning algorithm. Bias can lead to underfitting. When a user is training the model, simplified assumptions must be made to make the target function easier to understand and hence, overcome bias. The low bias machine learning algorithms are Decision Trees, k-NN and SVM while high bias machine learning algorithms are Linear Regression and Logistic Regression.

• Variance: Variance is an error introduced in a ML model due to complex machine learning algorithm. The model learns noise from the training dataset and performs unsatisfactorily based on test dataset. Variance can lead to high sensitivity and overfitting. As complexity of model increases, there is a reduction in error due to lower bias in the model. This however, only happens up to a particular point. As the model becomes more complex, the model faces over-fitting and the model starts suffering from high variance.

Bias-Variance trade off: The final goal of the supervised machine learning algorithm is to have low bias and low variance in order to achieve good prediction performance:

1. The k-nearest neighbor algorithm has low bias and high variance. This trade-off is used to increase the value of k which increases the number of neighbors that contribute to the prediction and hence, increase the bias of the model.

2. The support vector machine algorithm has low bias and high variance. This trade-off however, can be changed by increasing the C parameter that impacts the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

In Machine Learning algorithms, increasing the bias decreases the variance and vice versa.

**96. What do you know about the exploding gradients?**

**Answer:** Exploding gradients are a problem where large error gradients add up and result in a very large update to neural network model weights during training. The values of weights can become very large that can lead to an overflow and result in a number of NaN values. It has the effect of the model becoming unstable and unable to learn from the training dataset. Gradient is the direction and magnitude calculated during the training of a neural network that can be used to update the network weights in the right direction, by the right amount.

**97. What do you know about a confusion matrix?**

**Answer:** The confusion matrix is a 2X2 table that comprises 4 outputs delivered by the binary classifier. Error-rate, accuracy, specificity, sensitivity, precision and recall are some of the measures that have been derived from it. A dataset can be used for performance evaluation which is known as the test dataset and it should contain the correct and predicted labels.

The predicted labels are same if the performance of a binary classifier is good enough. The predicted labels match with part of the labels practically. A binary classifier predicts all data instances of a test dataset as positive or negative. The following outcomes are hence, possible-

1. True positive(TP) – Correct positive prediction

2. False positive(FP) – Incorrect positive prediction

3. True negative(TN) – Correct negative prediction

4. False negative(FN) – Incorrect negative prediction

The basic measures which have been derived from the confusion matrix are:

1. Error Rate = (FP+FN)/(P+N)

2. Accuracy = (TP+TN)/(P+N)

3. Sensitivity(Recall or True positive rate) = TP/P

4. Specificity(True negative rate) = TN/N

5. Precision(Positive predicted value) = TP/(TP+FP)

6. F-Score (Harmonic mean of precision and recall) = (1+b)(PREC.REC)/(b^2PREC+REC) where b is commonly 0.5, 1, 2.

**98. How does a ROC curve work?**

**Answer:** The ROC curve is the graphical illustration of the contrast between the true positive rates and false positive rates at a number of thresholds. A ROC is usually used as a substitute for the trade-off between the true positive rate and false positive rate.

**99. Tell us about the SVM machine learning algorithm in detail.**

**Answer:** SVM is an acronym for Support Vector Machine. A SVM is a supervised machine learning algorithm which can be used for Regression and Classification. If there are n features in the training dataset, SVM is used to plot it in an n-dimensional space with the value of every single feature being the value of a particular coordinate. Hyper planes are used to separate different classes based on the provided kernel function.

**100. Tell us about the different kernel functions in SVM.**

**Answer:** The different kernel functions in SVM:

• Linear Kernel

• Polynomial kernel

• Radial basis kernel

• Sigmoid kernel

**101. How does the Decision Tree algorithm work?**

**Answer:** The Decision Tree Algorithm is a supervised machine learning algorithm that is also used for the Regression and Classification. The algorithm disintegrates a dataset into smaller and smaller subsets and an associated decision tree is incrementally developed simultaneously. Finally we get a tree with decision nodes and leaf nodes. Such trees can handle categorical as well as numerical data.

**102. Define Entropy and Information gain in the Decision tree algorithm.**

**Answer:** The core algorithm required for building a decision tree is known as the ID3 which uses Entropy and Information Gain in order to construct a decision tree:

• Information Gain: This is based on the decrease in entropy after a dataset is split on an attribute. A decision tree is constructed from all those attributes which return the highest information gain.

• Entropy: A decision tree is built top-down from a root node and involves the partitioning of data into homogeneous subsets. ID3 uses Entropy to check the homogeneity of a sample. If the sample is completely homogeneous, then the entropy is zero. If the sample is an equally divided it has the Entropy of one.

**103. Tell us about pruning in **Decision** Tree.**

**Answer:** The process of removing sub-nodes of a decision node is called pruning or the opposite process of splitting.

**104. What do you know about Ensemble Learning?**

**Answer:** Ensemble is the process of merging a diverse set of learners together to enhance the stability and the predictive power of the model. Ensemble Learning has different types. These are the two most popular ensemble learning techniques:

• Bagging: Bagging is the implementation of similar learners on small sample populations and finally calculate the mean all the predictions. In generalized bagging, one can use different learners on different population which helps to reduce the variance error.

• Boosting: Boosting is an iterative technique which is used to adjust the weight of an observation based on the last classification. If an observation is classified incorrectly, it causes an increase the weight of this observation and vice versa. It also causes a general decrease in the bias error and tries to build the strong predictive models. Boosting might also cause overfitting on the training data.

**105. Tell us about Random Forest. Explain its working.**

**Answer:** The Random forest is a versatile machine learning method which can perform both regression and classification tasks. It is also used for reduction of dimensionality, proper treatment of missing values and outlier values. It is a type of ensemble learning method, where a group of weak models are added up to create a powerful model. In the Random Forest model, multiple trees are grown instead of a single tree. In order to classify a new object based on attributes, each tree provides a classification. The forest chooses the classification with the most votes from all the trees in the forest. In case of a regression, it takes the average of all outputs by different trees.

**106. Mention the cross-validation technique that should be used on a time series dataset by a data analyst.**

**Answer:** A time series is not a randomly distributed data but an inherently organized set ordered by chronological order. In case of time series data, the user can use techniques like forward chaining where the user will be the model on past data and then look at forward-facing data. The following set is used for the cross-validation technique:

• fold 1: training[1], test[2]

• fold 1: training[1 2], test[3]

• fold 1: training[1 2 3], test[4]

• fold 1: training[1 2 3 4], test[5]

**107. Tell us about logistic regression. Also, provide an example when logistic regression needs to be used.**

**Answer:** Logistic Regression is a logit model. It is a technique generally used to predict the binary outcome from a linear combination of predictor variables.

Consider a case where we need a prediction of whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary- 0(Lose) or 1(Win). The predictor variables used here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, and so on and so forth.

**108. Define Normal Distribution.**

**Answer:** Data is distributed in various ways with a bias to the left or right. There might be chances that data is distributed around a central value without any bias to the left or right and the sample reaches a form of normal distribution in the form of a bell shaped curve. The random variables are distributed in the form of a symmetrical bell shaped curve.

**109. Tell us about the **Box Cox** Transformation.**

**Answer:** Dependent variable generally used for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow a skewed distribution. In such scenarios, it is necessary to convert the response variable so that the data meets the required assumptions.

A Box cox transformation is a statistical technique that is used to transform non-normal dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. When a user applies a box cox transformation, he/she can run a broader number of tests. A Box Cox transformation is a means to transform the non-normal dependent variables into the normal shape. Normality is one of the most important assumption for a number of statistical techniques. However, if the data isn’t normal, the user can apply a Box-Cox transformation which means on which the user can run a broader number of tests.

The Box Cox transformation has been named after the statisticians George Box and Sir David Roxbee Cox who collaborated on a 1964 paper and developed this technique.

**110. How does a data analyst define the number of clusters in a clustering algorithm?**

**Answer:** In the K-Means clustering algorithm the K defines the number of clusters.

The graph used to represent the K-Means Clustering Algorithm is generally known as Elbow Curve. The point in the Elbow Graph is known as the bending point and taken as K in the K – Means. This is a widely used approach but few data scientists also use the Hierarchical clustering first to create a number of dendograms and identify the distinct groups from there.

**111. Define ‘Deep Learning’.**

**Answer:** Deep learning is a subfield of machine learning which have been inspired by structure and function of the brain known as the Artificial Neural Network. There are many algorithms under machine learning such as linear regression, SVM, Neural network, etc. and Deep Learning is just an extension of such Neural Networks. In Neural Networks, the hidden layers are also considered. However, in deep learning algorithms a greater number of hidden layers are considered to better understand the relationship between input and output.

**112. Define Recurrent Neural Networks (RNNs).**

**Answer:** Recurrent Networks are a type of Artificial Neural Networks that have been designed to recognize patterns from the sequence of data such as Time series, Stock Market and government agencies, etc.

To understand the working Recurrent Networks, one must understand the basics of the Feedforward Networks. Both these networks have been named after the way they channel information through a series of mathematical operations that are performed at the nodes of the network.

The Feedforward feeds information straight, without touching the same node twice. The RNN feeds information through the other cycles through a loop and is hence, called recurrent. Recurrent networks on the other hand, take input not only from the current input but also previous input values. Hence, recurrent networks have two sources of input- the present and the recent past, which combine in the future to determine how to respond to new data.

The error generated will return through backpropagation and can be used to adjust the weights until error cannot go any lower. The purpose of the RNNs is used to accurately classify sequential input.

Recurrent networks depend on an extension of backpropagation also known as ‘Backpropagation through Time (BPTT)’.

**113. Differentiate between machine learning and deep learning.**

**Answer:** Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. They can be categorized in following three categories.

1. Supervised machine learning,

2. Unsupervised machine learning,

3. Reinforcement learning

Deep Learning is a subfield of the concept of machine learning concerned with algorithms inspired by the structure and function of the brain known as Artificial Neural Networks.

**114. What do you know about regularization? Why is regularization useful?**

**Answer:** Regularization is the process of adding a parameter to a model in order to encourage smoothness to prevent overfitting. This is done to add a constant multiple to an existing weight vector. This constant is known as the L1 (Lasso) or the L2 (ridge). The model predictions used minimize the loss function calculated on the regularized training set.

**115. What do you know about TF/IDF vectorization?**

**Answer:** TF/IDF is an acronym for Term Frequency/Inverse Document Frequency. It is a numerical statistic that shows how important a word is to a document in a collection or a corpus. The TF/IDF is used as a weighting factor in information retrieval and text mining. The TF/IDF value causes an increase in the proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus which also helps to adjust that some words appear more frequently.

**116. Tell us what you know about the Recommender Systems.**

**Answer:** The Recommender Systems is a subclass of information filtering systems that have been developed to predict the preferences or ratings a user gives to a product. They are widely used in movies, news, research articles, products, social tags, music, etc.

**117. Differentiate between Regression and **classification** ML techniques.**

**Answer:** Regression and Classification are machine learning techniques which fall under supervised machine learning algorithms.

In Supervised machine learning algorithm, the user needs to train the model using labeled dataset. When the ML system is training, the user needs to explicitly provide the correct labels and the algorithm tries to learn the pattern from the input to provide the output.

If the labels are discrete values, then it will become a classification problem but if the labels are continuous values, then it will become a regression problem.

**118. Tell us about the p-value.**

**Answer:** The p-value can help the user determine the strength of your results. It is a number between 0 and 1. Based on the p-value, the strength of the results is denoted. The claim which is on trial is known as Null Hypothesis.

A low p-value (p≤ 0.05) indicates strength against the null hypothesis which means that the Null Hypothesis can be rejected.

A High p-value (p≥ 0.05) indicates strength for the Null Hypothesis which means that the Null Hypothesis cannot be rejected.

**119. What does the ‘Naive’ mean in a Naive Bayes?**

**Answer:** The Naive Bayes Algorithm is based on the Bayes Theorem.

According to the Bayes’ theorem, the probability of an event, based on prior knowledge of conditions, might be related to the event.

**120. What does the term ‘Naïve’ mean?**

**Answer:** The Algorithm is termed ‘naive’ because the algorithm makes assumptions that might or might not be correct.

**121. Mention someone of the skills one must possess in Python for proper data analysis.**

**Answer:** The following are some of the important skills a proper data analyst must possess in Python with respect to Machine Learning and deep Learning:

• A good understanding of the built-in data types such as lists, dictionaries, tuples, and sets.

• A good knowledge of N-dimensional NumPy Arrays.

• A good knowledge of the Pandas data frames.

• One should be able to perform the element-wise vector and matrix operations on NumPy arrays.

• One must also use the Anaconda distribution and the conda package manager.

• One must also be familiar with Scikit-learn.

• One must be able to write efficient list comprehensions instead of traditional for loops.

• One must be able to write small and clean functions.

• One must know how to profile the performance of a Python script and also how to optimize bottlenecks.

**122. Differentiate between the “long” and the “wide” format data.**

**Answer:** In the ‘wide’ format, a subject’s repeated responses are displayed in a single row and each response is displayed in a separate column. One can recognize the data in wide format by the fact that columns generally represent groups.

In the ‘long’ format, each row is actually one-time point per subject.

**123. Why do we require A/B Testing?**

**Answer:** The A/B Testing is a statistical hypothesis testing that is used for a randomized experiment with two variables A and B. The goal of this is to identify any changes that have been made to the web page to maximize the outcome of the interest. It helps to calculate some of the best online promotional and marketing strategies for your business and also to test everything from the website copy to the sales emails to search ads.

**124. Explain the concept of statistical power of sensitivity. How can we calculate the statistical power of sensitivity?**

**Answer:** Sensitivity is used to validate the accuracy of a classifier such as the Logistic, SVM, Random Forest etc.

True events are the events which are true and they also model which then predict them as true. The calculation of the seasonality is pretty straightforward.

Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )

**125. Differentiate between overfitting and underfitting.**

**Answer:** One of the most common tasks in any ML system is to fit a model to a set of training data, in order to make some reliable predictions on general untrained data.

In overfitting, a statistical model is used to describe a random error or noise instead of the underlying relationship. It occurs when a model has been designed to be excessively complex, such as too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Underfitting occurs whenever a statistical model or machine learning algorithm cannot capture the underlying trend of the data. It occurs while fitting a linear model to non-linear data as such a model can have poor predictive performance.

**126. Which programming language is generally used for text analytics?**

**Answer:** Python is preferred for text analytics because of the following reasons:

• Python performs faster for all types of text analytics.

• R is suitable for machine learning more than text analysis.

• Python is the best option because it has Pandas library that provides some easy to use data structures and high-performance data analysis tools.

**127. Explain the role of Data Cleansing in Data Analysis.**

**Answer:** Data cleaning helps in analysis in the following ways:

• Cleaning data from multiple sources helps to convert it into a format so that that data analysts or data scientists can work with it.

• Data Cleaning also helps to enhance the accuracy of the model in machine learning.

• Data Cleaning is a bulky process as the number of data sources increases, then the time taken to clean the data increases exponentially due to the number of sources and the large volume of data generated by these sources.

**128. Differentiate between a Validation Set and a Test Set.**

**Answer:** A Validation set is a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.

The Test Set is used for testing the performance of a trained machine learning model.

A training set is to fit the parameters and the test set is to measure the performance of the model by evaluating the predictive power and generalization.

**129. Define the term cross-validation.**

**Answer:** Cross-validation is a model validation technique used for evaluating how the outcomes of statistical analysis will generalize to an Independent dataset. These are mainly used in the background where the final goal is to forecast and the user wants to estimate how accurately a model will accomplish. The goal of the cross-validation is to term a data set to test the model in the training phase using the validation data set to limit problems like overfitting and to know how the model will generalize to an independent data set.

**130. Define the term ‘Linear Regression’.**

**Answer:** Linear regression is a popular statistical technique where the score of a variable Y is predicted from the score of a second variable X. Here, X is referred to as the predictor variable and Y as the criterion variable.

**131. Define the term ‘Collaborative filtering’.**

**Answer:** Collaborative Filtering is the process of filtering that is used by recommender systems in order to find patterns or information by collaborating viewpoints, various data sources and multiple agents.

**132. How is the user supposed to treat outlier values?**

**Answer:** Outlier values can be easily identified by using any univariate or other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for a large number of outliers, the values can be substituted with the 99th or the 1st percentile data values. However, all extreme values are not outlier values. The most common ways to treat outlier values used by analysts are:

• To change the value and bring in within a range.

• To remove the value.

**133. Mention the steps involved in building an analytics project.**

**Answer:** The following are the various steps involved in an analytics project:

1) To understand the Business problem.

2) To explore the data and become familiarize with it.

3) To prepare data for modeling by detecting outliers, proper treatment of missing values, transforming variables, etc.

4) After data preparation, the model should be started, to analyze the result and modify the approach accordingly. This is an iterative step until the output data is favorable.

5) To validate the model by using a new data set.

6) Finally, start the implementation of the model and track the result to analyze the performance of the model over a period of time.

**134. Define Artificial Neural Networks.**

**Answer:** Artificial Neural Networks are a set of algorithms that have transformed the field of machine learning. They are largely inspired by the biological neural networks. Neural Networks can be used to adapt to change the input so that the network generates the best possible result without redesigning the output criteria.

**135. Tell us about the structure of Artificial Neural Networks.**

**Answer:** Artificial Neural Networks work on the same principle as that of a biological Neural Network. An ANN comprises inputs which are processed with weighted sums and biases along with Activation Functions.

**136. What do you know about Gradient Descent?**

**Answer:** A gradient measures the value by which the output of a function changes if the input is changed even by a bit. The gradient measures the change in all the weights with respect to the change in error. The gradient is compared to the slope of a function. Gradient Descent is the descent down this slope. This is because it is a minimization algorithm that is used to minimize a given function, i.e., the Activation Function.

**137. Do you know anything about Back Propagation? Explain the working of Back Propagation.**

**Answer:** Backpropagation is a training algorithm that can be used for multilayer neural network. The error from either end of the network to all the weights inside the network and thus, allowing the efficient computation of the gradient.

The Back Propagation has the following steps:

• Forward Propagation of Training Data

• Derivatives are computed using both output and target

• Back Propagation is used for computing derivative of error with respect to the output activation

• Using the previously calculated derivatives for output

• Update the Weights

**138. Mention the different variants of Back Propagation.**

**Answer:** The different variants of Back Propagation are:

• Mini-batch Gradient Descent: This is one of the most popular optimization algorithms and is actually a variant of the Stochastic Gradient Descent. In the Mini-batch Gradient Descent instead of a single training example mini batches of samples are used.

• Stochastic Gradient Descent: In the Stochastic Gradient Descent, only a single training example is used for calculation of gradient and update parameters.

• Batch Gradient Descent: In the Batch Gradient Descent, the gradient is calculated for the whole dataset and perform an update for each iteration.

**139. List out the different Deep Learning frameworks.**

**Answer:** The different Deep Learning frameworks are:

• Pytorch

• Microsoft Cognitive Toolkit

• TensorFlow

• Keras

• Chainer

• Caffe

**140. Do you know about the Activation Function? Explain its function.**

**Answer:** The Activation function is used by data analysts to introduce non-linearity into the neural network so as to help the Neural Network to learn more complex functions. The neural network can only learn linear function without the Activation Function. An activation function is a function in an artificial neuron that helps to deliver an output based on corresponding inputs.

**141. Tell us something about the Auto-Encoder.**

**Answer:** Auto-Encoders are simple learning networks that can be used to transform the inputs into outputs with minimum possible error. The output achieved by the Auto-Encoder is close to the input as possible. A number of layers are placed between the input and the output and the size of these layers are smaller as compared to the input layer. The Auto-Encoder receives some cases of unlabeled input which are used to encode to reconstruct the input.

**142. Do you know anything about a Boltzmann Machine?**

**Answer:** Boltzmann machines have a simple learning algorithm that can be used to discover interesting features that represent a number of complex regularities in the training dataset. It is used to enhance the weights and the quantity for the given ML problem. The learning algorithm is very slow while working in networks with layers of feature detectors. The Restricted Boltzmann Machines algorithm has a single layer of feature detectors which make it faster than the rest.

**143. Tell us something about the feature vectors.**

**Answer:** A feature vector is an n-dimensional vector of numerical features that is used to represent some object. In machine learning, feature vectors are used for representing numeric or symbolic characteristics (features) of an object in a mathematical way so that it can be analyzed easily.

**144. Mention the steps required during **the making** a decision tree.**

**Answer:** The following are the steps that are required while building a decision tree:

1. The entire data set is considered as the input.

2. The next step is to look for a split that can be used to maximize the segregation of the classes.

3. Next, the split is applied to the input data.

4. Steps 1 to 2 are then applied to the divided data.

5. Whenever the stopping criteria is met, the process is stopped. This step is known as pruning and is used to clean the tree.

**145. Do you know anything about the process of root cause analysis?**

**Answer:** Root cause analysis was established in order to analyze industrial accidents. It is majorly a problem-solving technique that is used for separating the root causes of faults. Any factor is known as a root cause if its deduction from the problem-fault-sequence prevents the final detrimental event from reoccurring.

**146. ‘Gradient descent methods at all times converge to a similar point’. What **are** your view regarding this statement?**

**Answer:** No, this statement is not true. The Gradient Descent methods do not converge to a similar point because they usually reach a local minima or optima point. There is no global optima point and is governed by data and corresponding starting conditions.

**147. Mention the drawbacks of the linear model.**

**Answer:** The Linear Model has the following drawbacks:

• The linear model can’t be used to count outcomes or binary outcomes.

• It assumes the linearity of the errors.

• There are a number of overfitting problems that can’t be solved using the linear model.

**148. Tell us about the Law of Large Numbers.**

**Answer:** The Law of large Numbers is a theorem that is used to describe the result of performing the same experiment for a large number of times. According to this theorem, the sample mean, the sample variance and the sample standard deviation converge to what is being tried to be estimated.

**149. Tell us something about confounding variables.**

**Answer:** Confounding variables are extraneous variables in a statistical model that are used to compare directly or indirectly with the dependent and independent variable.

**150. Tell us something about the star schema.**

**Answer:** The star schema is the traditional database schema that works with a central table. Satellite tables plan IDs to respective physical names or descriptions and can be associated with the central fact table using the ID fields. Such tables are called lookup tables and are very useful in real-time applications as a lot of memory is saved. Star schemas also involve several layers of summarization in order to recover information faster.

**151. How frequently should a user update an algorithm?**

**Answer:** A good user should regularly update algorithms for the given reasons:

• As the underlying data source is constantly changing

• In order to support the evolving of the model to grow as data streams through infrastructure

• Non-stationarity is caused in some cases

**152. Mention the reasons for which resampling is performed.**

**Answer:** There are a number of ways for which resampling has to be performed:

• To estimate the accuracy of the sample by making use of subsets of accessible data or by drawing randomly with replacement from a set of data points.

• To validate different models by using a number of random subsets

• To substitute labels on various data points while performing tests based on significance

**153. Mention the types of biases that usually occur during sampling.**

**Answer:** The following are the types of biases that usually occur during sampling:

• Survivorship bias

• Selection bias

• Under coverage bias

**154. Tell us something about survivorship bias.**

**Answer:** Survivorship Bias is the logical error that users commit by focusing on the aspects that support the survival of some process and sometimes, overlook those processes that did not because of the lack of prominence. This bias usually leads to wrong conclusions in various ways.

**155. Tell us something about the working of a random forest.**

**Answer:** For the working of a random forest, a number of weak learners are combined to provide a strong learner. The following steps are used while working with the concept of a random forest:

• A number of decision trees are built on the basis of a number of training samples of data.

• For each tree, whenever the user considers a split, a random sample of the ‘mm’ predictors is selected as the split candidates from all of the ‘pp’ predictors.

• The thumb rule to be followed is that at each split ‘m=p√m=p’.

• The predictions that is used is the majority rule.

**156. Differentiate between Big Data, Data Science and Data Analytics.**

**Answer:**

• Big Data: It deals with large volumes of structured, unstructured and semi-structured data and requires a lot of knowledge pertaining to ‘Statistics’ and mathematics.

• Data Science: It deals with manipulation of data and also requires knowledge of ‘Statistics’ and mathematics.

• Data Analytics: Data Analytics is used to contribute operational insights for business scenarios and not a very deep knowledge of ‘Statistics’ and mathematics is required.

**157. Tell us something about SAS, R **and** Python programming. How are they different or similar?**

**Answer:**

• SAS: SAS is a popular analytics tools that is used by a number of companies. It has graphical user interface and statistical functions. However, SAS cannot be adopted by smaller enterprises.

• R: R is an Open Source tool which is largely used by academia and the research community. R is a robust tool for graphical representation, statistical computation, and reporting. It is constantly updated and all updates are available for all users.

• Python: Python is a great open source programming language that is easy to learn and works well with a number of tools and technologies. It has countless libraries and community-created modules which makes it a robust programming language.

The programming languages generally used for Machine Learning Algorithms are R and Python.

**158. Tell us something about the language R.**

**Answer:** The programming language R is basically used for data manipulation, statistical computing, graphical representation, and calculation.

These are some of the features of R:

• Operators for performing matrix and array calculations.

• Extensive collection of data analysis tools.

• Simple and effective.

• Data analysis technique for representing graphical data.

• R acts as an intermediate between various software, datasets and tools.

• Supports machine learning applications extensively.

• Helps to create high quality flexible and powerful reproducible analysis.

• R also provides a robust package ecosystem.

**159. Tell us about the components of the Hadoop Framework.**

**Answer:** Hadoop framework has two major components which are HDFS and YARN.

• HDFS- HDFS is the acronym for Hadoop Distributed File System. HDFS is the distributed database that works on the top of Hadoop and is capable of packing and retrieving large datasets quickly.

• YARN- YARN stands for Yet Another Resource Negotiator and is used to allocate resources dynamically and to handle the workloads.

**160. How is ‘Statistics’ useful for data scientists?**

**Answer:** ‘Statistics’ is used by data scientists to search data for patterns, and for conversion of Big Data to Big insights. ‘Statistics’ helps business to offer better customer service. ‘Statistics’ helps data scientists to learn about consumer behavior, attention, and preservation. A number of powerful data models are validated based on certain inferences and predictions with the help of ‘Statistics’. Hence, ‘statistics’ help business to flourish by using some key points from the data retrieved.

**161. Compare the importance of data analysis and data cleansing.**

**Answer:** Since data is retrieved from a number of sources, one must make sure that is good enough for analysis. Data cleansing helps to detect and correct data records, to ensure that data is complete and the irrelevant components of data are deleted or modified. Data cleansing is used along with batch processing or data wrangling.

Data is cleaned and makes it fulfill the criteria of correctness. Data cleansing is an important process in Data science as data can be incorrect or might have been lost due to transmission. Hence, it is the first and most crucial step in data analysis which is a more complex and long process.

**162. Mention the areas in which Machine Learning is applied in the real world.**

**Answer:** These are some of the real-world areas where the use of Machine Learning has proved to be beneficial:

• Ecommerce: To understand the mix of customers, arranging customer-specific advertising, and remarketing of old products

• Search engine: To rank pages on the basis of the personal preferences of the user

• Finance: To evaluate the investment opportunities and risks, and to detect fake transactions.

• Medicare: To design drugs on the basis of a patient’s medical history and present requirements

• Robotics: Machine learning is used in this area to handle unknown situations

• Social media: To understand relationships and explicitly recommend connections based upon the user

• Extraction of information: To frame questions to get answers from databases from all over the web

**163. Tell us about the different parts of a Machine Learning process.**

**Answer:**

• Domain knowledge: This step is used understand the process of the extraction of various features from data and learn more about the same. This step deals with the type of domain that we require for work.

• Feature Selection: In this step, the user deals with the feature that are being selected from the given set of features. Occasionally there are more than one feature and an intelligent decision needs to be made pertaining to the features that would help the machine learning system.

• Algorithm: Choosing the correct machine learning algorithm is extremely important. A choice needs to be made between linear and nonlinear algorithm. Support Vector Machines, Decision Trees, Naïve Bayes, K-Means Clustering, and some more are the machine learning algorithms.

• Training: Training is an important machine learning step. The system is trained by using the given data. The system improves with each training step, becomes smarter and takes better decisions.

• Evaluation: In this step, the decisions are checked for correctness pertaining to the input. Evaluation requires the inclusion of a number of basic steps.

• Optimization: Using various optimization methods, the machine learning system is made better. Using this step, the performance of the ML algorithm is enhanced. Usually, this step leads to the development of some new optimization techniques.

• Testing: This is one of the crucial ML steps as the system is tested with a set of unknown data. While working, two sets of data are used- test data and training data.

**164. Define the terms ‘Interpolation’ and ‘Extrapolation’.**

**Answer:** Interpolation is the determination of a value which lies between a certain set of values or between a certain sequences of values. This is majorly used when there are two extremities of a certain region but there aren’t enough data points at the specific point.

Extrapolation is the process of determining by making use of a set of values or facts that are already known by extending it and taking it to an unknown area or region. It is the practice of inferring something using available data.

**165. Tell us about Power Analysis.**

**Answer:** Power Analysis is the process of determination of the sample size required for detection of the effect of a given size from a cause with a certain degree of assurance. Power analysis allows the user to arrange specific probability in a sample size constraint. Statistical power analysis and sample size estimation use a number of techniques to make statistical judgment that are both accurate and can be used to gauge the size of sample necessary for experimental effects.

Through this process, the sample size estimate is neither high nor low. A low sample size shows there will be no authentication to offer reliable answers while a large sample size shows that there will be wastage of resources.

**166. Differentiate between Data modeling and Database design.**

**Answer:**

• Data Modeling: Data modeling is used to create a conceptual model on the basis of the relationships between the various data models. This process is used to move from the conceptual stage to the logical model to the physical schema. The systematic method of applying the data modeling techniques is also used in this method.

• Database Design: Database design is used to create an output which is also a detailed data model of the concerned database. It includes the detailed logical model of a database along with the physical design choices and storage parameters.

**Data Science** – The buzz in the market and top in the list of highest demanding skill. Almost every industry is looking at data analysis. With so many tools available, the data science is becoming the hot skill in the market. Most of the organizations want to make data driven decisions in every department. With vast data available in structured and unstructured formats, there is unbelievable demand for the skills and tools to analyze and provide the inferences. Data science is going to be the hottest skill in the market for few more decades. 166 top Data Science interview questions given above will help you to crack the interview successfully.