Congrats, you’ve made it to the onsite interview! There is only one more (fairly large) step between you & your dream job. How can you make sure that you nail the onsite and turn that interview into an offer? We created a list of the 113 most asked data science interview questions so you can prepare for your onsite interview and know that you’ll be coming in with confidence.

## Statistics

**Accenture question***–*What is linear regression?- What is your familiarity with statistical methods and passed projects?
**Airbnb question***–*How can you report the statistical results to a non-statistician staff?- What are different metrics to classify a dataset?
**Apple question***–*What is bias variance tradeoff? How is XGBoost handling bias-variance tradeoff?- Explain a probability distribution that is not normal.
**Google question***–*If two predictors are highly correlated, what is the effect on the coefficients in the logistic regression? What are the confidence intervals of the coefficients?- When using Gaussian mixture model, how do you know it is applicable?
**IBM question**– What are the relationships between the coefficient in the logistic regression and the odds ratio?- How do you find an anomaly in a distribution? How do you investigate that a certain trend in a distribution is due to anomaly?
**Netflix question**– When you split a population for A/B testing, what are some reasons you could see a significant difference in the control and variant groups?- Explain the concept of multicollinearity
**Google question**– Find the width of the confidence interval

## Probability

**Facebook question**– If you draw 2 cards from a shuffled 52 card deck, what is the probability that you’ll have a pair?**Microsoft question**– Generate 7 integers with equal probability from a function which returns 1/0 with probability p and (1-p).- Given an unfair coin with the probability of heads not equal to .5, what algorithm could you use to create a list of random 1s and 0s?
**Groupon question**– You are on a number line and you can jump to one of the neighboring points with equal probability, with the exception of n=0 where you can’t go to negative numbers but have to come back to n=1. If you start at n=44, what is the expected number of steps to reach n=4444?**CA Technologies question**– How do you get an estimate of the answer using Taylor expansion?

## Case study questions

- How would you measure the impact of introducing a new tool for partners?
**CA Technologies question**– How do you design an algorithm for fraud detection?**LinkedIn question**– Come up with some of the factors that could be used to produce certain algorithms (‘people you may know,’ and an algorithm to discover when a person is starting to search for new job).**Twitter question**– What features would you use to build recommendation algorithm for users.**Accenture question***–*Map an organization’s problem to data science – how will you solve it using data science and machine learning?**Airbnb question***–*Brainstorm potential causes of an anomaly in web traffic data.- An important metric goes down, how would you dig into the causes?
**Amazon question***–*Estimate the cumulative sum of the top 10 most profitable products of the last 6 month for customers in Seattle.- How do you deal with unbalanced data where the ratio of positive and negative is huge?
**Booking.com question***–*How can we automatically propose ‘good value deals’ to customers, including hotels that don’t have a rating yet?- If you have a customer and want to decide whether they will “buy today” or “not buy today” and you know 1. where they live, 2. their income, 3. their gender, 4. their profession, how would you define a machine learning algorithm to figure this out?
**Uber question***–*How much would it cost (initial and sustaining costs) to having a fleet of vehicles take Google street view photos of every major city in the US every day?**Booking.com question***–*Each hotel submits a short description. How do we figure out if it’s worth translating in some language?- How can you optimize/increase the number of languages a customer service department is able to serve? The constraint is to maintain the same quality as before with the same budget and same number of customer representatives as before.
**eBay question***–*eBay has to identify the cameras from the similar items like tripods, cables, and batteries, what is the approach? (Data is title, description of the product, price, image, etc.)**Expedia question***–*Develop a solution for the revenue optimization team using a structured dataset that describes the historical bookings of their hotels, which had the following attributes: number of people, booking times, arrival times, departure times, hotel features, prices, whether booked or not, etc.- Imagine we see a lot of users filling up a form but not submitting it, why would this be the case and how would you use data to figure it out?
**Intel question***–*Given measurements of acceleration taken from a wristband, with second by second acceleration in the X, Y, Z axis, how would you predict if the person wearing the band is sitting, walking, or just standing?**LinkedIn question***–*How would you design an A/B test for the homepage?- How many lines do you think a user’s daily login table has?
**Netflix question***–*Given a month’s worth of login data from Netflix such as account_id, device_id, and metadata concerning payments, how would you detect fraud? (identity theft, payment fraud, etc.)**Salesforce question***–*How would you build a classifier to predict the outcome of NFL games in real time?**SAP question***–*How would you design a recommendation system for customers, considering that a single customer may use many devices to logon to a single account?**Slack question***–*How would you prioritize which country to expand Slack to for furthering the international effort?**Stripe question***–*How would you choose between the subscription and the market-place based options i.e. evaluate which would be better for the business in the long run?**Uber question***–*If you were rolling out Uber ride passes for the first time, how would you set the prices?**Booking.com question**– How would you tag a listing as value for money? How would you measure the “value”? What features would you select to explain the “value”?**Apple question**– How do you take millions of users with 100s of transactions each, amongst 10ks of products and group the users together in a meaningful segments?**Facebook question**– How many high schools that people have listed on their profiles are real? How do we find out, and deploy at scale, a way of finding invalid schools?- We have a product that is getting used differently by two different groups. What is your hypothesis about why and how would you go about testing it?
**Intuit question**– How would you design a ranking system?**Uber question**– Explain how network effects might influence your choice of how to assign experimental/control units and measure your main outcome metrics**LinkedIn question**– What product metrics do you construct? How do you tell if your experiment is successful?- What trends in the data indicate that a given market is healthy? What does price tell you?

## SQL and databases

**Dell question**– What is indexing in database?**Facebook question**– Given a series of tables; write the SQL code you would need to count subpopulations through joins.**Pinterest question**– Write a SQL query to count the number of unique users per day who logged in from both an iPhone and the web, where iPhone logs and web logs are in distinct relations.**Spotify question**– Given a sample set of tables, write a sql query to get a summary metric from those tables.**Twitter question**– How can you illustrate a tree-based system with a SQL query?- If you have a table with a billion rows, how would you add a column inserting data from the original source without affecting the user experience?
**Facebook question**– There is a table that tracks every time a user turns a feature on or off, with columns for user_id, action (“on” or “off), date, and time. How many users turned the feature on today? How many users have ever turned the feature on? In a table that tracks the status of every user every day, how would you add today’s data to it?

## Programming

- Check if an integer is a palindrome (do not convert the integer to string)
**Adobe question***–*What kind of coding language do you use when handling a large-scale dataset?- Given 2 sorted arrays of integers, code to find a number from each array such that their sum is closest to some integer K
- How would you impute missing information?
**Amazon question***–*Write a Python function that displays the first n Fibonacci numbers.- Write Python code to return the count of words in a string
**Cisco question***–*Merge 2 sorted linked list- Clone a graph
**eBay question***–*Given a function roll() that uniformly returns a double between 0 and 1 and a array/list of numbers of length N (no duplicates), create a function shuffle() that returns a permutation of equal probability.- How would you create/design/implement a certain algorithm from start to end?
**LinkedIn question***–*Given a random generator that produces a number 1 to 5 uniformly, write a function that produces a number from 1 to 7 uniformly- Generate a sorted vector from two sorted vectors.
**Rakuten question***–*Write a function that finds the MST of a directed graph.**Uber question***–*Given a random Bernoulli trial generator, write a function to return a value sampled from a normal distribution.- Given 2 sorted arrays, merge them into 1 array. If the first array has enough space for 2, how do you merge the 2 without using extra space?
- How would you improve the complexity of a list merging algorithm from quadratic to linear?
**Apple question**– Find the index at which the sum of the left half of array is equal to the right half.**Groupon question**– How do you write sqrt function without using sqrt())?**HP question**– What is polymorphism and encapsulation in OOP?**Salesforce question**– What is the computational complexity of finding the most frequent word in a document?**IBM question**– Given a subset of daily sales and sellers, find the subset that identifies those with the highest daily sales average.

## Modeling

**Airbnb question**– Does the practice of removing missing values cause bias? If so, what would you do?**Adobe question**What is the difference between logit and probit models?- What is the degree of freedom for lasso?
- What is cross validation?
**Amazon question**What types of regularization exist? Which one is simpler to use?- What is a propensity model and how are beta estimates calculated by MLE?
- What is a time series model and how do you do the calculation of ACF and PACF?
**Booking.com question**How would you create an attribution model?- What would you do if the relation between outcome and features is not linear? How do you validate the model you built? Design and describe an experiment to confirm that the method you developed is a good one.
**Dell question**What is dimensionality reduction?**Dropbox question**How would you set up a propensity model for the SMB team looking at companies between 5-200 employees?**eBay question**Suggest a modeling process for a binary classification task with skewed and unbalanced data.- Build a model to identify customers interested in receiving ad emails.
**FICO question**What is a distribution you may use to model data whose range of input values is [0, N]?**Google question**If the labels are known in the clustering project, how do you evaluate the performance of the model?**IBM question**How do you validate a machine learning model?- How do you evaluate the performance of a regression prediction model as opposed to a classification prediction model?
**Microsoft question**How would you explain a deep learning model to customers?- How do you measure and compare models? For example, the pros and cons of Random Forest vs. Logistic Regression?
**Netflix question**– How should we approach attribution modeling to measure marketing effectiveness?**Rakuten question***–*How could you contribute to the team with quantitative modeling? Present the answer with details.**TripAdvisor question***–*How do you evaluate a classifier and how do you select features?**ServiceNow question**What are the ways to transform a numeric predictor to a categorical one and vice versa?- What’s the difference between Supervised vs. Unsupervised machine learning?
**Intuit question**– How does boosting work?**Netflix question**– How would you build and test a metric to compare two user’s ranked lists of movie/tv show preferences?**Amazon question**– What are hyperparameters, how do you tune them, how do you test them, how do you know if they worked for the particular problem.- What is overfitting? How do you avoid it?
- What is bagging?
**Expedia question**– What is the difference between LSTM and RNN?- How do you choose kernels in svm method?
**Oracle question**– Describe random forest to your grandmother**Microsoft question**– What is the ROC curve and the meaning of sensitivity, specificity, confusion matrix?

Our advisors at Pathrise work with smart and accomplished data scientists all of the time. Sometimes, in the interviews, these candidates let nerves get the best of them and struggle with the questions, even if they know how to solve the problems. So, here are some additional tips to help you once you get into the room.

**Always start with clarifying questions**

Sometimes, interviewers make a question intentionally vague. Especially for case study questions, it’s important to clearly define the business use case and metric. For example, if a company asked you to investigate “why sign up rates have declined,” you can ask questions such as:

1) “Over what time period did the decline happen and during which months?”

2) “How are we defining sign up rate? What is the numerator and denominator?”

**Proactively show positive signal**

While you’re working, provide 30 second “tidbits” of knowledge proactively. This is a strong tactic because, not only does it reduce the opportunities for negative signal, but also it provides the interviewer with a sense of your knowledge. Just make sure you are confident in what you are mentioning.

**Make context statements**

Context statements are the difference between doing something and providing the reasoning before/as you are doing something. How you are interpreted can very much change based on the context that you give. So, try to provide the rationale behind your actions so that your interviewer knows why you are making the choices you are making, especially for actions where the interpretation is opinionated.

**Know how to get help**

AKA – getting a hint. Some interviewers really hate the word, “hint,” so a better approach is to say something like, “my assumptions are X and Y, I’m thinking of doing Z. But I’m struggling with solving [specific problem].” You can also ask collaborative questions like,

- I was wondering if you had any thoughts.
- Do you think I’m going down the right direction?
- Do you think my assumptions are incorrect?

**Understand when to ask permission questions**

Every interviewer will have different preferences. For key decision points where the interviewer will have a different preference, you should ask for permission before assuming an appropriate action. These can be questions like, “Can I Google the syntax online?” or “Is it okay if I write some thoughts down on paper?” It’s also better if you tend towards closed ended questions such as, “should I use this solution or think of something more optimal?” versus “What should I do next?”

With these questions & tips in your back pocket, you should be more than prepared for your next data science technical onsite interview.

**Pathrise** is a career accelerator that works with students and young professionals 1-on-1 so they can land their dream job in tech. With these tips and guidance, we’ve seen up to 80% increase in interview success from our fellows in the program.

If you want to work with any of our advisors 1-on-1 to get help with your data science interviews or with any other aspect of the job search, become a Pathrise fellow.