SQL Window Features in Data Science Interviews Requested by Airbnb, Netflix, Twitter, and Uber

Window functions are a group of functions that will perform calculations on a set of rows related to your current row. They are considered advanced sql and are often asked during data science interviews. It is also used a lot at work to solve many different types of problems. Let’s summarize the 4 different types of window functions and cover why and when you would use them.

4 types of window functions

1. Regular aggregate functions

o These are added as AVG, MIN / MAX, COUNT, SUM

o You will want to use them to add your data and group it by another column like month or year

2. Sorting functions

o ROW_NUMBER, RANK, RANK_DENSE

o These are functions that help you classify your data. You can sort your entire dataset or sort it by groups, such as by month or country

o Extremely useful for generating classification indexes within groups

3. Generation of statistics

o These are great if you need to generate simple statistics like NTILE (percentiles, quartiles, medians)

o You can use this for your entire dataset or per group

4. Time series data handling

o A very common window function, especially if you need to calculate trends such as a month-over-month moving average or a growth metric

o LAG and LEAD are the two functions that allow you to do this.

1. Regular aggregate function

Regular aggregate functions are functions like average, count, sum, min / max that are applied to columns. The goal is to apply the aggregate function if you want to apply aggregations to different groups in the dataset, such as the month.

This is similar to the type of calculation that can be done with an aggregate function that you would find in the SELECT clause, but unlike normal aggregate functions, window functions do not group multiple rows into a single row of output, they are grouped together, or they retain their own identities, depending on how you find them.

Avg () Example:

Let’s take a look at an example of an avg () window function implemented to answer a data analysis question. You can see the question and write the code at the link below:

platform.stratascratch.com/coding-question?id=10302&python=

This is a perfect example of how to use a window function and then apply an avg () to a group of months. Here we are trying to calculate the average distance per dollar per month. This is difficult to do in SQL without this window function. Here we have applied the avg () window function to the third column where we have found the average value for the month-year for each month-year in the data set. We can use this metric to calculate the difference between the monthly average and the average of dates for each request date in the table.

The code to implement the window function would look like this:

SELECT a.request_date,

a.dist_to_cost,

AVG (a.dist_to_cost) OVER (PARTITION BY a.request_mnth) AS avg_dist_to_cost

FROM

(SELECT *,

to_char (request_date :: date, ‘YYYY-MM’) AS request_mnth,

(distance_to_work / monetary_cost) AS dist_to_cost

FROM uber_request_logs) to

ORDER BY request_date

2. Sorting functions

Classification functions are an important utility for a data scientist. You are always ranking and indexing your data to better understand which rows are the best in your data set. The SQL window functions provide you with 3 sort utilities: RANK (), DENSE_RANK (), ROW_NUMBER (), depending on your exact use case. These functions will help you to list your data in order and in groups based on what you want.

Range () Example:

Let’s take a look at a sort window function example to see how we can sort the data within groups using SQL window functions. Interactively follow this link: platform.stratascratch.com/coding-question?id=9898&python=

Here we want to find the best salaries by department. We can’t just find the top 3 salaries without a window function because it will only give us the top 3 salaries across all departments, so we need to classify salaries by department individually. This is done by range () and divided by department. From there, it is very easy to filter by the top 3 in all departments.

Here is the code to generate this table. You can copy and paste in the SQL editor at the link above and see the same result.

SELECT department,

salary,

RANK () OVER (PARTITION BY A DEPARTMENT

ORDER BY a.salary DESC) AS rank_id

FROM

(SELECT department, salary

FROM twitter_employee

GROUP BY department, salary

ORDER BY department, salary) a

ORDER BY department,

DESC salary

3. NTILE

NTILE is a very useful function for those in data analysis, business analysis, and data science. Often when your deadline with statistical data, you probably need to create robust statistics like quartile, quintile, median, decile in your daily work, and NTILE makes it easy to generate these results.

NTILE takes an argument of the number of containers (or basically how many containers you want to divide your data into) and then creates this number of containers by dividing your data into that number of containers. You set how the data is sorted and partitioned, if you want additional groupings.

NTILE example (100)

In this example, we will learn how to use NTILE to categorize our data into percentiles. You can follow it interactively at the link here: platform.stratascratch.com/coding-question?id=10303&python=

What you’re trying to do here is identify the top 5 percent of claims based on a score that an algorithm generates. But you can’t just find the top 5% and order because you want to find the top 5% by status. So one way to do this is to use an NTILE () sort function and then PARTITION by state. You can then apply a filter on the WHERE clause to get the top 5%.

Here is the code to generate the complete table above. You can copy and paste it in the link above.

SELECT policy_number,

Express,

claim_cost,

Fraud_score,

percentile

FROM

(SELECT *,

NTILE (100) OVER (PARTITION BY state

ORDER BY Fraud_score DESC) AS percentile

FROM Fraude_score) to

WHERE percentile <= 5

4. Time series data handling

LAG and LEAD are two window functions that are useful for dealing with time series data. The only difference between LAG and LEAD is whether you want to take from previous or next rows, almost like a sample from previous data or from future data.

You can use LAG and LEAD to calculate month-to-month growth or moving averages. As a data scientist and business analyst, you are always dealing with time series data and creating those time metrics.

LAG () example:

In this example, we want to find the percentage of growth year over year, which is a very common question that data scientists and business analysts answer on a daily basis. The problem statement, data, and SQL editor are at the following link if you want to try to code the solution on your own: platform.stratascratch.com/coding-question?id=9637&python=

The tricky thing about this problem is that the data is configured; you must use the value from the previous row in your metric. But SQL is not designed to do that. SQL is built to calculate anything you want as long as the values are in the same row. So we can use the lag () or lead () window function which will take the previous or next rows and put them in their current row, which is what this question is doing.

Here is the code to generate the complete table above. You can copy and paste the code in the SQL editor at the link above:

SELECT year,

host_current_year,

prev_year_host,

round (((host_current_year – host_previous_year) / (cast (host_previous_year AS numeric))) * 100) estimated_growth

FROM

(SELECT year,

host_current_year,

LAG (current_year_host, 1) OVER (ORDER BY year) AS prev_year_host

FROM

(SELECT extract (year

FROM host_since :: date) AS year,

count (id) host_current_year

FROM airbnb_search_details

WHERE host_since IS NOT NULL

GROUP BY statement (year

FROM host_since :: date)

ORDER PER year) t1) t2

Leave a Reply Cancel reply