Master Your Data Engineer Interview: Top 12 SQL Problems & Solutions

Practice these 12 SQL problems covering joins, aggregations, window functions, and subqueries to excel in data engineer interviews. Prepgenix AI offers tailored practice for Indian tech aspirants.

Landing a data engineering role often hinges on your SQL prowess, especially in the competitive Indian tech landscape. Companies like TCS, Infosys, and Wipro frequently test candidates on their ability to manipulate and query data efficiently. This article delves into the top 12 SQL interview problems that are recurrently asked, designed to assess your understanding of core database concepts, query optimization, and problem-solving skills. Whether you're a recent graduate from IIT or NIT, or a fresher from any engineering college preparing for placements, mastering these SQL challenges will significantly boost your confidence and readiness for your upcoming tech interview. We'll provide detailed explanations and solutions, helping you understand the logic behind each query, and how you can leverage platforms like Prepgenix AI to practice extensively.

Understanding Joins: The Foundation of Relational Databases

Joins are fundamental to relational databases, allowing you to combine rows from two or more tables based on a related column. Data engineers frequently encounter scenarios where they need to retrieve information scattered across different tables. The most common types are INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. An INNER JOIN returns only the rows where there is a match in both tables. For instance, imagine you have a 'Customers' table and an 'Orders' table. An inner join on 'CustomerID' would show you only those customers who have placed orders. A LEFT JOIN (or LEFT OUTER JOIN) returns all rows from the left table, and the matched rows from the right table. If there's no match, the result is NULL on the right side. This is crucial when you want to see all customers, even those who haven't ordered anything yet. A RIGHT JOIN is the inverse of a LEFT JOIN. A FULL OUTER JOIN returns all rows when there is a match in either the left or the right table. If there's no match, NULL values are returned for columns of the table without a match. Understanding the nuances of each join type is vital. Interviewers often pose questions like: 'Find all employees and their departments, including employees who are not assigned to any department.' This requires a LEFT JOIN from the 'Employees' table to the 'Departments' table. Another common problem is finding records that exist in one table but not another, which can be solved using a LEFT JOIN combined with a WHERE clause checking for NULLs in the right table, or using a NOT EXISTS subquery. Practicing these join scenarios on sample datasets, perhaps similar to those used in TCS NQT or Infosys mock tests, will solidify your understanding. Prepgenix AI provides interactive exercises that mimic these real-world join challenges, allowing you to experiment and learn.

Aggregate Functions and GROUP BY: Summarizing Your Data

Data engineers often need to summarize large datasets to extract meaningful insights. Aggregate functions like COUNT, SUM, AVG, MIN, and MAX are essential tools for this. When combined with the GROUP BY clause, they allow you to perform calculations on subsets of rows. For example, you might want to find the total number of orders placed by each customer, or the average salary of employees in each department. A typical interview question could be: 'Calculate the total sales amount for each product category.' This would involve joining a 'Products' table with a 'Sales' table, grouping by 'Category', and then summing the 'SalesAmount'. Another variation might ask for the number of customers in each city. The GROUP BY clause is powerful, but it also comes with potential pitfalls. You must ensure that all non-aggregated columns in the SELECT list are included in the GROUP BY clause. Interviewers might also ask about HAVING clause, which is used to filter groups based on a specified condition, unlike WHERE which filters individual rows before grouping. For instance, 'Find all product categories with total sales exceeding ₹1,00,000.' This requires using HAVING COUNT(Category) > 100000. Understanding how to use these functions effectively and efficiently is a key skill. Practicing with datasets that mimic common Indian e-commerce scenarios or banking transactions can make these concepts more relatable and easier to grasp during your interview preparation.

Window Functions: Advanced Analytics and Ranking

Window functions take aggregation to the next level by performing calculations across a set of table rows that are somehow related to the current row. Unlike aggregate functions with GROUP BY, window functions do not collapse rows; they return a value for each row based on a 'window' of related rows. This is incredibly useful for tasks like ranking, calculating running totals, and finding moving averages. Common window functions include ROW_NUMBER(), RANK(), DENSE_RANK(), LEAD(), LAG(), SUM() OVER(), AVG() OVER(), etc. A classic interview problem is: 'Rank employees within each department based on their salary.' This can be solved using RANK() OVER (PARTITION BY department_id ORDER BY salary DESC). The PARTITION BY clause divides the rows into partitions (e.g., departments), and ORDER BY within the OVER clause specifies the order within each partition. Another common question involves calculating the difference in sales between consecutive months, which can be solved using the LAG() function. For example, 'Calculate the month-over-month sales growth.' This would involve calculating the sum of sales per month and then using LAG() to get the previous month's sales for comparison. Understanding the OVER() clause, including PARTITION BY and ORDER BY, is critical. These functions are often used in more complex data engineering tasks, such as time-series analysis and performance reporting, making them a frequent topic in interviews for roles requiring advanced analytical capabilities. Familiarizing yourself with these functions through practice, perhaps on datasets simulating stock market data or customer activity logs, is highly recommended.

Subqueries: Solving Complex Problems Step-by-Step

Subqueries, also known as inner queries or nested queries, are queries embedded within another SQL query. They are powerful tools for breaking down complex problems into smaller, manageable parts. A subquery can be used in the WHERE clause, FROM clause, or SELECT clause. For instance, a common problem is: 'Find all customers who have placed more orders than the average number of orders placed by any customer.' This requires a subquery to first calculate the average number of orders, and then the outer query to select customers whose order count exceeds this average. Subqueries can also be used to find records that don't exist in another table. For example, 'Find all products that have never been ordered.' This can be achieved with a subquery like: SELECT product_name FROM Products WHERE product_id NOT IN (SELECT DISTINCT product_id FROM Order_Items). While powerful, subqueries can sometimes lead to performance issues if not written efficiently. Interviewers might ask you to optimize a query that uses a subquery, perhaps by rewriting it using JOINs or Common Table Expressions (CTEs). Understanding when and how to use subqueries effectively, and recognizing their potential performance implications, is a key aspect of data engineering interviews. Practicing various subquery scenarios, including correlated subqueries (where the inner query depends on the outer query), will enhance your problem-solving abilities for intricate data retrieval tasks.

Common Table Expressions (CTEs): Enhancing Readability and Recursion

Common Table Expressions (CTEs), introduced with SQL:1999, offer a way to write more readable and maintainable SQL code, especially for complex queries involving multiple steps or recursive operations. A CTE is a temporary, named result set that you can reference within a single SQL statement (SELECT, INSERT, UPDATE, or DELETE). They are defined using the WITH keyword. For example, a complex query involving multiple joins and aggregations can be broken down into logical steps using CTEs, making the overall query easier to understand. Consider the problem of finding the top N employees in each department based on salary. While window functions can solve this efficiently, using CTEs can also structure the logic clearly. You might have one CTE to calculate departmental salaries, another to rank employees within departments, and a final SELECT statement to fetch the top N. CTEs are particularly useful for recursive queries, which are used to traverse hierarchical data structures like organizational charts or bill of materials. For instance, an interviewer might ask you to find all subordinates of a specific manager in an employee hierarchy. A recursive CTE can elegantly solve this by iteratively querying the employee table based on the manager-employee relationship until the entire hierarchy is traversed. Mastering CTEs, both non-recursive and recursive, demonstrates a strong grasp of modern SQL practices and is often a plus in data engineering interviews, showcasing your ability to write clean, efficient, and sophisticated queries.

Handling NULL Values: The Unseen Challenge

NULL values represent missing or unknown data in SQL databases. They are not the same as zero or an empty string, and treating them incorrectly can lead to unexpected results. Data engineers must be adept at handling NULLs in queries. Common operations involving NULLs include checking for their presence or absence, and how they affect aggregate functions and comparisons. For instance, when counting rows with COUNT(), NULLs are included. However, COUNT(column_name) only counts non-NULL values in that column. Similarly, AVG(column_name) ignores NULL values. Interview questions often revolve around finding records with missing information. 'Find all customers who have not provided their email address.' This requires using WHERE email IS NULL. Conversely, 'Find all customers who have provided an email address' uses WHERE email IS NOT NULL. Another common scenario is using functions like COALESCE or ISNULL to replace NULL values with a default value. For example, 'Calculate the total revenue, treating missing discounts as 0.' This could be written as SUM(price (1 - COALESCE(discount, 0))). Understanding these nuances is critical because inconsistent handling of NULLs can lead to incorrect data analysis and flawed decision-making. In real-world scenarios, especially when integrating data from various sources, dealing with NULLs is a daily task for data engineers.

Advanced SQL Concepts: Pivoting, Unpivoting, and Self-Joins

Beyond the fundamentals, data engineers often need to transform data structures. Pivoting and unpivoting are common data transformation tasks. Pivoting transforms rows into columns, typically used to aggregate data by category. For example, converting monthly sales figures from rows into columns for each month. Unpivoting does the reverse, transforming columns into rows. While not standard SQL in all dialects, many RDBMS offer specific functions (like PIVOT/UNPIVOT in SQL Server, or conditional aggregation using CASE statements in others) to achieve this. A self-join is a regular join, but the table is joined with itself. This is useful when comparing rows within the same table. For instance, finding employees who earn more than their managers. Assuming an 'Employees' table with 'employee_id', 'name', and 'manager_id', a self-join would link 'Employees' to itself on 'manager_id' = 'employee_id' to find pairs where the employee's salary is greater than the manager's salary. Another application is finding duplicate records based on certain criteria, or identifying sequential events for the same user. Mastering these advanced techniques demonstrates a deeper understanding of SQL's flexibility and its application in complex data manipulation scenarios, which is highly valued in data engineering interviews.

Frequently Asked Questions

What are the most common SQL interview questions for freshers?

Common questions include defining different types of JOINs, explaining aggregate functions with GROUP BY, handling NULL values, and writing basic subqueries. Understanding window functions like RANK() and ROW_NUMBER() is also increasingly important for data engineer roles.

How important is SQL for a Data Engineer interview?

SQL is critically important. Data Engineers spend a significant amount of their time writing, optimizing, and understanding SQL queries for data extraction, transformation, and loading (ETL/ELT). Proficiency is essential.

Can you explain the difference between RANK() and DENSE_RANK()?

Both assign ranks based on an ordering. RANK() can create gaps in the ranking sequence if there are ties (e.g., 1, 1, 3), while DENSE_RANK() assigns consecutive ranks without gaps (e.g., 1, 1, 2).

What is a correlated subquery?

A correlated subquery is an inner query that references columns from the outer query. It executes once for each row processed by the outer query, making it potentially less efficient than non-correlated subqueries.

How can I practice SQL for interviews?

Practice on platforms like LeetCode, HackerRank, and specifically Prepgenix AI, which offers tailored courses and mock interviews for Indian tech aspirants. Work with sample datasets and try to solve real-world problems.

What is the purpose of the HAVING clause?

The HAVING clause is used to filter groups created by the GROUP BY clause. It allows you to apply conditions to the results of aggregate functions, similar to how WHERE filters individual rows before grouping.

How do you handle NULL values in SQL?

You handle NULLs using functions like IS NULL, IS NOT NULL, COALESCE, or ISNULL to check for their presence, replace them with default values, or exclude them from calculations as needed.

What are window functions and why are they important?

Window functions perform calculations across a set of table rows related to the current row without collapsing them. They are crucial for tasks like ranking, calculating running totals, and advanced analytics, making them vital for data engineers.