The 365 Data Science team is proud to invite you to our own community forum. A very well built system to support your queries, questions and give the chance to show your knowledge and help others in their path of becoming Data Science specialists.
Anybody can ask a question
Anybody can answer
The best answers are voted up and moderated by our team

SQL Tableau Problem 3 – SQL Code

SQL Tableau Problem 3 – SQL Code


I don’t think that the solution given to problem 3 is accurate.
Firstly, from a pedantic point of view, the task is to calculate average salaries which can be filtered by department. The solution provided calculates the average salaries per department in SQL, then averages the selected departments in Tableau. I don’t think that this is ideal since the average of averages is not necessarily the same as calculating the overall average.
The main issue I have with the solution is that the SQL code calculates the average salaries by using the employee number to join the employees, salaries and dept_emp tables as follows:

     e.gender, d.dept_name, ROUND(AVG(s.salary), 2) AS salary, YEAR(s.from_date) AS calendar_year 
  FROM t_salaries s 
  JOIN t_employees e ON s.emp_no = e.emp_no 
  JOIN t_dept_emp de ON de.emp_no = e.emp_no 
  JOIN t_departments d ON d.dept_no = de.dept_no 
  GROUP BY d.dept_no , e.gender , calendar_year 
  HAVING calendar_year <= 2002 
  ORDER BY d.dept_no;

Since a single employee can have multiple entries in the dept_emp table (they can work in multiple departments in the course of their employment) and in the salaries table, there is a many-to-many relationship between these two tables. For example employee 10124 worked in department 4 from 1991 to 1999 and in department 6 from 1999 to 2002. The SQL code above will include their salary every year from 1991 to 2002 as being in department 4 and department 6 when calculating the average.
I think that it is necessary to follow an approach more like in problem 2: select the employee numbers, salaries, department numbers, start dates and end dates from t_salaries s JOIN t_dept_emp de ON s.emp_no=de.emp_no, then calculate for each row whether the department date range and salary date range overlap (valid rows), then join the valid rows to the employees and department tables and calculate the average.
The final thing that I noticed is that the properties of the sample dataset cause a lot of data to be excluded from the calculation anyway. There are about 300,000 distinct employee numbers in the dept_emp table, 100,000 in the salaries table and 135,000 in the employees table. However, there are only about 46,000 employees in the intersection between the employees and salaries tables, so at least half of the data in the salaries table is not being included in the averages. This does not particularly make a difference to the methodology, but you would hope that a real company would not be failing to pay so many of its staff while paying so many people who are not employees, so it may be a good idea to ensure that the same employees feature in the employees, salaries and dept_emp tables!

1 Answer

365 Team

Hi Philip!
Thanks for reaching out.
Regarding your last paragraph, may I please note that this dataset uses data that is considered to be fabricated and has been modified to fit the purposes of this course to a satisfactory level. Therefore, the observations you are making are valid; it is just that we’ve accepted/assumed they are not going to be a significant issue while solving the given task.
Please allow me to proceed with answering the rest of your questions in a couple of days. Thank you.
Hope this helps.

Complete Data Science Education
Get 50% OFF