Data Analysis/Query
리트코드 : 262. Trips and Users
베짱이28호
2024. 4. 3. 23:59
리트코드 : 262. Trips and Users
문제
Table: Trips
+-------------+----------+
| Column Name | Type |
+-------------+----------+
| id | int |
| client_id | int |
| driver_id | int |
| city_id | int |
| status | enum |
| request_at | date |
+-------------+----------+
id is the primary key (column with unique values) for this table.
The table holds all taxi trips. Each trip has a unique id, while client_id and driver_id are foreign keys to the users_id at the Users table.
Status is an ENUM (category) type of ('completed', 'cancelled_by_driver', 'cancelled_by_client').
Table: Users
+-------------+----------+
| Column Name | Type |
+-------------+----------+
| users_id | int |
| banned | enum |
| role | enum |
+-------------+----------+
users_id is the primary key (column with unique values) for this table.
The table holds all users. Each user has a unique users_id, and role is an ENUM type of ('client', 'driver', 'partner').
banned is an ENUM (category) type of ('Yes', 'No').
The cancellation rate is computed by dividing the number of canceled (by client or driver) requests with unbanned users by the total number of requests with unbanned users on that day.
Write a solution to find the cancellation rate of requests with unbanned users (both client and driver must not be banned) each day between "2013-10-01" and "2013-10-03". Round Cancellation Rate to two decimal points.
Return the result table in any order.
The result format is in the following example.
Example 1:
Input:
Trips table:
+----+-----------+-----------+---------+---------------------+------------+
| id | client_id | driver_id | city_id | status | request_at |
+----+-----------+-----------+---------+---------------------+------------+
| 1 | 1 | 10 | 1 | completed | 2013-10-01 |
| 2 | 2 | 11 | 1 | cancelled_by_driver | 2013-10-01 |
| 3 | 3 | 12 | 6 | completed | 2013-10-01 |
| 4 | 4 | 13 | 6 | cancelled_by_client | 2013-10-01 |
| 5 | 1 | 10 | 1 | completed | 2013-10-02 |
| 6 | 2 | 11 | 6 | completed | 2013-10-02 |
| 7 | 3 | 12 | 6 | completed | 2013-10-02 |
| 8 | 2 | 12 | 12 | completed | 2013-10-03 |
| 9 | 3 | 10 | 12 | completed | 2013-10-03 |
| 10 | 4 | 13 | 12 | cancelled_by_driver | 2013-10-03 |
+----+-----------+-----------+---------+---------------------+------------+
Users table:
+----------+--------+--------+
| users_id | banned | role |
+----------+--------+--------+
| 1 | No | client |
| 2 | Yes | client |
| 3 | No | client |
| 4 | No | client |
| 10 | No | driver |
| 11 | No | driver |
| 12 | No | driver |
| 13 | No | driver |
+----------+--------+--------+
Output:
+------------+-------------------+
| Day | Cancellation Rate |
+------------+-------------------+
| 2013-10-01 | 0.33 |
| 2013-10-02 | 0.00 |
| 2013-10-03 | 0.50 |
+------------+-------------------+
Explanation:
On 2013-10-01:
- There were 4 requests in total, 2 of which were canceled.
- However, the request with Id=2 was made by a banned client (User_Id=2), so it is ignored in the calculation.
- Hence there are 3 unbanned requests in total, 1 of which was canceled.
- The Cancellation Rate is (1 / 3) = 0.33
On 2013-10-02:
- There were 3 requests in total, 0 of which were canceled.
- The request with Id=6 was made by a banned client, so it is ignored.
- Hence there are 2 unbanned requests in total, 0 of which were canceled.
- The Cancellation Rate is (0 / 2) = 0.00
On 2013-10-03:
- There were 3 requests in total, 1 of which was canceled.
- The request with Id=8 was made by a banned client, so it is ignored.
- Hence there are 2 unbanned request in total, 1 of which were canceled.
- The Cancellation Rate is (1 / 2) = 0.50
문제 풀이
MySQL
# SOLUTION1
WITH UNBANNED_WINDOW AS (
SELECT *
FROM TRIPS
WHERE (
REQUEST_AT BETWEEN '2013-10-01' AND '2013-10-03' AND
CLIENT_ID IN (SELECT USERS_ID FROM USERS WHERE BANNED = 'No') AND
DRIVER_ID IN (SELECT USERS_ID FROM USERS WHERE BANNED = 'No'))
),
COUNTING AS (
SELECT
REQUEST_AT,
COUNT(REQUEST_AT) AS TOTAL,
SUM(CASE WHEN STATUS = 'completed' THEN 1 ELSE 0 END) AS COMPLETED
FROM UNBANNED_WINDOW
GROUP BY REQUEST_AT
)
SELECT COUNTING.REQUEST_AT AS Day,
round(1-(COUNTING.COMPLETED/COUNTING.TOTAL),2) AS "Cancellation Rate"
FROM COUNTING
- SOLUTION1
- CTE 써서 필터 조건 하나 걸어주고, 그걸 바탕으로 개수를 센 테이블 만들어주기.
- 개수를 센 테이블 바탕으로 정답 테이블 만들어주기.
# SOLUTION2
SELECT
trips.request_at AS Day,
ROUND(SUM(IF(trips.status = 'completed',0,1))/COUNT(trips.status),2) AS 'Cancellation Rate'
FROM trips
WHERE trips.client_id NOT IN (SELECT users_id FROM users WHERE users.banned = 'Yes')
AND trips.driver_id NOT IN (SELECT users_id FROM users WHERE users.banned = 'Yes')
AND trips.request_at IN ('2013-10-01', '2013-10-02', '2013-10-03')
GROUP BY Day;
- SOLUTION2
- 제 풀이는 아닌데 깔끔하고 속도 빨라서 리트코드 정답에서 가져왔음
- 기존에 CTE를 WHERE문 조건에 다 넣고, 바로 조건문에서 카운팅이랑 한 번에 해버리기.
- CASE WHEN 대신 IF를 사용해서 완료면 0, 아니면 1로 값을 채워서 한 번에 계산.
Pandas
import pandas as pd
from datetime import datetime, timedelta
def trips_and_users(trips: pd.DataFrame, users: pd.DataFrame) -> pd.DataFrame:
window = trips[trips['request_at'].between('2013-10-01','2013-10-03')]
unbanned = users[users['banned']=='No']
join = pd.merge(window, unbanned, left_on='client_id', right_on='users_id')
join = pd.merge(join, unbanned, left_on='driver_id', right_on='users_id')
grouped = join.groupby(['request_at', 'status']).size().unstack(fill_value=0)
all_statuses = ['completed', 'cancelled_by_driver', 'cancelled_by_client']
for status in all_statuses:
if status not in grouped.columns:
grouped[status] = 0
grouped['Cancellation Rate'] = ((grouped['cancelled_by_client'] + grouped['cancelled_by_driver']) / grouped.sum(axis=1)).round(2)
grouped.reset_index(inplace=True)
grouped.rename(columns={'request_at':'Day'}, inplace=True)
return grouped[['Day', 'Cancellation Rate']]
- 조인을 두 번 해줘야 한다. 해시를 통해서 client, driver 모두 체크해도 상관 없긴 한데 join 두 번으로 풀었다.
- client가 unbanned join table을 만들어서 한 번 필터링 한 후, driver가 unbanned인 경우를 필터링 해준다.
- 이후 pivot table / gruopby unstack을 통해서 표를 만들어야 한다. (이게 다루기 편함)
- unstack을 통해서 피벗 테이블 형태로 만들어주고, 기존에 비어있던 컬럼은 None 대신 fill_value를 통해서 0으로 채워준다. (연산 오류 방지)
- pivot table을 만드는 과정에서 기존 테이블에 있었던 columns 속성들도 모두 가져와야 한다.
- 이렇게 안하면 columns 원소가 존재하지 않는 경우에 키 접근이 안돼서 에러 발생 가능.
코멘트
- SQL에서는 SELECT에서 조건문에 IF나 CASE WHEN써서 줄줄이 쓰는게 익숙하지 않다...
- pandas 에서는 조인 두 번 해야된다는 거를 놓쳐서 조금 걸렸음...