Background: combining datasets
2025-03-10
The problem * You have two datasets, and for analysis want to combine * Example: WVS and World Bank
Key issues: * Identify UIDs in each datasets * How to deal with imperfect matches * Which join?
Reading * BK Chapter 02
| Join Type | SQL | R (dplyr) |
Python (pandas) |
Stata (merge) |
|---|---|---|---|---|
| Inner Join | SELECT * FROM A INNER JOIN B ON A.ID = B.ID; |
inner_join(A, B, by = "ID") |
A.merge(B, on="ID", how="inner") |
merge 1:1 ID using B |
| Left Join | SELECT * FROM A LEFT JOIN B ON A.ID = B.ID; |
left_join(A, B, by = "ID") |
A.merge(B, on="ID", how="left") |
merge 1:1 ID using B, keep(1 3) |
| Right Join | SELECT * FROM A RIGHT JOIN B ON A.ID = B.ID; |
right_join(A, B, by = "ID") |
A.merge(B, on="ID", how="right") |
merge 1:1 ID using B, keep(2 3) |
| Full Join | SELECT * FROM A FULL OUTER JOIN B ON A.ID = B.ID; |
full_join(A, B, by = "ID") |
A.merge(B, on="ID", how="outer") |
merge 1:1 ID using B, keep(1 2 3) |
| Cross Join | SELECT * FROM A CROSS JOIN B; |
cross_join(A, B) (needs {dplyr} extension) |
A.merge(B, how="cross") |
No built-in command; use gen to create all combinations |
Common Join Types - inner_join(): Only matching records - left_join(): All from left + matches from right - right_join(): All from right + matches from left - full_join(): All records from both tables - cross_join(): Every combination of both tables
Example: - Table A: Customers - Table B: Orders
Table A (Customers)
+----+--------+
| ID | Name |
+----+--------+
| 1 | Alice |
| 2 | Bob |
| 3 | Charlie|
+----+--------+
Table B (Orders)
+----+---------+
| ID | Order |
+----+---------+
| 1 | Laptop |
| 2 | Phone |
| 4 | Tablet |
+----+---------+
| Join Type | Result |
|---|---|
inner_join() |
Alice - Laptop, Bob - Phone (ID 1,2) |
left_join() |
Alice - Laptop, Bob - Phone, Charlie - NA (ID 3 remains) |
right_join() |
Alice - Laptop, Bob - Phone, NA - Tablet (ID 4 remains) |
full_join() |
Alice - Laptop, Bob - Phone, Charlie - NA, NA - Tablet |
Common Problems: - Typos & Formatting: “Alice” vs “alice” - Case Sensitivity: “New York” ≠ “NEW YORK” - Different Formats: “2024-03-05” vs “05/03/2024” - Extra Spaces: “Bob” vs “Bob”
Example Issue:
Table A (Users)
+----+--------+
| ID | Name |
+----+--------+
| 1 | Alice |
| 2 | Bob |
| 3 | Charlie|
+----+--------+
Table B (Purchases)
+----+--------+
| ID | Item |
+----+--------+
| 1 | Laptop |
| 2 | Phone |
| 3 |Tablet | <-- Extra space typo!
+----+--------+
Solution: - Use str_trim(), tolower() for text matching. - Convert date formats before joining (lubridate::ymd()).
Example:
Table A (Students)
+----+--------+
| ID | Name |
+----+--------+
| 1 | Alice |
| 2 | Bob |
+----+--------+
Table B (Courses)
+----+----------+
| ID | Course |
+----+----------+
| 1 | Math |
| 1 | Physics |
| 2 | History |
+----+----------+
Problem: - Joining creates duplicate rows if not handled properly.
| Student | Course |
|---|---|
| Alice | Math |
| Alice | Physics |
| Bob | History |
Fix:
- Use aggregations (group_by() %>% summarise()). - Avoid unwanted row expansion with careful join conditions.
Example:
Table A (Employees)
+----+--------+
| ID | Name |
+----+--------+
| 1 | Alice |
| 2 | Bob |
| 3 | Charlie|
+----+--------+
Table B (Salaries)
+----+------+
| ID | Pay |
+----+------+
| 1 | 5000 |
| 3 | 4500 |
+----+------+
| Employee | Salary |
|---|---|
| Alice | 5000 |
| Bob | NA |
| Charlie | 4500 |
Issues: - NA appears in left_join() when there is no match. - If used in calculations (sum(), mean()), NA may cause issues.
Solution: - Use replace_na(Salary, 0) to handle missing values.
Why Joins Get Slow? - Large tables: Millions of rows - Unindexed keys: Searching without an index is costly - Multiple joins: Query complexity grows
Example in R (dplyr):
customer_id is not indexed, operation is slow.Optimizations: - Use Indexing (data.table or database indexing). - Filter Early: Reduce dataset before joining. - Check Join Order: Place smaller tables first in join operations.
Data Analysis with AI - 2025