Difficulties of Joining Data Tables

Background: combining datasets

Gábor Békés (CEU)

2025-03-10

Background to Data Analyis with AI: combining datasets

Combination

The problem * You have two datasets, and for analysis want to combine * Example: WVS and World Bank

Key issues: * Identify UIDs in each datasets * How to deal with imperfect matches * Which join?

Reading * BK Chapter 02

Unique identifiers (UID)

Tidy data: Each observation in a dataset A has a UID
- country code
- hotel ID
- survey respondent ID
Ideally it is not a name but a code (numeric, hex, alphanumeric)
- ensures it’s unique
  - names are not unique, may be misspelled

Unique identifiers (UID)

Sometimes it is in the data
Sometimes we have to create it
- names – > UID
- entity resolution (disambigouation )
Tidy data: UID may be based on multiple variables
- country*year

Join types

Join Type	SQL	R (`dplyr`)	Python (`pandas`)	Stata (`merge`)
Inner Join	`SELECT * FROM A INNER JOIN B ON A.ID = B.ID;`	`inner_join(A, B, by = "ID")`	`A.merge(B, on="ID", how="inner")`	`merge 1:1 ID using B`
Left Join	`SELECT * FROM A LEFT JOIN B ON A.ID = B.ID;`	`left_join(A, B, by = "ID")`	`A.merge(B, on="ID", how="left")`	`merge 1:1 ID using B, keep(1 3)`
Right Join	`SELECT * FROM A RIGHT JOIN B ON A.ID = B.ID;`	`right_join(A, B, by = "ID")`	`A.merge(B, on="ID", how="right")`	`merge 1:1 ID using B, keep(2 3)`
Full Join	`SELECT * FROM A FULL OUTER JOIN B ON A.ID = B.ID;`	`full_join(A, B, by = "ID")`	`A.merge(B, on="ID", how="outer")`	`merge 1:1 ID using B, keep(1 2 3)`
Cross Join	`SELECT * FROM A CROSS JOIN B;`	`cross_join(A, B)` (needs `{dplyr}` extension)	`A.merge(B, how="cross")`	No built-in command; use `gen` to create all combinations

Types of Joins & Their Challenges

Common Join Types - inner_join(): Only matching records - left_join(): All from left + matches from right - right_join(): All from right + matches from left - full_join(): All records from both tables - cross_join(): Every combination of both tables

Example: - Table A: Customers - Table B: Orders

Table A (Customers)
+----+--------+
| ID | Name   |
+----+--------+
| 1  | Alice  |
| 2  | Bob    |
| 3  | Charlie|
+----+--------+

Table B (Orders)
+----+---------+
| ID | Order  |
+----+---------+
| 1  | Laptop |
| 2  | Phone  |
| 4  | Tablet |
+----+---------+

Join Type	Result
`inner_join()`	Alice - Laptop, Bob - Phone (ID 1,2)
`left_join()`	Alice - Laptop, Bob - Phone, Charlie - NA (ID 3 remains)
`right_join()`	Alice - Laptop, Bob - Phone, NA - Tablet (ID 4 remains)
`full_join()`	Alice - Laptop, Bob - Phone, Charlie - NA, NA - Tablet

Key Matching Issues

Common Problems: - Typos & Formatting: “Alice” vs “alice” - Case Sensitivity: “New York” ≠ “NEW YORK” - Different Formats: “2024-03-05” vs “05/03/2024” - Extra Spaces: “Bob” vs “Bob”

Example Issue:

Table A (Users)
+----+--------+
| ID | Name   |
+----+--------+
| 1  | Alice  |
| 2  | Bob    |
| 3  | Charlie|
+----+--------+

Table B (Purchases)
+----+--------+
| ID | Item   |
+----+--------+
| 1  | Laptop |
| 2  | Phone  |
| 3  |Tablet  |  <-- Extra space typo!
+----+--------+

Solution: - Use str_trim(), tolower() for text matching. - Convert date formats before joining (lubridate::ymd()).

Many-to-Many and One-to-Many Issues

Example:

Table A (Students)
+----+--------+
| ID | Name   |
+----+--------+
| 1  | Alice  |
| 2  | Bob    |
+----+--------+

Table B (Courses)
+----+----------+
| ID | Course  |
+----+----------+
| 1  | Math    |
| 1  | Physics |
| 2  | History |
+----+----------+

Problem: - Joining creates duplicate rows if not handled properly.

Student	Course
Alice	Math
Alice	Physics
Bob	History

Fix:
- Use aggregations (group_by() %>% summarise()). - Avoid unwanted row expansion with careful join conditions.

Handling Missing Data from Joins

Example:

Table A (Employees)
+----+--------+
| ID | Name   |
+----+--------+
| 1  | Alice  |
| 2  | Bob    |
| 3  | Charlie|
+----+--------+

Table B (Salaries)
+----+------+
| ID | Pay  |
+----+------+
| 1  | 5000 |
| 3  | 4500 |
+----+------+

Employee	Salary
Alice	5000
Bob	NA
Charlie	4500

Issues: - NA appears in left_join() when there is no match. - If used in calculations (sum(), mean()), NA may cause issues.

Solution: - Use replace_na(Salary, 0) to handle missing values.

Performance Issues with Large Joins

Why Joins Get Slow? - Large tables: Millions of rows - Unindexed keys: Searching without an index is costly - Multiple joins: Query complexity grows

Example in R (dplyr):

library(dplyr)
orders %>% left_join(customers, by = "customer_id")

If customer_id is not indexed, operation is slow.

Optimizations: - Use Indexing (data.table or database indexing). - Filter Early: Reduce dataset before joining. - Check Join Order: Place smaller tables first in join operations.

Summary

Key Matching Issues: Data formatting problems.
Many-to-Many Pitfalls: Avoid unwanted row duplication.
Handling Missing Data: NA can break calculations.
Performance Concerns: Optimize joins with indexing.