Difficulties of Joining Data Tables

Background: combining datasets

Gábor Békés (CEU)

2025-03-10

Background to Data Analyis with AI: combining datasets

Combination

The problem * You have two datasets, and for analysis want to combine * Example: WVS and World Bank

Key issues: * Identify UIDs in each datasets * How to deal with imperfect matches * Which join?

Reading * BK Chapter 02

Unique identifiers (UID)

  • Tidy data: Each observation in a dataset A has a UID
    • country code
    • hotel ID
    • survey respondent ID
  • Ideally it is not a name but a code (numeric, hex, alphanumeric)
    • ensures it’s unique
      • names are not unique, may be misspelled

Unique identifiers (UID)

  • Sometimes it is in the data
  • Sometimes we have to create it
    • names – > UID
    • entity resolution (disambigouation )
  • Tidy data: UID may be based on multiple variables
    • country*year

Join types

Join Type SQL R (dplyr) Python (pandas) Stata (merge)
Inner Join SELECT * FROM A INNER JOIN B ON A.ID = B.ID; inner_join(A, B, by = "ID") A.merge(B, on="ID", how="inner") merge 1:1 ID using B
Left Join SELECT * FROM A LEFT JOIN B ON A.ID = B.ID; left_join(A, B, by = "ID") A.merge(B, on="ID", how="left") merge 1:1 ID using B, keep(1 3)
Right Join SELECT * FROM A RIGHT JOIN B ON A.ID = B.ID; right_join(A, B, by = "ID") A.merge(B, on="ID", how="right") merge 1:1 ID using B, keep(2 3)
Full Join SELECT * FROM A FULL OUTER JOIN B ON A.ID = B.ID; full_join(A, B, by = "ID") A.merge(B, on="ID", how="outer") merge 1:1 ID using B, keep(1 2 3)
Cross Join SELECT * FROM A CROSS JOIN B; cross_join(A, B) (needs {dplyr} extension) A.merge(B, how="cross") No built-in command; use gen to create all combinations

Types of Joins & Their Challenges

Common Join Types - inner_join(): Only matching records - left_join(): All from left + matches from right - right_join(): All from right + matches from left - full_join(): All records from both tables - cross_join(): Every combination of both tables

Example: - Table A: Customers - Table B: Orders

Table A (Customers)
+----+--------+
| ID | Name   |
+----+--------+
| 1  | Alice  |
| 2  | Bob    |
| 3  | Charlie|
+----+--------+

Table B (Orders)
+----+---------+
| ID | Order  |
+----+---------+
| 1  | Laptop |
| 2  | Phone  |
| 4  | Tablet |
+----+---------+
Join Type Result
inner_join() Alice - Laptop, Bob - Phone (ID 1,2)
left_join() Alice - Laptop, Bob - Phone, Charlie - NA (ID 3 remains)
right_join() Alice - Laptop, Bob - Phone, NA - Tablet (ID 4 remains)
full_join() Alice - Laptop, Bob - Phone, Charlie - NA, NA - Tablet

Key Matching Issues

Common Problems: - Typos & Formatting: “Alice” vs “alice” - Case Sensitivity: “New York” ≠ “NEW YORK” - Different Formats: “2024-03-05” vs “05/03/2024” - Extra Spaces: “Bob” vs “Bob”

Example Issue:

Table A (Users)
+----+--------+
| ID | Name   |
+----+--------+
| 1  | Alice  |
| 2  | Bob    |
| 3  | Charlie|
+----+--------+

Table B (Purchases)
+----+--------+
| ID | Item   |
+----+--------+
| 1  | Laptop |
| 2  | Phone  |
| 3  |Tablet  |  <-- Extra space typo!
+----+--------+

Solution: - Use str_trim(), tolower() for text matching. - Convert date formats before joining (lubridate::ymd()).

Many-to-Many and One-to-Many Issues

Example:

Table A (Students)
+----+--------+
| ID | Name   |
+----+--------+
| 1  | Alice  |
| 2  | Bob    |
+----+--------+

Table B (Courses)
+----+----------+
| ID | Course  |
+----+----------+
| 1  | Math    |
| 1  | Physics |
| 2  | History |
+----+----------+

Problem: - Joining creates duplicate rows if not handled properly.

Student Course
Alice Math
Alice Physics
Bob History

Fix:
- Use aggregations (group_by() %>% summarise()). - Avoid unwanted row expansion with careful join conditions.

Handling Missing Data from Joins

Example:

Table A (Employees)
+----+--------+
| ID | Name   |
+----+--------+
| 1  | Alice  |
| 2  | Bob    |
| 3  | Charlie|
+----+--------+

Table B (Salaries)
+----+------+
| ID | Pay  |
+----+------+
| 1  | 5000 |
| 3  | 4500 |
+----+------+
Employee Salary
Alice 5000
Bob NA
Charlie 4500

Issues: - NA appears in left_join() when there is no match. - If used in calculations (sum(), mean()), NA may cause issues.

Solution: - Use replace_na(Salary, 0) to handle missing values.

Performance Issues with Large Joins

Why Joins Get Slow? - Large tables: Millions of rows - Unindexed keys: Searching without an index is costly - Multiple joins: Query complexity grows

Example in R (dplyr):

library(dplyr)
orders %>% left_join(customers, by = "customer_id")
  • If customer_id is not indexed, operation is slow.

Optimizations: - Use Indexing (data.table or database indexing). - Filter Early: Reduce dataset before joining. - Check Join Order: Place smaller tables first in join operations.

Summary

  • Key Matching Issues: Data formatting problems.
  • Many-to-Many Pitfalls: Avoid unwanted row duplication.
  • Handling Missing Data: NA can break calculations.
  • Performance Concerns: Optimize joins with indexing.