Joining Data Tables: A Visual Guide with Austrian Hotels

Slide 1: Key Vocabulary

Data table joins combine information from multiple tables based on matching values.

Key Terms:

  • Primary Key: Unique identifier for each row (e.g., hotel_id)
  • Foreign Key: Column that references a primary key in another table
  • Join Key: Column(s) used to match rows between tables

Common Join Types:

  • Inner Join: Returns only matching rows
  • Left Join: Returns all rows from left table, matching rows from right
  • Right Join: Returns all rows from right table, matching rows from left
  • Full Join: Returns all rows from both tables

Slide 2: More Vocabulary

Relationship Types:

  • One-to-One: Each row in Table A matches exactly one row in Table B
    • Example: Hotel details and hotel star ratings
  • One-to-Many: Each row in Table A matches multiple rows in Table B
    • Example: Hotel to guest reviews
  • Many-to-Many: Multiple rows in Table A match multiple rows in Table B
    • Example: Hotels and amenities (each hotel has many amenities, each amenity exists in many hotels)

Tidy Data Principles: - Each variable forms a column - Each observation forms a row - Each type of observational unit forms a table


Slide 3: One-to-One Join (Perfect Match)

Scenario: Two tables with exactly the same hotels

Table A: Austrian Hotels | hotel_id | hotel_name | city | |———-|—————————-|———–| | 1 | Hotel Sacher | Vienna | | 2 | Hotel Imperial | Vienna | | 3 | Schloss Fuschl Resort | Salzburg | | 4 | Grand Hotel Wien | Vienna | | 5 | Hotel Goldener Hirsch | Salzburg |

Table B: Hotel Ratings | hotel_id | stars | avg_price_eur | |———-|——-|—————| | 1 | 5 | 450 | | 2 | 5 | 420 | | 3 | 5 | 380 | | 4 | 5 | 350 | | 5 | 5 | 320 |

Inner Join Result (same as Left, Right, and Full Join in this case):

hotels_with_ratings <- inner_join(hotels, ratings, by = "hotel_id")

Slide 4: One-to-One Join (Partial Match)

Scenario: Some hotels appear in one table but not the other

Table A: Austrian Hotels | hotel_id | hotel_name | city | |———-|—————————-|———–| | 1 | Hotel Sacher | Vienna | | 2 | Hotel Imperial | Vienna | | 3 | Schloss Fuschl Resort | Salzburg | | 4 | Grand Hotel Wien | Vienna | | 5 | Hotel Goldener Hirsch | Salzburg |

Table B: Boutique Hotels | hotel_id | is_boutique | room_count | |———-|————|————| | 1 | TRUE | 150 | | 2 | TRUE | 138 | | 5 | TRUE | 70 | | 6 | TRUE | 45 | | 7 | TRUE | 62 |

Different Join Types:

Inner Join (only matching hotel_ids):

boutique_overlap <- inner_join(hotels, boutique, by = "hotel_id")
# Returns hotels 1, 2, and 5

Left Join (all hotels from Table A):

all_hotels_boutique_info <- left_join(hotels, boutique, by = "hotel_id")
# Returns hotels 1-5, with NULL for boutique info for 3 and 4

Right Join (all boutique hotels):

all_boutique_hotel_info <- right_join(hotels, boutique, by = "hotel_id")
# Returns hotels 1, 2, 5, 6, 7 with NULL for hotel info for 6 and 7

Full Join (all hotels from both tables):

all_hotels_combined <- full_join(hotels, boutique, by = "hotel_id")
# Returns hotels 1-7, with NULLs where information is missing

Slide 5: One-to-One Join summary

Inner Join: Only keeps rows that exist in both tables (intersection) Left Join: Keeps all rows from the left table, adds matching data from right Right Join: Keeps all rows from the right table, adds matching data from left Full Join: Keeps all rows from both tables (union)

Economic Insight: Join type selection impacts analysis conclusions. For example, calculating average room prices would differ based on which hotels are included in the final dataset.


Slide 6: One-to-Many Join

Scenario: Each hotel has multiple guest reviews

Table A: Austrian Hotels | hotel_id | hotel_name | city | |———-|—————————-|———–| | 1 | Hotel Sacher | Vienna | | 2 | Hotel Imperial | Vienna | | 3 | Schloss Fuschl Resort | Salzburg |

Table B: Guest Reviews | review_id | hotel_id | rating | review_date | |———–|———-|——–|————-| | 101 | 1 | 4.8 | 2023-06-15 | | 102 | 1 | 4.7 | 2023-07-22 | | 103 | 1 | 4.9 | 2023-08-05 | | 104 | 2 | 4.6 | 2023-06-10 | | 105 | 2 | 4.8 | 2023-07-15 | | 106 | 3 | 4.5 | 2023-08-20 |

Join Result:

hotels_with_reviews <- left_join(hotels, reviews, by = "hotel_id")

This creates a table with 6 rows (one for each review) where hotel information is duplicated for hotels with multiple reviews.


Slide 7: One-to-Many Join Visualization

Key Points: - The resulting table has more rows than the “one” table - Information from the “one” table gets duplicated for each matching row in the “many” table - Common use: parent-child relationships in data (e.g., cities to buildings, companies to employees)

Data Analysis Impact: - Be careful with aggregations after a one-to-many join


Slide 8: Join with Composite Keys

Scenario: Hotels with seasonal pricing

Table A: Austrian Hotels | hotel_id | hotel_name | city | |———-|—————————-|———–| | 1 | Hotel Sacher | Vienna | | 2 | Hotel Imperial | Vienna | | 3 | Schloss Fuschl Resort | Salzburg |

Table B: Seasonal Hotel Pricing | hotel_id | season | avg_price_eur | occupancy_rate | |———-|———–|—————|—————-| | 1 | Summer | 520 | 0.92 | | 1 | Winter | 480 | 0.85 | | 1 | Christmas | 650 | 0.98 | | 2 | Summer | 490 | 0.88 | | 2 | Winter | 450 | 0.82 | | 3 | Summer | 420 | 0.95 | | 3 | Winter | 550 | 0.90 |

Composite Key Join:

seasonal_hotel_data <- left_join(hotels, seasonal_prices, by = c("hotel_id"))

This creates an incorrect result with duplicated rows. Instead, we need both the hotel_id and season to uniquely identify a record in the pricing table.

Proper Use Case: When joining hotel occupancy data that varies by both hotel and date/season.


Slide 9: Key Takeaways

  1. Choose the right join type based on your analytical needs:

    • Inner join for strict matching
    • Left/right join when you need to preserve all records from one table
    • Full join when you need all possible data
  2. Understand your data relationships:

    • One-to-one: Simple matching
    • One-to-many: Creates duplication of the “one” side
    • Many-to-many: Requires careful handling to avoid combinatorial explosion
  3. Composite keys are essential for:

    • economic data at geography* time level
  4. Data integrity is crucial:

    • Check for unexpected NULLs after joining
    • Validate row counts before and after joins
    • Consider foreign key constraints in database design

AI:

This is done with Claude Sonnet 3.7

“I am working on a presentation for economics students on ways of joining tidy data tables. Have 1 or 2 slides on vocabulary. Then 3 slides on 1:1. First, two tables exact same rows. Then second is fewer but full overlap. Third no full overlap, ie some rows only in 1 some in 2. Show what alternative types of join do. Then do 1 to many. Then join on composite. Create a nice example to carry though. One idea is that rows are hotels in Austria. Nice graphs are useful. No limit in slides, just focus on clarity and make it pretty. Can use tikz.”