Cocktail Recipe Analysis (part.1)
Initiatives
Drinking alcohol is a great way of breaking ice, networking with people. As a person who has a fairly low tolerance on alcohol, I prefer drinking cocktails over other types of drinks. They look fancy, come in many different flavors, and contains enough alcohol to let my shy personality go off for a while.
However cocktails also give me a big challenge when it comes to choosing the right one…
There are lots of cocktail recipe in this world and everytime I try a new bar only a few names are recognizable: mojito, old-fashioned, sex on the beach, cosmopolitan. Also when it comes to making a decision, I am like a mild version of Chidi from “the Good Place”. Hence, it is very difficult to choose which cocktail I should drink! On top of that some bars just name their cocktails in a different way, which makes me regret my decision.
So I initiated this project: I will perform a thorough analysis on cocktail recipes to help my poor decision making skills and people who can relate to my situation.
Possible Solutions
There are a several possible approaches to this problem:
-
Recommendation System
Based on the information on the menu, I can develop a recommendation system to make an optimal decision.
-
Ingredient Network Analysis
By analyzing ingredients in recipes, I can obtain information about which cocktail ingredients I like and dislike.
-
Don’t drink cocktails
This is not an option for me.
Before choosing strategy, I should look at the data first. So, let’s collect some data!
Data Collection
Numerous pages have information about cocktail recipes, but I believe this website has the most extensive recipes. Let’s get our data!
Collecting links to individual cocktail recipes:
# import packages
import requests
from bs4 import BeautifulSoup
import time
base_url = "http://www.drinksmixer.com/cat/1/"
pages = range(1,125) # total 124 pages exist (2019.10.29)
cocktail_links = []
cocktail_name_list = []
for i in pages:
# Set URL
url = base_url+str(i)
req = requests.get(url)
html = req.text
# Parse HTML with bs4
soup = BeautifulSoup(html,'html.parser')
# Find all recipe links
drinks_box = soup.find("div",{"class":"m1"}).find("div",{"class":"min"}).find("div",{"class":"clr"}).find("tr")
urls_in_page = drinks_box.find_all("a")
# Loop through the pages to get all the information links of cocktails
for link in urls_in_page:
cocktail_links.append("http://www.drinksmixer.com" + link["href"])
cocktail_name_list.append(link.text)
# Check when you collected your data
print("Links collected in {}".format(time.ctime()))
Links collected in Fri Oct 25 15:28:23 2019
print(len(cocktail_links),len(cocktail_name_list))
12334 12334
So, we have total of 12334 recipes. Wow! I was overwhelmed with the number of results. I can say that justifies the whole point of this project. Now let’s get the ingredients!
Collecting cocktail recipes
from selenium import webdriver
# Use your path to locate your chromedriver
path_to_chromedriver = "/Users/nowgeun/Desktop/chromedriver"
driver = webdriver.Chrome(path_to_chromedriver)
# List to track progress
done = []
# Dictionary to save our results
cocktail_recipes = {}
cocktail_instructions = {}
# Loop through the 12334 links we have collectee
for one_url in cocktail_links:
driver.get(one_url)
# Cocktail name
cocktail_name = driver.find_element_by_class_name("recipe_title").text
# Cocktail Recipe (Ingredients)
cocktail_recipe = driver.find_element_by_class_name("recipe_data").find_elements_by_class_name("ingredient")
recipe_dict = {} # Recipe of one cocktail
for ingrdnt in cocktail_recipe:
amount = ingrdnt.find_element_by_class_name("amount").text
ing_name = ingrdnt.find_element_by_class_name("name").text
recipe_dict[ing_name] = amount
cocktail_inst = driver.find_element_by_xpath("//*[@class='RecipeDirections instructions']")
# Save one cocktail ingredients and instruction to dictionary
cocktail_recipes[cocktail_name] = recipe_dict
cocktail_instructions[cocktail_name] = cocktail_inst.text.strip()
# Adding finished url to check consistancy
done.append(one_url)
Let’s check our collected data.
# The there should be equal number of cocktail names
len(cocktail_name_list) == len(cocktail_recipes.keys())
False
Wait? Why is False
returned here?? I assumed that some of the recipes were redundant and might have been repeated among the cocktail_name_list
.
# Redundant Recipes were in the list
assert len(set(cocktail_name_list)) == len(cocktail_recipes.keys())
print(len(cocktail_recipes.keys()))
12242
Booyah! That’s what I thought. So we have 12242 unique recipes in our dataset.
Before continuing, let’s save the data first.
Saving Data using pickle
import pickle
with open("./pickle_data/list_of_cocktail_recipe_links.pickle", "wb") as a:
pickle.dump(cocktail_links, a)
with open("./pickle_data/list_of_cocktail_names.pickle", "wb") as b:
pickle.dump(cocktail_name_list, b)
with open("./pickle_data/cocktail_recipe_dict.pickle", "wb") as c:
pickle.dump(cocktail_recipes, c)
with open("./pickle_data/cocktail_recipe_instructions.pickle", "wb") as d:
pickle.dump(cocktail_instructions, d)
We saved our data. Let’s double check whether the loaded data is consistent with what we have.
Loading Data using pickle
with open("./pickle_data/cocktail_recipe_dict.pickle", "rb") as j:
cocktail_links2 = pickle.load(j)
with open("./pickle_data/cocktail_recipe_dict.pickle", "rb") as k:
cocktail_name_list2 = pickle.load(k)
with open("./pickle_data/cocktail_recipe_dict.pickle", "rb") as l:
cocktail_recipes2 = pickle.load(l)
with open("./pickle_data/cocktail_recipe_instructions.pickle", "rb") as m:
cocktail_instructions2 = pickle.load(m)
assert cocktail_links == cocktail_links2
assert cocktail_name_list == cocktail_name_list2
assert cocktail_recipes == cocktail_recipes2
assert cocktail_instructions == cocktail_instruction2
print("Everything is Fine")
Everything is Fine
Great! Our loaded data is consistent to what we have collected.
Continuing on part 2….
Let’s call it a day and resume the work later! In the next part, I will explore my data, inspect problems and clean them.