Since October 2018, I’ve watched David Robinson’s Tidy Tuesday screencasts and learned so much about data analysis in R. As a result, I’m writing a series of posts called Independent Learning from Tidy Tuesday Screencast. These are mostly written so that I can refer to them in the future, but by sharing these I hope they serve as useful cheatsheets for data analysis in the tidyverse.
Why should you read these posts instead of just reading through David’s code from the screencasts? Well, these posts include interpretations of graphs, tricks for better data visualization and manipulation, and advice about data analysis that David talks about in his screencasts, but doesn’t write down. Hope you enjoy.
medium_processed %>% ggplot(aes(claps)) + geom_histogram() + labs(x = "Claps", y = "Count") + scale_x_log10(labels = scales::comma_format())
- Graph slightly shows log normal distribution
medium_words <- medium_processed %>% filter(!is.na(title)) %>% transmute(post_id, title, subtitle, year, reading_time, claps) %>% unnest_tokens(word, title) %>% anti_join(stop_words, by = "word") %>% filter(!(word %in% c("de", "en", "la", "para")), str_detect(word, "[a-z]")) medium_words %>% count(word, sort = TRUE) %>% mutate(word = fct_reorder(word, n)) %>% head(20) %>% ggplot(aes(word, n)) + geom_col() + coord_flip() + labs(x = NULL, y = NULL) + ggtitle("Common Words in Medium Post Titles")
- Text Mining Tip: remove all the stop words with
- Common Filtering Step: filter out numbers with
filter(str_detect(word, "[a-z]"))(keep values with letters)
medium_words_filtered <- medium_words %>% add_count(word) %>% filter(n >= 250) tag_claps <- medium_words_filtered %>% group_by(word) %>% summarize(median_claps = median(claps), geometric_mean_claps = exp(mean(log(claps + 1))) - 1, occurences = n()) %>% arrange(desc(median_claps)) library(widyr) top_word_cors <- medium_words_filtered %>% select(post_id, word) %>% pairwise_cor(word, post_id, sort = TRUE) %>% head(100) library(ggraph) library(igraph) vertices <- tag_claps %>% filter(word %in% top_word_cors$item1 | word %in% top_word_cors$item2) set.seed(2018) top_word_cors %>% graph_from_data_frame(vertices = vertices) %>% ggraph() + geom_edge_link() + geom_node_point(aes(size = occurences * 1.1)) + geom_node_point(aes(size = occurences, color = geometric_mean_claps)) + geom_node_text(aes(label = name), repel = TRUE) + scale_color_gradient2(low = "blue", high = "red", midpoint = 10) + labs(color = "Claps (mean)", title = "What's hot and what's not in Medium data articles?", subtitle = "Color shows the geometric mean of # of claps on articles with this word in the title") + theme_void()
Network Graphs with the Grammar of Graphics
- Every node is an observation in data and every link is one correlation observation
- Add theme_void() because you don’t need axis
- We see “clusters” of words: (artifiicial, intelligence) (machine, learning, deep, reinforcement) (neural, networks, network)
graph_from_data_frame(vertices = vertices)and
size = occurencessizes the nodes according to number of occurences (how common are the words?)
color = geometric_mean_clapsshows what clusters tend to get more claps than others? We can focus on which clusters are particularly popular? What’s hot and what’s not on Medium?
- Deep Learning, Keras, Tensorflow are hot titles on Medium
- Artifical Intelligence and Business are not as hot
- Simple Linear Regression are hot titles
- Pro Tip: You can give each point an outline by adding a slightly larger point underneath
geom_node_point(aes(size = occurences * 1.1)) +.
- Graph is a little bit more interpretable and easier to read.