Quasilinear Musings
https://www.timlrx.com/index.xml
Recent content on Quasilinear MusingsHugo -- gohugo.ioen-ustimothy.lin@alumni.ubc.ca (Timothy Lin)timothy.lin@alumni.ubc.ca (Timothy Lin)Sun, 14 Oct 2018 00:00:00 +0000Visualising Networks in ASOIAF - Part II
https://www.timlrx.com/2018/10/14/visualising-networks-in-asoiaf-part-ii/
Sun, 14 Oct 2018 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2018/10/14/visualising-networks-in-asoiaf-part-ii/<p>This is the second post of a character network analysis of George R. R. Martin’s A Song Of Ice and Fire (ASOIAF) series as well as my first submission to the <a href="https://www.r-bloggers.com/">R Bloggers community</a>. A warm welcome to all readers out there! In my <a href="./2018/09/09/visualising-networks-in-asoiaf/">first post</a>, I touched on the <a href="https://github.com/thomasp85/tidygraph">Tidygraph</a> package to manipulate dataframes and <a href="https://github.com/thomasp85/ggraph">ggraph</a> for network visualisation as well as some tricks to fix the position of nodes when ploting multiple graphs containing the same node set and labeling based on polar coordinates. In this post, we combine the plots together and use <a href="https://github.com/thomasp85/gganimate">gganimate</a> to visualise all 5 books.</p>
<div id="the-asoiaf-network" class="section level3">
<h3>The ASOIAF Network</h3>
<p>I carry on from the end of the previous post and will skip through the pre-processing and cleaning steps.</p>
<p>Previously, we created a <code>process_graph</code> function which pre-processes our input data into the tidygraph format and calculated the page-rank scores for each character at the book level. Subsequently, we join these graphs and used the full graph to identify relevant communities within the ASOIAF universe and key characters.</p>
<p>The full graph is used to define the coordinate of each node and their position in the final plot. A <code>plot_graph</code> function was created to take in a tidygraph table and produces the network visualisation. Here is the code which we used to generate the network graph.</p>
<pre class="r"><code>full_layout <- create_layout(graph = full_graph, layout = "linear", circular = T)
xmin <- min(full_layout$x)
xmax <- max(full_layout$x)
ymin <- min(full_layout$y)
ymax <- max(full_layout$y)
plot_graph <- function(graph) {
graph <- graph %>%
left_join(full_layout[full_layout$Id %in% V(graph)$Id, c('x', 'y', 'Id', 'community', 'pagerank')],
by = 'Id')
graph %>%
ggraph(layout = "manual", x = x, y = y, circular = T) +
geom_edge_arc(aes(alpha = weight)) +
geom_node_point(aes(color = community, size = pagerank)) +
# data = filter(graph %>% as_tibble(), x>0),
geom_node_text(aes(label = Label_short, x = x * 1.04, y = y* 1.04,
angle = ifelse(atan(-(x/y))*(180/pi) < 0,
90 + atan(-(x/y))*(180/pi),
270 + atan(-x/y)*(180/pi)),
hjust = ifelse(x > 0, 0 ,1)), size = 3.5) +
theme_graph() +
expand_limits(x = c(xmin-0.2, xmax+0.2), y = c(ymin-0.2, ymax+0.2))
}</code></pre>
<p>Let’s test it out on the entire ASOIAF universe and see how it looks like:</p>
<pre class="r"><code>plot_graph(full_graph %>% select(-community, -pagerank)) +
scale_color_manual(values=colorRampPalette(c("blue", "yellow", "red"))(11)[c(1,11,2,10,6,9,4,8,5,7,3)],
labels=c("King's Landing", "The Wall",
"Arya and the Brotherhood", "Daenerys's Khalasar",
"Bran's companions", "House Bolton",
"House Martell", "Young Griff", "Dragonstone",
"Brienne's party", "Gregor and Oberyn"))</code></pre>
<p><img src="./post/2018-10-14-visualising-networks-in-asoiaf-part-ii_files/figure-html/fullplot-1.png" width="1152" /></p>
<p>I made two main changes to the above plot compared to the previous post. First, I use a cut-off threshold of 0.8 instead of 0.75 to filter out the less relevant characters. To make it easier to infer the communities, I also mapped each community to a custom colour palette. This is generated using the <code>colorRampPalette</code> function but I rearranged the palette to maximise the perceptual difference between each community and make them easily distinguishable.</p>
<p>The characters fall into 11 relatively distinct communities. The biggest group is the King’s Landing community coloured in blue. Visually, this is also the most dense area of the plot and many of the cross-community interaction is fostered by characters within the King’s Landing group. This also explains the high page-rank score of the characters within this group.</p>
<p>Next, we have the Night’s Watch and wildlings which I termed as ‘The Wall’. Unsurprisingly, Jon Snow is the key character within this community having ties with almost every one of them and links to the broader ASOIAF universe.</p>
<p>The names of the other communities can be seen in the above plot. We can see the role of the Stark family in the story with each character being a key member of a different community. The network plot also illustrates the narrative technique employed by George R. R. Martin within the series. Many of the characters interact within their small communities and but are weaved into the broader narrative by their connections to certain key players. This allows him to build detailed character profiles while still creating an illusion of an extremely large universe. The clearest examples of this is Daenerys’s Khalasar and Arya’s network which exists in very isolated communities but are connected to the broader story mainly through Daenerys and Arya only.</p>
<p>Another interesting question that one might ask is how the current network structure differ between a cut-off threshold of 0.8 compared to 0.75 used in the previous post. The plot based on a threshold of 0.75 is shown below:</p>
<p><img src="./post/2018-10-14-visualising-networks-in-asoiaf-part-ii_files/figure-html/preprocess2-1.png" width="1152" /></p>
<p>There are 4 other meaningful segments compared to the previous graph and they roughly correspond to the Riverlands group, house Greyjoy, Harrenhal guards (Gregor and Oberyn are in that group) and the Citadel. Here is an interesting observation: Arya’s kill list creates numerous co-occurrence of Gregor, Sandor, Polliver and the Tickler which is picked up by the network plot as a distinct community.</p>
</div>
<div id="a-network-animation" class="section level3">
<h3>A network animation</h3>
<p>Separate plots for each book is nice but this makes it quite difficult to compare the network structure across books. One possible solution is to combine them all into an animation. <a href="https://github.com/thomasp85/gganimate">gganimate</a> is an extension to ggplot that allows animations to be created relatively easily.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p>
<p>With <code>gganimate</code> doing a simple transition can be done by appending <code>transition_manual</code> to the existing ggplot object and calling <code>animate</code> on it. Basically, anything that works with a <code>facet_wrap</code> can be converted to a simple animation. In the example below, I included <code>transition_manual(book)</code> to generate an animation that shows each book every 2 seconds. <code>animate</code> by default renders an animated gif which we can simply embed on the website.</p>
<pre class="r"><code>library(gganimate)
plot_graph <- function(graph) {
graph <- graph %>%
left_join(full_layout[full_layout$Id %in% V(graph)$Id, c('x', 'y', 'Id', 'community', 'pagerank')],
by = 'Id')
graph %>%
ggraph(layout = "manual", x = x, y = y, circular = T) +
# Need to use link0 for gganimate
#geom_edge_arc(aes(alpha = weight)) +
geom_edge_link0(aes(alpha = weight)) +
geom_node_point(aes(color = community, size = pagerank)) +
# data = filter(graph %>% as_tibble(), x>0),
geom_node_text(aes(label = Label_short, x = x * 1.04, y = y* 1.04,
angle = ifelse(atan(-(x/y))*(180/pi) < 0,
90 + atan(-(x/y))*(180/pi),
270 + atan(-x/y)*(180/pi)),
hjust = ifelse(x > 0, 0 ,1))) +
theme_graph() +
expand_limits(x = c(xmin-0.2, xmax+0.2), y = c(ymin-0.2, ymax+0.2))
}
p <- plot_graph(
full_graph %>%
select(-community, -pagerank)
)
p <- p +
scale_color_manual(values=colorRampPalette(c("blue", "yellow", "red"))(11)[c(1,11,2,10,6,9,4,8,5,7,3)],
labels=c("King's Landing", "The Wall",
"Arya and the Brotherhood", "Daenerys's Khalasar",
"Bran's companions", "House Bolton",
"House Martell", "Young Griff", "Dragonstone",
"Brienne's party", "Gregor and Oberyn")) +
ggtitle('ASOIAF Character Network', subtitle = 'Book {current_frame}') +
transition_manual(book)
animate(p, 100, 10, width = 1050, height = 700)</code></pre>
<p><img src="./post/2018-10-14-visualising-networks-in-asoiaf-part-ii_files/figure-html/animate-1.gif" /><!-- --></p>
<p>The above animation looks quite nice, but the transition between plots seem a little sudden. To smoothen the transition, we can transform the discrete events into continuous ones and use <code>transition_events</code> to control the rate of transition. The solution below is adapted from this <a href="https://gist.github.com/thomasp85/e6280e554c08f00c9e46f8efca2a5929">gist</a>. I experimented with various methods of converting the discrete book intervals to continuous events. Eventually, it seems that creating a time event out of the edge weights seem to work quite well. The weights are scaled to range from 0 to 60 and are used as minutes while the book number takes the place of the hour of the day. <code>enter_length</code> and <code>exit_length</code> is set to 60 minutes to create some overlap across books so that characters which remain connected across books do not appear to fade in and out unnecessarily. Here is the final code and output.</p>
<pre class="r"><code>p <- plot_graph(
full_graph %>%
select(-community, -pagerank) %>%
activate(edges) %>%
mutate(tweight = ifelse(weight >100, 100, weight)) %>%
group_by(book) %>%
mutate(scaled_weight = (tweight - min(tweight)) / (max(tweight) - min(tweight)) * 60,
book_start_time = as.POSIXct((paste0('2018-10-10', book, ":00")), format = "%Y-%m-%d %H:%M"),
book_end_time = as.POSIXct((paste0('2018-10-10', book, ":", scaled_weight)), format = "%Y-%m-%d %H:%M")) %>%
ungroup() %>%
mutate(book_end_time = if_else(is.na(book_end_time), as.POSIXct((paste0('2018-10-10', book+1, ":00")), format = "%Y-%m-%d %H:%M"), book_end_time)) %>%
activate(nodes)
)
fade_edge <- function(x) {
x$edge_alpha = 0
x$edge_width = 0
x
}
p2 <- p +
scale_color_manual(values=colorRampPalette(c("blue", "yellow", "red"))(11)[c(1,11,2,10,6,9,4,8,5,7,3)],
labels=c("King's Landing", "The Wall",
"Arya and the Brotherhood", "Daenerys's Khalasar",
"Bran's companions", "House Bolton",
"House Martell", "Young Griff", "Dragonstone",
"Brienne's party", "Gregor and Oberyn")) +
ggtitle('ASOIAF Character Network', subtitle = 'Book {as.numeric(format(frame_time, "%H"))}') +
transition_events(start = book_start_time,
end = book_end_time,
enter_length = hms::hms(minutes = 60), exit_length = hms::hms(minutes = 60)) +
enter_manual(fade_edge) +
exit_manual(fade_edge)
animate(p2, 100, 10, width = 900, height = 600)</code></pre>
<p><img src="./post/2018-10-14-visualising-networks-in-asoiaf-part-ii_files/figure-html/animate2-1.gif" /><!-- --></p>
<p>The animated plot shows quite clearly how the narrative of ASOIAF has evolved through the books. In A Game of Thrones, the first book, the plot mainly revolves around the characters in King’s Landing but as the series progresses, we see a lot more cross-community relationships being formed. It should also be noted that the mortality rate of characters in that community is especially high and there are much fewer of them at the end of book 5 then at the start of the series.</p>
</div>
<div id="conclusion" class="section level3">
<h3>Conclusion</h3>
<p>I hope you enjoyed this network view of the ASOIAF series and I can’t wait to update it when the Winds of Winter is released and see how it further changes over time. My guess is that characters in the North would gain more prominence and we should see new communities budding off from there. Maybe a new community of white walkers would also form though I wonder what their names would be. The arrival of Daenerys Targaryen to Westeros and probable meeting of all the ‘kings’ would also help create more links between the different communities.</p>
<p>On the data science side, I hope this post gives you a glimpse of how network analysis can be to other less well known areas such as texts and it is reasonably straightforward with the help of so many R packages. In short, words are not really wind and the context which they are situated in tells us a lot on how they are all connected to each other.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Note: There has been a major re-write of the API since Thomas Pederson took over from David Robinson so many of the existing tutorials out there are no longer compatible with the new one. I use the new API in the code below. There seems to be some issues with the integration of <code>ggraph</code> and <code>gganimate</code> as well. At the time of this post, I can only get <code>geom_edge_link0</code> to work while <code>geom_edge_link</code> and <code>geom_edge_arc</code> gives an error message. I filed a <a href="https://github.com/thomasp85/gganimate/issues/196">github issue</a> so hopefully these teething issues get resolve soon.<a href="#fnref1">↩</a></p></li>
</ol>
</div>
Visualising Networks in ASOIAF
https://www.timlrx.com/2018/09/09/visualising-networks-in-asoiaf/
Sun, 09 Sep 2018 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2018/09/09/visualising-networks-in-asoiaf/<p>While waiting for the winds of winter to arrive, there is plenty of time to revisit the 5 books. One of my favourite aspects of the series is the character and world building. As the song of ice and fire universe is so big, many characters are mentioned in passing while the major characters meet each other only occasionally. I thought it would be interesting to see how various characters are connected and how that progresses through the series. This evolved into a mini project featuring network analysis on the series. As it turns out, many people think likewise, and there are numerous network analysis flying around. This post is different in at least three ways: It focuses on the book rather than the game of thrones HBO series, it features an analysis of the entire series, not just a single book, and finally it features some very cool network plots (and shows you how to do it as well).</p>
<p>This analysis is done in R and uses two wonderful packages by Thomas Pederson. <a href="https://github.com/thomasp85/tidygraph">Tidygraph</a> provides a tidy API to the popular <code>igraph</code> package. This makes it easy to experiment with network algorithms especially if one is already familiar manipulating tibbles with the <code>dplyr</code> syntax. <a href="https://github.com/thomasp85/ggraph">Ggraph</a> extends ggplot to network graphs.</p>
<div id="dataset" class="section level2">
<h2>Dataset</h2>
<p>Data is obtained from <a href="https://github.com/mathbeveridge/asoiaf">Andrew Beveridge</a>. The dataset is unique in that it is actually an encoding of the actual text itself. The data is available as an edge list which is a list of character pairs and the number of times their names appeared within an interval of 15 words in a particular book. Thankfully, the hard work of cleaning the dataset has been done and we can just focus on analysing the data.</p>
</div>
<div id="analysis" class="section level2">
<h2>Analysis</h2>
<p>Let us first import the packages that would be used for this analysis. Since there are certain algorithm that involve random initialisations, we set a seed to make the analysis replicable.</p>
<p>Next, we read in the asoiaf files into two dataframe one containing all the nodes/vertices of the graph and the other the edges/connections.</p>
<pre class="r"><code>node_files <- list.files(path = "data/", pattern = "nodes")
edge_files <- list.files(path = "data/", pattern = "edges")
node_df = do.call(rbind, lapply(paste0("data/",node_files), function(x) read_csv(x)))
edge_df = do.call(rbind, lapply(paste0("data/",edge_files), function(x) read_csv(x)))</code></pre>
<p>Let’s take a look at how the datasets look like:</p>
<pre class="r"><code>node_df</code></pre>
<pre><code>## # A tibble: 1,340 x 2
## Id Label
## <chr> <chr>
## 1 Addam-Marbrand Addam Marbrand
## 2 Aegon-I-Targaryen Aegon I Targaryen
## 3 Aemon-Targaryen-(Maester-Aemon) Aemon Targaryen (Maester Aemon)
## 4 Aerys-II-Targaryen Aerys II Targaryen
## 5 Aggo Aggo
## 6 Albett Albett
## 7 Alliser-Thorne Alliser Thorne
## 8 Alyn Alyn
## 9 Arthur-Dayne Arthur Dayne
## 10 Arya-Stark Arya Stark
## # ... with 1,330 more rows</code></pre>
<pre class="r"><code>edge_df</code></pre>
<pre><code>## # A tibble: 3,909 x 5
## Source Target Type weight book
## <chr> <chr> <chr> <int> <int>
## 1 Addam-Marbrand Jaime-Lannister Undirect~ 3 1
## 2 Addam-Marbrand Tywin-Lannister Undirect~ 6 1
## 3 Aegon-I-Targaryen Daenerys-Targary~ Undirect~ 5 1
## 4 Aegon-I-Targaryen Eddard-Stark Undirect~ 4 1
## 5 Aemon-Targaryen-(Maester-Aemo~ Alliser-Thorne Undirect~ 4 1
## 6 Aemon-Targaryen-(Maester-Aemo~ Bowen-Marsh Undirect~ 4 1
## 7 Aemon-Targaryen-(Maester-Aemo~ Chett Undirect~ 9 1
## 8 Aemon-Targaryen-(Maester-Aemo~ Clydas Undirect~ 5 1
## 9 Aemon-Targaryen-(Maester-Aemo~ Jeor-Mormont Undirect~ 13 1
## 10 Aemon-Targaryen-(Maester-Aemo~ Jon-Snow Undirect~ 34 1
## # ... with 3,899 more rows</code></pre>
<p>On closer examination, it turns out that there is still some data cleaning required. The labels are not unique to each Id and are too long for visualisation purposes. I remove the duplicates and do a simple relabeling operation to have a nicer looking label for the final product. Many characters in the series have similar first names. In most circumstances it should be clear which major character is being referred to (e.g. Brandon Stark vs Brandon the builder). In other cases where it might be confusing, I manually corrected the labels.</p>
<p>We also prepare the <code>edge_df</code> dataset. Tidygraphs likes one column of the edge dataset to be named ‘from’ and the other to be named ‘to’, so we shall follow the convention. Note: The naming does not necessarily mean that the edges are directed. We can pass in an option to specify that when building the graph dataset.</p>
<pre class="r"><code>node_df <- node_df %>%
group_by(Id) %>%
filter(row_number()==1) %>%
mutate(Label_short = word(Label, 1)) %>%
mutate(Label_short = case_when(
Label == 'Aemon Targaryen (Maester Aemon)' ~ 'Maester Aemon',
Label == 'Aegon I Targaryen' ~ 'Aegon I Targaryen',
Label == 'Aegon V' ~ 'Aegon V',
Label == 'Brynden Rivers' ~ 'Bloodraven',
Label == 'High Sparrow' ~ 'High Sparrow',
Label == 'Roose Bolton' ~ 'Roose Bolton',
Label == 'Walder Frey' ~ 'Walder Frey',
Label == 'Jon Arryn' ~ 'Jon Arryn',
Label == 'Jon Snow' ~ 'Jon Snow',
Label == 'Robert Arryn' ~ 'Robert Arryn',
Label == 'Robert Baratheon' ~ 'Robert Baratheon',
TRUE ~ Label_short
))
edge_df <- edge_df %>%
filter(!is.na(book)) %>%
rename(from = Source, to = Target) %>%
select(from, to, weight, book)</code></pre>
<p>For a start, let us examine the network connections in the first book. Here, I plot the distribution of joint occurrences between characters.</p>
<pre class="r"><code>book1 = edge_df %>%
filter(book==1)
### Plot distribution of weights
book1 %>%
ggplot(aes(x=weight)) +
geom_bar() +
xlim(0,100) +
theme_bw()</code></pre>
<p><img src="./post/2018-09-09-visualising-networks-in-asoiaf_files/figure-html/weightPlot-1.png" width="672" /></p>
<p>Typical of network graphs, the plot shows a power law like distribution, with most character pairs connected by a few joint mentions and a few central characters being mentioned repeatedly in the same context. I truncated the x-axis to remove certain outliers, but if you are keen on finding out which characters were most commonly mentioned next to each other…</p>
<pre class="r"><code>book1 %>% arrange(desc(weight))</code></pre>
<pre><code>## # A tibble: 684 x 4
## from to weight book
## <chr> <chr> <int> <int>
## 1 Eddard-Stark Robert-Baratheon 291 1
## 2 Bran-Stark Robb-Stark 112 1
## 3 Arya-Stark Sansa-Stark 104 1
## 4 Daenerys-Targaryen Drogo 101 1
## 5 Joffrey-Baratheon Sansa-Stark 87 1
## 6 Eddard-Stark Petyr-Baelish 81 1
## 7 Jeor-Mormont Jon-Snow 81 1
## 8 Jon-Snow Samwell-Tarly 81 1
## 9 Daenerys-Targaryen Jorah-Mormont 75 1
## 10 Cersei-Lannister Robert-Baratheon 72 1
## # ... with 674 more rows</code></pre>
<p>We can construct our network using the <code>tbl_graph</code> function. As a from-to character pair captures the joint occurrence, it makes sense to model the relations as an undirected graph. I subset <code>node_df</code> to include only the relevant characters mentioned in the first book.</p>
<pre class="r"><code>df <- tbl_graph(nodes = node_df[node_df$Id %in% union(book1$from, book1$to),],
edges = book1,
directed = FALSE)</code></pre>
<p>Now, we can play around with the network using the <code>tidygraph</code> package. It has a nice API where one can perform certain operations on either the nodes or edges by using the <code>activate</code> function. Most of the <code>dplyr</code> functions are also supported. The major one missing is the <code>summarise</code> function but I will show a short work around for that when we need it later on.</p>
<p>We are keen on mapping the relationships between the important characters within the book. How do we calculate the importance of a character? One way of doing that is through the <a href="https://en.wikipedia.org/wiki/PageRank">pagerank</a> measure, also known as the google search engine ranking algorithm. The code below filters characters who are in the top quartile of pagerank score. In addition, we keep only the main connections between characters, those that are in the top quartile of weight score.</p>
<pre class="r"><code>df2 <- df %>%
activate(nodes) %>%
mutate(pagerank = centrality_pagerank(weights = weight, directed = FALSE),
degree = centrality_degree(weights = weight),
pagerank_75pc = quantile(pagerank, 0.75)) %>%
activate(edges) %>%
filter(weight >= quantile(weight, 0.75)) %>%
activate(nodes) %>%
filter(pagerank > pagerank_75pc) %>%
filter(!node_is_isolated())</code></pre>
<p>To visualise the network, we can use the <code>ggraph</code> package. The two main additional geoms are the <code>geom_edge_*</code> and the <code>geom_node_*</code> functions which control the visualisation of the edges and nodes respectively. In addition, there is a layout argument which allows a variety of popular configurations to be displayed. Here, I use the Fruchterman-Reingold algorithm which is one of the more popular force-directed algorithms out there.</p>
<pre class="r"><code>ggraph(df2, layout = "fr") +
geom_edge_link(color='red') +
geom_node_point() +
geom_node_text(aes(label = Label_short), repel = TRUE) +
theme_graph()</code></pre>
<p><img src="./post/2018-09-09-visualising-networks-in-asoiaf_files/figure-html/ggraphVis-1.png" width="672" /></p>
<p>Let us try to encode more information by finding the communities within graph.</p>
<pre class="r"><code>df3 <- df2 %>%
mutate(component = group_components(),
community = as.factor(group_infomap(weights = weight))) %>%
group_by(component) %>%
mutate(component_size = n()) %>%
filter(component_size > 5) %>%
ungroup() %>%
arrange(community)</code></pre>
<p>The spider-web in the middle makes it difficult to make much sense of the connections. An alternative way of visualising the connections is using a circular layout. We can do this easily by specifying a linear layout and pass <code>circular = TRUE</code> in the argument. We encode the pagerank information as the size of the nodes and color them according to the communities which they belong to.</p>
<pre class="r"><code>ggraph(df3, layout = "linear", circular=T) +
geom_edge_arc(alpha=0.2) +
geom_node_point(aes(color = community, size = pagerank)) +
geom_node_text(aes(label = Label_short), repel = TRUE, size = 3.5) +
theme_graph()</code></pre>
<p><img src="./post/2018-09-09-visualising-networks-in-asoiaf_files/figure-html/circular-1.png" width="1152" /></p>
<p>The communities discovered by the algorithm match fairly well to the locations where the characters spend most of their time in. We have a king’s landing community with Eddard Stark, Robert Baratheon and Varys; as well as a Winterfell community with Catelyn, Bran and Robb. We can also easily pick up the influential players in the graph. Daenerys, is the only link between Essos and the rest of Westeros. Tyrion and the Stark family (except Rickon) also rank highly in pagerank score and facilitate the connections between the other major characters in the first book.</p>
<div id="asoiaf-network" class="section level3">
<h3>ASOIAF Network</h3>
<p>We want to replicate the above analysis to the rest of the books, so let us code a function to generate the clean network dataframe for each selected book. We will call the function <code>process_graph</code>. I included the quantile cut-off as a variable as well to make it easy to adjust the final plot. 0.75 seems like a reasonable cut-off and includes most of the key characters in the books. In the code below, I create an <code>all_graph</code> list which contains all the 5 books processed as <code>tidygraph</code> dataframes.</p>
<pre class="r"><code>process_graph <- function(node_df, edge_df, book_num, q = 0.75){
book = edge_df %>%
filter(book==book_num)
df <- tbl_graph(nodes = node_df[node_df$Id %in% union(book$from, book$to),],
edges = book,
directed = FALSE)
df2 <- df %>%
activate(nodes) %>%
mutate(pagerank = centrality_pagerank(weights = weight, directed = FALSE),
degree = centrality_degree(weights = weight),
pagerank_qpc = quantile(pagerank, q)) %>%
activate(edges) %>%
filter(weight >= quantile(weight, q)) %>%
activate(nodes) %>%
filter(pagerank > pagerank_qpc) %>%
filter(!node_is_isolated()) %>%
select('Id', 'Label', 'Label_short', 'pagerank') %>%
rename(!! paste0('pagerank',book_num) := pagerank)
df3 <- df2 %>%
mutate(component = group_components()) %>%
group_by(component) %>%
mutate(component_size = n()) %>%
filter(component_size > 5) %>%
ungroup() %>%
select(-component, -component_size)
return(df3)
}
all_graphs <- lapply(1:5, function(x) process_graph(node_df, edge_df, book_num=x, q=0.75))</code></pre>
<p>There are two additional complications in extending this analysis to all the five books:</p>
<ol style="list-style-type: decimal">
<li><p>How do we find out the important nodes across all 5 books and determine the relevant communities?</p></li>
<li><p>How do we build a consistent layout to visualise across datasets.</p></li>
</ol>
<p>Let us address the first complication. To do so, we need to aggregate information across the 5 books. We can use the <code>graph_join</code> function which works in a similar manner as the <code>dplyr</code> <code>*_join</code> commands.</p>
<pre class="r"><code>full_graph <- Reduce(function(...) graph_join(..., by = c('Id', 'Label', 'Label_short')), all_graphs) %>%
convert(to_undirected)</code></pre>
<p>This produces a dataset with the information across all 5 books. Some of the edges are duplicated and we need to sum up the weights across these edges. Normally one can use the <code>summarise</code> function in <code>dplyr</code> to do this aggregation. Unfortunately, this is not supported in <code>tidygraph</code>. I opted to convert the edges to a dataframe using the <code>as_tibble</code> command before merging it back to the edge dataframe. The trick is to remove duplicated edges in the tidygraph object before doing the merge.</p>
<pre class="r"><code>full_graph_edges <- full_graph %>%
activate(edges) %>%
as_tibble() %>%
group_by(from, to) %>%
summarise(weight = sum(weight), book = first(book))
full_graph <- full_graph %>%
activate(edges) %>%
filter(!edge_is_multiple()) %>%
select(from, to) %>%
left_join(full_graph_edges, by = c('from', 'to'))</code></pre>
<p>We can then run the community algorithm and sum up the pagerank score across all the books to get a measure of the importance of a particular character.</p>
<pre class="r"><code>full_graph <- full_graph %>%
activate(nodes) %>%
mutate(community = as.factor(group_infomap(weights = weight))) %>%
arrange(community)
full_graph <- full_graph %>%
mutate_at(vars(contains('pagerank')), funs(if_else(is.na(.), 0, .))) %>%
mutate(pagerank = pagerank1 + pagerank2 + pagerank3 + pagerank4 + pagerank5)</code></pre>
<p>Returning back to the issue of plotting, we want to fix the locations of the nodes to be the same across books. This would make comparisons easy. My solution is adapted from the following two github issues (<a href="https://github.com/thomasp85/ggraph/issues/1">1</a>, <a href="https://github.com/thomasp85/ggraph/issues/130">2</a>). We create a manual layout using the full graph and write a <code>plot_graph</code> function that merges essential information about the nodes back to the network data frame.</p>
<p>There is one final issue to settle. The <code>geom_node_text</code> function has a repel argument that uses the <code>ggrepel</code> package to decide on the placement of the text. Normally the default options works very well for a scatterplot but in this case we have a circular plot and we do not want the text in the interior of the plot. Additionally, since we have many nodes, there is a tendency for ggrepel to repel the labels far away from the node position. One possible solution is to fix the position of the labels to the outside of the circle and angle the text such that they point towards the center of the circle. I adapted the following <a href="https://gist.github.com/ajhmohr/5337a5c99b504e4a243fad96203fa74f">solution</a> and used the <code>hjust</code> alignment argument to coerce the labels to form an outer circle.</p>
<pre class="r"><code>full_layout <- create_layout(graph = full_graph, layout = "linear", circular = T)
max(as.numeric(full_layout$community))</code></pre>
<pre><code>## [1] 14</code></pre>
<pre class="r"><code>xmin <- min(full_layout$x)
xmax <- max(full_layout$x)
ymin <- min(full_layout$y)
ymax <- max(full_layout$y)
plot_graph <- function(graph) {
graph <- graph %>%
left_join(full_layout[full_layout$Id %in% V(graph)$Id, c('x', 'y', 'Id', 'community', 'pagerank')],
by = 'Id')
graph %>%
ggraph(layout = "manual", x = x, y = y, circular = T) +
geom_edge_arc(aes(alpha = weight)) +
geom_node_point(aes(color = community, size = pagerank)) +
# data = filter(graph %>% as_tibble(), x>0),
geom_node_text(aes(label = Label_short, x = x * 1.04, y = y* 1.04,
angle = ifelse(atan(-(x/y))*(180/pi) < 0,
90 + atan(-(x/y))*(180/pi),
270 + atan(-x/y)*(180/pi)),
hjust = ifelse(x > 0, 0 ,1)), size = 3.5) +
theme_graph() +
expand_limits(x = c(xmin-0.2, xmax+0.2), y = c(ymin-0.2, ymax+0.2))
}</code></pre>
<p>And here is how each book looks:</p>
<pre class="r"><code>plot_graph(all_graphs[[1]])</code></pre>
<p><img src="./post/2018-09-09-visualising-networks-in-asoiaf_files/figure-html/book1-1.png" width="1152" /></p>
<pre class="r"><code>plot_graph(all_graphs[[2]])</code></pre>
<p><img src="./post/2018-09-09-visualising-networks-in-asoiaf_files/figure-html/book2-1.png" width="1152" /></p>
<pre class="r"><code>plot_graph(all_graphs[[3]])</code></pre>
<p><img src="./post/2018-09-09-visualising-networks-in-asoiaf_files/figure-html/book3-1.png" width="1152" /></p>
<pre class="r"><code>plot_graph(all_graphs[[4]])</code></pre>
<p><img src="./post/2018-09-09-visualising-networks-in-asoiaf_files/figure-html/book4-1.png" width="1152" /></p>
<pre class="r"><code>plot_graph(all_graphs[[5]])</code></pre>
<p><img src="./post/2018-09-09-visualising-networks-in-asoiaf_files/figure-html/book5-1.png" width="1152" /></p>
</div>
</div>
Applications of DAGs in Causal Inference
https://www.timlrx.com/2018/08/09/applications-of-dags-in-causal-inference/
Thu, 09 Aug 2018 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2018/08/09/applications-of-dags-in-causal-inference/<div id="introduction" class="section level2">
<h2>Introduction</h2>
<p>Two years ago I came across Pearl’s work on using directed cyclical graphs (DAGs) to model the problem of causal inference and have read the debate between academics on Pearl’s framework vs Rubin’s potential outcomes framework. Then I found it quite intriguing from a scientific methods and history perspective how two different formal frameworks could be developed to solve a common goal. I read a few papers on the DAG approach but without fully understanding how it could be useful to my work filed it away in the back of my mind (and computer folder).</p>
<p>Recently, I had the pleasure of trying to explain to a colleague of mine what a confounding variable is. I find it much easier to use an analogy when explaining these kinds of statistical concepts. One of my favourite examples to give is the problem of determining the effect of hospitalisation on mortality. Here, health plays the role of the confounding variable. Explaining confouders in terms of the potential outcomes framework would go something like this:</p>
<blockquote>
<p>To determine the effect of hostpitalisation on mortality we would ideally like to randomly assign individuals to hospitals. In reality, this is not the case and individuals admit themselves to hospitals based on their health status which is correlated with both hospital admission and mortality. An estimation of hospitalisation on mortality will be problematic since we will also be capturing the effect of health on both variables.</p>
</blockquote>
<p>Normally this does the trick and people eventually get the idea. However, if someone asks for a more technical explanation it gets a little tricky. One has to write down the conditional independence / ignorability assumption (<span class="math inline">\(Y_{i1}, Y_{i0} \perp T_{i}\)</span>) and start explain the idea of potential outcomes and different worlds.</p>
<p>This time round, I tried something different. I drew a directed acyclic graph (DAG), with health, <span class="math inline">\(Z\)</span>, as a common cause affecting both hospitalisation, <span class="math inline">\(X\)</span>, and mortality, <span class="math inline">\(Y\)</span>.</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/confoudner-1.png" width="60%" height="50%" /></p>
<p>I think the idea got through more clearly and the resulting concept of controlling for health (as a modified DAG without the edge from health to hospitalisation) was also simple to convey. Modelling causal effects as DAGs also seem quite natural and this is also the way instrument variables are taught.</p>
<p>As it turns out, Pearl has a full framework for modelling causality using a graphical framework. This rekindled my interest in the subject and I spent the last month going through Pearl, Glymour and P.Jewell’s Causal Inference in Statistics: A Primer (2016) and came away with a deeper appreciation of the benefits of causal analysis using DAGs. In this post I detail my three biggest takeaways from the book and how DAGs can contribute to better causal inference.</p>
</div>
<div id="pearls-causal-model-and-the-potential-outcomes-framework" class="section level2">
<h2>Pearl’s Causal Model and the Potential Outcomes Framework</h2>
<p>Despite the arguments between both schools of thought, I actually see them as complementary approaches but with a different starting point. Let’s start with a brief recap of the potential outcomes framework and a summary of Pearl’s approach.</p>
<div id="potential-outcomes" class="section level4">
<h4>Potential Outcomes</h4>
<p>In the potential outcomes framework, the problem of estimating a causal effect is framed as a missing data problem. One only gets to observe a particular state of the world - either an individual receives a treatment or he does not. If we could observe both states of the world, the causal effect would then be the difference between the treated and the untreated state (<span class="math inline">\(Y_{i1} - Y_{i0}\)</span>). Of course this is not equal to the difference between those observed taking the treatment and those not on the treatment, <span class="math inline">\(E[Y_{i} \vert T_{i}=1] - E[Y_{i} \vert T_{i}=0]\)</span>, since there are other factors that affect the taking of treatment and the outcome. However, if one could assign the treatment randomly, the causal effect can then be derived. Hence, the saying goes “no causation without manipulation”.</p>
</div>
<div id="pearls-causal-model" class="section level4">
<h4>Pearl’s Causal Model</h4>
<p>Pearl describes the idea of a structural causal model as a set of variables and relations that describe how nature assigns particular values to certain variables of interest i.e. structural equations. Formally, this could be written as a set of two kinds of variables <span class="math inline">\((U,V)\)</span>, where <span class="math inline">\(U\)</span> denotes a vector of exogenous variables determined outside the model and <span class="math inline">\(V\)</span> denotes a vector of endogenous variables. The model is completed by a set of functions, <span class="math inline">\(F\)</span>, that assigns each endogenous variable a value based on the other variables in the model with a constraint that the mapping is acyclic.</p>
<p>The structural causal model can then be translated to a DAG where the nodes represent the variables <span class="math inline">\(U\)</span> and <span class="math inline">\(V\)</span> and the edges between the nodes represent the set of functions, <span class="math inline">\(F\)</span>.</p>
<p>For example, the causal model defined by: <span class="math display">\[
\begin{aligned}
U = \{W, Z\}&, ~~V=\{X,Y\}, ~~F=\{f_{X}, f_{Y}\} \\
f_{X} &: X = 2W - Z \\
f_{Y} &: Y = -2X - 3Z
\end{aligned}
\]</span> can be translated to the following DAG:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/connectedGraph-1.png" width="60%" height="50%" /></p>
<p>Intervening (or randomising) on a particular variable (e.g. <span class="math inline">\(X\)</span>) is translated to removing the arrows then goes into that variable. Doing this graph surgery on the above example we get the modified graph:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/modGraph-1.png" width="60%" height="50%" /></p>
<p>In Pearl’s notation, we are interested in finding out the effect of <span class="math inline">\(P(Y=y \vert do(X=x))\)</span> where we intervene to fix <span class="math inline">\(X\)</span> at a particular value, e.g. <span class="math inline">\(x\)</span>. Establishing whether a causal effect of a certain variable of interest can be determined from a particular causal model can then be translated to examining the structure of the graph for certain properties which allow for such an effect to be determined. If so, the causal effect could theoretically be calculated from the conditional probabilities produced by the relationships given in the graph. Pearl introduces a set of methods (do-calculus) to translate the do operator into conditional probabilities that could be calculated with observational data. He also establishes the potential outcomes model as a special case of the graphical approach.</p>
<p>Converting a structural equation to a graphical model has its pros and cons. One downside is that it does not encode information about the functional relationships between variables beyond the direction of causation. This makes it hard to model identification strategies which depend on certain functional relationship between variables in this framework. For example, it does not capture the nuances of a regression discontinuity design (RDD) where a discontinuity in a certain variable of interest is used to infer its causal effect on an outcome variable, which one expects to otherwise vary linearly. The difference in difference approach is also hard to represent using DAGs since it involves certain assumptions of how variables change in a linear fashion over time.</p>
<p>Despite these limitations, I think there are areas in which adopting a DAG approach to modelling brings substantial benefits and clarity. In the following section, I discuss my three biggest takeaways on how an understanding of DAGs can contribute to better modelling of causal effects. First, DAGs make explicit what should be controlled for and what should not. Second, DAGs are useful as a heuristic tool to make assumptions and relationships between variables clear. Third, I introduce a novel identification strategy which Pearl calls the front-door criterion.</p>
</div>
</div>
<div id="benefits-of-the-dag-approach" class="section level2">
<h2>Benefits of the DAG approach</h2>
<div id="to-control-or-not-to-control" class="section level3">
<h3>1) To control or not to control</h3>
<p>The central problem of all selection on observables problem can be reduced down to the following saying, “To control or not to control - that is the question”. The DAG approach offers a practical tool-set to guide anyone facing such difficult questions in life. This is an advantage over the potential outcomes framework which offers the following rule, <span class="math inline">\(Y_{i1}, Y_{i0} \perp T_{i} \vert X_{i}\)</span>, the conditional ignorability assumption, as its guiding principle. However the statement does not make it explicit what variables should be controlled for.</p>
<p>One class of variables that should be controlled or adjusted for are common causes (factors which affect both the variable of interest and the outcome). I believe this is what most researchers have in mind what they select their choice of control variables. Yet, the important question that needs to be ask is the following - is this set of variables sufficient for inferring the causal effect?</p>
<p>The answer as it turns out is not quite. Before we examine what needs to be controlled for, I will give a quick overview on the three main components of every graphical model and how controlling for certain variables creates statistical dependence or independence for the structures in question.</p>
<p>All causal graphs can be decomposed into the following three main structures:<br />
1) Chains<br />
2) Forks<br />
3) Colliders</p>
<p><strong>Chains</strong><br />
Chains are defined by three nodes and two edges, with one edge directed into the middle variable and one edge directed out of it:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/chains-1.png" width="60%" height="50%" /></p>
<p>Without controlling for anything, any two variables in a chain are correlated. However, two variables at the edge of the chain <span class="math inline">\((Z\)</span> and <span class="math inline">\(X)\)</span> are independent given any variable in the middle of the chain <span class="math inline">\((Y)\)</span>.</p>
<p><strong>Forks</strong></p>
<p>Two edges originating from a central variable makes up a fork:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/fork-1.png" width="60%" height="50%" /></p>
<p>Since the variables share a common cause, they are correlated with each other. However, conditional on the common cause <span class="math inline">\((X)\)</span> the other two variables <span class="math inline">\((Z\)</span> and <span class="math inline">\(Y)\)</span> are independent.</p>
<p><strong>Colliders</strong></p>
<p>As the name suggests, a collider is a struture formed when two nodes have edges directed into one common node:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/collider-1.png" width="60%" height="50%" /></p>
<p>In this case, the two variables, <span class="math inline">\(Z\)</span> and <span class="math inline">\(X\)</span>, are independent. However, conditioning on the collision node, <span class="math inline">\(Y\)</span>, creates a dependency between the two.</p>
<p><strong>Graphs, paths and d-seperation</strong></p>
<p>Let’s move from these small components to a larger graph. In a larger graph, two variables / nodes are independent if every path between them are blocked. Otherwise, they are likely dependent. Or in Pearl’s terminology d-separated vs d-connected. Without conditioning on any of the variables, only colliders can block a path. However, if a set of variables, <span class="math inline">\(Z\)</span>, is conditioned on, the following kinds of formation can block a path:</p>
<ul>
<li>A chain or fork whose middle node is in <span class="math inline">\(Z\)</span>.<br />
</li>
<li>A collider that is <em>not</em> in <span class="math inline">\(Z\)</span> and whose descendants are also not in <span class="math inline">\(Z\)</span>.</li>
</ul>
<p>The discussion on forks show the importance of controlling for common causes in order to infer the causal effect but is this sufficient for inferring the causal effect? Consider the following scenario where <span class="math inline">\(X\)</span> is the variable of interest and <span class="math inline">\(Y\)</span> is the outcome variable:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/mgraph-1.png" width="60%" height="50%" /></p>
<p>Controlling for the common cause, <span class="math inline">\(Z\)</span>, eliminates <span class="math inline">\(Z\)</span> as a cause of bias but it induces a dependency between <span class="math inline">\(E\)</span> and <span class="math inline">\(A\)</span> which means that a causal effect cannot be estimated. It turns out that one has to condition on one of the following variable sets: <span class="math inline">\(\{E, Z\}\)</span>, <span class="math inline">\(\{A, Z\}\)</span> or <span class="math inline">\(\{E, A, Z\}\)</span> for the true effect to be estimated. Hence, controlling for a common causes is not sufficient to estimate the causal effect.</p>
<p><strong>Backdoor criterion</strong></p>
<p>The above set of variables fulfill what Pearl calls the backdoor criterion. These are sets of variables that satisfy two important rules:</p>
<ul>
<li>they do not contain a descendant of <span class="math inline">\(X\)</span></li>
<li>and they block every path between <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> that contains an arrow into <span class="math inline">\(X\)</span></li>
</ul>
<p>The backdoor criterion can be useful when particular parent variables are not observed but we can condition on other child nodes to block the path between <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span>.</p>
<p>The DAG approach also shows what variables that should not be adjusted for. Consider the following modified graph:<br />
<img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/mgraph2-1.png" width="60%" height="50%" /></p>
<p>Here, controlling for <span class="math inline">\(Z\)</span>, a collider node, introduces a bias when we could simply get away with comparing the bivariate relationship between <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span>. In this case, controlling for more variables is not necessarily better and the DAG provides a systemic method on what variables to control and what not to.</p>
</div>
<div id="dag-as-a-heuristic-tool" class="section level3">
<h3>2) DAG as a heuristic tool</h3>
<p>Next, I will discuss the use of DAGs as a heuristic tool. This is related to the first part on what to control but extends beyond it by discussing the use of DAGs as a way of evaluating identification strategies. I see two main benefits of the approach, first, as a way of explaining the thought process behind specification testing procedures and second, as a method of assessing instrument variables.</p>
<div id="specification-testing" class="section level4">
<h4>Specification testing</h4>
<p>Specification testing refers to the procedure where variables are incrementally added to a model and the coefficient of interest is interpreted across these different specifications and used to draw a conclusion about the variable of interest. Let us examine one such example which Gelbach (2014) discusses. It is adapted from Levitt and Syverson (2008) where the authors were trying to determine the causal effect of agent home ownership (whether the property agent owns the home or not) on sale prices in order to show an empirically interesting example of the principal-agent problem. The authors argue that agents have an incentive to sell a house quickly and at a lower price since they only receive a small commission from the sales but bear most of the expense related to the sales. However, when agents are selling their own houses, they would have the incentive to maximise the sales price. Hence, it would be expected that agent ownership of a house would lead to a higher final sales price. Here are the results of the specification tests which Levitt and Syverson conducted:</p>
<div class="figure">
<img src="./img/tbl2_levitt_syverson.png" />
</div>
<p>The main difference appears to be when basic house characteristics is added into the model. Levitt and Syverson conclude that the other controls have a small impact on agent ownership. Gelbach points out that this conclusion is mistaken:</p>
<blockquote>
<p>However, it is possible that the scale and basic-amenity variables are correlated with detailed indicators of house quality, description keywords, and block dummies. Thus, it is possible that the coefficient on the agent dummy would move just as much when any of these other sets of covariates is added before adding the basic scale and amenity characteristics.</p>
</blockquote>
<p>Denote agent ownership with <span class="math inline">\(X\)</span>, sales price with <span class="math inline">\(Y\)</span>, house characteristics with <span class="math inline">\(C\)</span>, quality with <span class="math inline">\(Q\)</span>, and keywords with <span class="math inline">\(K\)</span> (I omit block effects for the sake of brevity). The only situation in which one could come to the conclusion of Levitt and Syverson is if house quality, keywords and block effects all act as independent confounders:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/levittsyverson1-1.png" width="60%" height="50%" /></p>
<p>Yet, Gelbach is still not being precise enough when he hypothesise that house characteristics could be ‘correlated’ with the other control variables. There are many possible causal models that could create a correlation between these variables, but the main takeaway and interpretation would differ depending on which causal model is true. Hence, it is necessary to come up with a causal story even before one can properly interpret the coefficients.</p>
<p>From the variable names and descriptions, it could be possible that Gelbach has the following causal model in mind: An unobserved factor of housing quality, <span class="math inline">\(U\)</span>, is a direct cause of the other variables (<span class="math inline">\(C\)</span>, <span class="math inline">\(Q\)</span>, <span class="math inline">\(K\)</span>) and each of these variables serve as a proxy for the same underlying cause:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/levittsyverson2-1.png" width="60%" height="50%" /></p>
<p>Under such a hypothetical pathway, controlling for <span class="math inline">\(C\)</span> would serve two purposes: it would block the path from <span class="math inline">\(C\)</span> to <span class="math inline">\(Y\)</span> and also act as a proxy for <span class="math inline">\(U\)</span>. If the proxy effect dominates then one could come to the conclusion of Gelbach that the coefficient on the agent dummy would move just as much if any of the other control variables were added first. However, if one were to believe such a model, the only meaningful specification would be to include all the controls (i.e. block all confounding backdoor paths between <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span>) and an additive interpretation, as well as all other earlier specifications, would make no sense.</p>
<p>Of course, one could also propose another causal model in which an agent’s experience or network affects the selling price as well as housing quality, but not home ownership. In this case, conditioning on housing quality induces a bias (as a result of conditioning on a collider node) and it would make sense to present both specifications to show that the agent effect is positive in both of the scenarios. In short, the different specifications presented should come from theories or hypothesised relationships between variables rather than a page-filling statistical exercise.</p>
</div>
<div id="instrumental-variables-iv" class="section level4">
<h4>Instrumental variables (IV)</h4>
<p>The IV approach could be summarised by the following DAG:<br />
<img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/iv-1.png" width="60%" height="50%" /></p>
<p><span class="math inline">\(Z\)</span> is the instrumental variable and it affects <span class="math inline">\(Y\)</span> only through <span class="math inline">\(X\)</span>. The economics literature emphasise the importance of an instrument fulfilling two conditions:</p>
<ul>
<li>Instrument relevance, where the instrument is strongly correlated with the endogenous variable of interest<br />
</li>
<li>Instrument validity, where the instrument is exogenous conditional on the control variables and satisfy the exclusion restriction (i.e. affects the outcome only through the variable of interest)</li>
</ul>
<p>DAGs are a useful tool to help examine whether the proposed instruments satisfy the exclusion restriction criteria. When doing so, two things should be considered:</p>
<ul>
<li>Possible pathways between the instrument and the outcome variable<br />
</li>
<li>Possible pathways between the instrument and the variable of interest</li>
</ul>
<p>Let us consider one of the most popular instruments in the literature - rainfall and use Miguel, Satyanath and Sergenti (2004) as an example. In this paper the authors use rainfall as an instrument to find the causal effect of economic growth on civil conflict. Rainfall is shown to be a good predictor of economic growth in sub-Saharan Africa since these countries are very reliant on the agriculture industry and do not have extensive irrigation systems or stable water sources. After controlling for a variety of variables including ethnolinguistic fractionalization, religious fractionalization, democracy, per capita income, population, percentage of mountainous terrain, oil exports and fixed effects they found that a negative growth shock leads to conflict in the following year.</p>
<p>Side-note: I remember first being introduced to the concept of IVs by a developmental economist and marveled at the brilliance of rainfall as an instrument. It cannot be affected by any variable (i.e. exogenous) and can be used to determine the effect of growth on civil conflict, democratic institutions and remittances in Africa, riots on housing prices in the U.S. and the effect of poverty (rye price) on violent crime in 19th-century Germany. Maybe too much of a good thing should have raised some red flags?</p>
<p>Examining the pathway between rainfall and growth, it is clear that growth cannot affect rainfall (at least in the short and medium term). The author also describes a plausible channel which rainfall affects growth and backs this up with the first stage regression estimates.</p>
<p>Next, we have to consider the pathways between rainfall and conflict. Again, I think it is fair to rule out any other channels which causes rainfall (i.e. any arrows pointing towards rainfall) but this still leaves all sorts of possible influences of rainfall on conflict, i.e., it is hard to justify the exclusion restriction. The restriction only holds if the set of controls, <span class="math inline">\(C\)</span>, blocks all possible causal channels between rainfall, <span class="math inline">\(Z\)</span>, and civil conflict, <span class="math inline">\(Y\)</span>:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/rainfall-1.png" width="60%" height="50%" /></p>
<p>The authors try to address the potential violations of the exclusion restriction by examining other channels such as tax revenue, road networks and heatwaves and argue that these factors do not affect their estimates and ‘other effects are likely to be minor’. However, I believe there are other obvious intermediate channels where rainfall could affect civil conflicts directly. One example, pointed out in the appendix of the paper, is that of mass migration due to droughts. Another possibility is rainfall (either too little or too much) directly preventing riots from forming (it was used an as instrument on riots in a study in a U.S. based study).</p>
<p>The main benefits of DAGs here is to keep track of the assumptions made and the validity of the arguments. It also helps the reader from getting side tracked by the controls that the author included and promotes better thinking of potential pathways in which an identification strategy could be compromised.</p>
</div>
</div>
<div id="front-door-criterion" class="section level3">
<h3>3) Front-door criterion</h3>
<p>My third insight from the book was a possibly new method of teasing out causal effects which Pearl et al. called the front-door criterion. In the first section of the post, I touched on the backdoor criterion where the variables which satisfy this criterion blocks all backdoor path from the variable of interest to the outcome. The front-door criterion considers all variables that are descendants of the variable of interest and mediates between the variable of interest and the outcome. More concretely, a set of variables, <span class="math inline">\(Z\)</span>, satisfy the front-door criterion if</p>
<ul>
<li><span class="math inline">\(Z\)</span> intercepts all paths between <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span><br />
</li>
<li>All backdoor paths from <span class="math inline">\(Z\)</span> to <span class="math inline">\(Y\)</span> are blocked by <span class="math inline">\(X\)</span><br />
</li>
<li>There are no other paths between <span class="math inline">\(X\)</span> to <span class="math inline">\(Z\)</span></li>
</ul>
<p>In a way the front-door criterion is quite similar to the instrumental variable method but instead of affecting the outcome only through the variable of interest, it is the only channel which the variable of interest can affect the outcome. As Alex Chino notes in his <a href="http://www.alexchinco.com/example-front-door-criterion/">blog post</a>, ‘rather than focusing on exogenous variation in treatment selection, this approach exploits exogenous variation in the strength of the treatment.’</p>
<p>Pearl gives an example of trying to ascertain the effect of smoking, <span class="math inline">\(X\)</span>, on lung cancer, <span class="math inline">\(Y\)</span>. Genotype, <span class="math inline">\(U\)</span>, is an unobserved confounding variable but we know that smoking only causes lung cancer through the amount of tar deposits, <span class="math inline">\(Z\)</span>. This can be modeled by the following DAG:</p>
<p><img src="./post/2018-08-09-applications-of-dags-in-causal-inference_files/figure-html/frontdoor-1.png" width="60%" height="50%" /></p>
<p>In this example, we cannot block the back-door path since genotype is unobserved and hence, cannot directly estimate the effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span>. Instead, identification comes from our knowledge of the effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Z\)</span>, which we can directly estimate, and the effect of <span class="math inline">\(Z\)</span> on <span class="math inline">\(Y\)</span> which we can estimate by controlling for <span class="math inline">\(X\)</span> as this blocks the backdoor path between <span class="math inline">\(Z\)</span> and <span class="math inline">\(Y\)</span>. The overall effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Y\)</span> can then be derived by chaining the effect of <span class="math inline">\(X\)</span> on <span class="math inline">\(Z\)</span> and <span class="math inline">\(Z\)</span> on <span class="math inline">\(Y\)</span>. Interested readers can read the book for a more complete mathematical treatment.</p>
<p>The blog linked above gives some examples where it is used in the economics literature. One of the more interesting and convincing application is that by Cohen and Malloy (2010) who examined the link between social ties and congressional voting. Social ties is inferred by the colleges which the congressmen attended and there could be many factors driving both college and voting decisions. Instead of looking for an instrument, the authors exploit the channel through which social ties affect votes (namely the seating arrangement in the senate chamber). Since seating for rookie senators are randomised they were able to estimate the effect of social ties on voting outcomes based on exogenous variation in the mediating channel.</p>
<p>The number of assumptions needed to make the front-door criterion believable is quite high but at least it has not been used (and misused) to death like the instrumental variable method. The main difficulty of this method is that it requires one to make a causal claim on the pathway between the variable of interest, the mediating variable and the outcome. This is a large claim to make but in a way it is similar to arguing over the exogeneity of an instrument or whether it fulfills the exclusion restriction.</p>
</div>
</div>
<div id="conclusion" class="section level2">
<h2>Conclusion</h2>
<p>In this post, I highlighted 3 benefits of modeling causal effects in a DAG framework. The book discusses other interesting points such as the correspondence between the graphical framework and the potential outcomes framework as well as direct vs indirect effects which are worth a read. I hope you found this article useful and if you are interested in causal analysis please go check out Pearl, Glymour and P.Jewell (2016).</p>
</div>
Feature Selection Using Feature Importance Score - Creating a PySpark Estimator
https://www.timlrx.com/2018/06/19/feature-selection-using-feature-importance-score-creating-a-pyspark-estimator/
Tue, 19 Jun 2018 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2018/06/19/feature-selection-using-feature-importance-score-creating-a-pyspark-estimator/
<p>In this post I discuss how to create a new pyspark estimator to integrate in an existing machine learning pipeline. This is an extension of my <a href="https://www.timlrx.com/2018/04/08/creating-a-custom-cross-validation-function-in-pyspark/">previous post</a> where I discussed how to create a custom cross validation function. Recently, I have been looking at integrating existing code in the pyspark ML pipeline framework. A pipeline is a fantastic concept of abstraction since it allows the analyst to focus on the main tasks that needs to be carried out and allows the entire piece of work to be reusable.</p>
<p>As a fun and useful example, I will show how feature selection using feature importance score can be coded into a pipeline. I find Pyspark’s MLlib native feature selection functions relatively limited so this is also part of an effort to extend the feature selection methods. Here, I use the feature importance score as estimated from a model (decision tree / random forest / gradient boosted trees) to extract the variables that are plausibly the most important.</p>
<p>First, let’s setup the jupyter notebook and import the relevant functions. I use a local version of spark to illustrate how this works but one can easily use a yarn cluster instead.</p>
<pre><code class="language-python">from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark import SQLContext
</code></pre>
<pre><code class="language-python">sc = SparkContext()
spark = SQLContext(sc)
</code></pre>
<pre><code class="language-python">from pyspark.sql.functions import *
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler, VectorSlicer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
</code></pre>
<h3 id="bank-marketing-data-set">Bank Marketing Data Set</h3>
<p>To show the usefulness of feature selection and to sort of validate the script, I used the <a href="https://archive.ics.uci.edu/ml/datasets/bank+marketing">Bank Marketing Data Set from UCI Machine Learning Repository</a> as an example throughout this post. This comes from Moro et al., 2014 paper on A Data-Driven Approach to Predict the Success of Bank Telemarketing. As the name of the paper suggests, the goal of this dataset is to predict which bank customers would subscribe to a term deposit product as a result of a phone marketing campaign.</p>
<p>Let us read in the file and take a look at the variables of the dataset.</p>
<pre><code class="language-python">df = spark.read.option("delimiter", ";").csv("../data/bank-additional/bank-additional-full.csv", header=True, inferSchema = True)
</code></pre>
<pre><code class="language-python">df.dtypes
</code></pre>
<pre><code>[('age', 'int'),
('job', 'string'),
('marital', 'string'),
('education', 'string'),
('default', 'string'),
('housing', 'string'),
('loan', 'string'),
('contact', 'string'),
('month', 'string'),
('day_of_week', 'string'),
('duration', 'int'),
('campaign', 'int'),
('pdays', 'int'),
('previous', 'int'),
('poutcome', 'string'),
('emp.var.rate', 'double'),
('cons.price.idx', 'double'),
('cons.conf.idx', 'double'),
('euribor3m', 'double'),
('nr.employed', 'double'),
('y', 'string')]
</code></pre>
<p>There are some problematic variable names and we should replace the dot seperator with an underscore.</p>
<pre><code class="language-python">df = df.toDF(*(c.replace('.', '_') for c in df.columns))
</code></pre>
<pre><code class="language-python">df.limit(5).toPandas()
</code></pre>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>age</th>
<th>job</th>
<th>marital</th>
<th>education</th>
<th>default</th>
<th>housing</th>
<th>loan</th>
<th>contact</th>
<th>month</th>
<th>day_of_week</th>
<th>duration</th>
<th>campaign</th>
<th>pdays</th>
<th>previous</th>
<th>poutcome</th>
<th>emp_var_rate</th>
<th>cons_price_idx</th>
<th>cons_conf_idx</th>
<th>euribor3m</th>
<th>nr_employed</th>
<th>y</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>56</td>
<td>housemaid</td>
<td>married</td>
<td>basic.4y</td>
<td>no</td>
<td>no</td>
<td>no</td>
<td>telephone</td>
<td>may</td>
<td>mon</td>
<td>261</td>
<td>1</td>
<td>999</td>
<td>0</td>
<td>nonexistent</td>
<td>1.1</td>
<td>93.994</td>
<td>-36.4</td>
<td>4.857</td>
<td>5191.0</td>
<td>no</td>
</tr>
<tr>
<th>1</th>
<td>57</td>
<td>services</td>
<td>married</td>
<td>high.school</td>
<td>unknown</td>
<td>no</td>
<td>no</td>
<td>telephone</td>
<td>may</td>
<td>mon</td>
<td>149</td>
<td>1</td>
<td>999</td>
<td>0</td>
<td>nonexistent</td>
<td>1.1</td>
<td>93.994</td>
<td>-36.4</td>
<td>4.857</td>
<td>5191.0</td>
<td>no</td>
</tr>
<tr>
<th>2</th>
<td>37</td>
<td>services</td>
<td>married</td>
<td>high.school</td>
<td>no</td>
<td>yes</td>
<td>no</td>
<td>telephone</td>
<td>may</td>
<td>mon</td>
<td>226</td>
<td>1</td>
<td>999</td>
<td>0</td>
<td>nonexistent</td>
<td>1.1</td>
<td>93.994</td>
<td>-36.4</td>
<td>4.857</td>
<td>5191.0</td>
<td>no</td>
</tr>
<tr>
<th>3</th>
<td>40</td>
<td>admin.</td>
<td>married</td>
<td>basic.6y</td>
<td>no</td>
<td>no</td>
<td>no</td>
<td>telephone</td>
<td>may</td>
<td>mon</td>
<td>151</td>
<td>1</td>
<td>999</td>
<td>0</td>
<td>nonexistent</td>
<td>1.1</td>
<td>93.994</td>
<td>-36.4</td>
<td>4.857</td>
<td>5191.0</td>
<td>no</td>
</tr>
<tr>
<th>4</th>
<td>56</td>
<td>services</td>
<td>married</td>
<td>high.school</td>
<td>no</td>
<td>no</td>
<td>yes</td>
<td>telephone</td>
<td>may</td>
<td>mon</td>
<td>307</td>
<td>1</td>
<td>999</td>
<td>0</td>
<td>nonexistent</td>
<td>1.1</td>
<td>93.994</td>
<td>-36.4</td>
<td>4.857</td>
<td>5191.0</td>
<td>no</td>
</tr>
</tbody>
</table>
</div>
<p>It’s always nice to take a look at the distribution of the variables</p>
<pre><code class="language-python">df.describe().toPandas()
</code></pre>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>summary</th>
<th>age</th>
<th>job</th>
<th>marital</th>
<th>education</th>
<th>default</th>
<th>housing</th>
<th>loan</th>
<th>contact</th>
<th>month</th>
<th>day_of_week</th>
<th>duration</th>
<th>campaign</th>
<th>pdays</th>
<th>previous</th>
<th>poutcome</th>
<th>emp_var_rate</th>
<th>cons_price_idx</th>
<th>cons_conf_idx</th>
<th>euribor3m</th>
<th>nr_employed</th>
<th>y</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>count</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
<td>41188</td>
</tr>
<tr>
<th>1</th>
<td>mean</td>
<td>40.02406040594348</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>258.2850101971448</td>
<td>2.567592502670681</td>
<td>962.4754540157328</td>
<td>0.17296299893172767</td>
<td>None</td>
<td>0.08188550063178392</td>
<td>93.57566436828918</td>
<td>-40.50260027191787</td>
<td>3.6212908128585366</td>
<td>5167.035910944004</td>
<td>None</td>
</tr>
<tr>
<th>2</th>
<td>stddev</td>
<td>10.421249980934057</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>None</td>
<td>259.27924883646455</td>
<td>2.770013542902331</td>
<td>186.9109073447414</td>
<td>0.49490107983928927</td>
<td>None</td>
<td>1.57095974051703</td>
<td>0.5788400489541355</td>
<td>4.628197856174595</td>
<td>1.7344474048512557</td>
<td>72.25152766825924</td>
<td>None</td>
</tr>
<tr>
<th>3</th>
<td>min</td>
<td>17</td>
<td>admin.</td>
<td>divorced</td>
<td>basic.4y</td>
<td>no</td>
<td>no</td>
<td>no</td>
<td>cellular</td>
<td>apr</td>
<td>fri</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>failure</td>
<td>-3.4</td>
<td>92.201</td>
<td>-50.8</td>
<td>0.634</td>
<td>4963.6</td>
<td>no</td>
</tr>
<tr>
<th>4</th>
<td>max</td>
<td>98</td>
<td>unknown</td>
<td>unknown</td>
<td>unknown</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
<td>telephone</td>
<td>sep</td>
<td>wed</td>
<td>4918</td>
<td>56</td>
<td>999</td>
<td>7</td>
<td>success</td>
<td>1.4</td>
<td>94.767</td>
<td>-26.9</td>
<td>5.045</td>
<td>5228.1</td>
<td>yes</td>
</tr>
</tbody>
</table>
</div>
<p>There are quite a few variables that are encoded as a string in this dataset. Converting strings to a binary indicator variable / dummy variable takes up quite a few degrees of freedom. In machine learning speak it might also lead to the model being overfitted. Let us take a look at what is represented by each variable that is of string type.</p>
<pre><code class="language-python">for i in df.dtypes:
if i[1]=='string':
df.groupBy(i[0]).count().orderBy('count', ascending=False).toPandas()
</code></pre>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>job</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>admin.</td>
<td>10422</td>
</tr>
<tr>
<th>1</th>
<td>blue-collar</td>
<td>9254</td>
</tr>
<tr>
<th>2</th>
<td>technician</td>
<td>6743</td>
</tr>
<tr>
<th>3</th>
<td>services</td>
<td>3969</td>
</tr>
<tr>
<th>4</th>
<td>management</td>
<td>2924</td>
</tr>
<tr>
<th>5</th>
<td>retired</td>
<td>1720</td>
</tr>
<tr>
<th>6</th>
<td>entrepreneur</td>
<td>1456</td>
</tr>
<tr>
<th>7</th>
<td>self-employed</td>
<td>1421</td>
</tr>
<tr>
<th>8</th>
<td>housemaid</td>
<td>1060</td>
</tr>
<tr>
<th>9</th>
<td>unemployed</td>
<td>1014</td>
</tr>
<tr>
<th>10</th>
<td>student</td>
<td>875</td>
</tr>
<tr>
<th>11</th>
<td>unknown</td>
<td>330</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>marital</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>married</td>
<td>24928</td>
</tr>
<tr>
<th>1</th>
<td>single</td>
<td>11568</td>
</tr>
<tr>
<th>2</th>
<td>divorced</td>
<td>4612</td>
</tr>
<tr>
<th>3</th>
<td>unknown</td>
<td>80</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>education</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>university.degree</td>
<td>12168</td>
</tr>
<tr>
<th>1</th>
<td>high.school</td>
<td>9515</td>
</tr>
<tr>
<th>2</th>
<td>basic.9y</td>
<td>6045</td>
</tr>
<tr>
<th>3</th>
<td>professional.course</td>
<td>5243</td>
</tr>
<tr>
<th>4</th>
<td>basic.4y</td>
<td>4176</td>
</tr>
<tr>
<th>5</th>
<td>basic.6y</td>
<td>2292</td>
</tr>
<tr>
<th>6</th>
<td>unknown</td>
<td>1731</td>
</tr>
<tr>
<th>7</th>
<td>illiterate</td>
<td>18</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>default</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>no</td>
<td>32588</td>
</tr>
<tr>
<th>1</th>
<td>unknown</td>
<td>8597</td>
</tr>
<tr>
<th>2</th>
<td>yes</td>
<td>3</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>housing</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>yes</td>
<td>21576</td>
</tr>
<tr>
<th>1</th>
<td>no</td>
<td>18622</td>
</tr>
<tr>
<th>2</th>
<td>unknown</td>
<td>990</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>loan</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>no</td>
<td>33950</td>
</tr>
<tr>
<th>1</th>
<td>yes</td>
<td>6248</td>
</tr>
<tr>
<th>2</th>
<td>unknown</td>
<td>990</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>contact</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>cellular</td>
<td>26144</td>
</tr>
<tr>
<th>1</th>
<td>telephone</td>
<td>15044</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>month</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>may</td>
<td>13769</td>
</tr>
<tr>
<th>1</th>
<td>jul</td>
<td>7174</td>
</tr>
<tr>
<th>2</th>
<td>aug</td>
<td>6178</td>
</tr>
<tr>
<th>3</th>
<td>jun</td>
<td>5318</td>
</tr>
<tr>
<th>4</th>
<td>nov</td>
<td>4101</td>
</tr>
<tr>
<th>5</th>
<td>apr</td>
<td>2632</td>
</tr>
<tr>
<th>6</th>
<td>oct</td>
<td>718</td>
</tr>
<tr>
<th>7</th>
<td>sep</td>
<td>570</td>
</tr>
<tr>
<th>8</th>
<td>mar</td>
<td>546</td>
</tr>
<tr>
<th>9</th>
<td>dec</td>
<td>182</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>day_of_week</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>thu</td>
<td>8623</td>
</tr>
<tr>
<th>1</th>
<td>mon</td>
<td>8514</td>
</tr>
<tr>
<th>2</th>
<td>wed</td>
<td>8134</td>
</tr>
<tr>
<th>3</th>
<td>tue</td>
<td>8090</td>
</tr>
<tr>
<th>4</th>
<td>fri</td>
<td>7827</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>poutcome</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>nonexistent</td>
<td>35563</td>
</tr>
<tr>
<th>1</th>
<td>failure</td>
<td>4252</td>
</tr>
<tr>
<th>2</th>
<td>success</td>
<td>1373</td>
</tr>
</tbody>
</table>
</div>
<br>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>y</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>no</td>
<td>36548</td>
</tr>
<tr>
<th>1</th>
<td>yes</td>
<td>4640</td>
</tr>
</tbody>
</table>
</div>
<p>The number of categories for each string type is relatively small which makes creating binary indicator variables / one-hot encoding a suitable pre-processing step. Let us take a look at how to do feature selection using the feature importance score the manual way before coding it as an estimator to fit into a Pyspark pipeline.</p>
<h3 id="data-preprocessing">Data Preprocessing</h3>
<p>Before we run the model on the most relevant features, we would first need to encode the string variables as binary vectors and run a random forest model on the whole feature set to get the feature importance score. Here I just run most of these tasks as part of a pipeline.</p>
<pre><code class="language-python"># one hot encoding and assembling
encoding_var = [i[0] for i in df.dtypes if (i[1]=='string') & (i[0]!='y')]
num_var = [i[0] for i in df.dtypes if ((i[1]=='int') | (i[1]=='double')) & (i[0]!='y')]
string_indexes = [StringIndexer(inputCol = c, outputCol = 'IDX_' + c, handleInvalid = 'keep') for c in encoding_var]
onehot_indexes = [OneHotEncoderEstimator(inputCols = ['IDX_' + c], outputCols = ['OHE_' + c]) for c in encoding_var]
label_indexes = StringIndexer(inputCol = 'y', outputCol = 'label', handleInvalid = 'keep')
assembler = VectorAssembler(inputCols = num_var + ['OHE_' + c for c in encoding_var], outputCol = "features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features", seed = 8464,
numTrees=10, cacheNodeIds = True, subsamplingRate = 0.7)
pipe = Pipeline(stages = string_indexes + onehot_indexes + [assembler, label_indexes, rf])
</code></pre>
<pre><code class="language-python">mod = pipe.fit(df)
</code></pre>
<pre><code class="language-python">df2 = mod.transform(df)
</code></pre>
<p>The feature importance score that is returned comes in the form of a sparse vector. This is not very human readable and we would need to map this to the actual variable names for some insights. I wrote a little function to return the variable names sorted by importance score as a pandas data frame. This was inspired by the following post on <a href="https://stackoverflow.com/questions/42935914/how-to-map-features-from-the-output-of-a-vectorassembler-back-to-the-column-name">stackoverflow</a></p>
<pre><code class="language-python">mod.stages[-1].featureImportances
</code></pre>
<pre><code>SparseVector(63, {0: 0.0257, 1: 0.1596, 2: 0.0037, 3: 0.2212, 4: 0.0305, 5: 0.0389, 6: 0.0762, 7: 0.0423, 8: 0.1869, 9: 0.063, 10: 0.0002, 12: 0.0003, 13: 0.0002, 14: 0.0003, 15: 0.0005, 16: 0.0002, 18: 0.0006, 19: 0.0003, 20: 0.0002, 21: 0.0, 22: 0.001, 23: 0.0003, 24: 0.0005, 26: 0.0005, 27: 0.0007, 28: 0.0008, 29: 0.0003, 30: 0.0, 31: 0.0001, 34: 0.0002, 35: 0.0021, 37: 0.0001, 38: 0.0003, 39: 0.0003, 40: 0.0003, 41: 0.0001, 42: 0.0002, 43: 0.0284, 44: 0.0167, 45: 0.0038, 46: 0.0007, 47: 0.0008, 48: 0.0132, 49: 0.0003, 50: 0.0014, 51: 0.0159, 52: 0.0114, 53: 0.0103, 54: 0.0036, 55: 0.0002, 56: 0.0021, 57: 0.0002, 58: 0.0006, 59: 0.0005, 60: 0.0158, 61: 0.0038, 62: 0.0121})
</code></pre>
<pre><code class="language-python">def ExtractFeatureImp(featureImp, dataset, featuresCol):
list_extract = []
for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]:
list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
varlist = pd.DataFrame(list_extract)
varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x])
return(varlist.sort_values('score', ascending = False))
</code></pre>
<pre><code class="language-python">ExtractFeatureImp(mod.stages[-1].featureImportances, df2, "features").head(10)
</code></pre>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>idx</th>
<th>name</th>
<th>score</th>
</tr>
</thead>
<tbody>
<tr>
<th>3</th>
<td>3</td>
<td>pdays</td>
<td>0.221203</td>
</tr>
<tr>
<th>8</th>
<td>8</td>
<td>euribor3m</td>
<td>0.186892</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>duration</td>
<td>0.159579</td>
</tr>
<tr>
<th>6</th>
<td>6</td>
<td>cons_price_idx</td>
<td>0.076177</td>
</tr>
<tr>
<th>9</th>
<td>9</td>
<td>nr_employed</td>
<td>0.063016</td>
</tr>
<tr>
<th>7</th>
<td>7</td>
<td>cons_conf_idx</td>
<td>0.042298</td>
</tr>
<tr>
<th>5</th>
<td>5</td>
<td>emp_var_rate</td>
<td>0.038875</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>previous</td>
<td>0.030470</td>
</tr>
<tr>
<th>43</th>
<td>43</td>
<td>OHE_contact_cellular</td>
<td>0.028401</td>
</tr>
<tr>
<th>0</th>
<td>0</td>
<td>age</td>
<td>0.025732</td>
</tr>
</tbody>
</table>
</div>
<p>Now that we have the most important faatures in a nicely formatted list, we can extract the top 10 features and create a new input vector column with only these variables. Pyspark has a VectorSlicer function that does exactly that. A new model can then be trained just on these 10 variables.</p>
<pre><code class="language-python">varlist = ExtractFeatureImp(mod.stages[-1].featureImportances, df2, "features")
</code></pre>
<pre><code class="language-python">varidx = [x for x in varlist['idx'][0:10]]
</code></pre>
<pre><code class="language-python">varidx
</code></pre>
<pre><code>[3, 8, 1, 6, 9, 7, 5, 4, 43, 0]
</code></pre>
<pre><code class="language-python">slicer = VectorSlicer(inputCol="features", outputCol="features2", indices=varidx)
df3 = slicer.transform(df2)
</code></pre>
<pre><code class="language-python">df3 = df3.drop('rawPrediction', 'probability', 'prediction')
rf2 = RandomForestClassifier(labelCol="label", featuresCol="features2", seed = 8464,
numTrees=10, cacheNodeIds = True, subsamplingRate = 0.7)
mod2 = rf2.fit(df3)
df4 = mod2.transform(df3)
</code></pre>
<h3 id="building-the-estimator-function">Building the estimator function</h3>
<p>Now let us learn to build a new pipeline object that makes the above task easy!</p>
<p>First a bit of theory as taken from the <a href="https://spark.apache.org/docs/2.3.0/ml-pipeline.html">ML pipeline documentation</a>:</p>
<blockquote>
<p><strong>DataFrame</strong>: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.<br />
<strong>Transformer</strong>: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.<br />
<strong>Estimator</strong>: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.<br />
<strong>Pipeline</strong>: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.</p>
</blockquote>
<p>The important thing to remember is that the pipeline object has two components. The first is the estimator which returns a model and the second is the model/transformer which returns a dataframe.</p>
<p>We begin by coding up the estimator object. The cross-validation function in the <a href="https://www.timlrx.com/2018/04/08/creating-a-custom-cross-validation-function-in-pyspark/">previous post</a> provides a thorough walk-through on creating the estimator object and params needed. In this case, I wanted the function to select either the top n features or based on a certain cut-off so these parameters are included as arguments to the function. An estimator (either decision tree / random forest / gradient boosted trees) is also required as an input.</p>
<pre><code class="language-python">def __init__(self, estimator = None, selectorType = "numTopFeatures",
numTopFeatures = 20, threshold = 0.01, outputCol = "features")
</code></pre>
<p>Given a dataset we can write a fit function that extracts the feature importance scores</p>
<pre><code class="language-python">mod = est.fit(dataset)
dataset2 = mod.transform(dataset)
varlist = ExtractFeatureImp(mod.featureImportances, dataset2, est.getFeaturesCol())
</code></pre>
<p>Some conditional statements to select the correct indexes that corresponds to the feature we want to extract. This gives us the output of the model - a list of features we want to extract.</p>
<pre><code class="language-python">if (selectorType == "numTopFeatures"):
varidx = [x for x in varlist['idx'][0:nfeatures]]
elif (selectorType == "threshold"):
varidx = [x for x in varlist[varlist['score'] > threshold]['idx']]
</code></pre>
<p>Now for the second part of the problem - we want to take this list of features and create a transform function that returns the dataset with a new column containing our most relevant features. Sounds familiar? This is exactly what the VectorSlicer transformer does. So there is no need to re-invent the wheel and we can just reurn a VectorSlicer with the correct indices to slice.</p>
<pre><code class="language-python">return VectorSlicer(inputCol = est.getFeaturesCol(),
outputCol = outputCol,
indices = varidx)
</code></pre>
<p>That concludes our new feature selection estimator! The full code can be obtained <a href="https://gist.github.com/timlrx/1d5fdb0a43adbbe32a9336ba5c85b1b2">here</a>.</p>
<h3 id="putting-the-new-function-to-the-test">Putting the new function to the test</h3>
<p>Let’s try out the new function. I saved it as a file called FeatureImportanceSelector.py. Notice there is a new pipeline object called fis (featureImpSelector). This takes in the first random forest model and uses the feature importance score from it to extract the top 10 variables.</p>
<pre><code class="language-python">from FeatureImportanceSelector import ExtractFeatureImp, FeatureImpSelector
</code></pre>
<pre><code class="language-python"># one hot encoding and assembling
encoding_var = [i[0] for i in df.dtypes if (i[1]=='string') & (i[0]!='y')]
num_var = [i[0] for i in df.dtypes if ((i[1]=='int') | (i[1]=='double')) & (i[0]!='y')]
string_indexes = [StringIndexer(inputCol = c, outputCol = 'IDX_' + c, handleInvalid = 'keep') for c in encoding_var]
onehot_indexes = [OneHotEncoderEstimator(inputCols = ['IDX_' + c], outputCols = ['OHE_' + c]) for c in encoding_var]
label_indexes = StringIndexer(inputCol = 'y', outputCol = 'label', handleInvalid = 'keep')
assembler = VectorAssembler(inputCols = num_var + ['OHE_' + c for c in encoding_var], outputCol = "features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features", seed = 8464,
numTrees=10, cacheNodeIds = True, subsamplingRate = 0.7)
fis = FeatureImpSelector(estimator = rf, selectorType = "numTopFeatures",
numTopFeatures = 10, outputCol = "features_subset")
rf2 = RandomForestClassifier(labelCol="label", featuresCol="features_subset", seed = 8464,
numTrees=10, cacheNodeIds = True, subsamplingRate = 0.7)
pipe = Pipeline(stages = string_indexes + onehot_indexes + [assembler, label_indexes, fis, rf2])
</code></pre>
<pre><code class="language-python">pipeline_mod = pipe.fit(df)
</code></pre>
<pre><code class="language-python">df2 = pipeline_mod.transform(df)
</code></pre>
<pre><code class="language-python">ExtractFeatureImp(mod.stages[-1].featureImportances, df2, "features_subset")
</code></pre>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>idx</th>
<th>name</th>
<th>score</th>
</tr>
</thead>
<tbody>
<tr>
<th>3</th>
<td>3</td>
<td>cons_price_idx</td>
<td>0.221203</td>
</tr>
<tr>
<th>9</th>
<td>8</td>
<td>OHE_contact_cellular</td>
<td>0.186892</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>euribor3m</td>
<td>0.159579</td>
</tr>
<tr>
<th>6</th>
<td>6</td>
<td>emp_var_rate</td>
<td>0.076177</td>
</tr>
<tr>
<th>8</th>
<td>9</td>
<td>age</td>
<td>0.063016</td>
</tr>
<tr>
<th>7</th>
<td>7</td>
<td>previous</td>
<td>0.042298</td>
</tr>
<tr>
<th>5</th>
<td>5</td>
<td>cons_conf_idx</td>
<td>0.038875</td>
</tr>
<tr>
<th>4</th>
<td>4</td>
<td>nr_employed</td>
<td>0.030470</td>
</tr>
<tr>
<th>0</th>
<td>0</td>
<td>pdays</td>
<td>0.025732</td>
</tr>
<tr>
<th>2</th>
<td>2</td>
<td>duration</td>
<td>0.003744</td>
</tr>
</tbody>
</table>
</div>
<p>10 features as intended and not suprisingly, it matches the top 10 features as generated by our previous non-pipeline method.</p>
<p>Hope you found the tutorial useful and maybe it will inspire you to create more useful extensions for pyspark.</p>
Statistical Musings
https://www.timlrx.com/2018/04/28/statistical-musings/
Sat, 28 Apr 2018 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2018/04/28/statistical-musings/<p>No technical details in this post. Just a few scattered thoughts and some stories that have kept me semi-entertained over the last month. Some are inspired by work and others are just my take on the world, exaggerated to some degree.</p>
<div id="rademacher-coins" class="section level3">
<h3>Rademacher Coins</h3>
<p>As we move towards a cashless society, maybe it would make sense to do away with coins and decimal values. Decimal points make billed amount and account balances unnecessarily messy and untidy. Removing the decimals is a simple issue. After all, the current accounting system rounds to 2 decimal place and it would not be hard to just round it to the nearest integer. I guess the main concern is fairness. For billed amounts where the value is low, rounding up or down will translate to a significant percentage change in the price of a good.</p>
<p>Here’s my suggestion to the problem: randomly round up or down all transactions to the nearest dollar. On average, it would translate to paying / receiving 50 cents and assuming that no one would game the system (e.g. by trying to make sure that the total bill ends up greater than 50 cents per transaction), the system would be fair and everyone gets the benefit of round numbers.</p>
<p>All transactions would be validated by a blockchain system and early adopters would be rewarded with Rademacher coins. As given away by its name, the coins will have a value of either -1 or 1 with 50% probability.</p>
</div>
<div id="data-science" class="section level3">
<h3>Data Science</h3>
<p>Of all job descriptions with the title of data science, 80% involves data while only 20% have any relation to science. By science, I do not mean related to fields such as Chemistry, Biology or Physics, but rather the scientific method of generating hypothesis, experimenting and falsifying claims.</p>
<p>Maybe that figure is too low. A more accurate survey would show that 50% of data scientists are actually conducting experiments and analysis to confirm the prior expectations of their bosses. After all, data is truth.</p>
</div>
<div id="algorithmic-world" class="section level3">
<h3>Algorithmic World</h3>
<ol start="2018" style="list-style-type: decimal">
<li>Financial transactions can be conducted with a wave of a phone. Cars ply the road with no drivers in sight. Lampposts act as a weather vane, traffic monitoring device and surveillance tool.</li>
</ol>
<p>Corporations still exist, though they are now claim to provide work-life balance and free pantry snacks compared to those a decade ago. The capitalist points to this as a working of competitive markets and the benefit of labour mobility and international trade. The neo-neo-marxist claims that men are still tied in chains (or rather stimulated by syntactic sugar) while the capitalist (to be precise they are now called venture capitalist) reaps the bitcoin.</p>
<p>In one of these container blocks a data scientist presents his new predictive model, capable of profiling an individual to a level and detail never yet seen before.</p>
<blockquote>
<p>Customer i have a 70% probability of purchasing an iPhone X, 36% probability of contracting cancer by the age of 60 and 0.01% chance of getting eaten by a shark.</p>
</blockquote>
<p>The audience was intrigued. A flurry of questions were raised:</p>
<ul>
<li>What model did you use?</li>
<li>What is the underlying prevalence rate?</li>
<li>What is the precision score and AUC?</li>
<li>How do we automate the process and scale it up?</li>
<li>How do we ensure secure authentication and information storage?</li>
<li>What is the expected profits?</li>
<li>When can we use the model?</li>
</ul>
<p>But, no one asks <em>why?</em></p>
</div>
Creating a Custom Cross-Validation Function in PySpark
https://www.timlrx.com/2018/04/08/creating-a-custom-cross-validation-function-in-pyspark/
Sun, 08 Apr 2018 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2018/04/08/creating-a-custom-cross-validation-function-in-pyspark/
<h3 id="introduction">Introduction</h3>
<p>Lately, I have been using PySpark in my data processing and modeling pipeline. While Spark is great for most data processing needs, the machine learning component is slightly lacking. Coming from R and Python’s scikit-learn where there are so many machine learning packages available, this limitation is frustrating. Having said that, there are ongoing efforts to improve the machine learning library so hopefully there would be more functionalities in the future.</p>
<p>One of the problems that I am solving involves a time series component to the task of prediction. As such, k-fold cross-validation techniques, which is available in PySpark, would not give an accurate representation of the model’s performance. For such problems doing a rolling window approach to cross-validation is much better i.e. repeating the process of training the model on a lagged time period and testing the performance on a recent period.</p>
<p>However, other variants of cross-validation is not supported by PySpark. As of PySpark 2.3 it supports a k-fold version and a simple random split into train / test dataset. Normally, it would be difficult to create a customise algorithm on PySpark as most of the functions call their Scala equivalent, which is the native language of Spark. Thankfully, the <a href="https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py">cross-validation function</a> is largely written using base PySpark functions before being parallelise as tasks and distributed for computation. The rest of this post discusses my implementation of a custom cross-validation class.</p>
<h3 id="implementation">Implementation</h3>
<p>First, we will use the <code>CrossValidator</code> class as a template to base our new class on. The two main portions that need to be changed are the <code>__init__</code> and <code>_fit</code> functions. Let’s take a look at the <code>__init__</code> function first.</p>
<pre><code class="language-python"> @keyword_only
def __init__(self, estimator=None, estimatorParamMaps=None, evaluator=None, numFolds=3,
seed=None, parallelism=1):
super(CrossValidator, self).__init__()
self._setDefault(numFolds=3, parallelism=1)
kwargs = self._input_kwargs
self._set(**kwargs)
</code></pre>
<p>Rather than the typical <code>self.input = input</code> kind of statements, PySpark uses a decorator (<code>@keyword_only</code>) to assign the inputs as params. So this means that we would have to define additional params before assigning them as inputs when initialising the class.</p>
<p>Now let us examine the <code>_fit</code> function:</p>
<pre><code class="language-python"> def _fit(self, dataset):
est = self.getOrDefault(self.estimator)
epm = self.getOrDefault(self.estimatorParamMaps)
numModels = len(epm)
eva = self.getOrDefault(self.evaluator)
nFolds = self.getOrDefault(self.numFolds)
seed = self.getOrDefault(self.seed)
h = 1.0 / nFolds
randCol = self.uid + "_rand"
df = dataset.select("*", rand(seed).alias(randCol))
metrics = [0.0] * numModels
pool = ThreadPool(processes=min(self.getParallelism(), numModels))
for i in range(nFolds):
validateLB = i * h
validateUB = (i + 1) * h
condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
validation = df.filter(condition).cache()
train = df.filter(~condition).cache()
tasks = _parallelFitTasks(est, train, eva, validation, epm)
for j, metric in pool.imap_unordered(lambda f: f(), tasks):
metrics[j] += (metric / nFolds)
validation.unpersist()
train.unpersist()
</code></pre>
<p>The main thing to note here is the way to retrieve the value of a parameter using the <code>getOrDefault</code> function. We also see how PySpark implements the k-fold cross-validation by using a column of random numbers and using the <code>filter</code> function to select the relevant fold to train and test on. That would be the main portion which we will change when implementing our custom cross-validation function. In addition, I would also like to print some information on the progress status of the task as well as the results of the cross-validation.</p>
<p>Here’s the full custom cross-validation class. It loops through a dictionary of datasets and identifies which column to train and test via the cvCol and splitWord inputs. This is actually the second version of my cross-validation class. The first one runs on a merged dataset but in some cases the union operation messes up the metadata so I edited take in a dictionary as an input insted.</p>
<pre><code class="language-python">class CustomCrossValidator(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable):
"""
Modifies CrossValidator allowing custom train and test dataset to be passed into the function
Bypass generation of train/test via numFolds
instead train and test set is user define
"""
splitWord = Param(Params._dummy(), "splitWord", "Tuple to split train and test set e.g. ('train', 'test')",
typeConverter=TypeConverters.toListString)
cvCol = Param(Params._dummy(), "cvCol", "Column name to filter train and test list",
typeConverter=TypeConverters.toString)
@keyword_only
def __init__(self, estimator=None, estimatorParamMaps=None, evaluator=None,
splitWord = ('train', 'test'), cvCol = 'cv', seed=None, parallelism=1):
super(CustomCrossValidator, self).__init__()
self._setDefault(parallelism=1)
kwargs = self._input_kwargs
self._set(**kwargs)
def _fit(self, dataset):
est = self.getOrDefault(self.estimator)
epm = self.getOrDefault(self.estimatorParamMaps)
numModels = len(epm)
eva = self.getOrDefault(self.evaluator)
nFolds = len(dataset)
seed = self.getOrDefault(self.seed)
metrics = [0.0] * numModels
matrix_metrics = [[0 for x in range(nFolds)] for y in range(len(epm))]
pool = ThreadPool(processes=min(self.getParallelism(), numModels))
for i in range(nFolds):
validation = dataset[list(dataset.keys())[i]].filter(col(self.getOrDefault(self.cvCol))==(self.getOrDefault(self.splitWord))[0]).cache()
train = dataset[list(dataset.keys())[i]].filter(col(self.getOrDefault(self.cvCol))==(self.getOrDefault(self.splitWord))[1]).cache()
print('fold {}'.format(i))
tasks = _parallelFitTasks(est, train, eva, validation, epm)
for j, metric in pool.imap_unordered(lambda f: f(), tasks):
# print(j, metric)
matrix_metrics[j][i] = metric
metrics[j] += (metric / nFolds)
# print(metrics)
validation.unpersist()
train.unpersist()
if eva.isLargerBetter():
bestIndex = np.argmax(metrics)
else:
bestIndex = np.argmin(metrics)
for i in range(len(metrics)):
print(epm[i], 'Detailed Score {}'.format(matrix_metrics[i]), 'Avg Score {}'.format(metrics[i]))
print('Best Model: ', epm[bestIndex], 'Detailed Score {}'.format(matrix_metrics[bestIndex]),
'Avg Score {}'.format(metrics[bestIndex]))
### Do not bother to train on full dataset, just the latest train supplied
# bestModel = est.fit(dataset, epm[bestIndex])
bestModel = est.fit(train, epm[bestIndex])
return self._copyValues(CrossValidatorModel(bestModel, metrics))
</code></pre>
<p>Let’s test it out on a similar example as the one in the source code:</p>
<pre><code class="language-python">import findspark
findspark.init()
from pyspark import SparkContext
from pyspark import SQLContext
</code></pre>
<pre><code class="language-python">sc = SparkContext()
spark = SQLContext(sc)
</code></pre>
<pre><code class="language-python">from CustomCrossValidatorDict import CustomCrossValidator
</code></pre>
<pre><code class="language-python">from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder
</code></pre>
<pre><code class="language-python">d = {}
d['df1'] = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.0, 'train'),
(Vectors.dense([0.4]), 1.0, 'train'),
(Vectors.dense([0.5]), 0.0, 'train'),
(Vectors.dense([0.6]), 1.0, 'train'),
(Vectors.dense([1.0]), 1.0, 'train'),
(Vectors.dense([0.0]), 0.0, 'test'),
(Vectors.dense([0.4]), 1.0, 'test'),
(Vectors.dense([0.5]), 0.0, 'test'),
(Vectors.dense([0.6]), 1.0, 'test'),
(Vectors.dense([1.0]), 1.0, 'test')] * 10,
["features", "label", 'cv'])
d['df2'] = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.0, 'train'),
(Vectors.dense([0.4]), 1.0, 'train'),
(Vectors.dense([0.5]), 0.0, 'train'),
(Vectors.dense([0.6]), 1.0, 'train'),
(Vectors.dense([1.0]), 1.0, 'train'),
(Vectors.dense([0.0]), 0.0, 'test'),
(Vectors.dense([0.4]), 1.0, 'test'),
(Vectors.dense([0.5]), 0.0, 'test'),
(Vectors.dense([0.6]), 1.0, 'test'),
(Vectors.dense([1.0]), 1.0, 'test')] * 10,
["features", "label", 'cv'])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1, 5]).build()
evaluator = BinaryClassificationEvaluator()
</code></pre>
<pre><code class="language-python">cv = CustomCrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator,
splitWord = ('train', 'test'), cvCol = 'cv', parallelism=4)
</code></pre>
<pre><code class="language-python">cv.extractParamMap()
</code></pre>
<pre><code>{Param(parent='CustomCrossValidator_4acca941d35632cf8f28', name='parallelism', doc='the number of threads to use when running parallel algorithms (>= 1).'): 4,
Param(parent='CustomCrossValidator_4acca941d35632cf8f28', name='seed', doc='random seed.'): 7665653429569288359,
Param(parent='CustomCrossValidator_4acca941d35632cf8f28', name='estimator', doc='estimator to be cross-validated'): LogisticRegression_487fb6aaeb91e051211c,
Param(parent='CustomCrossValidator_4acca941d35632cf8f28', name='estimatorParamMaps', doc='estimator param maps'): [{Param(parent='LogisticRegression_487fb6aaeb91e051211c', name='maxIter', doc='max number of iterations (>= 0).'): 0},
{Param(parent='LogisticRegression_487fb6aaeb91e051211c', name='maxIter', doc='max number of iterations (>= 0).'): 1},
{Param(parent='LogisticRegression_487fb6aaeb91e051211c', name='maxIter', doc='max number of iterations (>= 0).'): 5}],
Param(parent='CustomCrossValidator_4acca941d35632cf8f28', name='evaluator', doc='evaluator used to select hyper-parameters that maximize the validator metric'): BinaryClassificationEvaluator_44cc9ebbba7a7a85e22e,
Param(parent='CustomCrossValidator_4acca941d35632cf8f28', name='splitWord', doc="Tuple to split train and test set e.g. ('train', 'test')"): ['train',
'test'],
Param(parent='CustomCrossValidator_4acca941d35632cf8f28', name='cvCol', doc='Column name to filter train and test list'): 'cv'}
</code></pre>
<pre><code class="language-python">cvModel = cv.fit(d)
</code></pre>
<pre><code>fold 0
fold 1
{Param(parent='LogisticRegression_487fb6aaeb91e051211c', name='maxIter', doc='max number of iterations (>= 0).'): 0} Detailed Score [0.5, 0.5] Avg Score 0.5
{Param(parent='LogisticRegression_487fb6aaeb91e051211c', name='maxIter', doc='max number of iterations (>= 0).'): 1} Detailed Score [0.8333333333333333, 0.8333333333333333] Avg Score 0.8333333333333333
{Param(parent='LogisticRegression_487fb6aaeb91e051211c', name='maxIter', doc='max number of iterations (>= 0).'): 5} Detailed Score [0.8333333333333333, 0.8333333333333333] Avg Score 0.8333333333333333
Best Model: {Param(parent='LogisticRegression_487fb6aaeb91e051211c', name='maxIter', doc='max number of iterations (>= 0).'): 1} Detailed Score [0.8333333333333333, 0.8333333333333333] Avg Score 0.8333333333333333
</code></pre>
<h3 id="concluding-thoughts">Concluding Thoughts</h3>
<p>Hope this post has been useful! The custom cross-validation class is really quite handy. It can be used for time series problems as well as times when you want to test a model’s performance over different geographical areas or customer segments. Took some time to work through the PySpark source code but my understanding of it has definitely improved after this episode.</p>
Uploading Jupyter Notebook Files to Blogdown
https://www.timlrx.com/2018/03/25/uploading-jupyter-notebook-files-to-blogdown/
Sun, 25 Mar 2018 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2018/03/25/uploading-jupyter-notebook-files-to-blogdown/
<p>I have been working quite a bit with Python recently, using the popular Jupyter Notebook interface. Have been thinking about uploading some of my machine learning experiments and notes on the blog but integrating python with a blog built on <code>blogdown</code> seems problematic as I could not google a solution. Turns out it is actually quite simple (maybe that’s why nobody posted a tutorial on it, or maybe people who blog in R do not really use Python).</p>
<p>Anyway, here’s the solution:</p>
<p>From the cmd line navigate to the folder which contains the notebook file and run the following line substituting Panda_Plotting for the name of your ipynb file.</p>
<pre><code>jupyter nbconvert --to markdown Panda_Plotting.ipynb
</code></pre>
<p>This generates a <code>.md</code> file as well as a folder containing all the images from that file. In my case the folder was named Panda_Plotting_files.</p>
<p>Copy the image files to the static folder where the website is located. I placed my files in the <code>static/img/python_img</code> folder.</p>
<p>Now all that is left to be done is to create a new post in markdown format,</p>
<pre><code class="language-r">new_post('Hello Python World', ext='.md')
</code></pre>
<p>copy the markdown file over, and replace all the image directory paths to point to the newly created one in the static folder.</p>
<p>Here’s my Panda_Plotting.ipynb file converted to markdown and hosted on my blog. It displays markdown and all the output quite nicely!</p>
<h3 id="converted-jupyter-notebook-file">Converted Jupyter Notebook File</h3>
<pre><code class="language-python">%matplotlib inline
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
</code></pre>
<pre><code class="language-python">plt.plot(np.random.normal(size=100), np.random.normal(size=100), 'ro')
</code></pre>
<pre><code>[<matplotlib.lines.Line2D at 0x22850dc7780>]
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_1_1.png" alt="png" /></p>
<h2 id="using-pandas-plotting-functions">Using pandas plotting functions</h2>
<h3 id="line-graphs">Line Graphs</h3>
<pre><code class="language-python">normals = pd.Series(np.random.normal(size=10))
normals.plot()
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x22850eef940>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_3_1.png" alt="png" /></p>
<pre><code class="language-python">normals.cumsum().plot(grid=True)
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x22850ca6c88>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_4_1.png" alt="png" /></p>
<pre><code class="language-python">variables = pd.DataFrame({'normal': np.random.normal(size=100),
'gamma': np.random.gamma(1, size=100),
'poisson': np.random.poisson(size=100)})
variables.cumsum().plot()
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x228511b5940>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_5_1.png" alt="png" /></p>
<pre><code class="language-python">variables.cumsum().plot(subplots=True)
</code></pre>
<pre><code>array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000228511F13C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000228512DD710>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000228511E9208>], dtype=object)
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_6_1.png" alt="png" /></p>
<pre><code class="language-python">### More control over subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))
for i,var in enumerate(['normal','gamma','poisson']):
variables[var].cumsum(0).plot(ax=axes[i], title=var)
axes[0].set_ylabel('cumulative sum')
</code></pre>
<pre><code><matplotlib.text.Text at 0x22851427160>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_7_1.png" alt="png" /></p>
<h3 id="bar-graphs">Bar Graphs</h3>
<pre><code class="language-python">segments = pd.read_csv("./data/transit_segments.csv")
segments.st_time = segments.st_time.apply(lambda d: datetime.strptime(d, '%m/%d/%y %H:%M'))
segments['year'] = segments.st_time.apply(lambda d: d.year)
segments['long_seg'] = (segments.seg_length > segments.seg_length.mean())
segments['long_sog'] = (segments.avg_sog > segments.avg_sog.mean())
</code></pre>
<pre><code class="language-python">segments.groupby('year').seg_length.mean().plot(kind='bar')
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x2285271bd68>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_10_1.png" alt="png" /></p>
<pre><code class="language-python">segments.groupby(['year','long_seg']).seg_length.count().plot(kind='bar')
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x21c6332e6a0>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_11_1.png" alt="png" /></p>
<pre><code class="language-python">### Stacked Bars
temp = pd.crosstab([segments.year, segments.long_seg], segments.long_sog)
temp.plot(kind='bar', stacked=True)
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x21c6bda5908>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_12_1.png" alt="png" /></p>
<h3 id="histograms">Histograms</h3>
<pre><code class="language-python">variables = pd.DataFrame({'normal': np.random.normal(size=100),
'gamma': np.random.gamma(1, size=100),
'poisson': np.random.poisson(size=100)})
variables.normal.hist(bins=30, grid=False)
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x21c4ae920f0>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_14_1.png" alt="png" /></p>
<pre><code class="language-python">variables.poisson.plot(kind='kde', xlim=(-4,6))
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x21c0a84fb00>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_15_1.png" alt="png" /></p>
<pre><code class="language-python">variables.gamma.hist(bins=20, normed=True)
variables.gamma.plot(kind='kde', style='r--')
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x21c0dacae80>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_16_1.png" alt="png" /></p>
<h3 id="scatterplot">Scatterplot</h3>
<pre><code class="language-python">segments.plot(kind='scatter', x='seg_length', y='avg_sog')
</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x241ae3c65f8>
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_18_1.png" alt="png" /></p>
<pre><code class="language-python">segments_subset = segments.loc[1:10000,['seg_length', 'avg_sog', 'min_sog', 'max_sog']]
pd.plotting.scatter_matrix (segments_subset, figsize=(12,8), diagonal='kde')
</code></pre>
<pre><code>array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000241AE461BA8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241B63B4A20>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241B405A898>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241AE5B85C0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000241AE75A748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241AE75A780>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241AF813D30>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241AF857160>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000241AF8E36A0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241AF8F3BA8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241AF9AD4E0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241B06B1BA8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000241B072D320>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241B078F5C0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241B080A278>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000241B0CA8BA8>]], dtype=object)
</code></pre>
<p><img src="./img/python_img/Panda_Plotting_19_1.png" alt="png" /></p>
Notes on Regression - Approximation of the Conditional Expectation Function
https://www.timlrx.com/2018/02/26/notes-on-regression-approximation-of-the-conditional-expectation-function/
Mon, 26 Feb 2018 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2018/02/26/notes-on-regression-approximation-of-the-conditional-expectation-function/<p>The final installment in my ‘Notes on Regression’ series! For a review on ways to derive the Ordinary Least Square formula as well as various algebraic and geometric interpretations, check out the previous 5 posts:</p>
<ul>
<li><p><a href="./2017/08/16/notes-on-regression-ols/">Part 1 - OLS by way of minimising the sum of square errors</a></p></li>
<li><p><a href="./2017/08/23/notes-on-regression-projection/">Part 2 - Projection and Orthogonality</a></p></li>
<li><p><a href="./2017/08/31/notes-on-regression-method-of-moments/">Part 3 - Method of Moments</a></p></li>
<li><p><a href="./2017/09/21/notes-on-regression-maximum-likelihood/">Part 4 - Maximum Likelihood</a></p></li>
<li><p><a href="./2017/10/21/notes-on-regression-singular-vector-decomposition/">Part 5 - Singular Vector Decomposition</a></p></li>
</ul>
<p>A common argument against the regression approach is that it is too simple. Real world phenomenons follow non-normal distributions, power laws are everywhere and multivariate relationships possibly more complex. The assumption of linearity in the OLS regression seems way out place of reality. However, if we take into consideration that the main aim of a statistical model is not to replicate the real world but to yield useful insights, the simplicity of regression may well turn out to be its biggest strength.</p>
<p>In this set of notes I shall discuss the OLS regression as a way of approximating the conditional expectation function (CEF). To be more precise, regression yields the best linear approximation of the CEF. This mathematical property makes regression a favourite tool among social scientist as it places the emphasis on interpretation of an approximation of reality rather than complicated curve fitting. I came across this method from Angrist and Pischke’s <a href="https://press.princeton.edu/titles/8769.html">nearly harmless econometrics</a>.</p>
<div id="what-is-a-conditional-expectation-function" class="section level2">
<h2>What is a Conditional Expectation Function?</h2>
<p>Expectation as in the statistics terminology normally refers to the population average of a particular random variable. The conditional expectation as its name suggest is the population average conditional holding certain variables fixed. In the context of regression, the CEF is simply <span class="math inline">\(E[Y_{i}\vert X_{i}]\)</span>. Since <span class="math inline">\(X_{i}\)</span> is random, the CEF is random.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p>
<div class="figure">
<img src="./img/CEF.png" alt="CEF" />
<p class="caption">CEF</p>
</div>
<p>The picture above is an illustrated example of the CEF plotted on a given dataset. Looking at the relationship of the number of stars obtained by a recipe and the log number of reviews, one can calculate the average star rating for a given number of reviews (indicated by the red dots). The CEF function joins all these red dots together (indicated by the blue line).</p>
<div id="nice-properties-of-the-cef" class="section level3">
<h3>Nice Properties of the CEF</h3>
<p>What could we infer about the relationship between the dependent variable, <span class="math inline">\(Y_{i}\)</span> and the CEF? Let’s split the dependent variable into two components: <span class="math display">\[
Y_{i} = E[Y_{i} \vert X_{i}] + \epsilon_{i}
\]</span> Using the law of iterated expectation, we can show that <span class="math inline">\(E[\epsilon_{i} \vert X_{i}]=0\)</span> i.e. mean independence and <span class="math inline">\(\epsilon_{i}\)</span> is uncorrelated with any function of <span class="math inline">\(X_{i}\)</span>. In other words, we can break the dependent variable into a component that is explained by <span class="math inline">\(X_{i}\)</span> and another component that is orthogonal to it. Sounds familiar?</p>
<p>Also, if we were to try to find a function of <span class="math inline">\(X\)</span>, <span class="math inline">\(m(X)\)</span> that minimises the squared mean error i.e. <span class="math inline">\(min~ E[(Y_{i} - m(X_{i}))^{2}]\)</span>, we would find that the optimum choice of <span class="math inline">\(m(X)\)</span> is exactly the CEF! To see that expand the squared error term: <span class="math display">\[
\begin{aligned}
(Y_{i} - m(X_{i}))^{2} &= ((Y_{i} - E[Y_{i} \vert X_{i}]) + (E[Y_{i} \vert X_{i}] - m(X_{i})))^{2} \\
&= (Y_{i} - E[Y_{i} \vert X_{i}])^{2} + 2(Y_{i} - E[Y_{i} \vert X_{i}])(E[Y_{i} \vert X_{i}] - m(X_{i}))
+ (E[Y_{i} \vert X_{i}] - m(X_{i}))^{2}
\end{aligned}
\]</span></p>
<p>The first term on the right does not factor in the arg min problem. <span class="math inline">\((Y_{i} - E[Y_{i} \vert X_{i}])\)</span> in the second term is simply <span class="math inline">\(\epsilon_{i}\)</span> and a function of <span class="math inline">\(X\)</span> multiplied with <span class="math inline">\(\epsilon_{i}\)</span> would still give an expectation of zero. Hence, the problem can be simplified to minimising the last term which is only minimised when <span class="math inline">\(m(X_{i})\)</span> = CEF.</p>
</div>
</div>
<div id="regression-and-the-cef" class="section level2">
<h2>Regression and the CEF</h2>
<p>Now let’s link the regression back to the discussion on the CEF. Recall the example of the number of stars a recipe has and the number of reviews submitted. Log reviews is a continuous variable and there are lots of points to take into consideration. Regression offers a way of approximating the CEF linearly i.e. <span class="math display">\[
\beta = \arg \min_{b}E[ E[Y_{i}\vert X_{i}] - X_{i}'b]
\]</span></p>
<p>To get this result, one can show that minimising <span class="math inline">\((Y_{i} -X'_{i}b)^{2}\)</span> is equivalent to minimising the above equation.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> Thus, even if the CEF is non-linear as in the recipe and star rating example, the regression line would provide the best linear approximation to it (drawn in green below).</p>
<div class="figure">
<img src="./img/CEF_regression.png" alt="Regression as the best linear approximation to the CEF" />
<p class="caption">Regression as the best linear approximation to the CEF</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>In practice, one obtains a sample of the population data and uses the sample to make an approximation of the population CEF.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>just add and subtract <span class="math inline">\(E[Y_{i}\vert X_{i}]\)</span> and manipulate the terms in a similar way to the previous proof using <span class="math inline">\(m(X)\)</span>.<a href="#fnref2">↩</a></p></li>
</ol>
</div>
February Thoughts
https://www.timlrx.com/2018/02/11/february-thoughts/
Sun, 11 Feb 2018 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2018/02/11/february-thoughts/<p>Sorry about the lack of post over the past few month. Hope to regain some work life balance and update the blog more regularly. To start of the first blog post of 2018 I thought it would be nice to do share some interesting things that I have been reading over the past few weeks and create a to-do list to function as my commitment device.</p>
<div id="fun-facts" class="section level3">
<h3>Fun Facts</h3>
<ul>
<li><p>Did you know that the skin color of a cat is heavily determined by a gene located on the X chromosome? Another interesting aspect of this gene is that only one copy is activated per cell. This creates the spotted and patchwork patterns in female cats as they contain two alleles of the gene but not in male cats. I came across this trivia while reading <strong>The Gene: An Intimate History</strong>, which provides a detailed yet comprehensible history of the study of genetics. Very enjoying read on the evolution of scientific ideas, the brilliance of human creativity and the socio-political elements of the gene.</p></li>
<li><p>Actually cat skin color is not totally random. Randomness does not generate patches. If it were truly random the cat would have the look of a scrabbled chessboard. There are probably some factors that locally control which particular allele is switched off…</p></li>
<li><p>It is hard and computationally intensive to prove that two graphs are isomorphic (whether the graph isomorphism problem can be solved in polynomial time is supposedly an unsolved problem) but easy to show that they are different. I remember reading the <a href="https://www.quantamagazine.org/algorithm-solves-graph-isomorphism-in-record-time-20151214/">quanta magazine article</a> on the problem being solved in 2015 but supposedly there is a <a href="https://www.quantamagazine.org/graph-isomorphism-vanquished-again-20170114/">flaw in the proof</a> so the problem is still unresolved.</p></li>
<li><p>Recently I have been reading up on using graphlets (small connected sub-graphs of the larger network) to measure the similarity between two larger networks.</p></li>
<li><p>Fascinating edible stuff: <a href="http://live.iop-pp01.agh.sleek.net/2017/09/22/the-physics-of-bread/">The physics of bread</a></p></li>
</ul>
</div>
<div id="to-do-list" class="section level3">
<h3>To-Do List</h3>
<ul>
<li><p>Maintain my blog a bit more regularly. I still have a few sets of notes on algebraic graph theory to post as well as one more piece on regression which I plan to write to complete the series.</p></li>
<li><p>Read up on data pipelines and workflow integration. Currently I find that a lot of my time is being wasted moving data around various systems and trying to bring a model from development into production. I am sure there are some nice ways to combine data explorations, model building, implementation and evaluation in a unified framework so this is probably my top priority over the next few months.</p></li>
<li><p>Learn more about text analysis. Machine learning advances over the past few years has made tremendous improvements in image recognition capabilities but text analytics have been lagging behind. Not to mention plenty of man-hours are wasted on mundane text analytics activities (drafting, summarisation, fact checking etc). Advancements in this field would truly be the next billion dollar technological change.</p></li>
<li><p>Start a deep learning project. Heard a lot of hype, have a decent knowledge of the theory, now it’s time to start playing around with it.</p></li>
<li><p>Read up more about randomised controlled trials (A/B testing if you are from the marketing / engineering world). There’s the bayesian side to explore and issues on early stopping worth knowing about.</p></li>
<li><p>Play around with more datasets! Let me know what you would like to see. Maybe I should do something that is related to the labour market again. Always interesting to see how things change over time.</p></li>
</ul>
</div>
Notes on Graphs and Spectral Properties
https://www.timlrx.com/2017/12/25/notes-on-graphs-and-spectral-properties/
Mon, 25 Dec 2017 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2017/12/25/notes-on-graphs-and-spectral-properties/<p>Here is the first series of a collection of notes which I jotted down over the past 2 months as I tried to make sense of algebraic graph theory. This one focuses on the basic definitions and some properties of matrices related to graphs. Having all the symbols and main properties in a single page is a useful reference as I delve deeper into the applications of the theories. Also, it saves me time from googling and checking the relationship between these objects.</p>
<div id="adjacency-matrix" class="section level2">
<h2>Adjacency Matrix</h2>
<p>Let <span class="math inline">\(n\)</span> be the number of vertices and <span class="math inline">\(m\)</span> the number of edges. Then the adjacency matrix <span class="math inline">\(A\)</span> of dimension <span class="math inline">\(n \times n\)</span> is a matrix where <span class="math inline">\(a_{ij}=1\)</span> where there is an edge from vertex i to vertex j and zero otherwise. For a weighted adjacency matrix, <span class="math inline">\(W\)</span>, we replace 1 with the weights, <span class="math inline">\(w_{ij}\)</span>.</p>
<p>Here we consider the case of undirected graphs. This means that the adjacency matrix is symmetric which implies it has a complete set of real eigenvalues (not necessary positive) and an orthogonal eigenvector basis. The set of eigenvalues (<span class="math inline">\(\alpha_{1} \geq \alpha_{2} \geq ... \geq \alpha_{n}\)</span>) is known as the spectrum of a graph.</p>
<div id="properties" class="section level3">
<h3>Properties</h3>
<ul>
<li>The greatest eigenvalue, <span class="math inline">\(\alpha_{1}\)</span> is bounded by the maximum degree.<br />
</li>
<li>Given two graphs with adjacency matrix <span class="math inline">\(A_{1}\)</span> and <span class="math inline">\(A_{2}\)</span>, the graphs are isomorphic iff there exist a permutation matrix <span class="math inline">\(P\)</span> such that <span class="math inline">\(PA_{1}P^{-1}=A_{2}\)</span>. Implies same eigenvalue /eigenvectors/determinant/trace etc. Note: Two graphs may be isospectral (same set of eigenvalues) but NOT isomorphic.</li>
</ul>
</div>
</div>
<div id="incidence-matrix" class="section level2">
<h2>Incidence Matrix</h2>
<p>An incidence matrix <span class="math inline">\(\tilde{D}\)</span> is of dimension <span class="math inline">\(n \times m\)</span> with <span class="math inline">\(D_{ij}=1\)</span> if <span class="math inline">\(e_{j} = (v_{i},v_{k})\)</span> or <span class="math inline">\(-1\)</span> if<span class="math inline">\(e_{j} = (v_{j},v_{i})\)</span> or zero otherwise. In other words, each column represents an edge that shows the vertex it is emitting from (1) and the vertex it is pointing to (-1).</p>
<p>For an undirected graph, there are two kinds of incidence matrix: oriented and unoriented. In the unoriented graph, we just put 1 for any vertex that is connected to an edge. The unoriented graph is similar to that of a directed graph (1 and -1) and is unique up to the negation of the columns.</p>
</div>
<div id="laplacian-matrix" class="section level2">
<h2>Laplacian Matrix</h2>
<p>Laplacian matrix is defined as <span class="math inline">\(L = D - A = \tilde{D}\tilde{D}'\)</span>, or the degree matrix <span class="math inline">\(D\)</span> minus the adjacency matrix <span class="math inline">\(A\)</span>. Hence, the diagonals are the degree while <span class="math inline">\(L_{ij}=-1\)</span> if <span class="math inline">\(v_{i}\)</span> and <span class="math inline">\(v_{j}\)</span> are connected, else 0.</p>
<p><strong>Note:</strong></p>
<ul>
<li><span class="math inline">\(\tilde{D}\)</span> is the unoriented incidence matrix.<br />
</li>
<li>The degree matrix is defined as <span class="math inline">\(D = diag(W \cdot \mathbf{1})\)</span>.<br />
</li>
<li>For a weighted degree matrix, the diagonal element <span class="math inline">\(d(i,i) = \sum_{j:(i,j)\in E} w_{ij}\)</span>.<br />
</li>
<li>The conventional ordering of eigenvalue is opposite to the adjacency matrix! (<span class="math inline">\(0=\lambda_{1} \leq \lambda_{2} \leq ... \leq \lambda_{n}\)</span>.)</li>
</ul>
<div id="walks" class="section level3">
<h3>Walks</h3>
<p>A walk on a graph is an alternating path of vertex and series from one vertex to another. A walk between two vertices <span class="math inline">\(u\)</span> and <span class="math inline">\(v\)</span> is called a <span class="math inline">\(u-v\)</span> walk. It’s length is the number of edges.</p>
<p><strong>Cool fact:</strong> Take the adjacency matrix and multiply it <span class="math inline">\(n\)</span> times, then <span class="math inline">\(a^{n}_{ij}\)</span>, an entry from the <span class="math inline">\(A^{n}\)</span> matrix gives the number of <span class="math inline">\(i-j\)</span> walks of length <span class="math inline">\(n\)</span>. Divide the <span class="math inline">\(i,j\)</span> entry by the degree of vertex <span class="math inline">\(i\)</span>. Then the <span class="math inline">\(i,j\)</span> entry would give the probability that starting from <span class="math inline">\(i\)</span>, you will end up at <span class="math inline">\(j\)</span> after <span class="math inline">\(n\)</span> steps.</p>
</div>
<div id="matrices-as-operators-on-the-vertices" class="section level3">
<h3>Matrices as operators on the vertices</h3>
<p>The adjacency and laplacian matrix can be interpreted as operators on functions of a graph. That is, given <span class="math inline">\(Ax\)</span>, <span class="math inline">\(x\)</span> can be interpreted as a function on the vertices, while <span class="math inline">\(A\)</span> is a linear mapping of the function <span class="math inline">\(x\)</span>. <span class="math display">\[
Ax(i) = \sum_{j:(i,j)\in E} x_{j}
\]</span> Or in other words, it is the sum of the elements of x that are connected to vertex <span class="math inline">\(i\)</span>. It can also be viewed as a quadratic form: <span class="math display">\[
x'Ax = \sum_{e_{ij}} x_{i}x_{j}
\]</span> Similarly, expressing the weighted laplacian matrix as an operator: <span class="math display">\[
\begin{aligned}
Lx(i) &= Dx(i) - Wx(i) \\
&= \sum_{j:(i,j)\in E} w_{ij} x_{i} - \sum_{j:(i,j)\in E} w_{ij}x_{j} \\
&= \sum_{j:(i,j)\in E} w_{ij}(x_{i}-x_{j})
\end{aligned}
\]</span></p>
<p>As a quadratic form <span class="math display">\[
\begin{aligned}
x'Lx &= x'Dx - x'Wx \\
&= \sum w_{ij}x_{i}^{2} - \sum_{e_{ij}} x_{i}w_{ij}x_{j} \\
&= \frac{1}{2}(\sum w_{ij}x_{i}^{2} - 2\sum_{e_{ij}} x_{i}w_{ij}x_{j} + \sum w_{ij}x_{j}^{2}) \\
&= \frac{1}{2}\sum_{e_{ij}} w_{ij}(x_{i}-x_{j})^{2}
\end{aligned}
\]</span></p>
<p>The symmetric normalised Laplacian matrix is defined as <span class="math inline">\(L^{sym} = D^{-1/2}LD^{-1/2} = I - D^{-1/2}AD^{-1/2}\)</span>.</p>
<p>Since the degree matrix is a diagonal matrix, <span class="math inline">\(D^{-1/2}\)</span> is just the <span class="math inline">\(D\)</span> matrix with the diagonals square rooted.</p>
</div>
<div id="properties-of-l" class="section level3">
<h3>Properties of <span class="math inline">\(L\)</span></h3>
<ul>
<li><span class="math inline">\(L\)</span> is symmetry because <span class="math inline">\(W\)</span> is symmetric.<br />
</li>
<li><span class="math inline">\(\mathbf{1}\)</span> is an eigenvector of the matrix (sum of any column=0), and <span class="math inline">\(L\mathbf{1} = 0\mathbf{1}\)</span>, hence 0 is the smallest eigenvalue.<br />
</li>
<li>The eigenvalues <span class="math inline">\(0=\lambda_{1} \leq \lambda_{2} \leq ... \leq \lambda_{n}\)</span> are real and non-negative.</li>
</ul>
</div>
</div>
<div id="laplacian-matrix-and-connectedness" class="section level2">
<h2>Laplacian Matrix and Connectedness</h2>
<p>Define a path as a walk without any repeated vertices. A graph is connected if any two of its vertices are contained in a path.</p>
<p>In a fully connected graph, <span class="math inline">\(\lambda_{2}>0\)</span>. Proof that the only eigenvector is <span class="math inline">\(\mathbf{1}\)</span>: Let <span class="math inline">\(x\)</span> be the eigenvector associated with the eigenvalue 0. From the quadratic form: <span class="math display">\[
x'Lx = x'0 = 0 = \sum_{e_{ij}} w_{ij}(x_{i}-x_{j})^{2}
\]</span> This implies that for any <span class="math inline">\({i,j} \in E, x_{i} = x_{j}\)</span>. Since, there exist a path from any two vertices, <span class="math inline">\(x_{i} = x_{j}\)</span> for all <span class="math inline">\(i,j \in V\)</span>: <span class="math display">\[
x = \alpha
\left[\begin{array}{c}
1 \\
1 \\
. \\
. \\
1
\end{array}\right]
\]</span> Hence, the multiplicity (number of linearly independent) of eigenvalue 0 is 1, and <span class="math inline">\(\lambda_{2} > 0\)</span>.</p>
<p>In fact, the multiplicity of the eigenvalue 0 tells us the number of connected components in the graph. For example, a graph with two connected components (the adjacency matrix and the laplacian matrix will have a block diagonal structure), you will get two eigenvectors associated with the eigenvalue 0. Something like <span class="math inline">\([1~1~1~0~0~0]'\)</span> and <span class="math inline">\([0~0~0~1~1~1]'\)</span>.</p>
<p>To summarise, the number of connected components is equal to the multiplicity of eigenvalue 0 which is equal to the dimension of the null space of <span class="math inline">\(L\)</span>.</p>
<div id="normalised-symmetric-laplacian-and-random-walk-matrix" class="section level3">
<h3>Normalised Symmetric Laplacian and Random walk matrix</h3>
<p>The normalised symmetric laplacian is defined as: <span class="math display">\[
L_{sym} = I - D^{-1/2}WD^{-1/2} = D^{-1/2}LD^{-1/2}
\]</span> In other words, it has 1 on the diagonals and <span class="math inline">\(-\frac{1}{\sqrt{deg(v_{i})deg(v_{j})}}\)</span> if <span class="math inline">\(v_{i}\)</span> is adjacent to <span class="math inline">\(v_{j}\)</span> and 0 otherwise.</p>
<p>The random walk matrix is defined as: <span class="math display">\[
L_{rw} = D^{-1}L = I - D^{-1}W = D^{-1/2}L_{sym}D^{1/2}
\]</span> <span class="math inline">\(L_{sym}\)</span> and <span class="math inline">\(L_{rw}\)</span> are similar matrices.</p>
</div>
<div id="properties-of-ll_sym-and-l_rw" class="section level3">
<h3>Properties of <span class="math inline">\(L\)</span>,<span class="math inline">\(L_{sym}\)</span> and <span class="math inline">\(L_{rw}\)</span></h3>
<ul>
<li>The three matrices are symmetric, positive, semidefinite.<br />
</li>
<li><span class="math inline">\(L_{sym}\)</span> and <span class="math inline">\(L_{rw}\)</span> share the same eigenvalues. <span class="math inline">\(u\)</span> is an eigenvector of <span class="math inline">\(L_{rw}\)</span> iff <span class="math inline">\(D^{1/2}u\)</span> is an eigenvector of <span class="math inline">\(L_{sym}\)</span>.<br />
</li>
<li><span class="math inline">\(u\)</span> is a solution of the eigenvalue problem <span class="math inline">\(Lu = \lambda Du\)</span> iff <span class="math inline">\(D^{1/2}u\)</span> is an eigenvector of <span class="math inline">\(L_{sym}\)</span> for the eigenvalue <span class="math inline">\(\lambda\)</span> iff <span class="math inline">\(u\)</span> is an eigenvector of <span class="math inline">\(L_{rw}\)</span> for the eigenvalue <span class="math inline">\(\lambda\)</span> .</li>
<li>A similar connection between the connected components and <span class="math inline">\(L\)</span> can be made with <span class="math inline">\(L_{sym}\)</span> and <span class="math inline">\(L_{rw}\)</span>.</li>
</ul>
</div>
</div>
Dashboard 2.0
https://www.timlrx.com/2017/11/23/dashboard-2-0/
Thu, 23 Nov 2017 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2017/11/23/dashboard-2-0/<p><a href="./dashboard/sg-dashboard/">SG Dashboard 2.0</a> is now released and updated with Q3’s economic results. Built on R’s <code>flexdashboard</code> with interactive graphs on <code>Plotly</code>. My take on bringing statistical releases to the digital age.</p>
Choosing a Control Group in a RCT with Multiple Treatment Periods
https://www.timlrx.com/2017/11/18/choosing-a-control-group-in-a-rct-with-multiple-treatment-periods/
Sat, 18 Nov 2017 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2017/11/18/choosing-a-control-group-in-a-rct-with-multiple-treatment-periods/<p>Came across a fun little problem over the past few weeks that is related to the topic of policy impact evaluation - a long time interest of mine! Here’s the setting: we have a large population of individuals and a number of treatments that we want to gauge the effectiveness of. The treatments are not necessarily the same but are targeted towards certain sub-segments in the population. Examples of such situations include online ad targeting or marketing campaigns. This gives rise to the following 3 methods of selecting the treatment and control groups:</p>
<ol style="list-style-type: decimal">
<li><p>Apply the targeting rule to get a population subset. Split this group into treatment and control, run the treatment and collect the results. In the next time period, keep those which remain in the control as the control and top up the group with a random sample to maintain a similar proportion of treated and control individuals.</p></li>
<li><p>Randomly split the population into treatment and control. For each period, do not vary the control group. Just administer the treatment on the treatment group. Evaluate the effectiveness of each period on the control group applying the targeting rule to subset the relevant control population.</p></li>
<li><p>For each period and campaign, apply the targeting rule and randomise the group into treatment and control.</p></li>
</ol>
<div id="framework" class="section level3">
<h3>Framework</h3>
<p>Would these methods give equivalent results? I will use the Neyman-Rubin causal framework to formalise the intended goal and outcomes. Let <span class="math inline">\(Y_{i}\)</span> denote the outcome of an individual (e.g. total spending). The fundamental problem of inference is that one would never be able to observe the spending of an individual if he was administered the treatment <span class="math inline">\(Y_{1i}\)</span> or if he was not <span class="math inline">\(Y_{0i}\)</span>. Here, <span class="math inline">\(Y_{1i}\)</span> and <span class="math inline">\(Y_{0i}\)</span> are referred to as potential outcomes as only one outcome can be observed but not the other.</p>
<p>The average effect of a treatment on an individual is given by: <span class="math display">\[
E[Y_{1i} - Y_{0i}]
\]</span></p>
<p>Let <span class="math inline">\(D_{i}=1\)</span> denote being treated and <span class="math inline">\(D_{i}=0\)</span> being not treated. We can look at the difference in average outcomes based on treatment status: <span class="math display">\[
E[Y_{i} \vert D_{i}=1] - E[Y_{i} \vert D_{i}=0] = E[Y_{1i} \vert D_{i}=1] - E[Y_{0i} \vert D_{i}=0]
\]</span></p>
<p>If the treatment is not randomly assigned (e.g. people can choose to take-up the treatment), the above expression can be written as: <span class="math display">\[
\begin{aligned}
E[Y_{i} \vert D_{i}=1] - E[Y_{i} \vert D_{i}=0] &= E[Y_{1i} \vert D_{i}=1] - E[Y_{0i} \vert D_{i}=1] \\
&+ E[Y_{0i} \vert D_{i}=1] - E[Y_{0i} \vert D_{i}=0]
\end{aligned}
\]</span> The first term on the right is the average treatment effect on treated while the second is the selection bias. For example, if the advertisement has a positive impact on spending then we would expect the second term to be positive leading to an upward bias in its estimated effect.</p>
<p>And that’s precisely why to evaluate the effectiveness of a treatment, we have to randomise people into treatment and control groups. Under randomisation, the potential outcomes are independent of the treatment<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>, <span class="math display">\[
\{Y_{1i}, Y_{0i}\} \perp D_{i}
\]</span> and</p>
<p><span class="math display">\[
E[Y_{1i} \vert D_{i}=1] - E[Y_{0i} \vert D_{i}=0] = E[Y_{1i} - Y_{0i}]
\]</span> This implies that taking the difference between the average across the treated and control group will give us the <em>Average Treatment Effect (ATE)</em>. In many situations, we relax the assumption by only allowing the mean of non-treated individuals to be independent of treatment status:</p>
<p><span class="math display">\[
E[Y_{1i} \vert D_{i}=1] - E[Y_{0i} \vert D_{i}=0] = E[Y_{1i} - Y_{0i} \vert D_{i}=1]
\]</span> This gives the <em>Average Treatment on Treated (ATT)</em>.</p>
</div>
<div id="thought-experiment" class="section level3">
<h3>Thought Experiment</h3>
<p>To consider the various scenarios outlined above, let me setup a little thought experiment. In my world, there are two types of customers, high type or low type, which I denote by <span class="math inline">\(X_{i}\)</span>. Low type customers, <span class="math inline">\(X_{i} = L\)</span>, spend <span class="math inline">\(\alpha + \epsilon_{it}\)</span> dollars while high type customers, <span class="math inline">\(X_{i} = H\)</span>, spend <span class="math inline">\(\alpha + \beta + \epsilon_{it}\)</span> dollars, where <span class="math inline">\(\epsilon_{it}\)</span> is a drawn from a normal distribution. The treatment of interest is a marketing promotion which is targeted at high spending individuals. Assume low type customers are not affected by the marketing promotion while high type customers have a <span class="math inline">\(p\%\)</span> probability of spending an additional <span class="math inline">\(\delta\)</span> dollars which persist for the rest of the periods. Having taken up the treatment, the high type individual will no longer subscribe to future promotions. I ignore any changes in spending across time periods, though in practice one way to account for such changes is to consider the first difference.</p>
</div>
<div id="simulation-setup" class="section level3">
<h3>Simulation Setup</h3>
<p>To check on the effectiveness of the 3 methods of selecting a control group, let’s do a little simulation with the following parameters: <span class="math display">\[
\begin{aligned}
\alpha &= 3, \\
\beta &=2, \\
\delta &=1, \\
p &=0.3, \\
\epsilon_{it} &\sim N(0,1) ~\forall i
\end{aligned}
\]</span></p>
<p>To start, let’s build a 3 period model with 100,000 people in the population (half high type and half low type). I consider observations in 3 period, <span class="math inline">\(t=1,2,3\)</span> and split the population into 80% treatment and 20% control. The treatment is targeted towards higher spending individuals. However, one cannot observe the underlying type distribution and has to segment the population by the amount which they spend. In the simulation, I use a spending rule (<span class="math inline">\(Y_{i} > 4\)</span>), which covers approximately 50% of the initial population.</p>
<pre class="r"><code>n= 1e5
p = 0.3
d = 1
df = data.frame(ind = seq(1, n),
type = rep(c(0,1), n/2),
epsilon = rnorm(n, 0, 1),
unif = runif(n, 0, 1),
unif2 = runif(n, 0, 1))
df$spend = ifelse(df$type==0, 3, 5) + df$epsilon
### Select treatment and control using unif
df$target = ifelse(df$spend>4 , 1, 0)
df$treat = ifelse(df$target==1 & df$unif<0.8, 1, 0)
df$control = ifelse(df$target==1 & df$unif>=0.8, 1, 0)</code></pre>
</div>
<div id="att" class="section level3">
<h3>ATT</h3>
<p>Despite covering 50% of the population, randomness in spending patterns implies that the target group would still consist of both low and high types. This means that the outcome of our experiment would only yield an ATT effect, or the effect on the sub-population who spend more than 4. Let us calculate this effect before using the simulation to verify the results. We are interested in finding the fraction of population who are high type conditional on spending more than 4. First, let us calculate the probability that a high and low type individual spend more than 4 using r’s pnorm function before calculating the conditional probability:</p>
<pre class="r"><code>1-pnorm(4, 3, 1)</code></pre>
<pre><code>## [1] 0.1586553</code></pre>
<pre class="r"><code>1-pnorm(4, 5, 1)</code></pre>
<pre><code>## [1] 0.8413447</code></pre>
<p><span class="math display">\[
\begin{aligned}
P(X_{i}=H \vert Y_{i}>4) &= \frac{P(X_{i}=H, Y_{i}>4)}{P(Y_{i} >4)} \\
&= \frac{0.841*0.5}{0.159*0.5 + 0.841 *0.5} \\
&= 0.841
\end{aligned}
\]</span> Since only 0.841 of the sub-population would be affected by the treatment, we would expect that the average treatment effect would be <span class="math inline">\(0.841*0.3 = 0.25\%\)</span></p>
<pre class="r"><code>### Add in treatment effect to treated
df$delta = ifelse(df$treat==1 & df$type==1 & runif(n, 0, 1)<=p, d, 0)
df$spend2 = df$delta + df$spend
### Average treatment effect on treated
df_subset = df[df$target==1,]
mean(df[df$treat==1,]$spend2) - mean(df[df$control==1,]$spend2)</code></pre>
<pre><code>## [1] 0.2534045</code></pre>
<pre class="r"><code>lm(spend2 ~ treat, data=df_subset)</code></pre>
<pre><code>##
## Call:
## lm(formula = spend2 ~ treat, data = df_subset)
##
## Coefficients:
## (Intercept) treat
## 5.1672 0.2534</code></pre>
<p>More generally, a better approach to check our result would be to loop over many random samples and find the central tendency of the parameter estimate:</p>
<pre class="r"><code>att <- function(n=1e5, p=0.3, d=1){
df = data.frame(ind = seq(1, n),
type = rep(c(0,1), n/2),
epsilon = rnorm(n, 0, 1),
unif = runif(n, 0, 1),
unif2 = runif(n, 0, 1))
df$spend = ifelse(df$type==0, 3, 5) + df$epsilon
### Select treatment and control using unif
df$target = ifelse(df$spend>4 , 1, 0)
df$treat = ifelse(df$target==1 & df$unif<0.8, 1, 0)
df$control = ifelse(df$target==1 & df$unif>=0.8, 1, 0)
### Add in treatment effect to treated
df$delta = ifelse(df$treat==1 & df$type==1 & runif(n, 0, 1)<=p, d, 0)
df$spend2 = df$delta + df$spend
### Average treatment effect on treated
df_subset = df[df$target==1,]
mean(df[df$treat==1,]$spend2) - mean(df[df$control==1,]$spend2)
mod <- lm(spend2 ~ treat, data=df_subset)
return(coef(mod)["treat"])
}
B = 500
coef_list = list()
for(b in 1:B){
coef_list[[b]] <- att()
}
hist(unlist(coef_list))</code></pre>
<p><img src="./post/2017-11-18-choosing-a-control-group-in-a-rct-with-multiple-treatment-periods_files/figure-html/attrepeat-1.png" width="672" /></p>
<pre class="r"><code>mean(unlist(coef_list))</code></pre>
<pre><code>## [1] 0.2526948</code></pre>
<p>Unsurprisingly, the empirical results tally with our mathematical derivation.</p>
</div>
<div id="nd-period-att" class="section level3">
<h3>2nd period ATT</h3>
<p>Now, we are ready to evaluate the various proposed control groups. To keep things simple, the 2nd marketing promotion will be the same as the first and target individuals who spend above 4. However, this time to evaluate the results we need to consider 3 groups - low type, high type takers and high type non-takers - where takers and non-takers refer to whether they responded positively to the treatment in the first period. Repeating the above calculations and focusing on the share of non-takers in the sub-population: <span class="math display">\[
\begin{aligned}
P(X_{i,t=2}=H_{nt} \vert Y_{i,t=2}>4) &= P(X_{i,t=2}=H_{nt} \vert Y_{i,t=2}>4, i \in treated_{t=1}) + P(X_{i,t=2}=H_{nt} \vert Y_{i,t=2}>4, i \in control_{t=1})\\
&=\frac{P(X_{i,t=2}=H_{nt}, Y_{i,t=2}>4)}{P(Y_{i,t=2} >4)}*0.8 + 0.841*0.2 \\
&= \frac{0.841*0.5*0.7}{0.159*0.5 + 0.841 *0.5*0.7 + + 0.841 *0.5*0.3} *0.8 + 0.841*0.2\\
&= 0.639 \\
\\
ATT_{t=2} &= 0.639*0.3 \\
&= 0.192
\end{aligned}
\]</span> The calculations make intuitive sense. With a smaller pool of customers who would respond positively to the treatment, the ATT in the second period is lower than the first.</p>
</div>
<div id="targeting-rule-with-top-up" class="section level3">
<h3>Targeting rule with top-up</h3>
<p>Here’s a few lines of code to implement the idea of trying to keep the members of the control group relatively similar and do a random top-up where necessary.</p>
<pre class="r"><code>### 2nd time period
df$target2 = ifelse(df$spend2>4, 1, 0)
n_control2 = round(sum(df$target2) * 0.2)
n_control_remain = sum(df$target2 & df$control==1)
unif_threshold = (n_control2 - n_control_remain) / (sum(df$target2) - n_control_remain)
df$control2 = ifelse(df$target2==1 & (df$control==1 | df$unif2<=unif_threshold), 1, 0)
### Approximately fill up
df$treat2 = ifelse(df$target2==1 & df$control2==0, 1, 0)
df$delta2 = ifelse(df$treat2==1 & df$type==1 & df$delta==0 & runif(n, 0, 1)<=p, d, 0)
df$spend3 = df$delta2 + df$spend2 + rnorm(n,0,1)
### Average treatment effect 2 (Less than predicted!)
df_subset2 = df[df$target2==1,]
lm(spend3 ~ treat2, data=df_subset2)</code></pre>
<p>I show the results from 500 runs of the above code extracting the coefficient of the supposed treatment effect as well as the proportion of high non-treated individuals from the treatment and control group.</p>
<pre class="r"><code>mean(unlist(coef_list))</code></pre>
<pre><code>## [1] 0.4276794</code></pre>
<pre class="r"><code>mean(unlist(prop_control_list))</code></pre>
<pre><code>## [1] 0.8404415</code></pre>
<pre class="r"><code>mean(unlist(prop_treat_list))</code></pre>
<pre><code>## [1] 0.5890693</code></pre>
<p>Notice that the proportion of high non-treated individuals are no longer the same across the groups and the estimated effect is much larger than the calculated value. Almost no one has been treated in the control group. This leads to an upwards bias in the estimated treatment effect since the coefficient estimate is combining the effect of both the first and second treatment together.</p>
<p>More generally, the extent and direction of bias cannot be so easily quantified. If one allows the spending amounts to have a component that evolves randomly across time, it is possible for the estimate to be smaller than its actual value.<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a></p>
</div>
<div id="additional-thoughts" class="section level3">
<h3>Additional Thoughts</h3>
<p>Method 2 of having a universal control group is actually a special case of the above problem, where the control group does not vary at all. Under the assumption that each treatment would have a positive effect, the estimated effect for each subsequent treatment would always be overstated.</p>
<p>Only method 3 would give us a sensible result across both periods of the treatment. Here’s a fun little exercise - try to implement a random sample on the second period after subsetting the population using the targeting rule. Do you get a result similar to the calculated ATT above?</p>
</div>
<div id="tldr" class="section level3">
<h3>TL;DR</h3>
<p>In short, when it comes to choosing a random control group in a policy evaluation setting with multiple treatments and periods, the best option is the simplest one. Random assignment always works, no need to over complicate things.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>More accurately only mean independence is required.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Keeping only members that were present in the control group of the previous time period introduces a selection bias.<a href="#fnref2">↩</a></p></li>
</ol>
</div>
November Reflections
https://www.timlrx.com/2017/11/05/november-reflections/
Sun, 05 Nov 2017 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2017/11/05/november-reflections/<p>A collection of thoughts to start the month off.</p>
<p><strong>On the blog</strong> - Had a look at the google analytics data. There are about 750 views in total since the blog’s inception with a few users clocking in 5-10 minutes per post - so thank you for bumping up the stats if you are a regular reader!</p>
<p>The most popular post…is the <a href="./dashboard/sg-dashboard/">SG dashboard</a>. This was a little surprisingly. I thought my thesis or any of the mathy stuff is more interesting but who knows? Maybe I will do a little spring cleaning to update the dashboard over the next few months. The post on <a href="./2017/08/29/mapping-the-distribution-of-religious-beliefs-in-singapore/">mapping Singapore’s religious distribution using census data</a> came in second. That’s a nice piece that I enjoyed writing too so maybe we will see more of that in the future.</p>
<p><strong>On life</strong> - Time flies. It wasn’t too long ago where I was stressing over some arcane mathematical spaces and now I am back to dealing with ‘real life’ problems. New job with the glamorous title of data scientist i.e. <a href="https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century">the sexiest job of the 21st century</a>. How much of that is marketing fluff vs ‘true value’…we shall see over the next few months. Though, to be fair there is tremendous heterogeneity in job requirements even for the same job title. Not to mention title inflation seems to be quite rampant in the job market today.</p>
<p><strong>On cycles</strong> - Funny how things change yet remains the same. The void deck cat, the flock of pigeons scurrying for bread crumbs and the dogs walking their owners. Though this time there is a new enemy - the bikes have invaded the park and void deck. Yellow and orange, two wheeled creatures, road hazard and eyesore.</p>
<p><strong>On math</strong> - 4 years on and I slowly learning to appreciate the wonders and usefulness of my first year math modules. I am probably reading more about eigenvectors and eigenvalues now compared to my undergraduate years. It’s fun when you can speak the language, otherwise it just remains a cryptic code.</p>
<p><strong>On learning</strong> - Singapore malls now seem to be the go to place for…tuition centers. I guess it’s one of the few existing businesses that are willing to pay for such physical spaces and have the money to do so. It is a little ironic that the popular bookstore in my area has made way for another tuition center. How can I argue with the logic of the market, one dollar one vote, and Singapore parents have decided. What happened to learning as a journey rather than a result? A discovery rather than an information download? A joy rather than a societal norm?</p>
Notes on Regression - Singular Vector Decomposition
https://www.timlrx.com/2017/10/21/notes-on-regression-singular-vector-decomposition/
Sat, 21 Oct 2017 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2017/10/21/notes-on-regression-singular-vector-decomposition/<p>Here’s a fun take on the OLS that I picked up from <a href="https://web.stanford.edu/~hastie/ElemStatLearn/">The Elements of Statistical Learning</a>. It applies the Singular Value Decomposition, also known as the method used in principal component analysis, to the regression framework.</p>
<div id="singular-vector-decomposition-svd" class="section level3">
<h3>Singular Vector Decomposition (SVD)</h3>
<p>First, a little background on the SVD. The SVD could be thought of as a generalisation of the eigendecomposition. An eigenvector v of matrix <span class="math inline">\(\mathbf{A}\)</span> is a vector that is mapped to a scaled version of itself: <span class="math display">\[
\mathbf{A}v = \lambda v
\]</span> where <span class="math inline">\(\lambda\)</span> is known as the eigenvalue. For a full rank matrix (this guarantees orthorgonal eigenvectors), we can stack up the eigenvalues and eigenvectors (normalised) to obtain the following equation: <span class="math display">\[
\begin{aligned}
\mathbf{A}\mathbf{Q} &= \mathbf{Q}\Lambda \\
\mathbf{A} &= \mathbf{Q}\Lambda\mathbf{Q}^{-1}
\end{aligned}
\]</span> where <span class="math inline">\(\mathbf{Q}\)</span> is an orthonormal matrix.</p>
<p>For the SVD decomposition, <span class="math inline">\(\mathbf{A}\)</span> can be any matrix (not square). The trick is to consider the square matrices <span class="math inline">\(\mathbf{A}'\mathbf{A}\)</span> and <span class="math inline">\(\mathbf{A}\mathbf{A}'\)</span>. The SVD of the <span class="math inline">\(n \times k\)</span> matrix <span class="math inline">\(\mathbf{A}\)</span> is <span class="math inline">\(\mathbf{U}\mathbf{D}\mathbf{V}'\)</span>, where <span class="math inline">\(\mathbf{U}\)</span> is a square matrix of dimension <span class="math inline">\(n\)</span> and <span class="math inline">\(\mathbf{V}\)</span> is a square matrix of dimension <span class="math inline">\(k\)</span>. This implies that <span class="math inline">\(\mathbf{A}'\mathbf{A} = \mathbf{V}\mathbf{D}^{2}\mathbf{V}\)</span> and <span class="math inline">\(\mathbf{V}\)</span> can be seen to be the eigenvalue matrix of that square matrix. Similarly, the eigenvectors of <span class="math inline">\(\mathbf{A}\mathbf{A}'\)</span> forms the columns of <span class="math inline">\(\mathbf{U}\)</span> while <span class="math inline">\(\mathbf{D}\)</span> is the square root of the eigenvalues of either matrix.</p>
<p>In practice, there is no need to calculate the full set of eigenvectors for both matrices. Assuming that the rank of <span class="math inline">\(\mathbf{A}\)</span> is k, i.e. it is a long matrix, there is no need to find all n eigenvectors of <span class="math inline">\(\mathbf{A}\mathbf{A}'\)</span> since only the elements of the first k eigenvalues will be multiplied by non-zero elements. Hence, we can restrict <span class="math inline">\(\mathbf{U}\)</span> to be a <span class="math inline">\(n \times k\)</span> matrix and let <span class="math inline">\(\mathbf{D}\)</span> be a <span class="math inline">\(k \times k\)</span> matrix.</p>
<p>Here’s a paint illustration of the dimensions of the matrices produced by the SVD decomposition to illustrate the idea more clearly. The <span class="math inline">\(n \times k\)</span> matrix produced by multiplying <span class="math inline">\(\mathbf{U}\)</span> and <span class="math inline">\(\mathbf{D}\)</span> is equivalent for both the blue (keeping all n eigenvectors) and red (keeping only the relevant k eigenvectors) boxes. <img src="./img/SVD_dimension.png" alt="svd" /></p>
</div>
<div id="applying-the-svd-to-ols" class="section level3">
<h3>Applying the SVD to OLS</h3>
<p>To apply the SVD to the OLS formula, we re-write the fitted values, substituting the data input matrix <span class="math inline">\(X\)</span> with its equivalent decomposed matrices:</p>
<p><span class="math display">\[
\begin{aligned}
\mathbf{X}\hat{\beta} &= \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y} \\
&= \mathbf{U}\mathbf{D}\mathbf{V}'(\mathbf{V}\mathbf{D}'\mathbf{U}'\mathbf{U}\mathbf{D}\mathbf{V}')^{-1}\mathbf{V}\mathbf{D}'\mathbf{U}'\mathbf{y} \\
&= \mathbf{U}\mathbf{D}(\mathbf{D}'\mathbf{D})^{-1}\mathbf{D}\mathbf{U}'\mathbf{y} \\
&= \mathbf{U}\mathbf{U}'\mathbf{y}
\end{aligned}
\]</span> where the third to fourth line comes from the fact that <span class="math inline">\((\mathbf{D}'\mathbf{D})^{-1}\)</span> is a <span class="math inline">\(k \times k\)</span> matrix with the square root of the eigenvalues on the diagonal, and <span class="math inline">\(\mathbf{D}\)</span> is a square diagonal matrix. Here we see that the fitted values are computed with respect to the orthonormal basis <span class="math inline">\(\mathbf{U}\)</span>.<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p>
</div>
<div id="link-to-the-ridge-regression" class="section level3">
<h3>Link to the ridge regression</h3>
<p>The ridge regression is an OLS regression with an additional penalty term on the size of the coefficients and is a popular model in the machine learning literature. In other words, the parameters are chosen to minimalise the penalised sum of squares: <span class="math display">\[
\sum_{i=1}^{n}(y_{i} - \sum_{j=1}^{k} x_{ij}\beta_{j})^{2} + \lambda \sum_{j=1}^{k} \beta_{j}^{2}
\]</span> The solution to the problem is given by: <span class="math inline">\(\hat{\beta}^{ridge} = (\mathbf{X}'\mathbf{X} + \lambda I_{k})^{-1}\mathbf{X}'\mathbf{Y}\)</span>. Substituting the SVD formula into the fitted values of the ridge regression:</p>
<p><span class="math display">\[
\begin{aligned}
\mathbf{X}\hat{\beta}^{ridge} &= \mathbf{X}(\mathbf{X}'\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}'\mathbf{y} \\
&= \mathbf{U}\mathbf{D}(\mathbf{D}'\mathbf{D} + \lambda\mathbf{I})^{-1}\mathbf{D}\mathbf{U}'\mathbf{y} \\
&= \sum_{j=1}^{k} \mathbf{u}_{j} \frac{d^{2}_{j}}{d^{2}_{j} + \lambda} \mathbf{u}_{j}'\mathbf{y}
\end{aligned}
\]</span> where <span class="math inline">\(\mathbf{u}\)</span> is a n-length vector from the columns of <span class="math inline">\(\mathbf{U}\)</span>. This formula makes the idea of regularisation really clear. It shrinks the predicted values by the factor <span class="math inline">\(d^{2}_{j}/(d^{2}_{j} + \lambda)\)</span>. Moreover, a greater shrinkage factor is applied to the variables which explain a lower fraction of the variance of the data i.e. lower <span class="math inline">\(d_{j}\)</span>. This comes from the fact that the eigenvectors associated with a higher eigenvalue explain a greater fraction of the variance of the data (see Principal Component Analysis).</p>
<p>The difference between how regularisation works when one uses the Principal Component Analysis (PCA) method vs the ridge regression also becomes clear with the above formulation. The PCA approach truncates variables that fall below a certain threshold, while the ridge regression applies a weighted shrinkage method.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Doing a QR decomposition will also give a similar set of results, though the orthogonal bases will be different.<a href="#fnref1">↩</a></p></li>
</ol>
</div>
Mapping SG - Shiny App
https://www.timlrx.com/2017/10/11/mapping-sg-shiny-app/
Wed, 11 Oct 2017 00:00:00 +0000timothy.lin@alumni.ubc.ca (Timothy Lin)https://www.timlrx.com/2017/10/11/mapping-sg-shiny-app/<p>While my previous posts on the Singapore census data focused mainly on the distribution of religious beliefs, there are many interesting trends that could be observed on other characteristics. I decided to pool the data which I have cleaned and processed into a Shiny app. Took a little longer than I expected but it is done. Have fun with it and hope you learn a little bit more about Singapore!</p>
<iframe src="https://timlrx.shinyapps.io/sg-census-shinyapp/" onload="this.width=screen.width;this.height=screen.height;">
</iframe>