Automated Hypothesis Generation Based on Mining Scientific Literature

The title of this blog post is a paper that recently appeared at KDD. Here is a link that probably won’t work. This post will be a running commentary on the paper as I read it.


“Current search technologies typically find many relevant documents, but they do not extract and organize the information content of these documents or suggest new scientific hypotheses based on this organized content.”

Okay, so there are a few big ideas here wrapped up in one. Information retrieval and information extraction are two sides of the same coin. This is what John does. Organization of the content is also related to what his group does and is certainly on the cutting edge of IR/IE. The step of suggesting hypotheses is more traditional AI, and I wonder how we extend from the more empirical, metrics-based approaches of the aformentioned fields to the more theoretical, rules-based approaches of hypothesis generation.

“KnIT…further reasons upon these data to generate novel and experimentally testable hypotheses.”


“KnIT combines entity detection with neighbor-text feature analysis and with graph-based diffusion of information to
identify potential new properties of entities that are strongly implied by existing relationships.”

Wow, sounds good. Entity detection is certainly hot right now, and folks from CIIR are very into it. I don’t know anything about the graph-based diffusion of information bit, but it sounds like it uses information-theoretic metrics to tell how far an effect/feature/something reaches?


“Even recognizing new questions that should be asked can be a challenge. Instead, only a sliver of the relevant knowledge guides hypotheses: an approach that is deeply wasteful. This fundamental bottleneck is pervasive in biology and representative of every area of human endeavor in which there is a mushrooming mismatch between raw information and our analytic abilities. ”

“Our goal is to accelerate scientific progress by combining mining, visualization, and analytics, with the hope to integrate all available content, identify the facts that are relevant to a given query, and from these facts suggest hypotheses that are new, interesting, testable and likely to be true.”

Lofty! Also, similar to some goals we have.

Their running case study will be the protein p53, which is a tumor suppresor.

KnIT stands for “Knowledge Integration Toolkit.” It has three phases: exploration, interpretation, and analysis.

Exploration involves some kind of extraction to generate features. The features index entities. Presumably the search space is the space of entities.

Interpretation involves connecting entities into a graph. There are some higher dimensional visualization techniques that can be used to highlight “interesting” areas, e.g., coloring. “Interesting” parts of the graph are subgraphs that denote relationships between entities that have not previously been discovered.

Analysis involves “globally diffus[ing] annotation information among entities” in order to rank them and select out the best entities for experimentation. I’m not sure what this means — experimentation is only partially over entities. Presumably the experiment is to test a causal relationship between entities? If that’s the case, then the graph representation needs a richer link structure than just “these are related.”
NOTE: the relationship in this case is whether a kinase “phosphorylates” p53, so it is a specific causal relationship.

They cite past work on hypothesis generation, and automatic inference of known formulae. However “hypothesis generation from unstructured text has been a hit-or-miss manual process.”

Impact on society

Lofty. Our brains are limited. We aren’t google (or cyborgs yet). Invoke Big Data.

The Problem of p53 Kinases

High-level discussion of p53. The takeaway: “knowing which proteins are modified by each kinase, and therefore which kinases would make good drug targets, is a difficult and unsolved problem.”

Representing Kinases

They mine papers (especially abstracts) for keywords, essentially. These define the feature space (bag of words). They use a knowledge base that contains the canonical name and all known synonyms. The system then queries a paper database with a giant OR for all synonyms and extracts features from those papers’ abstracts. They removed data that explicitly mentioned p53, since they were interested in discovering new relationships (rather than discovering those that had already been discovered and published). They represent each document (abstract) as a weighted normalized vector and remove stop words. They use tf-idf for their weights.

After this first pass, they calculate kinase centroids in the feature space. Then they calculate the Euclidean distance between every centroid. This gives some numeric sense of how “close” two kinases are. However, it is not immediately clear how even a domain expert would interpret these numbers.

They consider displaying the information as a network graph, but this model isn’t quite right, since network graphs rely on nodes being connected or not connected, and here every node is “connected.” They note that they could threshhold the distance to denote connectivity, but this would confer an interval interpretation to the distance, when ratio is really all that’s warranted.

They discuss visualization and HCI issues: how does one best communicate the data? They want to pare down the information sufficiently so that the user is not overloaded. They decide that the features they want are (a) minimal connectivity (minimize information presented), (b) tree-like (for the ease of navigation and the relative relationships between levels), and (c) low-arity (not sure what their argument is about — they just don’t want to have a linked list or something). They choose as the root the feature vector that is closest to the average of all the centroids and call this property “typicality.”

Selecting Candidate p53 Kinases

They need a ranking for the kinases and use graph diffusion to accomplish this:

“Graph diffusion is a semi-supervised learning approach for classification based on labeled and unlabeled data. It takes known information (initial labels) and then constrains the new labels to be smooth in respect to a defined structure (e.g. a network).”

Their known data are the p53s. The unknown ones are the other kinases.

They define a kinase network as the top 10 most closely related kinases. This number was chosen via cross validation. They are minimizing the error in predicted values less a smoothing term. They end up with a convex optimization problem. They show some ROC curves for their cross validation.


They did three studies:

  1. They did a retrospective analysis that looked at whether the model could predict discoveries after a particular date, given publications that occurred strictly before that date. This is a nice baseline, but might be a bit misleading — the interesting parts of science are the big leaps, not the incremental discoveries. The predictions in the retrospective study should be a minimum requirement, and not even a proof of concept.
  2. They compared the approach with human-driven state of the art discovery approaches and looked at areas for improvement.
  3. Their third study involved searching for discoveries at scale (the real promise of this work).

I skimmed the rest of the section, which was filled with domain details and reports on how things went. Without a comparative study, it’s hard to believe any “conclusion,” and besides they state up front that this is a proof of concept anyway.

Conclusion and Future Work

Repetition of the intro. ENTITIES.

Does this approach generalize? There was still a lot of domain knowledge baked in. Only time will tell…

Speeding up SurveyMan analyses

A major bottleneck in some of our analyses is that we need to resample survey responses. Let $$n$$ denote the number of responses we’ve seen. Let $$m$$ denote the number of questions in the survey. $$b_i$$ is the number of bootstrap iterations. $$b_s$$ is a list of bootstrap samples. $$scores$$ is a list of scores. Our resampling approach is as follows:

1. For each response $$r$$ in the list of all responses ($$O(n)$$):
    a. $$srs \gets$$ All other survey responses that have answered at least all the questions in $$r$$ ($$O(n)$$).
    b. For $$sr$$ in $$srs$$ ($$O(n)$$):
       i. Truncate $$sr$$ so we are only looking at the set that overlaps with $$r$$ ($$O(m)$$).
    c. For $$i \gets 1$$ to $$b_i$$ ($$O(b_i)$$):
       i. Randomly select $$|srs|$$ samples from $$srs$$ and add to the list $$b_s$$ ($$O(n)$$).
    d. For $$b$$ in $$b_s$$ ($$O(b_i)$$):
       i. Compute the scores for $$b$$ and add to the list $$scores$$ ($$O(n)$$).
    e. Sort $$scores$$ in ascending order ($$O(b_i\log b_i)$$).
    f. Return true or false if $$r$$’s score falls outside the appropriate density ($$O(1)$$).

Reminder: randomization and equivalence are non-trivial

The randomization of SurveyMan introduces some non-trivial complications to the above process. I’ve written before about how (1) relies on defining the notion of question equivalence carefully. When we have variants, if we attempt to match on the literal question, we may not have a statistically significant number of samples to compare. Consider the prototypicality survey. In that case, we had 16 blocks, each having 4 variants. This means we have up to $$4^{16}$$ distinct surveys! Resampling won’t help us at all in that case.

Can we resample fewer times?

Although resampling becomes possible when we collapse variants into a single question, it’s still time-consuming. Calculating scores for the phonology survey — which had almost 100 questions, and for which we gathered about 300 responses — takes upwards of 25 minutes. It would be nice if we could do this faster.

The current resampling code is very naive. As written above, we truncate the response lists first and then resample. In order to truncate, we first compare our response $$r$$ with our $$n$$ survey responses. Resampling involves drawing from the pool of $$n$$ survey responses $$b_i$$ times. When I am impatient, I set $$b_i$$ to 500. When I want to do things the right way, I set $$b_i$$ to 2000. Clearly the number of $$b_i$$ dominates the computation. We end up with $$O(n(n + nm + 2b_i n + b_i \log b_i))$$ = $$O(b_i n^2 + b_i\log b_i)$$ running time.


Would things be any better if we only computed the bootstrap samples once, and performed the truncation later? Let’s consider the following alternative algorithm:

1. For $$i \gets 1$$ to $$b_i$$ ($$O(b_i)$$):
    a. Randomly select $$n$$ samples from the responses and add to the list $$b_s$$ ($$O(n)$$).
2. For each response $$r$$ in the list of all responses ($$O(n)$$):
    a. For $$b$$ in $$b_s$$ ($$O(b_i)$$):
       i. $$srs \gets$$ All other survey responses that have answered at least all the questions in $$r$$ ($$O(n)$$).
       ii. For $$sr$$ in $$srs$$ ($$O(n)$$):
          A. Truncate $$sr$$ so we are only looking at the set that overlaps with $$r$$ ($$O(m)$$).
          B. Compute the scores for $$b$$ and add to the list $$scores$$ ($$O(n)$$).
    b. Sort $$scores$$ in ascending order ($$O(b_i\log b_i)$$).
    c. Return true or false if $$r$$’s score falls outside the appropriate density ($$O(1)$$).

The above gives us a running time of $$O(b_in + n(b_i(n + n(n + m)) + b_i\log b_i))$$ = $$O(b_in + b_in^2 + b_in^3 + b_inm + b_i\log b_i)$$ = $$O(b_in^3)$$. Yikes! Even though we are only computing the bootstrap sample once, we need to iterate over it. This iteration occurs in an outer loop, causing a blowup in the time to compute.

There are clearly more subtle analyses we can do. The first approach only computes the bootstrap sample over the truncated responses, which are often fewer than the total number of responses. We might be concerned about the garbage collector when we recompute new samples.

Another concern I have with caching is that it introduces a bias. We are essentially reusing the same data (the bootstrap samples) and may run into multiple comparison issues. Of course, the point of the bootstrap simulation is that it approximates the true distribution and therefore this should be less of an issue the more bootstrap iterations we have.


Another approach to speed things up would be to try to parallelize the score computation. When all of the analysis was written in Clojure, this would have been fine, since nothing was mutable. However, the Java implementation mutates scores in the original SurveyResponse object. We could get around this by completely decoupling the truncated view from the original SurveyResponse. I might do this to see if it makes a difference.

Using 3x

A few months ago, Emery pointed out a new project being advertised on Stanford’s CS web page. It’s called 3x and describes the system:

3X is an open-source software tool to ease the burden of conducting computational experiments and managing data analytics. 3X provides a standard yet configurable structure to execute a wide variety of experiments in a systematic way, avoiding repeated creation of ad-hoc scripts and directory hierarchies. 3X organizes the code, inputs, and outputs for an experiment. The tool submits arbitrary numbers of computational runs to a variety of different compute platforms, and supervises their execution. It records the returning results, and lets the experimenter immediately visualize the data in a variety of ways. Aggregated result data shown by the tool can be drilled down to individual runs, and further runs of the experiment can be driven interactively. Our ultimate goal is to make 3X a “smart assistant” that runs experiments and analyzes results semi-automatically, so experimenters and analysts can focus their time on deeper analysis. Two features toward this end are under development: visualization recommendations and automatic selection of promising runs.

The github repository describes a system for planning and re-running experiments. The tool manages inputs and outputs and produces a factorial design on the basis of the inputs. It isn’t clear to me whether a more sophisticated design is allowed.

I attempted to run a 3x experiment over the bocado work. The idea was to run with and without tracing, at varying sizes of number of functions sampled. Installing the tool seemed fine, but getting it to work on my machine was another problem. I tried both the executable provided and building from source. Initially the problem seemed to be with the custom file-watcher the author built. An error was being thrown in the GUI script when I tried to start that up after specifying the experiment. When I tried running the experiment with `3x run`, nothing appeared to happen. After some fiddling and starting again from scratch, I was able to get something running, although it didn’t look like my program was running, nor did the GUI appeared to work. There was limited debug information, so after a while I just gave up. The documentation for this tool includes screen shots of the GUI, which makes me think it works somewhere.

I would definitely be interested in using this tool in conjunction with some of our tools. I noticed in one of the issue comments that there appears to be another, similar tool called Sumatra, which I will check out in the future.

Of interest to us is the last statement on the Stanford page, stating that the goal of 3x is for it to operate “semi-automatically.” It promises development of “visualization recommendations and automatic selection of promising runs”. I am interested in seeing if/how that pans out.