Cartographer visualises trends within the UCB interest sphere by applying text mining, text processing and machine learning. This includes topic modelling and cluster analysis on different data sources.
One of these sources is Pubmed, the leading library containing abstracts and links to medical, nursing, healthcare and preclinical sciences journal articles. In addition, we also gathered NIH data of grants awarded to promising research topics. Furthermore, we used data from several patent databases to ensure an overview of both patent applications and published patents within the interest domains. Combining these data enabled us to create a quick yet powerful visualise of upcoming trends and trend evolutions within predefined life science interest fields.
To achieve this, the following steps were taken: We started with the analysis of the abstract and the metadata (keywords, author, …) of 5.5 millions Pubmed articles as our main data source. The data from Pubmed were downloaded, the XML was transformed into a suitable format for Apache Spark and uploaded in S3. Apache Spark is an engine for large-scale data processing that was used later in the analysis. Subsequently, the relevant metadata were retrieved from those transformed data and dumped as JSON in S3.
The next step comprised the generation of the corpus and was the actual start of the text analysis. We tokenized and tagged (Part-of-Speech tagging) the abstract and title using the NLTK Python library. Additionally, we removed stop words, numbers and punctuations during this phase. Subsequently, we collocated terms that were frequently found next to each other (e.g. ‘New’ and ‘York’ becomes ‘New_York’). Our collocation algorithm was an in house implementation based on phrases from Gensim. To withhold only relevant nouns, we also removed verbs, adverbs, determiners, cardinal numbers, etc. from the corpus. Finally we singularized all the relevant terms.
After corpus generation, topics were assigned to the available Pubmed articles. Therefore, two different algorithms were used. Firstly, topic modelling was done via TF-IDF (the Apache Spark implementation). The resulting topics were mixed with the provided article keywords. Secondly, Word2Vec was run on the full text abstracts. Based on the assigned topics, the articles that were in the interest sphere of UCB using on some predefined domains, were selected. The topics themselves were clustered within each domain via the Word2Vec results.
The final step was to map the clustered terms to other data sources (Google Trends, NIH, PatBase, …) to visualize the trends over time.
In the Cartographer application, you can study the relevance and evolution of a specific topic over time.
The resulting application is an appealing visualisation tool that gives a quick, yet powerful overview of upcoming trends and trend evolutions within predefined life science interest fields. In the Cartographer application, you can study the relevance and evolution of a specific topic over time. Additionally, for every topic, related interest fields and terms are shown and cross domain trends can be followed. Furthermore, because NIH grant data are included, recent investments in a scientific field can be supervised. A folding menu also lists top articles, top organisations and top projects within the select field or subfield of your interest. This section can be used to inform and redirect you to high impact publications or to help you detect top influencing organisations in your interest sphere to partner with.