Skip to content

ometaxas/ometaxas.github.io

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

752 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A simple topic-model browser for SciXplore (focusing on Covid-19 Open Research Corpus Analysis )

Our goal in this demo is to analyze COVID-19 Open Research Dataset in order to:

  • Demonstrate related domain specific analysis capability of SciXplore
  • Identify intepretable topics (described by WIKIPedia concepts, MeSH, words & texts)
  • Identify and annotate publication with WikiPedia or MeSH related concepts
  • Analyze related topic trends
  • Analyze similarity among topics, among publications or other entities
  • Demonstrate topic & concept based exploration & information retrieval capabilities

The analysis has been based on SciXplore multi-view topic modelling, representation learning and semantic annotation framework. SciXplore provides a lot of advantages on top of text only topic modeling alternatives (e.g., LDA) producing more interpetable, multi-view, topics as well as vector representations (embeddings) for all involving entities (including topics, words & concepts). Therefore, by combining multi-view topic modelling with skip-gram based topic2vec representation learning infers topics of better quality and facilitate similarity analysis & recommendation. SciXplore also provides a semantic annotation engine for DBPedia (Wiki) based annotation which extends the well-established DBPedia Spotlight framework

The visualization has been based on agoldst.github.io/dfr-browser, use d3 and it relies entirely on static html and javascript files. We have created a python scipt which analyzes the results of SciXplore and exports required data files in a format which is compatible with dfrbrowser. To better support the requirements of SciXplore & Covid-19 dataset we have implemented several changes on top of the dfrbrowser such as:

  • Change the way we are calculating trends and more specifically the normalization: topic activity per year is based on the ratio of the docs of a topic in a given year / total docs for a given topic (and total docs of all topics for a given year)
  • Change html / CSS in several places to handle longer 'word' descriptions (which correspond to concepts)
  • Support autocompletion & multi-selection in word & concept search
  • Show a topic relations view (through a Force-Directed Layout)

Our analysis has been based on the following steps:

  • Consume and pre-process publications from COVID-19 Open Research Dataset
  • Enrich them (e.g., getting MeSH terms) utilizing MEDLINE/PubMed dumps
  • Annotate all docs with Wikipedia terms using SciXplore semantic annotation engine
  • Analyze text, metadata, MeSH & Wikipedia concepts with SciXplore inferring:
    • Intepretable multi-view topics
    • Topics per document or other entities
    • Vector representations per topics or other entities
    • Analyze trends & similarities
    • Create a WEB GUI for data exploration (adapting agoldst.github.io/dfr-browser)
    • Give titles to the topics

There are several things that we can do on top (or that they aren't properly addressed through this visualization) especially in relation to search & recommendation, such as:

  • besides silimiarities among topics, identify and analyze similarities among concepts or documents
  • content-based recommendation: given a publication, topic or concept recommend similar entities
  • improvements on concept-based search (e.g., search using multiple topics or articles which are related to some concepts)
  • merge corresponding WIKI & MeSH concepts (and provide links)
  • journal or publisher specific analysis (e.g., search by journal or publisher)

For more information or feedback please contact Omiros Metaxas: ometaxas@atypon.com

About

ScTopic modelling visualization & analysis of results

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors