add knowledge_graph notebook

dcarpintero · dcarpintero · commit 6e37144c8add · 2024-08-22T17:54:25.000+02:00
diff --git a/05_knowledge_graphs.ipynb b/05_knowledge_graphs.ipynb
@@ -11,9 +11,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Knowledge Graphs, a form of graph-based knowledge representation, provide a method for modeling and storing interlinked information in a human - and machine - understandable format. In practice, such a graph data structure consists of *nodes* and *edges*, representing entities and their relationships. Unlike traditional databases, the inherent expressiveness of graphs allows for richer semantic understanding, while providing the flexibility to accommodate new entity types and relationships without being constrained by a fixed schema.\n",
+    "Knowledge Graphs, a form of graph-based knowledge representation, provide a method for modeling and storing interlinked information in a format that is both human- and machine-understandable. These graphs consist of *nodes* and *edges*, representing entities and their relationships. Unlike traditional databases, the inherent expressiveness of graphs allows for richer semantic understanding, while providing the flexibility to accommodate new entity types and relationships without being constrained by a fixed schema.\n",
     "\n",
-    "By combining knowledge graphs with embeddings (vector search), we can leverage *multi-hop connectivity* and *contextual understanding of information* to enhance querying, reasoning, and explainability in LLMs. This notebook explores the practical implementation of this approach, demonstrating how to (i) build a knowledge graph from academic literature, and (ii) extract actionable insights from it."
+    "By combining knowledge graphs with embeddings (vector search), we can leverage *multi-hop connectivity* and *contextual understanding of information* to enhance querying, reasoning, and explainability in LLMs. This notebook explores the practical implementation of this approach, demonstrating how to (i) build a knowledge graph of academic publications, and (ii) extract actionable insights from it."
    ]
   },
   {
@@ -24,6 +24,379 @@
     "  <img src=\"./static/knowledge-graphs.png\">\n",
     "</p>"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1. Knowledge Graph Initialization\n",
+    "\n",
+    "We will create our Knowledge Graph using [Neo4j](https://neo4j.com/), an open-source database management system that specializes in graph database technology."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "%pip install neo4j langchain langchain_openai langchain-community python-dotenv --quiet | tail -n 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 1.1 Setting Up a Neo4j Instance\n",
+    "\n",
+    "For a quick and easy setup, you can start a free instance on [Neo4j Aura](https://neo4j.com/product/auradb/). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import dotenv\n",
+    "dotenv.load_dotenv('.env', override=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from langchain_community.graphs import Neo4jGraph\n",
+    "\n",
+    "graph = Neo4jGraph(\n",
+    "    url=os.environ['NEO4J_URI'], \n",
+    "    username=os.environ['NEO4J_USERNAME'],\n",
+    "    password=os.environ['NEO4J_PASSWORD'],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 1.2 Loading Dataset into a Graph\n",
+    "\n",
+    "The below example creates a connection with our Neo4j database and populates it with synthetic data about research articles and their authors. \n",
+    "\n",
+    "The entities are: \n",
+    "- *Researcher*\n",
+    "- *Article*\n",
+    "- *Topic*\n",
+    "\n",
+    "Whereas the relationships are:\n",
+    "- *Researcher* --[PUBLISHED]--> *Article*\n",
+    "- *Article* --[IN_TOPIC]--> *Topic*\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_community.graphs import Neo4jGraph\n",
+    "\n",
+    "graph = Neo4jGraph()\n",
+    "\n",
+    "q_load_articles = \"\"\"\n",
+    "LOAD CSV WITH HEADERS\n",
+    "FROM 'https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/dataset/synthetic_articles.csv' \n",
+    "AS row \n",
+    "FIELDTERMINATOR ';'\n",
+    "MERGE (a:Article {title:row.Title})\n",
+    "SET a.abstract = row.Abstract,\n",
+    "    a.publication_date = date(row.Publication_Date)\n",
+    "FOREACH (researcher in split(row.Authors, ',') | \n",
+    "    MERGE (p:Researcher {name:trim(researcher)})\n",
+    "    MERGE (p)-[:PUBLISHED]->(a))\n",
+    "FOREACH (topic in [row.Topic] | \n",
+    "    MERGE (t:Topic {name:trim(topic)})\n",
+    "    MERGE (a)-[:IN_TOPIC]->(t))\n",
+    "\"\"\"\n",
+    "\n",
+    "graph.query(q_load_articles)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "print(graph.get_schema)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 1.3 Build Vector Index\n",
+    "\n",
+    "We implement a vector index to efficiently search for relevant articles based on their *topic, title, and abstract*. This process involves calculating the embeddings for each article using these fields. At query time, the system finds the most similar articles to the user's input by employing a similarity metric, such as cosine distance.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "from langchain_community.vectorstores import Neo4jVector\n",
+    "from langchain_openai import OpenAIEmbeddings\n",
+    "\n",
+    "vector_index = Neo4jVector.from_existing_graph(\n",
+    "    OpenAIEmbeddings(),\n",
+    "    url=os.environ['NEO4J_URI'],\n",
+    "    username=os.environ['NEO4J_USERNAME'],\n",
+    "    password=os.environ['NEO4J_PASSWORD'],\n",
+    "    index_name='articles',\n",
+    "    node_label=\"Article\",\n",
+    "    text_node_properties=['topic', 'title', 'abstract'],\n",
+    "    embedding_node_property='embedding',\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2. Graph Cypher Chain\n",
+    "\n",
+    "LangChain provides a wrapper around Neo4j graph database that allows you to generate Cypher statements based on the user input and use them to retrieve relevant information from the database."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chains import GraphCypherQAChain\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "graph.refresh_schema()\n",
+    "\n",
+    "cypher_chain = GraphCypherQAChain.from_llm(\n",
+    "    cypher_llm = ChatOpenAI(temperature=0, model_name='gpt-4o'),\n",
+    "    qa_llm = ChatOpenAI(temperature=0, model_name='gpt-4o'), \n",
+    "    graph=graph,\n",
+    "    verbose=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3. Inference traversing Knowledge Graphs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Knowledge graphs excel in their ability to query and navigate the connections between entities, allowing for the retrieval of pertinent information and the discovery of new insights."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 3.1 Sample 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this example, our question 'How many articles has published Emma Wilson' will be translated into the Cyper query:\n",
+    "\n",
+    "```\n",
+    "MATCH (r:Researcher {name: \"Emma Wilson\"})-[:PUBLISHED]->(a:Article)\n",
+    "RETURN COUNT(a) AS numberOfArticles\n",
+    "```\n",
+    "\n",
+    "which matches nodes labeled `Author` with the name 'Emma Wilson' and traverses the `PUBLISHED` relationships to `Article` nodes. \n",
+    "It then counts the number of `Article` nodes connected to 'Emma Wilson':"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# the answer should be '5'\n",
+    "cypher_chain.invoke(\n",
+    "    {\"query\": \"How many articles has published Emma Wilson?\"}\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 3.2 Sample 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this example the query 'are there any pair of researchers who have published more than one article together?' results in the Cypher query:\n",
+    "\n",
+    "```\n",
+    "MATCH (r1:Researcher)-[:PUBLISHED]->(a:Article)<-[:PUBLISHED]-(r2:Researcher)\n",
+    "WHERE r1 <> r2\n",
+    "WITH r1, r2, COUNT(a) AS sharedArticles\n",
+    "WHERE sharedArticles > 1\n",
+    "RETURN r1.name, r2.name, sharedArticles\n",
+    "```\n",
+    "\n",
+    "which results in traversing from `Researcher` to `PUBLISHED` to find connected `Article` nodes, and then traversing back to find `Researchers` pairs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# the answer should be Alice Johnson and David Miller, Alexander Lee and David Miller, Olivia Taylor and Alexander Lee, and David Miller and Alice Johnson\n",
+    "cypher_chain.invoke(\n",
+    "    {\"query\": \"are there any pair of researchers who have published more than one article together?\"}\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 3.3 Sample 3\n",
+    "\n",
+    "It appears David Miller has collaborated with many peers. Lets find out is he is the researcher with most peers collaborations. \n",
+    "Our query 'which researcher has collaborated with the most peers?' results now in the Cyper:\n",
+    "\n",
+    "```\n",
+    "MATCH (r:Researcher)-[:PUBLISHED]->(:Article)<-[:PUBLISHED]-(peer:Researcher)\n",
+    "WITH r, COUNT(DISTINCT peer) AS peerCount\n",
+    "RETURN r.name AS researcher, peerCount\n",
+    "ORDER BY peerCount DESC\n",
+    "LIMIT 1\n",
+    "```\n",
+    "\n",
+    "Here, we need to star from all `Researcher` nodes and traverse their `PUBLISHED` relationships to find connected `Article` nodes. For each `Article` node, Neo4j then traverses back to find other `Researcher` nodes (peer) who have also published the same article."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# the answer should be 'David Miller' with 5\n",
+    "cypher_chain.invoke(\n",
+    "    {\"query\": \"Which researcher has collaborated with the most peers?\"}\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 3.3 More Samples"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# the answer should be 'David Miller and Alice Johnson'\n",
+    "cypher_chain.invoke(\n",
+    "    {\"query\": \"Who wrote the article 'Language Model Compression for Mobile Devices'?\"}\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# the answer should be '2024'\n",
+    "cypher_chain.invoke(\n",
+    "    {\"query\": \"In which year there were more articles published??\"}\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# the answer should be Bob Smith, David Miller, Sophia Martinez, and John Robinson\n",
+    "cypher_chain.invoke(\n",
+    "    {\"query\": \"Which researchers have worked with Emma Wilson?\"}\n",
+    ")"
+   ]
   }
  ],
  "metadata": {