Create a Question Answering System in Python using NLP

Introduction

Imagine a machine answering all your questions, you throw different kinds of questions at that machine and it answers every one of your questions without getting tired. In this case, the machine is nothing but a computer. This is not unrealistic at all. This has also been made possible by the advanced technology today and the contribution of modern information technology is outstanding in this field where Artificial Intelligence continues to play one of the key roles.

Here, we will create a simple Question Answering System in Python using Natural Language Processing (NLP) which will be able to answer our questions using its own intelligence within a certain range.

The reason I said, “a certain range” is because here we will train our AI model with some data on some specific topic collected from the web. The AI will then take the query from us, understand that, and provide the most relevant result based on that pre-collected data. 

This was an Introductory part, let’s move on to the Project Details and start talking about our main topic (Question Answering System in Python using NLP).

Visit Also: Create a Sentiment Analysis Project using Python

The Project Details

Since NLP (Natural Language Processing) is the main topic of this project, let’s discuss this topic briefly.

What is Natural Language Processing?

Natural Language Processing (NLP) is a subfield of computer science and artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language. It involves the development of algorithms and models that allow computers to analyze, understand, and produce natural language text or speech.

NLP combines linguistics, computer science, and mathematics to create computational models that can analyze human language. It includes a wide range of tasks, such as sentiment analysis, text classification, machine translation, speech recognition, named entity recognition, and language generation. These tasks enable computers to extract meaning from unstructured text data and interact with humans in a more natural and intuitive way.

NLP is used in various applications, including virtual assistants, chatbots, search engines, recommendation systems, and sentiment analysis tools. The ultimate goal of NLP is to create machines that can communicate and interact with humans in a way that is indistinguishable from human-to-human communication.

How to build this project?

Hearing the words Artificial Intelligence, NLP (Natural Language Processing), etc. you might think it is going to be a complicated topic but trust me after covering the complete tutorial, you won’t think so. Here, we need to train our AI model with some data first. But how, right?

To keep the overall discussion simple, I’ve chosen some popular topics, collected data related to that topic from Wikipedia pages, and stored them in separate text files by topic. This is our Corpus data.

For example, in the main project file, you will find these text files as the corpus of documents: ‘kevin_mitnick.txt’, ‘linux.txt’, ‘microsoft.txt’, ‘python.txt’, and ‘trump.txt’. You can add more data or modify the existing here but remember one thing, the document file must be a text file. These files need to be placed in a folder and the name of that folder must be specified when running the program (you will get more information on this later).

Next, we’ll create a Python Program. First, we will perform the Document Retrieval operation. Here, we will access the data of those text documents and try to find out which document(s) are most relevant to the user’s query (the query must be in English). We will use tf-idf to find the most relevant documents.

Our next operation will be Passage (sentences) Retrieval where the most relevant document(s) will be split into several passages to determine the most relevant passage to the user’s query. To do this, we will use a combination of idf (inverse document frequency) and query term density.

The entire program is so transparent and well organized (Comment lines will help you to understand the objective of each important segment of the code) but before moving there it’s important to understand the driver or main function of our program briefly.

The Main Function

We will go through the following steps in the main function.

  • Step 1: We will load the text file from the directory (where those files are located. In our case the directory name is ‘corpus’) using the load_files function.
  • Step 2: Each of the files will be tokenized (using the tokenize function) into a list of words.
  • Step 3: We will compute inverse document frequency (idf) for each of the words using the compute_idfs function.
  • Step 4: Next we need to take the input query from the user.
  • Step 5: The top_files function will find the most relevant file that matches the user’s query.
  • Step 6: At last, the top sentences from those top matches files will be extracted which are the best match with the query.

We have to define these functions separately: load_files, tokenize, compute_idfs, top_files, and top_sentences. Don’t be tense. As I said earlier, I tried to keep the overall code as simple as possible. Functions are easy to understand. You just need a little patience and attention.

Requirements

We’re so close to our source code, but before we get there let’s see what you need to install beforehand.

NLTK

NLTK or Natural Language Toolkit is a very powerful tool provided by Python to analyze unstructured data (The amount of data could be small or huge too). It offers several classes or methods to perform various operations on such given data and trains computers to work on that to create a beautiful AI Model that can read or understand Human Languages. 

Install NLTK: pip install nltk

Install NLTK Data

Since we are going to perform tokenization and lemmatization, we have to install some additional resources before we proceed. Let’s do all these tasks first so that no errors occur when running the program later.

Open your Python interpreter and run the following lines of code (Here we will download ‘punkt’ and ‘wordnet’, these two resources).

import nltk
nltk.download('punkt')
nltk.download('wordnet')

Source Code

Don’t start copying the Source Code first, take a break here and do the following:

  1. Download the project file directly through the Download Button below.
  2. Unzip the zip file.
  3. Now create a Python file in the project folder with this name: ‘question.py’.

Now you are ready to copy the entire code. So, do that and paste it into the ‘question.py’ program file. 

To run the program, type the command: “python question.py corpus”, where ‘question.py’ is the program file and ‘corpus’ is the directory where the text document files are stored [If you are using Linux, instead of python use python3].

You will get another directory called ‘small’ in there which contains two text files with a small amount of text data. It may help you determine how the code is handling the data.

import nltk
import sys
import os
import string
import math
from nltk.corpus import stopwords


FILE_MATCHES = 1
SENTENCE_MATCHES = 1

# Getting the stopwords and storing them into 
# a set variable.
stop_words = set(stopwords.words("english"))


def main():
    # Check command-line arguments
    if len(sys.argv) != 2:
        sys.exit("Usage: python questions.py corpus")

    # Calculate IDF values across files
    files = load_files(sys.argv[1])
    file_words = {
        filename: tokenize(files[filename])
        for filename in files
    }
    file_idfs = compute_idfs(file_words)


    # Prompt user for query
    query = set(tokenize(input("Query: ")))

    # Determine top file matches according to TF-IDF
    filenames = top_files(query, file_words, file_idfs, 
    n=FILE_MATCHES)


    # Extract sentences from top files
    sentences = dict()
    for filename in filenames:
        for passage in files[filename].split("n"):
            for sentence in nltk.sent_tokenize(passage):
                tokens = tokenize(sentence)
                if tokens:
                    sentences[sentence] = tokens

    # Compute IDF values across sentences
    idfs = compute_idfs(sentences)

    # Determine top sentence matches
    matches = top_sentences(query, sentences, idfs, 
    n=SENTENCE_MATCHES)
    print()
    for match in matches:
        print(match)


def load_files(directory):
    """
    Given a directory name, return a dictionary mapping 
    the filename of each `.txt` file inside that 
    directory to the file's contents as a string.
    """
    file_dict = {}

    for file in os.listdir(directory):
        with open(os.path.join(directory, file), 
        encoding="utf-8") as f:
            file_dict[file] = f.read()

    return file_dict


def tokenize(document):
    """
    Given a document (represented as a string), return 
    a list of all of the words in that document, in order.

    Process the document by converting all words in 
    lowercase, and removing any punctuation or English stopwords.
    """
    tokenized_data=nltk.tokenize.word_tokenize(document.lower())

    final_data = list()
    
    for item in tokenized_data:
        if item not in stop_words and item not in string.punctuation:
            final_data.append(item)

    return final_data


def compute_idfs(documents):
    """
    Given a dictionary of `documents` that maps names of
    documents to a list of words, return a dictionary that
    maps words to their IDF values.

    Any word that appears in at least one of the documents 
    should be in the resulting dictionary.
    """
    idf = dict()

    document_len = len(documents)

    all_words = set(sum(documents.values(), []))

    for word in all_words:
        count = 0
        for doc_values in documents.values():
            if word in doc_values:
                count += 1

        idf[word] = math.log(document_len/count)

    return idf

def top_files(query, files, idfs, n):
    """
    Given a `query` (a set of words), `files` 
    (a dictionary mapping names of files to a 
    list of their words), and `idfs` (a dictionary 
    mapping words to their IDF values), return a 
    list of the filenames of the `n` top files that 
    match the query, ranked according to tf-idf.
    """
    scores_dict = dict()
    for filename, filedata in files.items():
        file_score = 0
        for word in query:
            if word in filedata:
                file_score += filedata.count(word) * idfs[word]
        if file_score != 0:
            scores_dict[filename] = file_score

    sorted_list = list()

    for key, value in sorted(scores_dict.items(), 
    key=lambda v: v[1], reverse=True):
        sorted_list.append(key)
    
    return sorted_list[:n]


def top_sentences(query, sentences, idfs, n):
    """
    Given a `query` (a set of words), `sentences` 
    (a dictionary mapping sentences to a list of their words), 
    and `idfs` (a dictionary mapping words to their IDF values), 
    return a list of the `n` top sentences that match
    the query, ranked according to idf. If there are ties, 
    preference should be given to sentences that have a higher 
    query term density.
    """
    top_sentences = dict()
    for sentence, words in sentences.items():
        sent_score = 0
        for word in query:
            if word in words:
                sent_score += idfs[word]

        count = 0
        if sent_score != 0:
            for word in query:
                count += words.count(word)
            density = count / len(words)
            top_sentences[sentence] = (sent_score, density)
    
    sorted_list = list()

    for key in sorted(top_sentences.keys(), 
    key = lambda v: top_sentences[v], reverse=True):
        sorted_list.append(key)
    
    return sorted_list[:n]


if __name__ == "__main__":
    main()

Output

Here, I asked the AI a series of questions. Let’s see what our AI gave us as results.

Output 1

Query: Who is Kevin Mitnick?

Kevin David Mitnick (born August 6, 1963) is an American computer security consultant, author, and convicted hacker.

Output 2

Query: Who founded Microsoft?

Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800.

Output 3

Query: What are popular Linux distributions?

Popular Linux distributions include Debian, Fedora Linux, and Ubuntu, which in itself has many different distributions and modifications, including Lubuntu and Xubuntu.

Output 4

Query: Who created Python Programming Language?

Created by Guido van Rossum and first released in 1991, Python’s design philosophy emphasizes code readability with its notable use of significant whitespace.

Output 5

Query: In which year python 2.0 released?

Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles.

Output 6

Query: Who is Donald Trump?

Donald Trump (full name Donald John Trump) is an American politician, media personality, and businessman who served as the 45th president of the United States from 2017 to 2021.

Output 7

Query: Trump’s children.

He had three children: Donald Jr. (born 1977), Ivanka (born 1981), and Eric (born 1984).

Output Video

Watch the entire video to understand how the project actually works.

Output

Summary

In this lesson, we built a Question Answering System using Python where we have taken the help of Natural Language Processing which is an important part of Artificial Intelligence. We trained our AI Model through some pre-collected topics collected from Wikipedia.

You are welcome to add more topics to this Question Answering System or modify the existing one. Just keep in mind one thing, the document file must be a text file in the corpus directory. For any query related to this topic, please don’t hesitate to leave your comment below. You will get a reply soon.

Share your love
Subhankar Rakshit
Subhankar Rakshit

Hey there! I’m Subhankar Rakshit, the brains behind PySeek. I’m a Post Graduate in Computer Science. PySeek is where I channel my love for Python programming and share it with the world through engaging and informative blogs.

Articles: 194