<template>

  <div id="body" class="font-roboto mt-5 w-3/4 m-auto text-justify">
     
    <h1 class="text-kGreen font-bold text-xl">How Search works</h1>
    <p>Written by: nazmi.tarmizi@khazanah.com.my <br>
    Date: 7th July 2022 </p>
    <div class="mt-3">



      <div class="hidden md:block">
        <h2 class="font-semibold text-left text-gray-700">Overview</h2>
        <div class="flex flex-row justify-center gap-5 font-medium ">
          <div class="flex flex-row rounded-lg p-5 bg-white my-5 drop-shadow-md">
            <svg xmlns="http://www.w3.org/2000/svg" class="h-6 w-6" fill="none" viewBox="0 0 24 24"
              stroke="currentColor" stroke-width="2">
              <path stroke-linecap="round" stroke-linejoin="round"
                d="M12 10v6m0 0l-3-3m3 3l3-3m2 8H7a2 2 0 01-2-2V5a2 2 0 012-2h5.586a1 1 0 01.707.293l5.414 5.414a1 1 0 01.293.707V19a2 2 0 01-2 2z" />
            </svg>
            <p class="text-left">1. Document Ingestion</p><br>
          </div>
          <div class="flex flex-row rounded-lg p-5 bg-white my-5 drop-shadow-md">
            <svg xmlns="http://www.w3.org/2000/svg" class="h-6 w-6" fill="none" viewBox="0 0 24 24"
              stroke="currentColor" stroke-width="2">
              <path stroke-linecap="round" stroke-linejoin="round"
                d="M9 12h6m-6 4h6m2 5H7a2 2 0 01-2-2V5a2 2 0 012-2h5.586a1 1 0 01.707.293l5.414 5.414a1 1 0 01.293.707V19a2 2 0 01-2 2z" />
            </svg>
            <p class="text-left">2. Document Pre-Processing</p>
          </div>
          <div class="flex flex-row rounded-lg p-5 bg-white my-5 drop-shadow-md">
            <svg xmlns="http://www.w3.org/2000/svg" class="h-6 w-6" fill="none" viewBox="0 0 24 24"
              stroke="currentColor" stroke-width="2">
              <path stroke-linecap="round" stroke-linejoin="round"
                d="M7 8h10M7 12h4m1 8l-4-4H5a2 2 0 01-2-2V6a2 2 0 012-2h14a2 2 0 012 2v8a2 2 0 01-2 2h-3l-4 4z" />
            </svg>
            <p class="text-left">3. Sentence Embeddings</p>
          </div>
          <div class="flex flex-row rounded-lg p-5 bg-white my-5 drop-shadow-md">
            <svg xmlns="http://www.w3.org/2000/svg" class="h-6 w-6" fill="none" viewBox="0 0 24 24"
              stroke="currentColor" stroke-width="2">
              <path stroke-linecap="round" stroke-linejoin="round"
                d="M19 11H5m14 0a2 2 0 012 2v6a2 2 0 01-2 2H5a2 2 0 01-2-2v-6a2 2 0 012-2m14 0V9a2 2 0 00-2-2M5 11V9a2 2 0 012-2m0 0V5a2 2 0 012-2h6a2 2 0 012 2v2M7 7h10" />
            </svg>
            <p class="text-left">4. Indexing</p>
          </div>
          <div class="flex flex-row rounded-lg p-5 bg-white my-5 drop-shadow-md">
            <svg xmlns="http://www.w3.org/2000/svg" class="h-6 w-6" fill="none" viewBox="0 0 24 24"
              stroke="currentColor" stroke-width="2">
              <path stroke-linecap="round" stroke-linejoin="round"
                d="M10 21h7a2 2 0 002-2V9.414a1 1 0 00-.293-.707l-5.414-5.414A1 1 0 0012.586 3H7a2 2 0 00-2 2v11m0 5l4.879-4.879m0 0a3 3 0 104.243-4.242 3 3 0 00-4.243 4.242z" />
            </svg>
            <p class="text-left">5. Matching in Real Time</p>
          </div>
        </div>
      </div>

      <div>
        <h2 class="font-semibold text-left text-gray-700">How <abbr title="Natural Language Processing">NLP</abbr> works
        </h2>
        <p class="my-3">Extract text from pdf page by page and split into smaller chunks. This step is very important
          because advance
          NLP model can process 512 <abbr
            title="A token is a string of contiguous characters between two spaces, or between a space and punctuation marks. A token can also be an integer, real, or a number with a colon">tokens</abbr>
          at one time. If there are more than 512 tokens, sentences will be truncated and the later part of the sentence
          is not searchable. <br>
          <br> By default , the data is still very noisy and need to be cleaned by removing extra spaces and special
          characters.Next, we need to identify unwanted text(e.g. disclaimer text) and remove them.
          Each type of document may need different methods to extract and clean. So far most of the documents that we
          processed are landscape format powerpoint that contain multiple bullet points(instead of paragraph with
          complete sentences) <br>
          <br>Cleaning the data is very important. Without clean data the output of the NLP model will not be usable.
        </p>


        <p class="my-3">
          For words to be processed by machine learning models, they need some form of numeric representation that
          models
          can use in their calculation.
          We are using <a class="underline text-blue-500"
            href="https://huggingface.co/sentence-transformers/msmarco-distilbert-cos-v5">Sentence Transformers
            DistilBERT</a> NLP model to convert sentences to a vector in very high dimension(768 dimensions).
          The NLP model was trained by other researchers using <a class="underline text-blue-500"
            href="https://microsoft.github.io/msmarco/">MS Marco dataset</a> which contains 1 million question dataset,
          a
          natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling
          dataset,
          and a conversational search from Microsoft.

          <br>
          <br> Using a loss function such as
          softmax loss,
          multiple negatives ranking loss, or MSE margin loss, these models are optimized to produce similar embeddings
          for similar sentences, and dissimilar embeddings otherwise. <br>
          <br> Picture below shows the high level architecture of how sentence transformers are trained.
          <br>
        <div class="rounded-lg p-5 bg-white my-5 drop-shadow-md">
          <img class="m-auto" src="../assets/SentenceTransformerDiagram.svg" alt="">
          <p>Refer the research paper : <a class="underline text-blue-500"
              href="https://arxiv.org/pdf/1908.10084.pdf">Sentence-BERT: Sentence Embeddings using Siamese
              BERT-Networks</a></p>
        </div>

        The output of <a class="underline text-blue-500"
          href="https://huggingface.co/sentence-transformers/msmarco-distilbert-cos-v5">NLP model</a> that we are using
        for query matching is numerical representation of sentences , which then are used to calculate similarity between sentences with cosine similarity. Because similar sentences will be
        close together in the vector space, the higher the value of cosine similarity the more similar is the pair.

        <br>The animation below tries to visualize <a class="underline text-blue-500"
          href="https://en.wikipedia.org/wiki/Word2vec">Word2Vec</a> word embeddings after dimension reduction using
        UMAP

        <div class="rounded-lg p-5 bg-white my-5 drop-shadow-md">
          <img class="m-auto" src="../assets/word2vecumap.gif" alt="">
          <p>Source : <a class="underline text-blue-500"
              href="https://projector.tensorflow.org/">https://projector.tensorflow.org/</a></p>
        </div>

        <br>
        Cosine Similarity Formula
        <div class="rounded-lg p-5 bg-white my-5 drop-shadow-md">
          <img class="m-auto" src="../assets/cosformula.png" alt="">

        </div>
        Cosine Similarity also can be used for recommendation system
        <br>u = sentence embeddings of search query , v = sentence embeddings of a text from database.
        </p>
      </div>
      <div>
        <h2 class="font-semibold text-left text-gray-700 mt-5">Architecture</h2>
        <div class="rounded-lg p-5 bg-white my-5 drop-shadow-md">
          <img class="m-auto" src="../assets/LightHouseArchitecture.svg" alt="">
        </div>

      </div>

      <h2 class="font-semibold text-left text-gray-700">Future Works</h2>
      <p>Fine Tuning of the model</p>
      <p class="ml-2 "> 
        <br> Importance of fine tuning is to make the score really low for bad match and increase the score for good
        match.
        <br> The data that we need:
        <br>1. Search Text
        <br>2. Matched Sentences
        <br>3. Label (Good Match, Bad Match)
        <br>
      </p>

      <h2 class="font-semibold text-left text-gray-700 my-5">Other Application of NLP</h2>
      <p class="my-2">Rating agency such as Moody and S&P also started using NLP models to develop Credit Sentiment
        Score</p>
    </div>
  </div>
</template>

<script>
export default {
  name: "ApproachView",
  props: {},
  components: {},
};
</script>
