Resolved: Query on the Project - Create a Q&A Chatbot with LangChain Project

Question

Hi Instructor,

I hope you're doing well.

In the first step, load the course transcript. I have written the code up to string_list_concat, but this variable, string_list_concat, has concatenated in a way that the MarkdownHeaderTextSplitter is unable to split the headers with one # and two ##, as the entire string is in one list. Let me share the code here. I kindly request your assistance with it. Thank you.

Please take a look at the code and help me.

Thank you,

# LangChain Q&A ChatBot Project
from langchain_community.document_loaders import PyPDFLoader

from langchain_text_splitters import (MarkdownHeaderTextSplitter, 
                                      TokenTextSplitter)

from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, SystemMessage

from langchain_core.prompts import (PromptTemplate,
                                    HumanMessagePromptTemplate, 
                                    ChatPromptTemplate)

from langchain_core.runnables import (RunnablePassthrough, 
                                      RunnableLambda, 
                                      chain)
from langchain_openai import ChatOpenAI

from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_chroma import Chroma 

# Load the Tableau Course Transcript
loader_pdf = PyPDFLoader('./Introduction_to_Tableau.pdf') 
docs_list = loader_pdf.load()
docs_list[0] 

' '.join(docs_list[0].page_content.split()) 

#concatenating
string_list_concat = "".join([" ".join(i.page_content.split()) for i in docs_list]) 

md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on = [("#", "Section Title"),
                                                                ("##", "Lecture Title")]) 
docs_list_md_split = md_splitter.split_text(string_list_concat) 

docs_list_md_split # But here the variable has an empty list as I believe the concatenation was not done with proper code

Answer 1

Hey,

Thank you for reaching out and for engaging with the project!

Please, note the argument of the join() function when defining the string_list_concat variable:

string_list_concat = "".join([i.page_content for i in docs_list])

What this code does is extracting the page content of each document in docs_list and then joining them in a single string. Applying these changes to your code should resolve the issue.

Let me know if you need further assistance!

Kind regards,
365 Hristina

Answer 2

Hey Hristina,

Thank you so much for answering my question. However, how do we remove the /n new lines from a single string doc?

Kindly assist me.

Thank you,
Sharieff,

Answer 3

Hey again!

Please, refer to lecture "Indexing: Document Loading with PyPDFLoader" from the "Retrieval Augmented Generation (RAG)" section of the "Build Chat Applications with OpenAI and LangChain" course.

However, please note that removing the newline characters is not part of the task. In the third section of the project titled "Create a Chain to Correct the Course Transcript", you'll be tasked with using an LLM to structure the text appropriately.

Let me know if I can assist further!

Kind regards,
365 Hristina

Answer 4

Go it, thank you so much. I thought removing the newline character was a part of the task and hence the problem was created.

Once again thank you so much Hristina. My query is now resolved.

Answer 5

Hristina Hristova

Instructor

Posted on:

29 Jan 2025

0

Happy to help! Enjoy the project!

Kind regards,
365 Hristina

Resolved: Query on the Project - Create a Q&A Chatbot with LangChain Project

Submit an answer

related questions