Skip to main content

Command Palette

Search for a command to run...

Transforming Unstructured Documents to Standardized Formats with GPT: Building a Resume Parser

Updated
5 min read
Transforming Unstructured Documents to Standardized Formats with GPT: Building a Resume Parser
R

Automate day-to-day operations with Generative AI

Among its numerous applications, GPT has become a game-changer in the processing and standardization of unstructured documents.

In this blog post, we'll explore how you can convert unstructured documents, specifically resumes, into a standardized format using GPT.

Background

Resumes come in various shapes and sizes, with no two being exactly alike. This presents a unique challenge for recruiters who need to sift through hundreds or even thousands of resumes to identify suitable candidates.

As you can see, a quick Google search returns resumes in various designs and formats.

This is a well-structured resume but extracting the text from the PDF file would result in unstructured text, losing the original formatting.

The resume above uses a month/year format for dates but other resumes may use different date formats such as date/month/year or only the year. These variations make the task of parsing resumes challenging as it is difficult to account for all possible cases.

Code

def extract_text_from_binary(file):
    pdf_data = io.BytesIO(file)
    reader = PyPDF2.PdfReader(pdf_data)
    num_pages = len(reader.pages)
    text = ""

    for page in range(num_pages):
        current_page = reader.pages[page]
        text += current_page.extract_text()
    return text

First, we need to extract the text from PDF. We can use the PyPDF2 library for this.

To call the OpenAI API, we use LangChain. LangChain is a community-driven framework for developing Language Model powered applications. It streamlines the development process by taking care of tedious tasks under the hood.

from langchain.llms import OpenAIChat
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
from langchain import LLMChain, PromptTemplate

template = """Format the provided resume to this YAML template:
        ---
    name: ''
    phoneNumbers:
    - ''
    websites:
    - ''
    emails:
    - ''
    dateOfBirth: ''
    addresses:
    - street: ''
      city: ''
      state: ''
      zip: ''
      country: ''
    summary: ''
    education:
    - school: ''
      degree: ''
      fieldOfStudy: ''
      startDate: ''
      endDate: ''
    workExperience:
    - company: ''
      position: ''
      startDate: ''
      endDate: ''
    skills:
    - name: ''
    certifications:
    - name: ''

    {chat_history}
    {human_input}"""

prompt = PromptTemplate(
        input_variables=["chat_history", "human_input"],
        template=template
    )

memory = ConversationBufferMemory(memory_key="chat_history")

llm_chain = LLMChain(
        llm=OpenAIChat(model="gpt-3.5-turbo"),
        prompt=prompt,
        verbose=True,
        memory=memory,
    )

    res = llm_chain.predict(human_input=resume)
    return res

Here we are asking GPT to structure the data into YAML format. I find YAML to be easy to read so I chose it for this task. Also, there has been a report that special characters such as {} in the prompt do not get processed correctly.

If you instruct GPT to follow the provided structure as much as possible you should set the temperature to 0.

I found gpt-3.5-turbo to perform very well and the best value for the cost.

And that's it! Very simple.

Result

>-
  ---

  name: 'IM A . SAMPLE X'

  phoneNumbers:
  - '308-308-3083'

  websites:
  - ''

  emails:
  - 'imasample10@xxxx.net'

  dateOfBirth: ''

  addresses:
  - street: '3083 North South Street, Apt. A -1'
    city: 'Grand Island'
    state: 'Nebraska'
    zip: '68803'
    country: ''

  summary: 'Seeking Position in Human/Social Service Administration or related
  field utilizing strong academic background, experience and excellent
  interpersonal skills'

  education:
  - school: 'Bellevue University'
    degree: 'BS in Human & Social Service Administration'
    fieldOfStudy: ''
    startDate: 'Jan 20xx'
    endDate: ''
  - school: 'Central Community College - Hastings Campus'
    degree: 'AAS in Human Services'
    fieldOfStudy: ''
    startDate: 'Dec 19xx'
    endDate: ''
  - school: ''
    degree: '75-Hr Basic Nursing Assistant Program'
    fieldOfStudy: ''
    startDate: 'Jan 20xx'
    endDate: ''

  workExperience:
  - company: 'Greater NE Goodwill Industries'
    position: 'Day Rehabilitation Specialist'
    startDate: 'June 20xx'
    endDate: 'Present'
  - company: 'Tiffany Square Care Center'
    position: 'Assistant Receptionist'
    startDate: 'Jan 20xx'
    endDate: 'June 20xx'
  - company: 'Central NE Goodwill Industries'
    position: 'Employment Trainer'
    startDate: 'Aug 19xx'
    endDate: 'May 20xx'
  - company: 'Crisis Center Inc & Family Violence Coalition'
    position: 'Criminal Justice/Shelter Advocate'
    startDate: 'July 20xx'
    endDate: 'Oct 20xx'
  - company: 'Tiffany Square Care Center'
    position: 'Social Services Assistant'
    startDate: 'Jan 20xx'
    endDate: 'Sept 20xx'

  skills:
  - name: ''

  certifications:
  - name: ''

  communityService:
  - organization: 'Women\'s Health Services & Resource Center'
    role: 'Volunteer'
    startDate: 'Fall 20xx'
    endDate: 'Present'
    responsibilities:
    - 'Assisted professional staff and participated in one-on-one discussions with women seeking advice on health-related issues'
    - 'Observed group training sessions to develop the skills needed to facilitate groups in the future'

GPT managed to parse resumes with human-level accuracy and precision; even perfectly handled work history. In the sample pdf we used, GPT was also able to accurately interpret the use of '20xx' as a year in the dates section, which can often be challenging for computers to understand.

It's worth noting that GPT's flexibility allowed it to create a new field for community service, even though it wasn't part of the provided YAML template.

Achieving this level of flexibility is usually challenging with traditional programming methods. Parsing unstructured documents like resumes can be very costly and time-consuming due to the vast number of patterns that need to be accounted for.

GPT demonstrated that it can be incredibly useful in transpiring unstructured data to a structured format for its low cost, high accuracy, and scalability. I am interested in exploring more on using GPT for data conversion.

DEMO

I put together a simple DEMO! I'm using FastAPI for the backend and deployed it on Render hosting.

You can upload any pdf and have GPT parse the resume. You will need to wait a few minutes for the pdf to be processed.

Hope you enjoyed it! I will be blogging more about using Language Models for applications so watch this space!

S

Full code is not available please provide full code

1
K

The code provided is not complete. For example you are extracting the text from the pdf but you have included the call to the function in your code. Just for completeness of the article and for first time learners, I would suggest sharing the whole code or providing a link to the whole structures code.

Y
Yuvayogi2y ago

I want full code for this!

2
D

Can u please provide the full code for new learners

K

Excellent article - illustrated in a very simple manner and very practical as well. I liked it.

R

OpenAIChat is deprecated. We need to go with the ChatOpenAI.

chatopenai = ChatOpenAI( model_name="gpt-3.5-turbo") llmchain_chat = LLMChain(llm=chatopenai, prompt=prompt)

R

Sorry, I am not sure the exact reason why you decided to go with the conversation memory-based approach. However, the good news is that it's not required for this use case. The reason for optimization is to save cost by reducing the token size. Also improve the overall performance by asking for what is required.

R

Please try to flatten the resume input. A multi-line to single-line conversion will save a ton of tokens and cost involved with the Open AI GPT based models.

R

The YAML template which you have specified in the "template" can be simplified by converting the multi-line to single say by using online convertors. It makes a lot of difference as each space and line adds up to the token count. You don't want to exceed the token count nor spend money unnecessarily.

R

Great post. One small suggestion on returning the JSON response using pyyaml.

1
R
Reo Ogusu2y ago

Thanks! Yeah I could try that.

More from this blog

Seeai: LLM LAB

8 posts