A Kaggle Competition Approach with Large Language Models

In the LLM20 Kaggle competition, I dived into solving a 20 Questions game with AI. The rules were simple: build an agent that asks or answers yes-or-no questions to guess a secret keyword within 20 turns using Large Language Models (LLMs). While most went for mixed-strategy solutions, I wanted to see how far prompt engineering with LLMs could go.

The LLM Approach

Most top entries employed complex methods, like using unigram-based approaches and multi-modal strategies. I deliberately avoided these, focusing solely on LLM performance. I aimed to mimic human-like logical deductions with pure LLM decision-making. I stuck purely to using LLMs and prompt engineering. The goal was to harness the logical prowess of LLMs for question-asking and answer deduction without any offline strategies.

My LLM-based strategy:

Strategic Questioning: Using prompts to guide the AI in asking questions that would efficiently narrow down the keyword categories.

ask = (
    "Now ask the next question ({i}/20). Here are some tips you should keep in mind:\n"
    "- Start with general questions and then move to specific ones.\n"
    "- Be strategic: aim to divide the possibilities roughly in half with each question to maximize efficiency and do not ask a question if you are quite sure of the answer."
    "- Focus on gathering information, you shoud not guess the keyword directly.\n"
    "- Reflect on previous answers to avoid repetition and redundant questions.\n"
    "- Use contrasting pairs (e.g., 'Is it big?' vs. 'Is it small?') and don’t be afraid to throw in a disjunction (Is it used for transportation OR healthcare) to eliminate multiple possibilities at once.\n"
    "- Refrain from using phrases like 'such as' or providing examples within your questions, as they can introduce ambiguity in the responses.\n"
    "- Keep it concise: end your query with a question mark (?), no additional commentary needed."
)
ask_3no = (
    ask
    + "\nWARNING: You've hit a streak of 3 'no' answers. It might be time to rethink your approach. Try asking broader questions or consider a different angle."
)
 
ask_5no = (
    ask + "\nCAUTION:  You've hit a streak of 5 'no' answers. "
    "You may be on the wrong track. Take a step back, reassess your assumptions, and try to remove any ambiguity with a new line of questioning."
)

guess = (
    "It's time to make your guesses ({i}/20). "
    "Based on the clues, provide three educated guesses without additional explanations. "
    "EXAMPLES OF ANSWER: The keyword is red. **tomato** or **red pepper**."
)

Responsive Answering: The LLM as the responder, ensuring accurate yes or no answers with a strong focus on preserving question semantics.

Evolution Through Prompts: Regularly refining prompts to foster critical thinking in the LLM, ensuring dynamic problem-solving without pre-set data.

class Asker(Questioner):
    temperature: int = 0.6
    ...
    def config_state(self, obs):
        super().config_state(obs)
        i = len(obs.questions) + 1
        if obs.answers[-5:].count("no") == 5:
            ask_prompt = prompt.ask_5no.format(i=i)
        elif obs.answers[-3:].count("no") == 3:
            ask_prompt = prompt.ask_3no.format(i=i)
        else:
            ask_prompt = prompt.ask.format(i=i)
        self.update_state("user", ask_prompt)
    ...

Tech Constraints & Challenges

Working within the following constraints shaped my approach:

Size Limits: Each question capped at 750 characters; guesses at 100 characters.
Time Boundaries: Agents had 60 seconds per round with an extra 300 seconds as a buffer.
Hardware: With 100 GB of disk space, 16 GB RAM, and a T4 GPU, I had to use a smaller model, namely llama3.1 8B. This size limitation made LLM-only strategies more challenging compared to heuristic-based solutions.

Any timeout or non-binary response would end the game. This necessitated meticulous prompt crafting and robust error handling, showcased in this snippet:

def result(self, obs, cfg) -> str:
    start_time = time.time()
    while time.time() - start_time < 0.8 * cfg.actTimeout:
        results = super().result(obs, cfg).split("|")
        for res in results:
            res = re.sub(r"^(a|an|the)\s+", "", res.lower().strip())
            if res not in obs.guesses and res.count(" ") < 3:
                return res
        self.temperature = min(0.2, self.temperature + 0.05)
    return self.default_res()

Here, timing and response constraints forced careful state management and rapid decision-making.

Results & Insights

My LLM-only approach, while not leading, demonstrated the potential and current limitations of LLMs in interactive reasoning. The experience underscored the need for advancements in efficiency and context management. As LLMs improve, I foresee LLM-exclusive strategies becoming more competitive.

This project revealed the balance between creative AI prompt engineering and the technical constraints of dataset and model sizes within a set time frame, laying groundwork for future explorations in LLM capabilities.