import React, { useState } from 'react';
import ReactMarkdown from 'react-markdown';
import Header from './components/Header';
import { Helmet } from 'react-helmet';



const TechnicalReportPage = () => {
  const reportContent4 = `
## Part 4 (2024-11-05)

## SWE-Bench

### Results

We achieved 49.2% on SWE-Bench Verified benchmark.

### Execution Environment

The execution environment remains the same as in the previous submission.

### Workflow

In our previous workflow, passing context between agents sometimes led to the loss of essential details. To address this, we combined the localization and fixing steps into a single agent. Additionally, we introduced a basic context management algorithm to streamline operations.

#### Localizer & Fixer agent

This new combined agent has access to three main tools: Edit Code, Create Code, Run Bash.

To prevent the agent from hitting its context window limit, we implemented a "first in, first out" (FIFO) strategy. This approach removes the oldest tool calls and outputs when necessary to maintain context within the limit.

#### Scorer

We retained the Scorer agent from the previous submission, but with a key difference: it no longer votes on solutions. Instead, it now reads all solutions and selects the best one directly.

### Future

We're encouraged by the progress made so far and plan to continue exploring new strategies to enhance the agent's reasoning abilities.`;

  const reportContent3 = `
## Part 3 (2024-10-30)

## SWE-Bench

### Results

We achieved 41.60% on SWE-Bench Verified benchmark.

### Execution Environment

The execution environment remains the same as in the previous submission.

### Workflow

We attempted to improve the validator agents - reproducer, tester, and code reviewer - but couldn't surpass our previous score on the SWE-Bench Verified Benchmark. Therefore, this time, we focused solely on patch generation rather than on patch validation agents.

We are reusing two agents from previous submission: Localizer and Code Editor.

We updated the Fixer agent and added a Scorer Agent.

#### Fixer

This agent has only one tool, unlike previous fixer agents: the "edit code" tool, which forwards edit instructions directly to the Code Editor agent.

The main improvement is that the fixer agent now spends time planning its changes before executing them. It also reviews its actions afterward and can make further edits if necessary.

#### Scorer

The Scorer Agent is assigned with evaluating the samples. The trigger to make agent stop generating samples is when total score of all identical samples stack up with each other and reach the threshold.

Our initial prompts were rated too highly, which led to the threshold being incorrectly triggered. To evaluate the solutions, we experimented with several different criteria and identified those that best fit our task.

We believe that allowing an LLM to assign scores based on criteria that rely on the LLM's own judgment (such as "Understanding the issue") is not the best practice.

Additionally, the maximum number of samples is set to 5. If, at the end of the workflow, there are 2 or more solutions with the same score, the selection will be made by the LLM based on specific criteria.

### Future

We attempted a pass@5 test with all five samples and achieved a 54.2% success rate. Next, we plan to explore alternative methods to enhance the scorer agent.`;

  const reportContent2 = `
## Part 2 (2024-10-07)

## SWE-Bench

### Results

We achieved 31.6% on SWE-Bench Verified benchmark.

- Median token usage: ~370K tokens
- Median execution time: ~5 minutes 33 seconds

### Execution Environment

The execution environment remains the same as in the previous submission.

### Workflow

This time, we're using a completely different approach by giving the agent full control of the workflow.

1. Before reading or writing any code, we use the Censorer agent to remove redundant information from the task. 
2. Next, we try to reproduce the issue using the Reproducer agent, which prepares a script that replicates the problem.
3. Finally, we ask our agent to solve the task using these tools: reproduce (runs the script), review (reviews the code), and fix (identifies the problem, proposes a solution, and creates a patch).

#### Censorer
The Censorer is a specialized agent designed to simplify problem statements by focusing on their core elements. Its main role is to remove unnecessary details that could cause confusion or ambiguity. By filtering out irrelevant content, the Censorer ensures that only the most important information remains, improving the clarity and accuracy of the problem statement.

Additionally, the Censorer eliminates any references to testing procedures or documentation. This targeted exclusion helps keep the problem statement focused solely on the engineering challenge, allowing the next agents to concentrate fully on solving the issue without distractions.

#### Reproducer 

The Reproducer agent aims to replicate an issue by creating a standalone Python script. This process consists of two main parts: script preparation and script execution.

1. Script Preparation. This step occurs before any patch is applied. During this phase, the agent operates in an empty folder and has read-only access to the repository. It is equipped with two tools: execute bash script and edit a file. We instruct the agent to write a script that reproduces the issue and can be run multiple times without changing its behavior. Once the script is ready, we save both the script and the output from its execution.

2. Script Execution. In this phase, the agent has access only to the execute bash script tool. When the agent runs the script, we compare the original output with the current output. If the output remains unchanged, we conclude that the issue has not been fixed. If there is a difference, we ask the agent to draw conclusions based on the outputs.

#### Localizer
The Localizer agent is designed to systematically navigate the project's file tree and pinpoint the code segments most relevant to the problem statement. By using a custom file tree structure, the agent traverses directories and subdirectories to build a contextual understanding of the issue.

The Localizer agent utilizes two key tools for this task:
1. Open Directory: This tool allows the agent to expand directories within the file tree. By accessing deeper files, the agent ensures thorough exploration of the codebase.
2. Analyze File: Instead of having unlimited access to all files, the agent uses this tool to query specific files. This ensures a targeted approach, extracting only relevant information based on the problem at hand.

Through this interaction, the agent gradually builds an understanding of the codebase, staying focused and avoiding information overload. In the end, the Localizer outputs the location of the relevant code segments along with its reasoning, providing a clear and structured method for identifying the root cause of issues in complex codebases.

#### Fixer

The Fixer Agent is tasked with resolving code issues identified by the Localizer agent. It receives the probable location of the problem, the reasoning behind this location, and the original problem statement. 

Using its advanced capabilities, the Fixer agent performs detailed code analysis and modifications through the following tools:
1. Open Directory: Like the Localizer agent, this tool helps the Fixer Agent explore additional directories in the file tree to access files when needed.
2. Analyze Code: This tool extracts specific insights from a targeted file, helping the agent understand the code's structure and logic more deeply.
3. Read File: Unlike Analyze Code, this tool provides the agent with a code skeleton, allowing it to grasp the overall structure without diving into unnecessary details.
4. Open Code: This tool grants the Fixer Agent full access to entire code blocks (functions or classes), complementing the Read File tool by offering complete context when needed.

Since the Fixer Agent is already directed to the likely source of the issue, it can focus on that specific area without needing to navigate the entire repository. After gathering sufficient context, the Fixer Agent generates a new code solution to fix the problem.

This solution is then passed to the Code Editor Agent for integration and replacement in the codebase.

#### Code Editor

The Code Editor agent is designed to create executable patches. It efficiently searches for code segments, replaces them, and resolves issues such as indentation and syntax errors.

To ensure accurate replacements, the agent computes similarity scores to identify matching code segments. It standardizes indentation for consistency and constructs prompts to update the specified code segments. Additionally, the agent manages errors that occur when search strings cannot be found. In our prompts, we emphasize the importance of making minimal changes, which enhances the overall editing experience.

#### Code Reviewer

A specialized Code Reviewer Agent was developed to thoroughly evaluate the logic and functional correctness of code, while ignoring minor issues such as syntax errors or style inconsistencies. This focus allows the agent to concentrate on identifying logical flaws and suboptimal design patterns that impact the code's quality.

To systematize the review process, a custom ranking system was created. This system measures code quality based on predefined criteria, enabling a thorough analysis of its logic, functionality, and resilience. By using this ranking mechanism, the agent can compare different implementations, highlighting both strengths and weaknesses. This structured approach enhances understanding of the code's overall quality, helping make more informed decisions and driving continuous improvement of the codebase.

Additionally, observations showed that when a code submission fails the review twice, the chances of success in later attempts drop significantly. To address this, a dynamic strictness adjustment mechanism was introduced. This system gradually reduces the agent's strictness after repeated failures, striking a balance between ensuring progress and addressing critical issues in the code.

### Future

We noticed that our previous agent handles different types of tasks. By combining the results of both agents, we achieve a 41.2% success rate. In the future, we plan to merge these two approaches.`;

  const reportContent1 = `
## Part 1 (2024-10-01)

## Introduction

At nFactorial, we are building fully autonomous agents to enable the next wave of 1-person software companies.

## SWE-Bench

### Results

We achieved 25.8% on SWE-Bench Verified benchmark.

- Median token usage: 67,504 tokens
- Median execution time: 53 seconds

### Execution Environment

We utilize [SWE-bench's](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md) docker image creation scripts to set up our environment. However, unlike SWE-bench, we do not apply test patches or perform an evaluation in this process. 

All of the results are pass@1. We executed the workflow for each task only once.

The agent does not have access to hints_text, fail-to-pass, pass-to-pass data. We only use repo, instance_id, base_commit, and problem statement.

The agent does not have access to the internet.

### Workflow

To reduce hallucinations and minimize token usage, we have developed a "pipeline" where the agent does not control the workflow. The pipeline consists of three primary phases: localize, fix, and analyze. The localization and fixing steps are repeated twice, and the resulting outputs are passed to the analysis phase, where the agent selects the best solution.

#### Localize

1. Based on the issue description, the agent searches the codebase using up to five keywords. This is achieved via a simple \`grep\` command.
2. The agent selects the 10 most relevant files from the search results.
3. The agent requests code snippets from these relevant files for review.
4. The agent reads the code snippets, with an intentionally extended context - up to 200 lines per snippet.
5. The agent decides which files need modifications and provides reasoning for these choices.

These numbers are based on trial-and-error experimentation across 20 random tests and are not underpinned by any formal analysis.

#### Fix

Informed by the conclusions from the previous step, the agent generates a patch to address the identified issue.

To modify the code, we employ a search-and-replace approach. The agent is tasked with generating a search-and-replace request in a predefined format.

Occasionally, the agent hallucinates and introduces incorrect indentation, either by adding unnecessary spaces or removing required ones. To mitigate this, we implemented a brute force algorithm that tests various indentation combinations, applying the search-and-replace process accordingly. This solution has reduced indentation errors to zero, validated through 20 random tests.

#### Analyzer

From the two generated solutions, the agent selects the optimal one.

Throughout the execution of these steps, we manually manage the context window. We extract the most relevant information from previous messages and create a new prompt that summarizes the key details and actions from earlier stages.

### Future

This report represents our simplest implementation. In the future, we aim to investigate more advanced agent-driven workflows where the agent takes full control of the workflow.

### Acknowledgments

Huge thanks to the SWE-bench team for providing containerized environments and for establishing a benchmark for coding agents.`;

  const reports = [reportContent1, reportContent2, reportContent3, reportContent4];
  const images = ['./graphs/sasuke.png', './graphs/neo.png', './graphs/dicaprio.png', './graphs/itachi.png'];
  const [reportIdx, setReportIdx] = useState(3);

  return (
    <div className="min-h-screen bg-gray-900 text-gray-100 pt-36">
      <Helmet>
  <title>SWE Bench Technical Report - nFactorial AI</title>
  <meta name="description" content="Read our latest technical reports on AI-driven software development. Learn about our progress in SWE-Bench." />
  <meta name="keywords" content="AI, SWE-Bench, nFactorial, technical reports, autonomous agents" />
  <meta name="robots" content="index, follow" />
  <meta property="og:title" content="Technical Report - nFactorial AI" />
  <meta property="og:description" content="Our progress in SWE-Bench benchmarks and autonomous agent development." />
  <meta property="og:image" content="https://nfactorial.dev/Data-Analyst.png" />
  <meta property="og:url" content="https://nfactorial.dev/swe-bench" />
</Helmet>
      <Header />

      <main className="max-w-5xl mx-auto px-4 py-8">
        <div className="flex gap-4 mb-8 flex-wrap">
          {[
            { id: 0, label: 'Part 1 (2024-10-01)' },
            { id: 1, label: 'Part 2 (2024-10-07)' },
            { id: 2, label: 'Part 3 (2024-10-30)' },
            { id: 3, label: 'Part 4 (2024-11-05)' }
          ].map(part => (
            <button
              key={part.id}
              onClick={() => setReportIdx(part.id)}
              className={`px-4 py-2 rounded-lg transition-colors ${
                reportIdx === part.id
                  ? 'bg-blue-600 text-white font-bold'
                  : 'bg-gray-800 hover:bg-gray-700'
              }`}
            >
              {part.label}
            </button>
          ))}
        </div>

        <div className="prose prose-invert max-w-none">
          <ReactMarkdown
            components={{
              h1: ({node, ...props}) => <h1 className="text-3xl font-bold mt-8 mb-4" {...props} />,
              h2: ({node, ...props}) => <h2 className="text-2xl font-bold mt-6 mb-3" {...props} />,
              h3: ({node, ...props}) => <h3 className="text-xl font-semibold mt-5 mb-3" {...props} />,
              h4: ({node, ...props}) => <h4 className="text-lg font-semibold mt-4 mb-2" {...props} />,
              p: ({node, ...props}) => <p className="mb-4" {...props} />,
              ul: ({node, ...props}) => <ul className="list-disc pl-6 mb-4 space-y-1" {...props} />,
              ol: ({node, ...props}) => <ol className="list-decimal pl-6 mb-4 space-y-1" {...props} />,
              li: ({node, ...props}) => <li className="ml-2" {...props} />,
              a: ({node, ...props}) => <a className="text-blue-400 hover:text-blue-300" {...props} />,
              code: ({node, inline, ...props}) => 
                inline ? (
                  <code className="bg-gray-800 px-1.5 py-0.5 rounded text-sm" {...props} />
                ) : (
                  <code className="block bg-gray-800 p-4 rounded-lg my-4 overflow-x-auto" {...props} />
                )
            }}
          >
            {reports[reportIdx]}
          </ReactMarkdown>

          <div className="mt-12 text-center">
            <img
              src={images[reportIdx]}
              alt="Solution Graph"
              className="mx-auto max-h-96 mb-4"
            />
            <p className="text-gray-400 italic">
              Simple representation of the solution with its internal name.
            </p>
          </div>
        </div>
      </main>

      <footer className="border-t border-gray-800 mt-16 py-8 px-4 text-center">
        <p>
          <span className="block text-gray-400 mb-2">Team nFactorial AI</span>
          <span className="font-medium">
            Nurdaulet Bolatov, Batyr Sardarbekov, Alen Abeshov, Arman Suleimenov
          </span>
        </p>
      </footer>
    </div>
  );
};

export default TechnicalReportPage;