jeudi 2 juin 2022

public static List ParseRobotFile(List robotLines)

 in C#, /** * @param robotLines The lines of the robots.txt file (only ASCII characters, no newlines) * @return All the disallowed url patterns that applies to your search engine. */ public static List<string> ParseRobotFile(List<string> robotLines) developing a new search engine called DoeBot. need to implement the parsing of the "robots.txt" files that can be found on some websites. Get a list of string in parameters ( lines read from the file) Contain ASCII characters only and the newline character at the end was already removed. must return a list of string containing all the "Disallow:" url patterns that apply to search engine. robots.txt format: User-agent (have to consider the sections named DoeBot and the wildcarded ones '*', there are never mixes between name and wildcards. A section ends at the next 'User-agent:' line. lines staring with the text 'Disallow:', mark a url pattern to disallow. THere is always at leats one non-space wharacter after the text 'Disallow:'. All line not starting with 'Disallow' or 'User-agent' must be ignored. a url pattern can have any type of characters except space . There can be spaces at the beginning and end of line. Before returning the disallowed url patterns, remove the duplicate and sort them


Here's a C# implementation of the ParseRobotFile function to handle parsing of robots.txt for the DoeBot search engine, following all the constraints you've described.


✅ Final Code

using System;
using System.Collections.Generic;
using System.Linq;

public class Solution
{
    public static List<string> ParseRobotFile(List<string> robotLines)
    {
        HashSet<string> disallowed = new HashSet<string>();
        bool isRelevantSection = false;

        foreach (string rawLine in robotLines)
        {
            string line = rawLine.Trim(); // remove leading/trailing spaces

            if (line.StartsWith("User-agent:", StringComparison.OrdinalIgnoreCase))
            {
                string agent = line.Substring("User-agent:".Length).Trim();
                isRelevantSection = agent == "DoeBot" || agent == "*";
            }
            else if (isRelevantSection && line.StartsWith("Disallow:", StringComparison.OrdinalIgnoreCase))
            {
                string urlPattern = line.Substring("Disallow:".Length).Trim();
                if (!string.IsNullOrWhiteSpace(urlPattern))
                {
                    disallowed.Add(urlPattern);
                }
            }
            // other lines ignored
        }

        return disallowed.OrderBy(p => p).ToList();
    }
}

🔍 Explanation

  • Sections are defined by User-agent: lines.

  • The active section applies if User-agent is "DoeBot" or "*".

  • Only lines starting with Disallow: inside a relevant section are used.

  • We use a HashSet<string> to remove duplicates.

  • Finally, return the list sorted alphabetically.


💡 Example Input:

var lines = new List<string>
{
    "User-agent: *",
    " Disallow: /private ",
    "Disallow: /temp",
    "User-agent: GoogleBot",
    "Disallow: /not-for-google",
    "User-agent: DoeBot",
    "Disallow: /secret",
    "Disallow: /private"
};

💡 Output:

[ "/private", "/secret" ]


Aucun commentaire:

Enregistrer un commentaire