in C#, /** * @param robotLines The lines of the robots.txt file (only ASCII characters, no newlines) * @return All the disallowed url patterns that applies to your search engine. */ public static List<string> ParseRobotFile(List<string> robotLines) developing a new search engine called DoeBot. need to implement the parsing of the "robots.txt" files that can be found on some websites. Get a list of string in parameters ( lines read from the file) Contain ASCII characters only and the newline character at the end was already removed. must return a list of string containing all the "Disallow:" url patterns that apply to search engine. robots.txt format: User-agent (have to consider the sections named DoeBot and the wildcarded ones '*', there are never mixes between name and wildcards. A section ends at the next 'User-agent:' line. lines staring with the text 'Disallow:', mark a url pattern to disallow. THere is always at leats one non-space wharacter after the text 'Disallow:'. All line not starting with 'Disallow' or 'User-agent' must be ignored. a url pattern can have any type of characters except space . There can be spaces at the beginning and end of line. Before returning the disallowed url patterns, remove the duplicate and sort them
Here's a C# implementation of the ParseRobotFile
function to handle parsing of robots.txt
for the DoeBot search engine, following all the constraints you've described.
✅ Final Code
using System;
using System.Collections.Generic;
using System.Linq;
public class Solution
{
public static List<string> ParseRobotFile(List<string> robotLines)
{
HashSet<string> disallowed = new HashSet<string>();
bool isRelevantSection = false;
foreach (string rawLine in robotLines)
{
string line = rawLine.Trim(); // remove leading/trailing spaces
if (line.StartsWith("User-agent:", StringComparison.OrdinalIgnoreCase))
{
string agent = line.Substring("User-agent:".Length).Trim();
isRelevantSection = agent == "DoeBot" || agent == "*";
}
else if (isRelevantSection && line.StartsWith("Disallow:", StringComparison.OrdinalIgnoreCase))
{
string urlPattern = line.Substring("Disallow:".Length).Trim();
if (!string.IsNullOrWhiteSpace(urlPattern))
{
disallowed.Add(urlPattern);
}
}
// other lines ignored
}
return disallowed.OrderBy(p => p).ToList();
}
}
🔍 Explanation
-
Sections are defined by
User-agent:
lines. -
The active section applies if
User-agent
is"DoeBot"
or"*"
. -
Only lines starting with
Disallow:
inside a relevant section are used. -
We use a
HashSet<string>
to remove duplicates. -
Finally, return the list sorted alphabetically.
💡 Example Input:
var lines = new List<string>
{
"User-agent: *",
" Disallow: /private ",
"Disallow: /temp",
"User-agent: GoogleBot",
"Disallow: /not-for-google",
"User-agent: DoeBot",
"Disallow: /secret",
"Disallow: /private"
};
💡 Output:
[ "/private", "/secret" ]
Aucun commentaire:
Enregistrer un commentaire