Class RobotsTxt

java.lang.Object
org.attribyte.api.http.util.RobotsTxt

public class RobotsTxt extends Object
A parsed robots.txt file.
  • Field Details

    • NO_ROBOTS

      public static final RobotsTxt NO_ROBOTS
  • Constructor Details

    • RobotsTxt

      public RobotsTxt(Reader r, Set<String> agents) throws IOException
      Parse robots.txt from a character stream.
      Parameters:
      r - A reader from which the robots.txt is read.
      agents - A list of user agents that, if listed in the file, should be preserved. The wildcard (*) is always preserved.
      Throws:
      IOException - on input error.
  • Method Details

    • parse

      public static RobotsTxt parse(String host, Client httpClient, String userAgent, Set<String> preserveAgents, org.attribyte.api.Logger logger)
      Creates a robots.txt from the standard location (/robots.txt).
      Parameters:
      host - The hostname. The URL will be created as [host]/robots.txt.
      httpClient - The HTTP client for making the request.
      userAgent - The User-Agent sent with the request.
      preserveAgents - The set of agents to preserve. Agents not contained in this set will be ignored during parse.
      logger - A logger for errors. May be null. If specified HTTP errors during parse will be logged at the warn level.
      Returns:
      The parsed robots.txt.
    • isAllowed

      public final boolean isAllowed(String userAgent, String path)
      Determine if a user agent is allowed for the specified path.
      Parameters:
      userAgent - The user agent string.
      path - The path.
      Returns:
      Is the agent allowed?
    • isAllowed

      public final boolean isAllowed(String userAgent, String path, boolean checkWildcard)
      Determine if a user agent is allowed for the specified path.

      Technically, the treatment of Allow is not right (http://www.robotstxt.org/wc/norobots-rfc.html). A single list should be processed - matching all records in the order they appear. However, in practice, I have found that many times people do things that don't make sense - like disallow all, then allow, etc.

      Parameters:
      userAgent - The user agent string.
      path - The path.
      checkWildcard - Should the wildcard record be checked? (This gives a way to know if a user agent is explicitly disallowed by name.)
      Returns:
      Is the agent allowed?