Class HTMLPageParser

java.lang.Object
com.attribyte.parser.page.HTMLPageParser

public class HTMLPageParser extends Object
Extract interesting content from HTML as a Page.
Author:
Matt Hamer
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static void
    anchors(org.jsoup.nodes.Document doc, Set<Anchor> anchors)
    Extract unique anchors from a document.
    static void
    audio(org.jsoup.nodes.Document doc, Set<Audio> audios)
    Extract unique audio streams appearing in a document into a set.
    static void
    images(org.jsoup.nodes.Document doc, Set<Image> images)
    Extract unique images appearing in a document into a set.
    static boolean
    Does the author name appear to be valid.
    static Page
    parse(String content, String defaultCanonicalLink)
    Parse a page from a string.
    static Page
    parse(org.jsoup.nodes.Document doc, String defaultCanonicalLink)
    Parse a page from a Jsoup document.
    static Page
    parseFragment(String content, String defaultCanonicalLink)
    Parse a page from a string containing a fragment of HTML.
    static Page
    parseFromURL(String url, int maxBodySize)
    Parse from a URL.
    static void
    videos(org.jsoup.nodes.Document doc, Set<Video> videos)
    Extract unique videos appearing in a document into a set.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • HTMLPageParser

      public HTMLPageParser()
  • Method Details

    • parseFromURL

      public static Page parseFromURL(String url, int maxBodySize) throws IOException
      Parse from a URL.
      Parameters:
      url - The URL.
      maxBodySize - The maximum allowed body size.
      Returns:
      The page.
      Throws:
      IOException - on parse error.
    • parse

      public static Page parse(String content, String defaultCanonicalLink)
      Parse a page from a string.
      Parameters:
      content - The string to parse.
      defaultCanonicalLink - The default canonical link to use.
      Returns:
      The page.
    • parseFragment

      public static Page parseFragment(String content, String defaultCanonicalLink)
      Parse a page from a string containing a fragment of HTML.
      Parameters:
      content - The string to parse.
      defaultCanonicalLink - The default canonical link to use.
      Returns:
      The page.
    • parse

      public static Page parse(org.jsoup.nodes.Document doc, String defaultCanonicalLink)
      Parse a page from a Jsoup document.
      Parameters:
      doc - The document,
      defaultCanonicalLink - The default canonical link for the page.
      Returns:
      The parsed page.
    • isValidAuthor

      public static boolean isValidAuthor(String author)
      Does the author name appear to be valid.
      Parameters:
      author - The author.
      Returns:
      Is the name valid?
    • images

      public static void images(org.jsoup.nodes.Document doc, Set<Image> images)
      Extract unique images appearing in a document into a set.
      Parameters:
      doc - The document.
      images - The set to which images are added.
    • videos

      public static void videos(org.jsoup.nodes.Document doc, Set<Video> videos)
      Extract unique videos appearing in a document into a set.
      Parameters:
      doc - The document.
      videos - The set to which videos are added.
    • anchors

      public static void anchors(org.jsoup.nodes.Document doc, Set<Anchor> anchors)
      Extract unique anchors from a document.
      Parameters:
      doc - The document.
      anchors - The set to which anchors are added.
    • audio

      public static void audio(org.jsoup.nodes.Document doc, Set<Audio> audios)
      Extract unique audio streams appearing in a document into a set.
      Parameters:
      doc - The document.
      audios - The set to which audio is added.