java.lang.Object

com.attribyte.parser.page.HTMLPageParser

public class HTMLPageParser extends Object

Extract interesting content from HTML as a Page.

Author:: Matt Hamer

Constructor Summary

Constructors

Constructor

Description

HTMLPageParser()
Method Summary

Modifier and Type

Method

Description

static void

anchors(org.jsoup.nodes.Document doc, Set<Anchor> anchors)

Extract unique anchors from a document.

static void

audio(org.jsoup.nodes.Document doc, Set<Audio> audios)

Extract unique audio streams appearing in a document into a set.

static void

images(org.jsoup.nodes.Document doc, Set<Image> images)

Extract unique images appearing in a document into a set.

static boolean

isValidAuthor(String author)

Does the author name appear to be valid.

static Page

parse(String content, String defaultCanonicalLink)

Parse a page from a string.

static Page

parse(org.jsoup.nodes.Document doc, String defaultCanonicalLink)

Parse a page from a Jsoup document.

static Page

parseFragment(String content, String defaultCanonicalLink)

Parse a page from a string containing a fragment of HTML.

static Page

parseFromURL(String url, int maxBodySize)

Parse from a URL.

static void

videos(org.jsoup.nodes.Document doc, Set<Video> videos)

Extract unique videos appearing in a document into a set.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- HTMLPageParser
  
  public HTMLPageParser()
Method Details
- parseFromURL
  
  public static Page parseFromURL(String url, int maxBodySize) throws IOException
  
  Parse from a URL.
  
  Parameters:
  
  url - The URL.
  
  maxBodySize - The maximum allowed body size.
  
  Returns:
  
  The page.
  
  Throws:
  
  IOException - on parse error.
- parse
  
  public static Page parse(String content, String defaultCanonicalLink)
  
  Parse a page from a string.
  
  Parameters:
  
  content - The string to parse.
  
  defaultCanonicalLink - The default canonical link to use.
  
  Returns:
  
  The page.
- parseFragment
  
  public static Page parseFragment(String content, String defaultCanonicalLink)
  
  Parse a page from a string containing a fragment of HTML.
  
  Parameters:
  
  content - The string to parse.
  
  defaultCanonicalLink - The default canonical link to use.
  
  Returns:
  
  The page.
- parse
  
  public static Page parse(org.jsoup.nodes.Document doc, String defaultCanonicalLink)
  
  Parse a page from a Jsoup document.
  
  Parameters:
  
  doc - The document,
  
  defaultCanonicalLink - The default canonical link for the page.
  
  Returns:
  
  The parsed page.
- isValidAuthor
  
  public static boolean isValidAuthor(String author)
  
  Does the author name appear to be valid.
  
  Parameters:
  
  author - The author.
  
  Returns:
  
  Is the name valid?
- images
  
  public static void images(org.jsoup.nodes.Document doc, Set<Image> images)
  
  Extract unique images appearing in a document into a set.
  
  Parameters:
  
  doc - The document.
  
  images - The set to which images are added.
- videos
  
  public static void videos(org.jsoup.nodes.Document doc, Set<Video> videos)
  
  Extract unique videos appearing in a document into a set.
  
  Parameters:
  
  doc - The document.
  
  videos - The set to which videos are added.
- anchors
  
  public static void anchors(org.jsoup.nodes.Document doc, Set<Anchor> anchors)
  
  Extract unique anchors from a document.
  
  Parameters:
  
  doc - The document.
  
  anchors - The set to which anchors are added.
- audio
  
  public static void audio(org.jsoup.nodes.Document doc, Set<Audio> audios)
  
  Extract unique audio streams appearing in a document into a set.
  
  Parameters:
  
  doc - The document.
  
  audios - The set to which audio is added.

Class HTMLPageParser

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

HTMLPageParser

Method Details

parseFromURL

parse

parseFragment

parse

isValidAuthor

images

videos

anchors

audio