Package com.attribyte.parser.page
Class HTMLPageParser
java.lang.Object
com.attribyte.parser.page.HTMLPageParser
Extract interesting content from HTML as a
Page.- Author:
- Matt Hamer
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidExtract unique anchors from a document.static voidExtract unique audio streams appearing in a document into a set.static voidExtract unique images appearing in a document into a set.static booleanisValidAuthor(String author) Does the author name appear to be valid.static PageParse a page from a string.static PageParse a page from a Jsoup document.static PageparseFragment(String content, String defaultCanonicalLink) Parse a page from a string containing a fragment of HTML.static PageparseFromURL(String url, int maxBodySize) Parse from a URL.static voidExtract unique videos appearing in a document into a set.
-
Constructor Details
-
HTMLPageParser
public HTMLPageParser()
-
-
Method Details
-
parseFromURL
Parse from a URL.- Parameters:
url- The URL.maxBodySize- The maximum allowed body size.- Returns:
- The page.
- Throws:
IOException- on parse error.
-
parse
Parse a page from a string.- Parameters:
content- The string to parse.defaultCanonicalLink- The default canonical link to use.- Returns:
- The page.
-
parseFragment
Parse a page from a string containing a fragment of HTML.- Parameters:
content- The string to parse.defaultCanonicalLink- The default canonical link to use.- Returns:
- The page.
-
parse
Parse a page from a Jsoup document.- Parameters:
doc- The document,defaultCanonicalLink- The default canonical link for the page.- Returns:
- The parsed page.
-
isValidAuthor
Does the author name appear to be valid.- Parameters:
author- The author.- Returns:
- Is the name valid?
-
images
Extract unique images appearing in a document into a set.- Parameters:
doc- The document.images- The set to which images are added.
-
videos
Extract unique videos appearing in a document into a set.- Parameters:
doc- The document.videos- The set to which videos are added.
-
anchors
Extract unique anchors from a document.- Parameters:
doc- The document.anchors- The set to which anchors are added.
-
audio
Extract unique audio streams appearing in a document into a set.- Parameters:
doc- The document.audios- The set to which audio is added.
-