Package com.attribyte.parser.page
Class LinkExtractor
java.lang.Object
com.attribyte.parser.page.LinkExtractor
Extract all links found in a document, including images.
-
Field Summary
FieldsModifier and TypeFieldDescriptionfinal StringThe default protocol used for links that start with//or have no protocol. -
Constructor Summary
ConstructorsConstructorDescriptionLinkExtractor(String sourceURL) Creates a link extractor.LinkExtractor(String sourceURL, Collection<String> skipLinks) Creates a link extractor.LinkExtractor(String sourceURL, Collection<String> skipLinks, Function<String, String> canonicalizeFn) Creates a link extractor. -
Method Summary
Modifier and TypeMethodDescriptionfinal LinkExtractorAdds an externally processed URL.anchors()Gets all anchors.audio()Gets all audio.extractLinks(org.jsoup.nodes.Document doc) Extracts all links from a document.final voidAdds a url to ignore.images()Gets all images.links()Gets all links.toString()videos()Gets all videos.
-
Field Details
-
defaultProtocol
The default protocol used for links that start with//or have no protocol.
-
-
Constructor Details
-
LinkExtractor
Creates a link extractor.- Parameters:
sourceURL- The source URL.
-
LinkExtractor
Creates a link extractor.- Parameters:
sourceURL- The source URL.skipLinks- A set of links to skip.
-
LinkExtractor
public LinkExtractor(String sourceURL, Collection<String> skipLinks, Function<String, String> canonicalizeFn) Creates a link extractor.- Parameters:
sourceURL- The source URL.skipLinks- A collection of urls for links to skip.canonicalizeFn- A function that returns the canonical version of a URL.
-
-
Method Details
-
extractLinks
Extracts all links from a document.- Parameters:
doc- The document- Returns:
- A self-reference.
-
add
Adds an externally processed URL.- Parameters:
url- The URL.- Returns:
- A self-reference.
-
ignore
Adds a url to ignore.- Parameters:
url- The URL to ignore.
-
links
Gets all links.- Returns:
- The links.
-
anchors
Gets all anchors.- Returns:
- The anchors.
-
images
Gets all images.- Returns:
- The images.
-
videos
Gets all videos.- Returns:
- The videos.
-
audio
Gets all audio.- Returns:
- The audio.
-
toString
-