Class LinkExtractor

java.lang.Object
com.attribyte.parser.page.LinkExtractor

public class LinkExtractor extends Object
Extract all links found in a document, including images.
  • Field Details

    • defaultProtocol

      public final String defaultProtocol
      The default protocol used for links that start with // or have no protocol.
  • Constructor Details

    • LinkExtractor

      public LinkExtractor(String sourceURL)
      Creates a link extractor.
      Parameters:
      sourceURL - The source URL.
    • LinkExtractor

      public LinkExtractor(String sourceURL, Collection<String> skipLinks)
      Creates a link extractor.
      Parameters:
      sourceURL - The source URL.
      skipLinks - A set of links to skip.
    • LinkExtractor

      public LinkExtractor(String sourceURL, Collection<String> skipLinks, Function<String,String> canonicalizeFn)
      Creates a link extractor.
      Parameters:
      sourceURL - The source URL.
      skipLinks - A collection of urls for links to skip.
      canonicalizeFn - A function that returns the canonical version of a URL.
  • Method Details

    • extractLinks

      public LinkExtractor extractLinks(org.jsoup.nodes.Document doc)
      Extracts all links from a document.
      Parameters:
      doc - The document
      Returns:
      A self-reference.
    • add

      public final LinkExtractor add(String url)
      Adds an externally processed URL.
      Parameters:
      url - The URL.
      Returns:
      A self-reference.
    • ignore

      public final void ignore(String url)
      Adds a url to ignore.
      Parameters:
      url - The URL to ignore.
    • links

      public List<String> links()
      Gets all links.
      Returns:
      The links.
    • anchors

      public List<Anchor> anchors()
      Gets all anchors.
      Returns:
      The anchors.
    • images

      public List<Image> images()
      Gets all images.
      Returns:
      The images.
    • videos

      public List<Video> videos()
      Gets all videos.
      Returns:
      The videos.
    • audio

      public List<Audio> audio()
      Gets all audio.
      Returns:
      The audio.
    • toString

      public String toString()
      Overrides:
      toString in class Object