ecologylab.net
Class ParsedURL

java.lang.Object
  extended by ecologylab.generic.Debug
      extended by ecologylab.net.ParsedURL
All Implemented Interfaces:
MimeType

public class ParsedURL
extends Debug
implements MimeType

Extends the URL with many features for the convenience and power of network programmers. New class for manipulating and displaying URLs. Uses lazy evaluation to minimize storage allocation.

Author:
andruid, eunyee, madhur

Field Summary
protected  java.net.URL directory
          Directory that the document referred to by the URL resides in.
protected  java.lang.String domain
           
protected  java.net.URL hashUrl
          URL with hash, that is, a reference to an anchor within the document.
protected  java.lang.String lc
           
protected  java.lang.String string
          String representation of the URL.
protected  java.lang.String suffix
           
static int TIMEOUT
           
protected  java.net.URL url
          this is the no hash url, that is, the one with # and anything after it stripped out.
 
Fields inherited from interface ecologylab.net.MimeType
GIF, HTML, JPG, NUM_MEDIA_MIMES, PDF, PNG, RSS, TXT, UNKNOWN_MIME
 
Constructor Summary
ParsedURL(java.io.File file)
          Create a ParsedURL from a file.
ParsedURL(java.net.URL url)
           
 
Method Summary
 PURLConnection connect()
          Create a connection, using the standard timeouts of 23 seconds, and the super-basic ConnectionAdapter, which does *nothing special* when encountering directories, re-directs, ...
 PURLConnection connect(ConnectionHelper connectionHelper)
          Create a connection, using the standard timeouts of 23 seconds.
 PURLConnection connect(ConnectionHelper connectionHelper, int connectionTimeout, int readTimeout)
          Create a connection.
 boolean crawlable()
          Use unsupportedMimes and protocolIsSupported to determine if this is content fit for processing.
static ParsedURL createFromHTML(ParsedURL contextPURL, java.lang.String addressString, boolean fromSearchPage)
          Called while processing (parsing) HTML.
 ParsedURL createFromHTML(java.lang.String addressString)
          Called while processing (parsing) HTML.
 ParsedURL createFromHTML(java.lang.String addressString, boolean fromSearchPage)
          Called while processing (parsing) HTML.
 java.net.URL directory()
          Get the URL for the directory associated with this.
 ParsedURL directoryPURL()
          Form a ParsedURL based on this, if this is a directory.
 java.lang.String directoryString()
           
 java.lang.String domain()
          Uses lazy evaluation to minimize storage allocation.
 boolean equals(java.lang.Object other)
          Return true if the other object is either a ParsedURL or a URL that refers to the same location as this.
 java.io.File file()
           
protected static ParsedURL get(java.net.URL url, java.lang.String addressString)
           
static ParsedURL getAbsolute(java.lang.String webAddr)
          Create a PURL from an absolute address.
static ParsedURL getAbsolute(java.lang.String webAddr, java.lang.String errorDescriptor)
          Create a PURL from an absolute address.
 java.lang.String getName()
          Returns the name of the file or directory denoted by this abstract pathname.
 ParsedURL getRelative(java.lang.String relativeURLPath)
          Form a ParsedURL, based on a relative path, using this as the base.
 ParsedURL getRelative(java.lang.String relativeURLPath, java.lang.String errorDescriptor)
          Form a ParsedURL, based on a relative path, using this as the base.
static ParsedURL getRelative(java.net.URL base, java.lang.String relativeURLPath, java.lang.String errorDescriptor)
          Form a new ParsedURL, relative from a supplied base URL.
 boolean getTimeout()
           
static java.net.URL getURL(java.net.URL base, java.lang.String path, java.lang.String error)
           
 int hashCode()
          Hash this by its URL.
 java.net.URL hashUrl()
           
 boolean hasSuffix(java.lang.String s)
           
 boolean isFile()
          True if this ParsedURL represents an entity on the local file system.
 boolean isHTML()
          Test type of document this refers to.
static boolean isImageSuffix(java.lang.String thatSuffix)
           
 boolean isImg()
           
 boolean isJpeg()
           
 boolean isNoAlpha()
           
 boolean isNotFileOrExists()
           
 boolean isPDF()
          Test type of document this refers to.
 boolean isRSS()
          Test type of document this refers to.
 boolean isUnsupported()
           
 java.lang.String lc()
          Uses lazy evaluation to minimize storage allocation.
 int mediaMimeIndex()
          Get Media MimeType indexes.
 int mimeIndex()
          Get MimeType index by seeing suffix().
 java.lang.String noAnchorNoQueryPageString()
           
 java.lang.String noAnchorPageString()
           
 java.lang.String pathDirectoryString()
           
 boolean protocolIsSupported()
          Check whether the protocol is supported or not.
static boolean protocolIsSupported(java.lang.String protocol)
          Check whether the protocol is supported or not.
 boolean protocolIsUnsupported()
          Check whether the protocol is supported or not.
static boolean protocolIsUnsupported(java.lang.String protocol)
          Check whether the protocol is supported or not.
 void recycle()
          Free all all resources associated with this, rendering it no longer usable.
 java.lang.String removePunctuation()
           
 void resetCaches()
          Free some memory resources.
 boolean sameDomain(ParsedURL other)
           
 boolean sameHost(ParsedURL other)
           
 java.lang.String shortString()
          A shorter string for displaing in the modeline for debugging, and in popup messages.
 java.lang.String suffix()
          Uses lazy evaluation to minimize storage allocation.
static java.lang.String suffix(java.lang.String lc)
           
 boolean supportedMime()
           
 java.lang.String toString()
          Uses lazy evaluation to minimize storage allocation.
 ElementState translateFromXML(TranslationSpace translationSpace)
          Use this as the source of stuff to translate from XML
 java.net.URL url()
          Uses lazy evaluation to minimize storage allocation.
 ParsedURL withArgs(java.lang.String args)
          Form a new ParsedURL from this, and the args passed in.
 
Methods inherited from class ecologylab.generic.Debug
classSimpleName, closeLoggingFile, debug, debug, debug, debug, debugA, debugA, debugA, debugI, debugI, debugI, error, error, getClassName, getClassName, getInteractive, getPackageName, getPackageName, getPackageName, initialize, level, level, level, logToFile, print, print, println, println, println, println, println, println, printlnA, printlnA, printlnA, printlnI, printlnI, printlnI, printlnI, setLoggingFile, show, show, superString, toggleInteractive, toString, warning, warning, weird, weird
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

url

protected java.net.URL url
this is the no hash url, that is, the one with # and anything after it stripped out.


hashUrl

protected java.net.URL hashUrl
URL with hash, that is, a reference to an anchor within the document.


directory

protected java.net.URL directory
Directory that the document referred to by the URL resides in.


string

protected java.lang.String string
String representation of the URL.


lc

protected java.lang.String lc

suffix

protected java.lang.String suffix

domain

protected java.lang.String domain

TIMEOUT

public static final int TIMEOUT
See Also:
Constant Field Values
Constructor Detail

ParsedURL

public ParsedURL(java.net.URL url)

ParsedURL

public ParsedURL(java.io.File file)
Create a ParsedURL from a file. If the file is a directory, append "/" to the path, so that relative URLs will be formed properly later.

Parameters:
file -
Method Detail

isNotFileOrExists

public boolean isNotFileOrExists()
Returns:
true if this refers to a file, and that file exists.

getAbsolute

public static ParsedURL getAbsolute(java.lang.String webAddr)
Create a PURL from an absolute address. (Do it the quick and dirty way, providing less error handling.) NB: Only call this method if you are *sure* a MalformedURlException would never be produced.


getAbsolute

public static ParsedURL getAbsolute(java.lang.String webAddr,
                                    java.lang.String errorDescriptor)
Create a PURL from an absolute address.

Parameters:
webAddr - url string
errorDescriptor - which will be printed out in the trace file if there is something happen converting from the url string to URL.
Returns:
ParsedURL from url string parameter named webAddr, or null if the param is malformed.

getRelative

public final ParsedURL getRelative(java.lang.String relativeURLPath,
                                   java.lang.String errorDescriptor)
Form a ParsedURL, based on a relative path, using this as the base.

Parameters:
relativeURLPath - Path relative to this.
errorDescriptor -
Returns:
New ParsedURL based on this and the relative path.

getRelative

public final ParsedURL getRelative(java.lang.String relativeURLPath)
Form a ParsedURL, based on a relative path, using this as the base.

Parameters:
relativeURLPath - Path relative to this.
Returns:
New ParsedURL based on this and the relative path.

getRelative

public static ParsedURL getRelative(java.net.URL base,
                                    java.lang.String relativeURLPath,
                                    java.lang.String errorDescriptor)
Form a new ParsedURL, relative from a supplied base URL.

Parameters:
relativeURLPath -
errorDescriptor -
Returns:
New ParsedURL

translateFromXML

public ElementState translateFromXML(TranslationSpace translationSpace)
                              throws XMLTranslationException
Use this as the source of stuff to translate from XML

Parameters:
translationSpace - Translations that specify package + class names for translating.
Returns:
ElementState object derived from XML at the InputStream of this.
Throws:
XMLTranslationException

getURL

public static java.net.URL getURL(java.net.URL base,
                                  java.lang.String path,
                                  java.lang.String error)

toString

public java.lang.String toString()
Uses lazy evaluation to minimize storage allocation.

Overrides:
toString in class Debug
Returns:
The URL as a String.

lc

public java.lang.String lc()
Uses lazy evaluation to minimize storage allocation.

Returns:
Lower case rendition of the URL String.

suffix

public java.lang.String suffix()
Uses lazy evaluation to minimize storage allocation.

Returns:
The suffix of the filename, in lower case.

directoryPURL

public ParsedURL directoryPURL()
Form a ParsedURL based on this, if this is a directory. Otherwise, form the ParsedURL from the parent of this. Process files carefully to propagate their file-ness.

Returns:

directory

public java.net.URL directory()
Get the URL for the directory associated with this. Requires looking for slash at the end, looking for a suffix or arguments. As a result, we sometimes add a slash at the end, sometimes peel off the filename. Result is cached a la lazy evaluation.

Returns:
Directory URL

domain

public java.lang.String domain()
Uses lazy evaluation to minimize storage allocation.

Returns:
The domain of the URL.

suffix

public static java.lang.String suffix(java.lang.String lc)
Returns:
The suffix of the filename, in whatever case is found in the input string.

url

public final java.net.URL url()
Uses lazy evaluation to minimize storage allocation.

Returns:
the URL.

hashUrl

public final java.net.URL hashUrl()

noAnchorNoQueryPageString

public java.lang.String noAnchorNoQueryPageString()

noAnchorPageString

public java.lang.String noAnchorPageString()

hasSuffix

public final boolean hasSuffix(java.lang.String s)
Returns:
true if the suffix of this is equal to that of the argument.

createFromHTML

public ParsedURL createFromHTML(java.lang.String addressString)
Called while processing (parsing) HTML. Used to create new ParsedURLs from urlStrings in response to such as the a element's href attribute, the img element's src attribute, etc.

Does processing of some fancy stuff, like, in the case of javascript: URLs, it mines them for embedded absolute URLs, if possible, and uses only those embedded URLs.

Parameters:
addressString - This may be specify a relative or absolute url.
Returns:
The resulting ParsedURL. It may be null. It will never have protocol javascript:.

createFromHTML

public ParsedURL createFromHTML(java.lang.String addressString,
                                boolean fromSearchPage)
Called while processing (parsing) HTML. Used to create new ParsedURLs from urlStrings in response to such as the a element's href attribute, the img element's src attribute, etc.

Does processing of some fancy stuff, like, in the case of javascript: URLs, it mines them for embedded absolute URLs, if possible, and uses only those embedded URLs.

Parameters:
addressString - This may be specify a relative or absolute url.
fromSearchPage - If false, then add / to the end of the URL if it seems to be a directory.
Returns:
The resulting ParsedURL. It may be null. It will never have protocol javascript:.

get

protected static ParsedURL get(java.net.URL url,
                               java.lang.String addressString)

createFromHTML

public static ParsedURL createFromHTML(ParsedURL contextPURL,
                                       java.lang.String addressString,
                                       boolean fromSearchPage)
Called while processing (parsing) HTML. Used to create new ParsedURLs from urlStrings in response to such as the a element's href attribute, the img element's src attribute, etc.

Does processing of some fancy stuff, like, in the case of javascript: URLs, it mines them for embedded absolute URLs, if possible, and uses only those embedded URLs.

Parameters:
addressString - This may be specify a relative or absolute url.
fromSearchPage - If false, then add / to the end of the URL if it seems to be a directory.
Returns:
The resulting ParsedURL. It may be null. It will never have protocol javascript:.

removePunctuation

public java.lang.String removePunctuation()
Returns:
A String version of the URL path, in which all punctuation characters have been changed into spaces.

sameDomain

public boolean sameDomain(ParsedURL other)
Returns:
true if they have same domains. false if they have different domains.

sameHost

public boolean sameHost(ParsedURL other)
Returns:
true if they have same hosts. false if they have different hosts.

crawlable

public boolean crawlable()
Use unsupportedMimes and protocolIsSupported to determine if this is content fit for processing.

Returns:
true if this seems to be a web addr we can crawl to. (currently that means html).

protocolIsSupported

public boolean protocolIsSupported()
Check whether the protocol is supported or not. Currently, only http and ftp are.


protocolIsSupported

public static boolean protocolIsSupported(java.lang.String protocol)
Check whether the protocol is supported or not. Currently, only http and ftp are.


protocolIsUnsupported

public boolean protocolIsUnsupported()
Check whether the protocol is supported or not. Currently, only http and ftp are.


protocolIsUnsupported

public static boolean protocolIsUnsupported(java.lang.String protocol)
Check whether the protocol is supported or not. Currently, only http and ftp are.


isImg

public boolean isImg()
Returns:
true if this is an image file.

isImageSuffix

public static boolean isImageSuffix(java.lang.String thatSuffix)
Parameters:
thatSuffix -
Returns:
true if the suffix passed in is one for an image type that we can handle.

isJpeg

public boolean isJpeg()
Returns:
true if this is a JPEG image file.

isNoAlpha

public boolean isNoAlpha()
Returns:
true if we can tell the image file wont have alpha, just from its suffix. This is currently the case for jpeg and bmp.

isHTML

public boolean isHTML()
Test type of document this refers to.

Returns:
true if this refers to an HTML file

isPDF

public boolean isPDF()
Test type of document this refers to.

Returns:
true if this refers to a PDF file

isRSS

public boolean isRSS()
Test type of document this refers to.

Returns:
true if this refers to an RSS feed

mimeIndex

public int mimeIndex()
Get MimeType index by seeing suffix().

Parameters:
parsedURL -

mediaMimeIndex

public int mediaMimeIndex()
Get Media MimeType indexes. Media MimeTypes are currently text and all kinds of images such as JPG, GIF, and PNG.

Parameters:
parsedURL -

isUnsupported

public boolean isUnsupported()

supportedMime

public boolean supportedMime()

directoryString

public java.lang.String directoryString()
Returns:
The directory of this, with protocol and host.

pathDirectoryString

public java.lang.String pathDirectoryString()
Returns:
The directory of this, without protocol and host.

equals

public boolean equals(java.lang.Object other)
Return true if the other object is either a ParsedURL or a URL that refers to the same location as this. Note: this is our own implementation. It is *much* faster and slightly less careful than JavaSoft's. Checks port, host, file, protocol, and query. Ignores ref = hash.

Overrides:
equals in class java.lang.Object

hashCode

public int hashCode()
Hash this by its URL.

Overrides:
hashCode in class java.lang.Object

shortString

public java.lang.String shortString()
A shorter string for displaing in the modeline for debugging, and in popup messages.


isFile

public boolean isFile()
True if this ParsedURL represents an entity on the local file system.

Returns:
true if this is a local File object.

file

public java.io.File file()
Returns:
The file system object associated with this, if this is an entity on the local file system, or null, otherwise.

withArgs

public ParsedURL withArgs(java.lang.String args)
Form a new ParsedURL from this, and the args passed in. A question mark is appended to the String form of this, and then args are appended.

Parameters:
args -
Returns:
ParsedURL with args after ?

getName

public java.lang.String getName()
Returns the name of the file or directory denoted by this abstract pathname. This is just the last name in the pathname's name sequence. If the pathname's name sequence is empty, then the empty string is returned.

Analagous to File.getName().

Returns:
Name of this, without directory, host, or protocol.

connect

public PURLConnection connect()
Create a connection, using the standard timeouts of 23 seconds, and the super-basic ConnectionAdapter, which does *nothing special* when encountering directories, re-directs, ...

Parameters:
connectionHelper -
Returns:

connect

public PURLConnection connect(ConnectionHelper connectionHelper)
Create a connection, using the standard timeouts of 23 seconds.

Parameters:
connectionHelper -
Returns:

connect

public PURLConnection connect(ConnectionHelper connectionHelper,
                              int connectionTimeout,
                              int readTimeout)
Create a connection.

Parameters:
connectionHelper -
connectionTimeout -
readTimeout -
Returns:

getTimeout

public boolean getTimeout()

resetCaches

public void resetCaches()
Free some memory resources. They can be re-allocated through subsequent lazy evaluation. The object is still fully functional after this call.


recycle

public void recycle()
Free all all resources associated with this, rendering it no longer usable.