org.apache.nutch.crawl
Class Generator
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.crawl.Generator
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class Generator
- extends org.apache.hadoop.conf.Configured
- implements org.apache.hadoop.util.Tool
Generates a subset of a crawl db to fetch.
Method Summary |
org.apache.hadoop.fs.Path |
generate(org.apache.hadoop.fs.Path dbDir,
org.apache.hadoop.fs.Path segments,
int numLists,
long topN,
long curTime)
Generate fetchlists in a segment. |
org.apache.hadoop.fs.Path |
generate(org.apache.hadoop.fs.Path dbDir,
org.apache.hadoop.fs.Path segments,
int numLists,
long topN,
long curTime,
boolean filter,
boolean force)
Generate fetchlists in a segment. |
static String |
generateSegmentName()
|
static void |
main(String[] args)
Generate a fetchlist from the crawldb. |
int |
run(String[] args)
|
Methods inherited from class org.apache.hadoop.conf.Configured |
getConf, setConf |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.hadoop.conf.Configurable |
getConf, setConf |
CRAWL_GENERATE_FILTER
public static final String CRAWL_GENERATE_FILTER
- See Also:
- Constant Field Values
GENERATE_MAX_PER_HOST_BY_IP
public static final String GENERATE_MAX_PER_HOST_BY_IP
- See Also:
- Constant Field Values
GENERATE_MAX_PER_HOST
public static final String GENERATE_MAX_PER_HOST
- See Also:
- Constant Field Values
GENERATE_UPDATE_CRAWLDB
public static final String GENERATE_UPDATE_CRAWLDB
- See Also:
- Constant Field Values
CRAWL_TOP_N
public static final String CRAWL_TOP_N
- See Also:
- Constant Field Values
CRAWL_GEN_CUR_TIME
public static final String CRAWL_GEN_CUR_TIME
- See Also:
- Constant Field Values
CRAWL_GEN_DELAY
public static final String CRAWL_GEN_DELAY
- See Also:
- Constant Field Values
LOG
public static final org.apache.commons.logging.Log LOG
Generator
public Generator()
Generator
public Generator(org.apache.hadoop.conf.Configuration conf)
generate
public org.apache.hadoop.fs.Path generate(org.apache.hadoop.fs.Path dbDir,
org.apache.hadoop.fs.Path segments,
int numLists,
long topN,
long curTime)
throws IOException
- Generate fetchlists in a segment. Whether to filter URLs or not is
read from the crawl.generate.filter property in the configuration
files. If the property is not found, the URLs are filtered.
- Parameters:
dbDir
- Crawl database directorysegments
- Segments directorynumLists
- Number of reduce taskstopN
- Number of top URLs to be selectedcurTime
- Current time in milliseconds
- Returns:
- Path to generated segment or null if no entries were
selected
- Throws:
IOException
- When an I/O error occurs
generate
public org.apache.hadoop.fs.Path generate(org.apache.hadoop.fs.Path dbDir,
org.apache.hadoop.fs.Path segments,
int numLists,
long topN,
long curTime,
boolean filter,
boolean force)
throws IOException
- Generate fetchlists in a segment.
- Returns:
- Path to generated segment or null if no entries were selected.
- Throws:
IOException
generateSegmentName
public static String generateSegmentName()
main
public static void main(String[] args)
throws Exception
- Generate a fetchlist from the crawldb.
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface org.apache.hadoop.util.Tool
- Throws:
Exception
Copyright © 2006 The Apache Software Foundation