Yea..i guess u rite in terms what web harvesting is....infact I actually had asked for this forum to be put up.
One of the Instructional Designers was asking me about this thing and i was fascinated abt this topic so i wanted to see what others had to say abt this topic....i have to admit i dobnt know much about web harvesting....the only thing i know as probably the same as you Tim.
I see web harvesting concept as a combination of search engines optimizations and RSS working together closely...
"Search engines are a big help, but they can do only part of the work, and they are hard-pressed to keep up with daily changes. For all the power of Google and its kin, all that search engines can do is locate information and point to it. They go only two or three levels deep into a Web site to find information and then return URLs. They also find and return meta descriptions and meta keywords embedded in Web pages, but these may well be inaccurate.
Consider that even when you use a search engine to locate data, you still have to do the following tasks to capture the information you need:
Scan the content until you find the information.
Mark the information (usually by highlighting with a mouse).
Switch to another application (such as a spreadsheet, database or word processor).
Paste the information into that application.
A better solution, especially for companies that are aiming to exploit a broad swath of data about markets or competitors, lies with Web harvesting tools."(extracted from:
www.computerworld.com/databasetopics/businessintelligence/story/0,10801,93919,00.html)
Does anyone knows whar are some of the programs that support web harvesting or actually does web harvesting?
So…what should web harvesting tools contain?
Well According to: Scip.Online the next generation of web harvesting tools will contains or shoudl contain:
"
Format flexibilityWeb content is no longer just HTML. It resides in a variety of formats - .pdf, .doc, .xls, etc. Once this content is harvested, users want it in the format of their choice, not the tool vendor’s choice (most vendors are in love with XML). Ideally, a web harvesting tool should extract information from internal and external sources, not just HTML, and transform it into a variety of structured formats, not just XML.
“Deep web” accessSearch engines and first generation web harvesting tools can follow static links, but that’s no longer good enough. Many sites now hide information behind dynamic links generated from web forms or imbedded in scripting languages. This hidden information is often referred to as the “deep web”. To reach the deep web, tools must automatically discover and fill out web forms and synthesize dynamic links.
Output consolidationA CI professional may need to collect information from multiple sites, but wants it consolidated into a single report. Taking it a step further, she may want the harvested information incorporated into an existing database or spreadsheet. Web harvesting tools need to be flexible enough to bring information together from multiple harvests and/or multiple sites into a single format or existing template or worksheet.
Logical pinpointingMost first generation tools require the user to first physically pinpoint (usually with a mouse) the information they want harvested. Then the tool remembers the navigation path and screen position so the extraction is automated for future harvests. It's easy, but useful only to repetitively harvest information from a small, fixed number of unchanging pages.
Logical pinpointing, on the other hand, systematically finds the information by identifying information based on a defined criteria (such as a Boolean expression). While it may not be as easy as physical pinpointing, logical pinpointing is very efficient when the information resides on a variable number of changing pages.
AnonymityMany sites have implemented measures to thwart unwanted visitors. To get around this some organizations anonymize their digital identities. The same precautions should be taken when using web harvesting tools. Anonymization capabilities should be integrated into the web harvesting tool, not left to the user to figure out.
Performance on demandSome sites are very slow, particularly during busy periods, so getting all the information needed within a limited time window may be impossible. To address this, a web harvesting tool either needs to provide performance on demand by initiating and managing simultaneous harvests or bail out and recommend or schedule the harvest for a less busy time.
Scheduling
Scheduled harvesting is very useful for two reasons:
Running harvests at night can have information organized and ready for analysis in the morning. The productivity of information analysts should increase if collection time is replaced by analysis time.
Some sites are updated periodically throughout the day and night. If the information is time critical, it should be harvested as close to the time of update as possible so managers can be alerted to changing business conditions.
On some sites information is updated continually throughout the day (i.e. stock quotes) so the tool should constantly keep an eye on the site. On other sites information is updated at appointed times (i.e. power generation metrics) so the tool may be scheduled to harvest within minutes of the scheduled update.
Most CI departments can benefit from a web harvesting tool, but before succumbing to the marketing hype, make sure the features are there to support your requirements and deal with the complexity of today’s modern web sites."
(extracted from:
www.imakenews.com/scip2/e_article000093294.cfm?x=213559,0%252)
Well i would love to hia what others think about this concept and its use in education?
Thanx
Javed