Web Harvesting

Phil Moderator Posts: 59	Web Harvesting Mar 24, 2007 21:03:58 GMT 4 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Phil on Mar 24, 2007 21:03:58 GMT 4 Don't know what web Harvesting is about? well, neither do I but I'm sure we'll find out. Cue the music...
	------------------------------------------------ Yes, yes, yes. You're absolutely right. Definitely. What did you say again?

Tim Mulvihill
Guest

Web Harvesting Mar 25, 2007 5:17:33 GMT 4

Quote

Post by Tim Mulvihill on Mar 25, 2007 5:17:33 GMT 4

With a quick google search I found that web harvesting is also known as web farming, web mining. As far as I understand, web harvesting software tools are designed to make collecting information from an expotentially growing web environment more effective. They pick up where search engines leave off, essentially doing the work that search engines can't. According to an article in Computer World " Extraction tools automate the reading, copying and pasting necessary to collect information for analysis."

Within web harvesting there are three main techniques for collecting information:

1) Web Content Harvesting - "is concerned directly with the specific content of documents or their descriptions, such as HTML files, images or e-mail messages. Since most text documents are relatively unstructured (at least as far as machine interpretation is concerned), one common approach is to exploit what's already known about the general structure of documents and map this to some data model.
"

2) Web Structure Harvesting - "takes advantage of the fact that Web pages can reveal more information than just their obvious content. Links from other sources that point to a particular Web page indicate the popularity of that page, while links within a Web page that point to other resources may indicate the richness or variety of topics covered in that page. "

3) Web Usage Harvesting - "uses data recorded by Web servers about user interactions to help understand user behavior and evaluate the effectiveness of the Web structure. "

Cheers,

Tim

www.computerworld.com/databasetopics/businessintelligence/story/0,10801,93919,00.html

javed
Founding Member

Posts: 4

Web Harvesting Mar 26, 2007 9:49:09 GMT 4

Quote

Post by javed on Mar 26, 2007 9:49:09 GMT 4

Yea..i guess u rite in terms what web harvesting is....infact I actually had asked for this forum to be put up.

One of the Instructional Designers was asking me about this thing and i was fascinated abt this topic so i wanted to see what others had to say abt this topic....i have to admit i dobnt know much about web harvesting....the only thing i know as probably the same as you Tim.

I see web harvesting concept as a combination of search engines optimizations and RSS working together closely...

"Search engines are a big help, but they can do only part of the work, and they are hard-pressed to keep up with daily changes. For all the power of Google and its kin, all that search engines can do is locate information and point to it. They go only two or three levels deep into a Web site to find information and then return URLs. They also find and return meta descriptions and meta keywords embedded in Web pages, but these may well be inaccurate.

Consider that even when you use a search engine to locate data, you still have to do the following tasks to capture the information you need:

Scan the content until you find the information.
Mark the information (usually by highlighting with a mouse).
Switch to another application (such as a spreadsheet, database or word processor).
Paste the information into that application.

A better solution, especially for companies that are aiming to exploit a broad swath of data about markets or competitors, lies with Web harvesting tools."
(extracted from: www.computerworld.com/databasetopics/businessintelligence/story/0,10801,93919,00.html)

Does anyone knows whar are some of the programs that support web harvesting or actually does web harvesting?

So…what should web harvesting tools contain?
Well According to: Scip.Online the next generation of web harvesting tools will contains or shoudl contain:
"
Format flexibility
Web content is no longer just HTML. It resides in a variety of formats - .pdf, .doc, .xls, etc. Once this content is harvested, users want it in the format of their choice, not the tool vendor’s choice (most vendors are in love with XML). Ideally, a web harvesting tool should extract information from internal and external sources, not just HTML, and transform it into a variety of structured formats, not just XML.

“Deep web” access
Search engines and first generation web harvesting tools can follow static links, but that’s no longer good enough. Many sites now hide information behind dynamic links generated from web forms or imbedded in scripting languages. This hidden information is often referred to as the “deep web”. To reach the deep web, tools must automatically discover and fill out web forms and synthesize dynamic links.

Output consolidation
A CI professional may need to collect information from multiple sites, but wants it consolidated into a single report. Taking it a step further, she may want the harvested information incorporated into an existing database or spreadsheet. Web harvesting tools need to be flexible enough to bring information together from multiple harvests and/or multiple sites into a single format or existing template or worksheet.

Logical pinpointing
Most first generation tools require the user to first physically pinpoint (usually with a mouse) the information they want harvested. Then the tool remembers the navigation path and screen position so the extraction is automated for future harvests. It's easy, but useful only to repetitively harvest information from a small, fixed number of unchanging pages.

Logical pinpointing, on the other hand, systematically finds the information by identifying information based on a defined criteria (such as a Boolean expression). While it may not be as easy as physical pinpointing, logical pinpointing is very efficient when the information resides on a variable number of changing pages.

Anonymity
Many sites have implemented measures to thwart unwanted visitors. To get around this some organizations anonymize their digital identities. The same precautions should be taken when using web harvesting tools. Anonymization capabilities should be integrated into the web harvesting tool, not left to the user to figure out.

Performance on demand
Some sites are very slow, particularly during busy periods, so getting all the information needed within a limited time window may be impossible. To address this, a web harvesting tool either needs to provide performance on demand by initiating and managing simultaneous harvests or bail out and recommend or schedule the harvest for a less busy time.

Scheduling

Scheduled harvesting is very useful for two reasons:
Running harvests at night can have information organized and ready for analysis in the morning. The productivity of information analysts should increase if collection time is replaced by analysis time.
Some sites are updated periodically throughout the day and night. If the information is time critical, it should be harvested as close to the time of update as possible so managers can be alerted to changing business conditions.
On some sites information is updated continually throughout the day (i.e. stock quotes) so the tool should constantly keep an eye on the site. On other sites information is updated at appointed times (i.e. power generation metrics) so the tool may be scheduled to harvest within minutes of the scheduled update.

Most CI departments can benefit from a web harvesting tool, but before succumbing to the marketing hype, make sure the features are there to support your requirements and deal with the complexity of today’s modern web sites."

(extracted from: www.imakenews.com/scip2/e_article000093294.cfm?x=213559,0%252)

Well i would love to hia what others think about this concept and its use in education?

Thanx
Javed

ElearnEnable

Post by Phil on Mar 24, 2007 21:03:58 GMT 4

Post by Tim Mulvihill on Mar 25, 2007 5:17:33 GMT 4

Post by javed on Mar 26, 2007 9:49:09 GMT 4

Quick Reply

Shoutbox