From: Nat Friedman To: Miguel de Icaza, Matt Loper Subject: Structured data from arbitrary files. Date: 13 Feb 2003 02:44:38 -0500 My dashboard needs a programmatic way of extracting common structured data fields from arbitrary files. For a given file, I would like to know a bunch of different information. Base information: * What type it is * What icon corresponds to that type * Who wrote it, when * Who modified it, when * What its subject is * What keywords there are associated with it * What its title is * If there is a summary, what is it * If it has an outline structure, a full extraction of that structure and the heading names * If it contains a set of files, what are they? More advanced, source information: * If it was emailed to me, who sent it? What was the subject of the email and who was on the Cc? * If I downloaded it, where from? * If it has embedded or linked objects, what are they? * If I cut-and-pasted from web sites or other documents or emails to create it, where from? File-specific information: * For photographs, all the various EXIF data. * For documents, all the text in the documents. * For web pages, all the objects it links to ... Some of the basic data is really MIME-type information, and that's the minimum that I'd need, and can be gathered with gnome-vfs (which has no Mono bindings currently). The other information is currently not available in any centralized, structured way. This problem seems to demand a new libary with dynamically-loaded backends. It seems like this library would be useful in other contexts as well, e.g. to a file manager or document browser, etc. It's also highly parallelizable, once you build a framework. I think this could be built easily. The API could be just: XML get_file_information (uri) There would be a set of standard field names for the XML. Per-filetype extensions could be created too. The approach would be, build the framework, write a small document listing the standard fields, build support for a core 10 or 15 file types, and then release it. Make the framework automatically download updated backends. What's hard is that you would want to use existing libaries like libgsf, libole, libpdf (if such a thing exists), and maybe gnome-vfs. So you have dependencies. Just some thoughts. Nat