[Typo3-dev] New search engine upcoming
Jörg Roth
info at zeusmedia.de
Sun Jan 18 02:22:04 CET 2004
Hi,
at my company - mediaDIALOG - we have developed a new kind of search
engine for documents and websites. It's called SEMANTIKbrowser and
currently an official grant-aided european research project.
What does it do?
Well, it is split up in two parts: The scanning software written
entirely in Java and the frontend written in PHP.
So the scanning tool does the whole indexing part getting all
information out of the files. There are many supported file types like
PDF, RTF, DOC and even more formats will be implemented in the near
future. All neccessary information is stored to a mySQL-Database. It's
also possible to use MS-SQL, PostGre, Oracle and some other. The proper
indexing will be done after everything is scanned. Using a huge
stop-word-list (currently there's a german and a english version) the
data will be edited based on it's semantics and the ontology. After this
process the index-tables are filled. The required relational (m2m)
tables are also created automatically.
The frontend just provides the search options. Not the classic way. It
displays a 5 x 5 "Matrix" which holds in an inner circle the near and in
an outer circle the far words in dependence and emphasis to the search
word. This Matrix is extensible up to 50 words (you can switch it). On
the left side you will have several options such as a search history,
categories and synonyms. Beneath the Matrix the result list will be
displayed. They are ordered based on hits and emphasis. The document
type will be shown, the document title and the first cut of it.
How does it work?
You just don't need to search for your information the classic way. You
don't even need to provide a search term - the Matrix starts
automatically with the words which is indexed with the most 'hits'. Just
navigate through the Matrix to find the right document. And if you
like to get it faster, add a second word and the result list will be
limited.
What else?
Well, some of you may know those high-end products like VisualThesaurus
from plumb:design or the AquaBrowser. Our solution won't be that BIG or
GOOD, but the product itself is about 10 times cheaper and the same way
reliable.
So we've made the decision to create a fully featured TYPO3-Extension
out of it. There are many details to talk about, but I think this is all
for now. And yes, it will be totally free.
Personal note: Anyone who's working with cm-systems (including me) is
missing such a intelligent search engine. Even the big ones like VIP
GAUSS, InterRed, RedDot, SixCMS or Vignette don't provide such
'functionalities'. I think this would be a great benefit for TYPO3 and
anyone who's using it.
Technology information
The backend of TYPO3 will be expanded by a new Module 'semantikbrowser'
where you can do all the neccessary configuration. For the first release
all scanning will be done by PHP manually. But admins will be able to
add a cronjob. Currently we are discussing the release of the scanner
tool in a special free version. For projects with a site count larger
than 250 pages/documents we highly recommend the use of the scanner
tool! It's lightning fast and cpu/memory consumption is very low.
The frontend will be simply embedded as a plugin type. We don't think it
needs to be an own content element.
The frontend will be fully customizable via HTML-Template and CSS. We're
thinking about XHTML and going to test that out!
That's all for now. I just thought to introduce it. If you have any
questions or comments, please let me know at j.roth at mediadialog.de,
www.mediadialog.de or here at the NG.
Ah, yes. There's currently one public available instance of it. Just
visit www.ihk-muenchen.de and switch the DropDown labled 'Direktsprung'
to 'A-Z Schlagwortsuche'. This version of the SEMANTIKbrowser is a
individual one using the list view option.
On monday I will post a link to a screenshot (maybe some HTML-Preview)
where you can take a look at the design and layout.
The current state of the project:
- Planning phase is done (it's always a planning phase, isn't it?)
- Database analysis is done
- Testrun using a big customer site is done
Next week we've got a meeting to discuss anything. I'll be setting up a
schedule till the release of a first public alpha version. This will be
around may or june.
Regards,
Jörg
PS: All additional extension like news, address, forum and so on should
be considered! Maybe we need some further infos about that.
More information about the TYPO3-dev
mailing list