[Typo3-dev] New search engine upcoming

Sun Jan 18 02:22:04 CET 2004

Hi,

at my company - mediaDIALOG - we have developed a new kind of search 
engine for documents and websites. It's called SEMANTIKbrowser and 
currently an official grant-aided european research project.

What does it do?
Well, it is split up in two parts: The scanning software written 
entirely in Java and the frontend written in PHP.
So the scanning tool does the whole indexing part getting all 
information out of the files. There are many supported file types like 
PDF, RTF, DOC and even more formats will be implemented in the near 
future. All neccessary information is stored to a mySQL-Database. It's 
also possible to use MS-SQL, PostGre, Oracle and some other. The proper 
indexing will be done after everything is scanned. Using a huge 
stop-word-list (currently there's a german and a english version) the 
data will be edited based on it's semantics and the ontology. After this 
process the index-tables are filled. The required relational (m2m) 
tables are also created automatically.

The frontend just provides the search options. Not the classic way. It 
displays a 5 x 5 "Matrix" which holds in an inner circle the near and in 
an outer circle the far words in dependence and emphasis to the search 
word. This Matrix is extensible up to 50 words (you can switch it). On 
the left side you will have several options such as a search history, 
categories and synonyms. Beneath the Matrix the result list will be 
displayed. They are ordered based on hits and emphasis. The document 
type will be shown, the document title and the first cut of it.

How does it work?
You just don't need to search for your information the classic way. You 
don't even need to provide a search term - the Matrix starts 
automatically with the words which is indexed with the most 'hits'. Just 
  navigate through the Matrix to find the right document. And if you 
like to get it faster, add a second word and the result list will be 
limited.

What else?
Well, some of you may know those high-end products like VisualThesaurus 
from plumb:design or the AquaBrowser. Our solution won't be that BIG or 
GOOD, but the product itself is about 10 times cheaper and the same way 
reliable.

So we've made the decision to create a fully featured TYPO3-Extension 
out of it. There are many details to talk about, but I think this is all 
for now. And yes, it will be totally free.

Personal note: Anyone who's working with cm-systems (including me) is 
missing such a intelligent search engine. Even the big ones like VIP 
GAUSS, InterRed, RedDot, SixCMS or Vignette don't provide such 
'functionalities'. I think this would be a great benefit for TYPO3 and 
anyone who's using it.

Technology information
The backend of TYPO3 will be expanded by a new Module 'semantikbrowser' 
where you can do all the neccessary configuration. For the first release 
all scanning will be done by PHP manually. But admins will be able to 
add a cronjob. Currently we are discussing the release of the scanner 
tool in a special free version. For projects with a site count larger 
than 250 pages/documents we highly recommend the use of the scanner 
tool! It's lightning fast and cpu/memory consumption is very low.
The frontend will be simply embedded as a plugin type. We don't think it 
needs to be an own content element.

The frontend will be fully customizable via HTML-Template and CSS. We're 
thinking about XHTML and going to test that out!

That's all for now. I just thought to introduce it. If you have any 
questions or comments, please let me know at j.roth at mediadialog.de, 
www.mediadialog.de or here at the NG.

Ah, yes. There's currently one public available instance of it. Just 
visit www.ihk-muenchen.de and switch the DropDown labled 'Direktsprung' 
to 'A-Z Schlagwortsuche'. This version of the SEMANTIKbrowser is a 
individual one using the list view option.

On monday I will post a link to a screenshot (maybe some HTML-Preview) 
where you can take a look at the design and layout.

The current state of the project:
- Planning phase is done (it's always a planning phase, isn't it?)
- Database analysis is done
- Testrun using a big customer site is done

Next week we've got a meeting to discuss anything. I'll be setting up a 
schedule till the release of a first public alpha version. This will be 
around may or june.

Regards,

Jörg

PS: All additional extension like news, address, forum and so on should 
be considered! Maybe we need some further infos about that.