BlueShoes Application Framework made with PHP http://www.blueshoes.org/


Packageindex Classtrees Modulegroups Elementlist Report XML Files

File: C:/usr/local/lib/php/blueshoes-4.2/core/text/Bs_LanguageDetector.class.php
BlueShoes Application Framework - text

Bs_LanguageDetector

Bs_Object
   |
  +-- Bs_LanguageDetector

detects the language of a text (not meant for single words).

 

public class Bs_LanguageDetector extends Bs_Object

detects the language of a text (not meant for single words).
supports en, fr, de and nl.there are quite some 3rd party tools available for detecting languages,they support up to 260 langs. see list here:http://odur.let.rug.nl/~vannoord/TextCat/competitors.html (or trythe odp, yahoo and google, as always...).dependencies: Bs_String, Bs_Array, Bs_Db

Authors
Version4.0.$id$
Copyrightblueshoes.org

 

Methods inherited from Bs_Object

isex, isexception, tostring, tohtml, persist, unpersist, bs_object, bbsetoutput, bbawake, bbisawake, bbxmsg, bbxfunctionstart, bbxfunctionend, bbxecho, bbxvar, bbxvardump, bbforcetrace, bbbufferstart, bbbufferget, bbbufferendflush, bbbufferendclean

Public Method Summary

mixed

detectWord(string $word, [ string $lang ])

detects the language of a given word.
array

detectText(string $string)

detects the language of a given text.
void

Bs_LanguageDetector()

Warning: documentation is missing.

Private Method Summary

void

_loadDictionary()

loads the dictionary (prepares data) used for and in detectText().

Public Field Summary

array

$wordDictEn

100 most used words in english.
array

$wordDictFr

100 most used words in french.
array

$wordDictDe

100 most used words in german.
array

$wordDictNl

100 most used words in dutch.
array

$wordDict

another structure of all the $wordDics language dependant vars, all in here.
array

$prefixDict

array with word prefixes for different languages.
array

$suffixDict

array with word suffixes for different languages.
string

$specialChars

special chars to replace in text.

Private Field Summary

object [unknown]

$_bsDb

reference to the globally used db object.
object [unknown]

$_Bs_Array

reference to the globally used Bs_Array class.
object [unknown]

$_Bs_String

reference to the globally used Bs_String class.

Private Constant Summary

BS_LANGUAGEDETECTOR_VERSION >>4.0.$x$<< Warning: documentation is missing.

Public Method Details

detectWord

public mixed detectWord( string $word, [ string $lang ] )

  detects the language of a given word.
the following languages are supported:1) english2)we need more language dictionaries!!a word may exist in different languages. but the first match will return immediatly.the order of langs checked is the order listed above.

Parameter
string $word
string $lang = >>null<<
(you can give a vector with possible languages. if you do, only these langs will be checked. for example if you know it must be english or german.)
Returns mixed

(the iso-code(s) of the detected language(s) so it's a string or vector.)

Throws bool FALSE if detection failed.

detectText

public array detectText( string $string )

  detects the language of a given text.
the returned value is a hash with the keys:'lang' => iso-code of the language with the most hits'hits' => something like array('en'=>0, 'fr'=>0, 'de'=>0)'reason' => the hits themselves,like array('en'=>array(), 'fr'=>array(), 'de'=>array())

Parameter
string $string
Returns array

(see above)

Throws bool FALSE if detection failed.

Bs_LanguageDetector

public void Bs_LanguageDetector( )

 

Warning: documentation is missing.

Returns void


Private Method Details

_loadDictionary

private void _loadDictionary( )

  loads the dictionary (prepares data) used for and in detectText().

Returns void


Public Field Details

$wordDictEn

public array $wordDictEn

>>array( 'the'=>'en', 'of'=>'en', 'to'=>'en', 'and'=>'en', 'a'=>'en', 'in'=>'en', 'for'=>'en', 'is'=>'en', 'The'=>'en', 'that'=>'en', 'on'=>'en', 'said'=>'en', 'with'=>'en', 'be'=>'en', 'was'=>'en', 'by'=>'en', 'as'=>'en', 'are'=>'en', 'at'=>'en', 'from'=>'en', 'it'=>'en', 'has'=>'en', 'an'=>'en', 'have'=>'en', 'will'=>'en', 'or'=>'en', 'its'=>'en', 'he'=>'en', 'not'=>'en', 'were'=>'en', 'which'=>'en', 'this'=>'en', 'but'=>'en', 'can'=>'en', 'more'=>'en', 'his'=>'en', 'been'=>'en', 'would'=>'en', 'about'=>'en', 'their'=>'en', 'also'=>'en', 'they'=>'en', 'million'=>'en', 'had'=>'en', 'than'=>'en', 'up'=>'en', 'who'=>'en', 'In'=>'en', 'one'=>'en', 'you'=>'en', 'new'=>'en', 'A'=>'en', 'I'=>'en', 'other'=>'en', 'year'=>'en', 'all'=>'en', 'two'=>'en', 'S'=>'en', 'But'=>'en', 'It'=>'en', 'company'=>'en', 'into'=>'en', 'U'=>'en', 'Mr.'=>'en', 'system'=>'en', 'some'=>'en', 'when'=>'en', 'out'=>'en', 'last'=>'en', 'only'=>'en', 'after'=>'en', 'first'=>'en', 'time'=>'en', 'says'=>'en', 'He'=>'en', 'years'=>'en', 'market'=>'en', 'no'=>'en', 'over'=>'en', 'we'=>'en', 'could'=>'en', 'if'=>'en', 'people'=>'en', 'percent'=>'en', 'such'=>'en', 'This'=>'en', 'most'=>'en', 'use'=>'en', 'because'=>'en', 'any'=>'en', 'data'=>'en', 'there'=>'en', 'them'=>'en', 'government'=>'en', 'may'=>'en', 'software'=>'en', 'so'=>'en', 'New'=>'en', 'now'=>'en', 'many'=>'en' )<<

100 most used words in english.


$wordDictFr

public array $wordDictFr

>>array( 'de'=>'fr', 'la'=>'fr', 'le'=>'fr', 'et'=>'fr', 'les'=>'fr', 'des'=>'fr', 'en'=>'fr', 'un'=>'fr', 'du'=>'fr', 'une'=>'fr', 'que'=>'fr', 'est'=>'fr', 'pour'=>'fr', 'qui'=>'fr', 'dans'=>'fr', 'a'=>'fr', 'par'=>'fr', 'plus'=>'fr', 'pas'=>'fr', 'au'=>'fr', 'sur'=>'fr', 'ne'=>'fr', 'se'=>'fr', 'Le'=>'fr', 'ce'=>'fr', 'il'=>'fr', 'sont'=>'fr', 'La'=>'fr', 'Les'=>'fr', 'ou'=>'fr', 'avec'=>'fr', 'son'=>'fr', 'Il'=>'fr', 'aux'=>'fr', 'd\'un'=>'fr', 'En'=>'fr', 'cette'=>'fr', 'd\'une'=>'fr', 'ont'=>'fr', 'ses'=>'fr', 'mais'=>'fr', 'comme'=>'fr', 'on'=>'fr', 'tout'=>'fr', 'nous'=>'fr', 'sa'=>'fr', 'Mais'=>'fr', 'fait'=>'fr', 'été'=>'fr', 'aussi'=>'fr', 'leur'=>'fr', 'bien'=>'fr', 'peut'=>'fr', 'ces'=>'fr', 'y'=>'fr', 'deux'=>'fr', 'A'=>'fr', 'ans'=>'fr', 'l'=>'fr', 'encore'=>'fr', 'n\'est'=>'fr', 'marché'=>'fr', 'd'=>'fr', 'Pour'=>'fr', 'donc'=>'fr', 'cours'=>'fr', 'qu\'il'=>'fr', 'moins'=>'fr', 'sans'=>'fr', 'C\'est'=>'fr', 'Et'=>'fr', 'si'=>'fr', 'entre'=>'fr', 'Un'=>'fr', 'Ce'=>'fr', 'faire'=>'fr', 'elle'=>'fr', 'c\'est'=>'fr', 'peu'=>'fr', 'vous'=>'fr', 'Une'=>'fr', 'prix'=>'fr', 'On'=>'fr', 'dont'=>'fr', 'lui'=>'fr', 'également'=>'fr', 'Dans'=>'fr', 'effet'=>'fr', 'pays'=>'fr', 'cas'=>'fr', 'De'=>'fr', 'millions'=>'fr', 'Belgique'=>'fr', 'BEF'=>'fr', 'mois'=>'fr', 'leurs'=>'fr', 'taux'=>'fr', 'années'=>'fr', 'temps'=>'fr', 'groupe'=>'fr' )<<

100 most used words in french.


$wordDictDe

public array $wordDictDe

>>array( 'der'=>'de', 'die'=>'de', 'und'=>'de', 'in'=>'de', 'den'=>'de', 'von'=>'de', 'zu'=>'de', 'das'=>'de', 'mit'=>'de', 'sich'=>'de', 'des'=>'de', 'auf'=>'de', 'für'=>'de', 'ist'=>'de', 'im'=>'de', 'dem'=>'de', 'nicht'=>'de', 'ein'=>'de', 'Die'=>'de', 'eine'=>'de', 'als'=>'de', 'auch'=>'de', 'es'=>'de', 'an'=>'de', 'werden'=>'de', 'aus'=>'de', 'er'=>'de', 'hat'=>'de', 'daß'=>'de', 'sie'=>'de', 'nach'=>'de', 'wird'=>'de', 'bei'=>'de', 'einer'=>'de', 'Der'=>'de', 'um'=>'de', 'am'=>'de', 'sind'=>'de', 'noch'=>'de', 'wie'=>'de', 'einem'=>'de', 'über'=>'de', 'einen'=>'de', 'Das'=>'de', 'so'=>'de', 'Sie'=>'de', 'zum'=>'de', 'war'=>'de', 'haben'=>'de', 'nur'=>'de', 'oder'=>'de', 'aber'=>'de', 'vor'=>'de', 'zur'=>'de', 'bis'=>'de', 'mehr'=>'de', 'durch'=>'de', 'man'=>'de', 'sein'=>'de', 'wurde'=>'de', 'sei'=>'de', 'In'=>'de', 'Prozent'=>'de', 'hatte'=>'de', 'kann'=>'de', 'gegen'=>'de', 'vom'=>'de', 'können'=>'de', 'schon'=>'de', 'wenn'=>'de', 'habe'=>'de', 'seine'=>'de', 'Mark'=>'de', 'ihre'=>'de', 'dann'=>'de', 'unter'=>'de', 'wir'=>'de', 'soll'=>'de', 'ich'=>'de', 'eines'=>'de', 'Es'=>'de', 'Jahr'=>'de', 'zwei'=>'de', 'Jahren'=>'de', 'diese'=>'de', 'dieser'=>'de', 'wieder'=>'de', 'keine'=>'de', 'Uhr'=>'de', 'seiner'=>'de', 'worden'=>'de', 'Und'=>'de', 'will'=>'de', 'zwischen'=>'de', 'Im'=>'de', 'immer'=>'de', 'Millionen'=>'de', 'Ein'=>'de', 'was'=>'de', 'sagte'=>'de' )<<

100 most used words in german.


$wordDictNl

public array $wordDictNl

>>array( 'de'=>'nl', 'van'=>'nl', 'een'=>'nl', 'het'=>'nl', 'en'=>'nl', 'in'=>'nl', 'is'=>'nl', 'dat'=>'nl', 'op'=>'nl', 'te'=>'nl', 'De'=>'nl', 'zijn'=>'nl', 'voor'=>'nl', 'met'=>'nl', 'die'=>'nl', 'niet'=>'nl', 'aan'=>'nl', 'er'=>'nl', 'om'=>'nl', 'Het'=>'nl', 'ook'=>'nl', 'als'=>'nl', 'dan'=>'nl', 'maar'=>'nl', 'bij'=>'nl', 'of'=>'nl', 'uit'=>'nl', 'nog'=>'nl', 'worden'=>'nl', 'door'=>'nl', 'naar'=>'nl', 'heeft'=>'nl', 'tot'=>'nl', 'ze'=>'nl', 'wordt'=>'nl', 'over'=>'nl', 'hij'=>'nl', 'In'=>'nl', 'meer'=>'nl', 'jaar'=>'nl', 'was'=>'nl', 'ik'=>'nl', 'kan'=>'nl', 'je'=>'nl', 'zich'=>'nl', 'al'=>'nl', 'hebben'=>'nl', 'geen'=>'nl', 'hun'=>'nl', 'we'=>'nl', 'wat'=>'nl', 'Een'=>'nl', 'Maar'=>'nl', 'werd'=>'nl', 'moet'=>'nl', 'wel'=>'nl', 'kunnen'=>'nl', 'Dat'=>'nl', 'nu'=>'nl', 'dit'=>'nl', 'deze'=>'nl', 'zal'=>'nl', 'Ik'=>'nl', 'veel'=>'nl', 'zo'=>'nl', 'En'=>'nl', 'andere'=>'nl', 'nieuwe'=>'nl', 'zou'=>'nl', 'twee'=>'nl', 'moeten'=>'nl', 'onder'=>'nl', 'eerste'=>'nl', 'haar'=>'nl', 'Van'=>'nl', 'wil'=>'nl', 'tegen'=>'nl', 'men'=>'nl', 'mensen'=>'nl', 'gaat'=>'nl', 'tussen'=>'nl', 'grote'=>'nl', 'waar'=>'nl', 'goed'=>'nl', 'maken'=>'nl', 'dus'=>'nl', 'alleen'=>'nl', 'Hij'=>'nl', 'Op'=>'nl', 'frank'=>'nl', 'ons'=>'nl', 'u'=>'nl', 'daar'=>'nl', 'na'=>'nl', 'had'=>'nl', 'gaan'=>'nl', 'alle'=>'nl', 'Als'=>'nl', 'Er'=>'nl', 'één'=>'nl' )<<

100 most used words in dutch.


$wordDict

public array $wordDict

>><<

another structure of all the $wordDics language dependant vars, all in here.

See Also _loadDictionary()

$prefixDict

public array $prefixDict

>>array( 'off' =>'en', 'to' =>'en', 'under'=>'en', 'thou'=>'en', 'mont'=>'fr', 'contr'=>'fr', 'mal' =>'fr', 'ver' =>'de', 'zu' =>'de', 'los' =>'de', 'gut'=>'de' )<<

array with word prefixes for different languages.
don't use 'in', it is german and english, maybe others.

See Also $suffixDict

$suffixDict

public array $suffixDict

>>array( 'son' =>'en', 'day' =>'en', 'ing' =>'en', 'ly' =>'en', 'ght'=>'en', 'ique'=>'fr', 'tude'=>'fr', 'ont' =>'fr', 'nal'=>'fr', 'tung'=>'de', 'heim'=>'de', 'zeug'=>'de' )<<

array with word suffixes for different languages.

See Also $prefixDict

$specialChars

public string $specialChars

>>'.,!?"()[]{}!§$%&/*+#'<<

special chars to replace in text.


Private Field Details

$_bsDb

private object [unknown] $_bsDb

>><<

reference to the globally used db object.


$_Bs_Array

private object [unknown] $_Bs_Array

>><<

reference to the globally used Bs_Array class.


$_Bs_String

private object [unknown] $_Bs_String

>><<

reference to the globally used Bs_String class.


Private Constant Details

BS_LANGUAGEDETECTOR_VERSION

define( BS_LANGUAGEDETECTOR_VERSION, >>4.0.$x$<< )
Case: default: case sensitive




Packageindex Classtrees Modulegroups Elementlist Report XML Files
PHPDoc 1.0beta