2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:425 INFO: Namespace(input_files=['lex/de/realword_ocr_errors.nw.txt', 'lex/de/ocr_errors.nw.txt', 'lex/de/old_spelling.rw.txt', 'lex/de/modern_spelling.rw.txt', 'lex/de/dewiki.unigram.freq.tsv.bz2'], bloom_path='build.d/fp_prob_0.00001/ocrqa-wp_v1.0.6-de.bloom', fp_probability=1e-05, log_level='INFO', log_file='build.d/fp_prob_0.00001/ocrqa-wp_v1.0.6-de.bloom.log', config=None, min_frequency=2, single_char_min_frequency=20, diagnose_bloom=True) 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:226 INFO: Starting Bloom Filter creation... 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:178 INFO: Processing nonword file: lex/de/realword_ocr_errors.nw.txt 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['negierung'] 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['negierungen'] 2025-02-14 16:07:47,840 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nidwaiden'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ölten'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unterwaiden'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verlausen'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['vertretet'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:190 INFO: Excluded 7 words that should never be added 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:178 INFO: Processing nonword file: lex/de/ocr_errors.nw.txt 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['0oo'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@d'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@e'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@i'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@r'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['@t'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['aargan'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['abbin'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['abgefetzt'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['abgereift'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ahresbesoldung'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['aisbann'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['aneh'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['anmeldungstermln'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ariesheim'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['aueh'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ausgefetzt'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['auslände'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bahmen'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bandesblatt'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bebacht'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bechnung'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['befetzt'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['befetzten'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['begelung'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['begierung'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['begierungen'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['begierungsrat'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['behorde'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['behorden'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['berieht'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bersonen'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['besicht'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bestimm'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['betragt'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['betreifend'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['beutscher'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bevision'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bielleicht'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bingier'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bnndesblatt'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bücksicht'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['bundesbehorden'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['diefe'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['diefer'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['dingungen'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['dnrch'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['dnreh'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['dureh'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eahmen'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eappen'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eäte'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eatifikation'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ebenfall'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ebruar'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eechnung'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eecht'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eechte'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eechts'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegel'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegelung'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegierung'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegierungen'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eegierungsrat'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eeglement'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eeihe'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eente'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eenten'] 2025-02-14 16:07:47,841 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eepublik'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eesolution'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eevision'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ehur'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eichter'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eidgenossenschast'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eidgenossische'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eingefetzt'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['einlabung'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eirea'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eiue'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eldg'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['elfaß'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['endlieh'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['engtischen'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eobert'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eolle'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erbalten'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erbbeben'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erfetzt'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erhallen'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erlauft'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erleiben'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erleibet'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['erseht'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['eücksicht'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['euenburg'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['feiet'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['feinet'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['feite'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['festgefetzt'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fetze'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fetzte'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fetzten'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fetzung'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['feuersbrunft'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fiir'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fipoi'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fllr'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fönst'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fortfetzen'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fortfetzung'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['fortgefetzt'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['franzofen'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['franzosischen'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['frauken'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['galleu'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gefetzt'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gemass'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gesellschast'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gesellsehast'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gewöhn'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gierung'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['gischen'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['grossere'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['grossern'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['hallung'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['handelsund'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['hauptfache'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['heuligen'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ì000'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['iaht'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['iahte'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['iahten'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['iiber'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['infofern'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['jnni'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['kauton'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['kautone'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['kautons'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['korden'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['kreispostdirektiou'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['leife'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['liier'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['lnzern'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['locamo'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['lostet'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['luzernburg'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['macbonalb'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mahnahmen'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mahregeln'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mährend'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['matznahmen'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['melben'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['melche'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ministet'] 2025-02-14 16:07:47,842 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['mitleib'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['moglich'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['möglid'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['moglieh'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['naeh'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nieht'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nikiaus'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['noeh'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nothig'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['nothigen'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ollem'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['poft'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['prankreich'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['rebakteur'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['reiburg'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['reuenburg'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['roieber'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['rovember'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ruffischen'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['schisse'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['schwierigleiten'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['sehengen'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['seihst'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['siud'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['srühern'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['stanben'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['stobt'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['tatfache'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['tatfachen'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['teten'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['thronrebe'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['uater'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['uicht'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unbein'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unfete'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unier'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['unterschieb'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['urfache'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['ürich'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verfetzt'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verkau'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verkauten'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verlauft'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verlaus'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['verlehr'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['vorfitz'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['vstrr'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['webet'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['welehe'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['wnrde'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['znm'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:187 INFO: Nonword tokens: ['zurlch'] 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:190 INFO: Excluded 213 words that should never be added 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:196 INFO: Processing real-word file: lex/de/old_spelling.rw.txt 2025-02-14 16:07:47,843 ocrqa_create_bloom_filter.py:196 INFO: Processing real-word file: lex/de/modern_spelling.rw.txt 2025-02-14 16:07:47,844 ocrqa_create_bloom_filter.py:135 INFO: Processing frequency file: lex/de/dewiki.unigram.freq.tsv.bz2 2025-02-14 16:08:00,551 ocrqa_create_bloom_filter.py:240 INFO: low_freq_excluded before removing parts from high-frequency words: 3780824 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:248 INFO: low_freq_excluded after removing parts from high-frequency words: 3288723 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:252 INFO: Lexical processing complete. 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - nonwords_read: 213 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - nonwords_count: 213 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - realwords_read: 488 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - realwords_accepted: 488 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - realwords_nonwords_filtered: 0 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - freq_words_read: 9199714 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - words_accepted: 4143119 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - subwords_accepted: 5029202 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - subwords_filtered: 1 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - low_freq_excluded: 3288723 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - single_char_words_filtered: 6719 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:256 INFO: - freq_words_filtered: 5049876 2025-02-14 16:08:00,724 ocrqa_create_bloom_filter.py:259 INFO: Estimated word count: 3358453 2025-02-14 16:08:01,663 ocrqa_create_bloom_filter.py:263 INFO: Bloom Filter created and saved to build.d/fp_prob_0.00001/ocrqa-wp_v1.0.6-de.bloom 2025-02-14 16:08:02,322 ocrqa_create_bloom_filter.py:285 INFO: Diagnosis Results: 2025-02-14 16:08:02,323 ocrqa_create_bloom_filter.py:286 INFO: - Excluded words in bloom filter: 0 2025-02-14 16:08:02,323 ocrqa_create_bloom_filter.py:287 INFO: - Known words not in bloom filter: 0 2025-02-14 16:08:02,821 ocrqa_create_bloom_filter.py:294 INFO: - Low-frequency words in bloom filter: 25 2025-02-14 16:08:02,821 ocrqa_create_bloom_filter.py:300 INFO: - Proportion of excluded words in bloom filter: 0.00000000 2025-02-14 16:08:02,821 ocrqa_create_bloom_filter.py:306 INFO: - Proportion of known words not in bloom filter: 0.00000000 2025-02-14 16:08:02,821 ocrqa_create_bloom_filter.py:314 INFO: - Proportion of low-frequency words in bloom filter: 0.00000760