My company, Southdata Ltd., is interested in OCR (Optical Character Recognition), and the first problem you have is getting from the paper document in the source language onto a computer so that you can work at it.
My company, Southdata Ltd., is interested in OCR (Optical Character Recognition). I imagine that many of you translate on the computer, and the first problem you have is getting from the paper document in the source language onto a computer so that you can work at it. You could just copy type it, but if you want to save time and trouble you could have a go at Optical Character Recognition. What the OCR tries to do is to run around the page, find bits of ink and decide what they are. Having decided that what it’s seen is an e acute or umlaut or whatever it is it then outputs the appropriate computer code. Begin by looking at some of the problems you have with OCR. This is the process of OCR (a lot of people call it scanning, which is only part of it). You start with a piece of paper and you stick it on the scanner, which is a piece of equipment rather like a photocopier in reverse. It scans the paper but it doesn’t output another bit of paper, it outputs a signal to the computer. The signal goes as an image into the computer. Inside the computer, the Optical Character Recognition Software tries to make sense of it. It will put things on the screen to get the operator to help it and then finally it produces texts. The whole process is equivalent to copy typing – it is really nothing more or less than that, just quicker. What are the problems with OCR? The first and main one is that what the OCR wants is letters that are separate bits of black surrounded by white, which is the definition of a letter. Very few documents are actually printed clearly. They usually look more like a bit of printing which has been photocopied so often it couldn’t conceivably be read by Optical Character Resolution Recognition. But if it says, “Can I buy you a drink?” and if