Data and the Fine Print or: How to Create a Sh*tstorm

Jost Zetzsche takes a revealing look at confidentiality when using online translation services

I like Twitter — it’s a good way to learn what’s happening and at the same time have an additional motivation to process and curate information so that you can share noteworthy articles and information yourself. ‘Twas in that spirit that I shared an article by Matthew Blake about the dangers of lawyers using Google Translate, specifically regarding quality and confidentiality. Not only that, but I even tagged on a “Good read” to my tweet.

It’s true that I hadn’t noticed it was “sponsored content” (but truth be told, I have ghost-written a number of articles for sponsored content placement and they were still pretty good, if I do say so myself), but either way I wasn’t quite prepared for the storm that broke loose, a very small portion of which you can follow at Twitter. I didn’t jump with both feet into the assumed controversy right away but a few days after the original eruption, I actually revisited the contentious topic — the issue of confidentiality when using services like Google Translate and Microsoft Bing Translator — and was surprised by what I found.

Like most of you probably, I had always assumed that everything passing through one of those two services would be used by Google and Microsoft. Well, that’s only partially true, and especially for us it’s important to know at what point exactly the data is being used.

When you go to Google Translate at translate.google.com, it’s exactly like Blake’s article claims. Here is the language:

“When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content.”

Pretty clear-cut and very much along the lines of what we expect: your source content (not your target, unless you use one of the tools on the site to modify the suggested translation) will be used.

Google Translator Toolkit, the minimalistic translation environment tool that Google offers, also uses your content, only here it uses both source and target:

“We may use the content you upload to Google Translator Toolkit to improve Google services pursuant to our Terms of Service [see above]. If you delete your content from Google Translator Toolkit, we will delete the content from our servers and, from that point forward, will not use it for any additional improvements to Google services.”

I’m not a lawyer, but in my mind the last addition means that while the data is not being processed anymore once you delete it, whatever has been gained from the data while you had it stored with GTT will still be used.

However, once you use the Google Translate API (which we use in most translation environment tools — essentially anytime we enter the “API Key” and have to pay for the use), things are very different. In that case

“Google does not use the content you translate to train and improve our machine translation engine. In order to improve the quality of machine translation, Google needs parallel text — the content along with the human translation of that content.”

You can read all of the text.

(Now, I’m not sure about the parallel text statement. Statistical machine translation engines typically do use monolingual text alongside parallel text. I also don’t know why they would need the monolingual content of the non-API Google Translate but not this one. But, hey, what do I know, right?)

All this said, Google is assuring us that it will not use any of our data if we pay for the translation service. Did you know that? I didn’t either.

Let’s move on to Microsoft and the data that gets submitted to Microsoft Bing Translator.

Microsoft has all of its terms nicely put together on one page.

“Microsoft Translator does not use the text you submit for translation for any purpose other than to provide and improve Translator, including improvements to the quality and accuracy of translations provided by Translator. (…) The text we use to improve Translator is limited to a sample of not more than 10% of randomly selected, non-consecutive sentences from the text you submit, and we mask or delete numeric strings of characters and email addresses that may be present in the samples of text. The portions of text that we do not use to improve Translator are deleted within 48 hours after they are no longer required to provide your translation. If Translator is embedded within another service or product, we may group together all text samples that come from that service or product, but we do not store them with any identifiers associated with specific users.”

OK, kind of what we thought. And what about Microsoft Translator Hub, the customizable machine translation engine that Microsoft offers?

“The Hub retains and uses submitted documents in full in order to provide your personalized translation system and to improve the Translator service. After you remove a document from your Hub account we may continue to use it for improving the Translator service.”

That’s a little “less generous” than Google — even after you withdraw your documents, they might still continue to be processed.

What’s really interesting is that there is also an exception — just like with Google, if you pay (enough) you can opt out of your data being processed.

If you subscribe to a monthly volume of 250 million characters or more, you may request to have logging turned off for the text you submit to Microsoft Translator.

So, if you pay a little more than USD 2,000 per month (USD 2,055 to be exact), you can request to not have your data processed by Microsoft to improve the translation service. (The same terms apply to Microsoft Translator Hub as well.)

So, to summarize, if you don’t pay for either Google’s or Microsoft’s services, your data will be processed. If you pay (in Microsoft’s case: “if you pay a whole lot”), your data will be left alone. That’s at least what the legal language says. And that should have an impact on the ongoing discussions on confidentiality concerns when using generic machine translation services.

And Blake’s article? He was essentially right since he was not talking about professional linguists who would likely be using the API, but about the casual user in the legal field. His concerns about quality are spot on as well.

And as far as us not being in a position to have an impact on those matters? After I published an early version of this article in my newsletter, I sent it to one of the people at Microsoft who is responsible for the Microsoft Translator program. His response: “Looks like it is time to revise our behavior one more time.” We can make a difference.

To put this all into perspective, here’s an interesting outlook from a localization manager of a reasonably large IT company with whom I have worked in the past. He wrote to me recently to share his concern about the decreasing quality of translation this past year, wondering aloud whether generic MT engines like the ones discussed in this article are to blame. When I shared this on Twitter, a deluge of responses suggested he should find new vendors or that it’s the responsibility of the individual translator which tool to choose. I agree and I agree — still, we would be wise to “treasure all these things and turn them over in [our] mind.”

This article was originally published in the ITI Bulletin (May-June 2015), the bi-monthly magazine of the Institute of Translation & Interpreting (www.iti.org.uk).

Header image credit: Unsplash
Header image edited with Canva

Author bio

Jost Zetzsche is an English-to-German translator, a localization and translation consultant, and a widely published author on various aspects of translation. He writes regular columns in the ATA Chronicle and the ITI Bulletin; his computer guide for translators, A Translator’s Tool Box for the 21st Century, is now in its tenth edition; and his technical newsletter for translators goes out to more than 10,000 translators. In 2012, Penguin published his co-authored Found in Translation, a book about translation and interpretation for the general public. You can find his website at www.internationalwriters.com and his Twitter handle is @Jeromobot.

September 28, 2015 By Reblog Reblogs

About the author

This post was reblogged from a magazine or journal of a translators' association (or another type of organization). In most cases, the articles we reblog in the Adventures in Technical Translation blog are available only in print format for the organization's members and not online. We believe the content of these articles is very interesting and valuable so we want to make them available to as many people as possible. Check out the awesome reblogs that have been published on our blog.

Data and the Fine Print or: How to Create a Sh*tstorm

You may also like