Translate articles easily with Amazon Translate and R
This article is part of the university home assignment where I will give you a guide about translating articles by utilizing Amazon Translate service with the help of R script. If you are a lazy programmer like me who is looking for easier ways of translating big articles, then you are in the right place!
First I will give a brief explanation of tools we need to set up our workspace and then I will cover the steps needed to get the article we want and translate it.
So, first things first: we need RStudio to write a script in R and we need an account in AWS (Amazon Web Services) to utilize its Amazon Translate service. Yes, just 2 tools to get the work done!
Now, let’s look how exactly the workflow looks like:
- In AWS, we get Access Key as a CSV file to use the Amazon Translate service
- In RStudio, we install necessary packages and read our access keys
- Also, we write an R script that gets the article we want
- Then we send the request to the Amazon Translate service by asking to translate our article
- The response of the service contains the translation of the article. We simply save this translated article as a text file
Yes, as simple as that! If you are excited to know the details, then let’s jump to the implementation.
First step: Access Key. We need this key to access Amazon Services using our R code. To get this key, after logging in to your AWS Management Console, you should navigate to IAM¹ which is located in Security, Identity, & Compliance section. Then you go to Users → click your username → Security Credentials → Create Access Key. Now it’s time to Download CSV file. Let’s download this file to the same folder where we will write our R code and rename it accessKeys.CSV. Now we can use this key to authenticate as AWS user.
We need this key to access Amazon Services using our R code. To get this key, after logging in to your AWS Management Console, you should navigate to IAM¹ which is located in Security, Identity, & Compliance section. Then you go to Users → click your username → Security Credentials → Create Access Key. Now it’s time to Download CSV file. Let’s download this file to the same folder where we will write our R code and rename it accessKeys.CSV. Now we can use this key to authenticate as AWS user.
This CSV file has Access Key ID and Secret access key. Never push this key to the git repository if you don’t want some stranger to come and take your key and use it to mine bitcoin for example (yes, by wasting your money on Amazon). This scenario has a low probability to happen as you will get “AWS Key exposed” warning email, Amazon sends you an alert and disables that access key. But still, let’s be on a safe side
Second step: installing R packages. First, let me show you the code:
In order to access the Amazon Translate service, we need to install cloudyr package. Then we need to read the access key (in CSV file) that we obtained from Amazon using the read.csv() function and assign it as a keyTable variable (data frame). Access Key Id and Secret Access Key is extracted and assigned as separate variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. After that, we activate our keys in a system environment. The last line of the code is for loading the library “aws.translate”.
Third step: get the article. Again, first code → then explanation:
We start by installing and loading “xml2” and “rvest” packages. They help us to scrape the web and get the text of the article. We have chosen to translate the Russian article about the COVID-19 vaccine. The URL of this article is assigned as a variable. Next, we supply this variable as an argument to read_html() function that returns the body of the webpage. Then we extract the full text of the article out of the whole page in the XML nodeset format. After that, we obtain the text of the article from these nodes.
Fourth and fifth step: translate the article and save it. In order to use the method translate() of “aws.translate” library, we should specify the following:
- text we want to translate
- from which language
- to which language
We created an empty vector. We iterate over the article vector that contains each paragraph of the original article. We supply each Russian paragraph as a text to translate() method. We also pass from = “ru” and to = “en” arguments. We populate our empty vector with translated paragraphs.
The text we supply to translate() method should not exceed 5000 characters. Otherwise we get HTTP 404 error message.
The error will be in HTTP 404 format because the communication between R code and AWS goes through REST API that uses HTTP protocol. In the background, RStudio opens an HTTPS connection to one of the AWS endpoints, sends a message to the Amazon Translate service, and gets the result from it. So, if we want to analyze a big article, we should split them into chunks with 5000 characters in each.
After our vector is populated with translated paragraphs, we use unlist() method. As a last step, we write our result into a text file.
As you can see, we can translate articles easily with some R code and using Amazon Translate service. This approach is extremely useful when we want to systematically translate a lot of articles from many sources.
: IAM - Identity and Access Management