FAQ about Computational Linguistics

I’ve compiled a list of questions that I am often asked by linguists about computational linguistics and Natural Language Processing (NLP).

Feel free to contact me if you have other questions, or better answers!

show all

What is computational linguistics? How is it different from natural language processing (NLP)?

There is no clear consensus how the two differ. Others have discussed this. I know at least one person who insists that computational linguistics only refers to computational modeling of linguistic theories, but you are likely to hear "computational linguistics" and "natural language processing (NLP)" used interchangeably. Both involve programming computers to do interesting things with language. Generally speaking, "computational linguistics" is used more by linguists and "NLP" by the computer science side. A class called "Introduction to Computational Linguistics" in a linguistics department and one called "Introduction to NLP" in a computer science department may use the same textbook and cover the same topics, although adapted to different audiences.

How much data is necessary to train a computer to automate tasks in language documentation with reasonable reliability?

That's a good question. There will probably never be a definite, measurable answer. Too many factors are involved. Normally, natural language processing/computational linguistics algorithms are trained on hundreds of thousands or millions of words! The rule of thumb is that the more pre-annotated data a model trains on, the better its results. However, researchers are exploring techniques to improve results when training data is limited to thousands or even hundreds of words. I have seen over 90% accuracy reached with as little as 3000 words to train the model. But a whole lot depends on how consistently the linguists annotated the words!

Instead of attacking the question of "how much data" directly, better questions might be:

Can we achieve results that accelerate a linguist's work?
Can we achieve useful results with models trained on the resources produced by typical field projects, in language documentation and description?
Human-in-the-loop systems create a virtuous cycle of computer and human interaction, could this interaction improve language documentation tasks?
What techniques or pieces of information (computational or linguistic) improve results most dramatically?

The closest answers I've seen to these questions are in Paul Felt's MA thesis where a machine model did automatic prediction and then humans annotators corrected those predictions. The annotators' accuracy and speed started to improve when the model's predictions passed the threshold of 60% accuracy, compared to doing annotation with no help from the model.

What accuracy can a computer achieve when processing language documentation data?

That is almost impossible to answer without knowing what task you want to perform, so I'll talk generally about how we can improve accuracy with limited hand-annotated data. Keep in mind that most computational models train on thousands or millions of words. With this amount of pre-processed data, the models can achieve up to 97% or sometimes even 99% accuracy at tasks like part-of-speech tagging. The amount of annotated data available for most languages is not nearly enough to achieve that without some clever techniques. One technique is data augmentation or data "hallucination". This increases data by automatically creating words that may not actually exist in the language but adhere to its rules. For example, taking an adjective "capable" and attaching a nominalizing sufix "-ness" gives "capableness". This may not be a real English word but it follows English rules of morphology and phonology. Another technique is called human-in-the-loop (or active learning). This is a virtuous cycle that trains a computational model and then humans correct the model's worst errors. The model is then re-trained with those corrections. Hopefully, this second iteration of training greatly improves results. I believe this cycle of analyzing and annotating a small amount of data, training, predicting, correcting, and re-training a model will someday be a vital part of documentary and descriptive linguistic research. Perhaps models trained this way will achieve 99% accuracy without millions of words!

What programming languages do computational linguists use?

Python. Python has a number of code libraries, or packages, specifically for machine learning. Machine learning is a big part of computational linguistics and other fields in AI. Python is one of the easiest programming languages to learn, but once you’ve learned basic programming concepts in another language, it's easy to learn a second language.

I would like to recruit tech-savvy people to my language project. Do you have any recommendations?

It is great to seek collaboration! I was launched into NLP by talking to computational linguists at the International Conference on Language Documentation and Conservation (ICLDC) and I recommend this conference for anyone working in minority languages. You may find like-minded people as you browse its past presentations. The 2019 conference had a technology theme so it's a great place to start.

“The Workshop for the Use of Computational Methods for Endangered Languages” (ComputEL) is another event that brings together linguists, computer scientists, and community members. If you see a project there that sounds like something you want to do, contact the authors. Note that some articles are fairly technical.

If you have an idea for language technology that does not involve specific natural language processing or machine learning skills, you might contact a local hackathon or computer science department to see if you can pitch a project to them.

I'm interested in computational linguistics but I know very little computer science. Where should I start?

Not so many years ago, I was where you are now, very much interested in computational linguistics but a novice in computer science. Fortunately, NLP is an exploding field and getting started should be easier for you.

I was told to find a computer scientist with similar interests and collaborate with them. Collaborations like that do produce great work. For examples, see the automated transcription tool ELPIS or SignStream for linguistic analysis of American Sign Language.

That answer did not satisfy me and I took another path. For me the solution was to dive into programming and practice, practice, practice. Computer programming has a steep, but happily, a short, learning curve. I strongly recommend that you take a bootcamp or classes with an in-person teacher. I took a free class through a non-profit called LaunchCode which is available in some USA cities. After Programming 101, take a class in data structures and algorithms to learn how to think like a programmer.

If you have a basic programming foundation, a degree program might be the most efficient way to get your foot in the job market door.

I'm a college freshman interested in computational linguistics. What should I major in?

Great question! If your school does not have a computational linguistics or NLP major, then I recommend doing a major/minor combination in linguistics and either computer science or data science. You can approach computational linguistics as a linguist with computer science skills or as a computer scientist who works with languages. In computer science, this is commonly known as natural language processing (NLP). I come from the linguistics side and more than once I have regretted that I didn't take any computer science classes in my undergrad. I majored in history and I didn't discover linguistics until after I graduated.

I have an undergraduate major in linguistics and want to study computational linguistics in graduate school. Where should I start?

Check out my recommended "prerequisites".

To start, take an introductory programming class, preferably in Python. Then, take a class in data structures and algorithms. Your life will be easier if you feel comfortable with basic probability and statistics. Beyond an intuitive grasp of linear algebra (vectors and matrices) and some beginner calculus (reading symbols and partial derivatives). I found Khan Academy videos on linear algebra and calculus were sufficient. But I wish I had taken more classes in statistics, or rather, that I had retaken Statistics 101 a few more times :)! Don't let fear of a subject keep you from pursuing this field. Remember, C's get degrees!.

For graduate school, here's a list of linguistics departments in North American univerisities with computational linguistic faculty. You can find general advice for graduate applications on the web. Personally, I don't recommend a PhD to most people. Professional master programs in computational linguistics/NLP is exploding! For example, University of Colorado (my alma mater) has their well-established CLASIC program. The University of Washington has an online program and some of their coureses are offered free through Coursera and edX. At the University of Florida we have plans to develop a master's in computational linguistics as well!

Finally, consider whether you are more interested in computational modeling, that is, computational linguistics in the strict sense, or in data-driven applied work. North American Linguistics departments tend to lean towards one or the other.

Are you aware of any jobs that specialize in NLP for minority languages?

Work on endangered languages is mostly done in academia or by other non-profit institutions. Unfortunately, these jobs are limited. On the other hand, computer science or linguistics university departments are competing against higher-paying industry jobs. So, these positions tend to be less competitive than for other faculty positions.

Languages with small communities are not commercially profitable but more and more companies are working with low-resource languages. For example, both Microsoft and Google work with minority languages in India. (In India, "minority" can still mean 1 million speakers!) These languages are not endangered, but resources for training NLP models are limited. A challenge you solve in industry could, if made available, benefit endangered language communities.