Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.
The field is divided in two major categories:
The technologies needed for both are very different, "speech" being addressed normally by Electronic or Telecommunications Engineers, while "text" being more addressed by Computer Scientists. Additionally, Speech recognition and speech generation are normally architectured as a layer over a Text recognition / generation engine.
To follow a course on NLP, you must know about Formal languages and Grammars, and be able to program fluently. For the practicals, you must learn a Logic programming language; Prolog is recommended. A rule based programming language can serve as well, although it perhaps is more difficult to master.
If you look on this text above
Furthermore the text contains more information we could identify, e.g. a header or links. In general a document can contain other document elements images, audio, video or interactive elements like forms or geometric construction that can be manipulated by the reader.
A tree can represent the decomposition of text into substructures. The decomposition of text into sections is e.g. the top level of an Abstract Syntax Tree (AST).
<h1>Introduction</h1> Text <h2>My Subsection </h2> Text of subsection
\section{Introduction} Text \section{My Subsection}} Text of subsection