Last Updated on August 13, 2024
The average user probably doesn’t spend much time considering how the tech that drives modern life actually works, but we’d like to change that, at least in a small way, with today’s discussion.
Rather than speaking about hardware advances today, we’ll be taking a look at text query search techniques.
Even if you’re not familiar with this field, you’ve no doubt made use of search techniques of some kind. In computer science, a text query refers to the process of using a text-based search to find relevant instances of similar text within a massive pool of sources.
If Google comes to mind, then you’re on the right track. Google and other popular search engines are easily the most well-known examples of text queries.
Under the hood, these search engines are using advanced techniques to look through massive amounts of data to find and identify a relatively small selection of results that, ideally, are relevant to a user’s search.
Internet search engines are certainly impressive in their own right, but there may have been times when you searched for something using one of these engines and none of the results gave you what you were looking for.
In certain cases, this might be due to the fact that there aren’t any posts online that are especially relevant to your search. But it also might be because the search engine was simply unable to return relevant results based on the text you entered.
Does this mean that existing internet search engines and their underlying tech are of poor quality? No, it doesn’t. For the majority of casual searches, these engines do exactly what they’re supposed to do, site scoring and bias aside.
But search engines, not just those we use online, still have plenty of room for improvement, and that’s precisely the specialty of our guest, Dr. Hua Yan.
With a Ph.D. in Computer Science, Dr. Yan has been researching search techniques for years, specializing in the area of latent semantic indexing, or LSI, which is a method for analyzing documents for statistical co-occurrences of specific words in order to give a better idea of the topics of these documents.
We’ll have more details on LSI in the next section, but for the moment, we just want to stress that Dr. Yan is working within LSI to find even better ways to search and pull relevant information from large data sets.
“When we search on the internet, we always want to find the most relevant results. We say search engine X is better than search engine Y because the former seems to consistently bring out search results that we deem more relevant than the latter one does. The motivation behind my LSI research is to try to develop methods that can be used to create a better search engine. This better engine can be trained to be used for any text queries, including internet searches.”
Dr. Yan has had multiple research papers published, and the premise of one of these papers is all about improving LSI.
Why LSI matters
But before we get to Dr. Yan’s novel LSI search technique, let’s take a step back and discuss why LSI is so important.
To provide a straightforward explanation of LSI’s value when compared to other search techniques, we turn to Dr. Yan:
“LSI enables us to identify those few gemstones among a large pool of text sources. These small numbers of gems are hard to find using other techniques. However, with LSI, we can easily identify them because LSI employs an advanced mathematical indexing methodology and process to extract ‘hidden’ meaningful clues. We use these extracted hidden clues to match and identify those gems.”
LSI has its limitations, of course, some of which we’ll touch on later, but LSI’s unique value relative to other search techniques makes it an excellent source for certain applications.
Within LSI, Dr. Yan’s research has proposed a new rescaling technique that can further enhance the value of LSI.
A novel approach to LSI
In Dr. Yan’s journal paper, “Augmenting the power of LSI in text retrieval: Singular value rescaling,” which he authored alongside William I. Grosky and Farshad Fotouhi, and which was published in Volume 65 of Data & Knowledge Engineering in 2008, he proposed a new rescaling technique: singular value rescaling, or SVR.
The potential of this technique is impressive, to say the least. Here’s Dr. Yan’s breakdown of some of the paper’s most important findings.
“Experiments on a standardized data set confirmed the effectiveness of SVR, showing an improvement ratio of 5.9% over the best conventional LSI query method. I also compared SVR with another scaling technique in text retrieval called iterative residual rescaling (IRR). Experiments show that SVR performs better than IRR.”
So how exactly does SVR manage to improve on other LSI query methods? What are the nuts and bolts of SVR?
Dr. Yan provides a more technical explanation of how SVR performs differently from other techniques below.
“All LSI techniques generate a set of singular values for the data set with respect to a query. All of them, except SVR, treat these singular values as non-variables. In SVR, these singular values go through a transformative process where they are subject to rescaling. Each instance of rescaling produces a new set of transformed singular values, which are subject to re-evaluation. The re-evaluation process ultimately produces a particular set of rescaled singular values with the highest score. And this is the optimum solution that can be used to identify the top few gems among a large pool of text sources.”
In a more basic sense, SVR transforms values to enhance LSI’s inherent ability to bring the most valuable results to the forefront.
So now that we know SVR can deliver impressive results, where has it been applied already and where is it likely to be applied in the future?
SVR applications
First, let’s cover the applications where SVR has already been tested. In the abovementioned journal paper, Dr. Yan and his colleagues tested SVR against three different data sets: a Congressional Records data set, a Financial Times data set, and a Federal Register data set.
“In each instance, SVR was shown to produce the best query results compared with other techniques. These three applications indicate that if SVR is used in other occasions of text pools, it will likely also produce good or even optimum results.”
This further proves the current and potential value of SVR, but insightful readers will also notice that these particular data sets, while certainly large, aren’t nearly as large as the data sets associated with internet searches, which brings us to the question of whether SVR, or indeed LSI in a more general sense, could be used within internet search engines.
Adapting for internet searches
Dr. Yan explained that his recent work has focused on two major aspects of LSI-based techniques.
First, he is actively exploring different ways of rescaling SVR within LSI. Second, Dr. Yan is trying to bridge the gap of the difference in text pool sizes between conventional LSI and internet searches.
At the moment, LSI techniques just aren’t practical for internet searches, as they’re very CPU heavy.
“In conventional LSI, the pool size is in the hundreds or thousands, while, in the case of an internet search, the pool size is in the millions. LSI and SVR can’t directly process such large pools because of limited computational resources. Therefore, some filtering or preprocessing techniques must be found in order to apply this technique to the area of internet searches.”
So in other words, it is possible that LSI and, more specifically, SVR could be applied to internet searches, but there’s at least one missing piece to the puzzle, and Dr. Yan is dedicated to finding that missing piece.
Making the computer more intelligent
Toward the end of our discussion with Dr. Yan, he provided a compelling comparison of LSI and a rather different area of computer science that’s currently seeing a lot of progress: AI.
Dr. Yan explained that both AI and LSI have the ultimate aim of making the computer more intelligent when it performs assigned tasks.
Where these fields split, however, is in how they’re actually achieved.
“With LSI, it’s the mathematical process carried out in the computer that ultimately produces the most relevant results. In this respect, LSI or SVR is more heavily dependent on underlying mathematical models than other fields of computer science such as AI, which are typically more algorithm-oriented.”
Dr. Yan feels excited to be working on the cutting-edge of LSI, and he feels strongly that LSI will continue to be a very fruitful field of computer science.
This brings us to the end of our look at LSI, courtesy of Dr. Yan. Thank you for joining us.