corpus linguistics | Chris Aldrich

Replied to a post by Natalie (hcommons.social)

I started the second week of "Programming 101: An Introduction to #Python for Educators" on #FutureLearn and wrote a small quiz about Arabic verbs: https://github.com/kranatalie/Introduction-to-Python/blob/main/arabic-quiz-bot.py It was fun again and I'm actually a little proud! Would really like to recommend the course again. It's the perfect gentle introduction for me that doesn't overwhelm but still teaches enough to get an idea of what's possible. Looking forward to the final challenge this week: Building commands into your bot. Let's try this! https://www.futurelearn.com/courses/programming-101

@natalie, Thanks for the recommendation, this looks great! It looks like it may be a good companion to the Santa Fe Institute’s (free) Foundations & Applications of Humanities Analytics https://www.complexityexplorer.org/courses/162-foundations-applications-of-humanities-analytics which starts on Jan 17. #DigitalHumanities

Read A Song of Scottish Publishing, 1671-1893 by Shawn (electricarchaeology.ca)

The Scottish National Library has made available a collection of chapbooks printed in Scotland, from 1671 – 1893, on their website here. That’s nearly 11 million words’ worth of material. The booklets cover an enormous variety of subjects. So, what do you do with it? Today, I decided to turn ...

This is more cool than truly useful, but I could see audioizations of data like this being used to surface and recognize patterns that might not otherwise be seen.

👓 Humane Ingenuity 9: GPT-2 and You | Dan Cohen | Buttondown

Read Humane Ingenuity 9: GPT-2 and You by

Dan Cohen (buttondown.email)

This newsletter has not been written by a GPT-2 text generator, but you can now find a lot of artificially created text that has been.

For those not familiar with GPT-2, it is, according to its creators OpenAI (a socially conscious artificial intelligence lab overseen by a nonprofit entity), “a large-scale unsupervised language model which generates coherent paragraphs of text.” Think of it as a computer that has consumed so much text that it’s very good at figuring out which words are likely to follow other words, and when strung together, these words create fairly coherent sentences and paragraphs that are plausible continuations of any initial (or “seed”) text.

This isn’t a very difficult problem and the underpinnings of it are well laid out by John R. Pierce in *[An Introduction to Information Theory: Symbols, Signals and Noise](https://amzn.to/32JWDSn)*. In it he has a lot of interesting tidbits about language and structure from an engineering perspective including the reason why crossword puzzles work.
November 13, 2019 at 08:33AM

The most interesting examples have been the weird ones (cf. HI7), where the language model has been trained on narrower, more colorful sets of texts, and then sparked with creative prompts. Archaeologist Shawn Graham, who is working on a book I’d like to preorder right now, An Enchantment of Digital Archaeology: Raising the Dead with Agent Based Models, Archaeogaming, and Artificial Intelligence, fed GPT-2 the works of the English Egyptologist Flinders Petrie (1853-1942) and then resurrected him at the command line for a conversation about his work. Robin Sloan had similar good fun this summer with a focus on fantasy quests, and helpfully documented how he did it.

Circle back around and read this when it comes out.

Similarly, these other references should be an interesting read as well.
November 13, 2019 at 08:36AM

From this perspective, GPT-2 says less about artificial intelligence and more about how human intelligence is constantly looking for, and accepting of, stereotypical narrative genres, and how our mind always wants to make sense of any text it encounters, no matter how odd. Reflecting on that process can be the source of helpful self-awareness—about our past and present views and inclinations—and also, some significant enjoyment as our minds spin stories well beyond the thrown-together words on a page or screen.

And it’s not just happening with text, but it also happens with speech as I’ve written before: Complexity isn’t a Vice: 10 Word Answers and Doubletalk in Election 2016 In fact, in this mentioned case, looking at transcripts actually helps to reveal that the emperor had no clothes because there’s so much missing from the speech that the text doesn’t have enough space to fill in the gaps the way the live speech did.
November 13, 2019 at 08:43AM

👓 Large Cache of Texts May Offer Insight Into One of Africa’s Oldest Written Languages | Smithsonian Magazine

Read Large Cache of Texts May Offer Insight Into One of Africa's Oldest Written Languages (Smithsonian)

Archaeologists in Sudan have uncovered the largest assemblage of Meroitic inscriptions to date

This is a cool discovery, in great part because their documentation was interesting enough to be able to suggest further locations to check for more archaeological finds. This might also be something one could apply some linguistic analysis and information theory to in an attempt to better pull apart the language and grammar.

h/t to @ArtsJournalNews, bookmarked on April 17, 2018 at 08:16AM

Trove Of Inscriptions In Sub-Saharan Africa’s Oldest Written Language Discovered:

“Archaeologists in Sudan have uncovered a large cache of rare stone inscriptions at the Sedeinga necropolis along the Nile River. The collection of funerary texts are ins… https://t.co/8qb3gkkpsa

— ArtsJournal (@ArtsJournalNews) April 17, 2018

🔖 [1803.09745] English verb regularization in books and tweets | arXiv

Bookmarked [1803.09745] English verb regularization in books and tweets by Tyler J. Gray, Andrew J. Reagan, Peter Sheridan Dodds, Christopher M. Danforth (arxiv.org)

The English language has evolved dramatically throughout its lifespan, to the extent that a modern speaker of Old English would be incomprehensible without translation. One concrete indicator of this process is the movement from irregular to regular (-ed) forms for the past tense of verbs. In this study we quantify the extent of verb regularization using two vastly disparate datasets: (1) Six years of published books scanned by Google (2003--2008), and (2) A decade of social media messages posted to Twitter (2008--2017). We find that the extent of verb regularization is greater on Twitter, taken as a whole, than in English Fiction books. Regularization is also greater for tweets geotagged in the United States relative to American English books, but the opposite is true for tweets geotagged in the United Kingdom relative to British English books. We also find interesting regional variations in regularization across counties in the United States. However, once differences in population are accounted for, we do not identify strong correlations with socio-demographic variables such as education or income. [.pdf]

The emotional arcs of stories are dominated by six basic shapes

Bookmarked The emotional arcs of stories are dominated by six basic shapes (arxiv.org)

Advances in computing power, natural language processing, and digitization of text now make it possible to study our a culture's evolution through its texts using a "big data" lens. Our ability to communicate relies in part upon a shared emotional experience, with stories often following distinct emotional trajectories, forming patterns that are meaningful to us. Here, by classifying the emotional arcs for a filtered subset of 1,737 stories from Project Gutenberg's fiction collection, we find a set of six core trajectories which form the building blocks of complex narratives. We strengthen our findings by separately applying optimization, linear decomposition, supervised learning, and unsupervised learning. For each of these six core emotional arcs, we examine the closest characteristic stories in publication today and find that particular emotional arcs enjoy greater success, as measured by downloads.

Webmention + Books = BookMention

Part of my plans to (remotely) devote the weekend to the IndieWeb Summit in Portland were hijacked by the passing of Muhammad Ali. Wait… What?! How does that happen?

A year ago, I opened started a publishing company and we came out with our first book Amerikan Krazy in late February. The author has a small backcatalogue that’s out of print, so in conjunction with his book launch, we’ve been slowly releasing ebook versions of his old titles. Coincidentally one of them was a fantastic little book about Ali entitled Muhammad Ali Retrospective, so I dropped everything I was doing to get it finished up and out as a quick way of honoring his passing.

But while I was working on some of the minutiae, I’ve been thinking in the back of my mind about the ideas of marginalia, commonplace books, and Amazon’s siloed community of highlights and notes. Is there a decentralized web-based way of creating a construct similar to webmention that will allow all readers worldwide to highlight, mark up and comment across electronic versions of texts so that they can share them in an open manner while still owning all of their own data? And possibly a way to aggregate them at the top for big data studies in the vein of corpus linguistics?

I think there is…

However it’ll take some effort, but effort that could have a worthwhile impact.

I have a few potential architectures in mind, but also want to keep online versions of books in the loop as well as potentially efforts like hypothes.is or even the academic portions of Genius.com which do web-based annotation.

If anyone in the IndieWeb, books, or online marginalia worlds has thought about this as well, I’d love to chat.

Some Thoughts on Academic Publishing and “Who’s downloading pirated papers? Everyone” from Science | AAAS

Bookmarked Who's downloading pirated papers? Everyone by John Bohannon (Science | AAAS)

An exclusive look at data from the controversial web site Sci-Hub reveals that the whole world, both poor and rich, is reading pirated research papers.

Sci Hub has been in the news quite a bit over the past half a year and the bookmarked article here gives some interesting statistics. I’ll preface some of the following editorial critique with the fact that I love John Bohannon’s work; I’m glad he’s spent the time to do the research he has. Most of the rest of the critique is aimed at the publishing industry itself.

From a journalistic standpoint, I find it disingenuous that the article didn’t actually hyperlink to Sci Hub. Neither did it link out (or provide a full quote) to Alicia Wise’s Twitter post(s) nor link to her rebuttal list of 20 ways to access their content freely or inexpensively. Of course both of these are editorial related, and perhaps the rebuttal was so flimsy as to be unworthy of a link from such an esteemed publication anyway.

Sadly, Elsevier’s list of 20 ways of free/inexpensive access doesn’t really provide any simple coverage for graduate students or researchers in poorer countries which are the likeliest group of people using Sci Hub, unless they’re going to fraudulently claim they’re part of a class which they’re not, and is this morally any better than the original theft method? It’s almost assuredly never used by patients, which seem to be covered under one of the options, as the option to do so is painfully undiscoverable past their typical $30/paper firewalls. Their patchwork hodgepodge of free access is so difficult to not only discern, but one must keep in mind that this is just one of dozens of publishers a researcher must navigate to find the one thing they’re looking for right now (not to mention the thousands of times they need to do this throughout a year, much less a career).

Consider this experiment, which could be a good follow up to the article: is it easier to find and download a paper by title/author/DOI via Sci Hub (a minute) versus through any of the other publishers’ platforms with a university subscription (several minutes) or without a subscription (an hour or more to days)? Just consider the time it would take to dig up every one of 30 references in an average journal article: maybe just a half an hour via Sci Hub versus the days and/or weeks it would take to jump through the multiple hoops to first discover, read about, and then gain access and then download them from the over 14 providers (and this presumes the others provide some type of “access” like Elsevier).

Those who lived through the Napster revolution in music will realize that the dead simplicity of their system is primarily what helped kill the music business compared to the ecosystem that exists now with easy access through the multiple streaming sites (Spotify, Pandora, etc.) or inexpensive paid options like (iTunes). If the publishing business doesn’t want to get completely killed, they’re going to need to create the iTunes of academia. I suspect they’ll have internal bean-counters watching the percentage of the total (now apparently 5%) and will probably only do something before it passes a much larger threshold, though I imagine that they’re really hoping that the number stays stable which signals that they’re not really concerned. They’re far more likely to continue to maintain their status quo practices.

Some of this ease-of-access argument is truly borne out by the statistics of open access papers which are downloaded by Sci Hub–it’s simply easier to both find and download them that way compared to traditional methods; there’s one simple pathway for both discovery and download. Surely the publishers, without colluding, could come up with a standardized method or protocol for finding and accessing their material cheaply and easily?

“Hart-Davidson obtained more than 100 years of biology papers the hard way—legally with the help of the publishers. ‘It took an entire year just to get permission,’ says Thomas Padilla, the MSU librarian who did the negotiating.” John Bohannon in Who’s downloading pirated papers? Everyone

Personally, I use use relatively advanced tools like LibX, which happens to be offered by my institution and which I feel isn’t very well known, and it still takes me longer to find and download a paper than it would via Sci Hub. God forbid if some enterprising hacker were to create a LibX community version for Sci Hub. Come to think of it, why haven’t any of the dozens of publishers built and supported simple tools like LibX which make their content easy to access? If we consider the analogy of academic papers to the introduction of machine guns in World War I, why should modern researchers still be using single-load rifles against an enemy that has access to nuclear weaponry?

My last thought here comes on the heels of the two tweets from Alicia Wise mentioned, but not shown in the article:

I’m all for universal access, but not theft! There are lots of legal ways to get access https://t.co/iDZW2XcPhy 1/2 .@mbeisen .@Sci_Hub

— Alicia Wise (@wisealic) March 14, 2016

A digital sub to the NYT $260/person and for all #Elsevier content $215/researcher. Both fantastic value! 2/2 .@mbeisen .@Scihub @nytimes

— Alicia Wise (@wisealic) March 14, 2016

She mentions that the New York Times charges more than Elsevier does for a full subscription. This is tremendously disingenuous as Elsevier is but one of dozens of publishers for which one would have to subscribe to have access to the full panoply of material researchers are typically looking for. Further, Elsevier nor their competitors are making their material as easy to find and access as the New York Times does. Neither do they discount access to the point that they attempt to find the subscription point that their users find financially acceptable. Case in point: while I often read the New York Times, I rarely go over their monthly limit of articles to need any type of paid subscription. Solely because they made me an interesting offer to subscribe for 8 weeks for 99 cents, I took them up on it and renewed that deal for another subsequent 8 weeks. Not finding it worth the full $35/month price point I attempted to cancel. I had to cancel the subscription via phone, but why? The NYT customer rep made me no less than 5 different offers at ever decreasing price points–including the 99 cents for 8 weeks which I had been getting!!–to try to keep my subscription. Elsevier, nor any of their competitors has ever tried (much less so hard) to earn my business. (I’ll further posit that it’s because it’s easier to fleece at the institutional level with bulk negotiation, a model not too dissimilar to the textbook business pressuring professors on textbook adoption rather than trying to sell directly the end consumer–the student, which I’ve written about before.)

(Trigger alert: Apophasis to come) And none of this is to mention the quality control that is (or isn’t) put into the journals or papers themselves. Fortunately one need’t even go further than Bohannon’s other writings like Who’s Afraid of Peer Review? Then there are the hordes of articles on poor research design and misuse of statistical analysis and inability to repeat experiments. Not to give them any ideas, but lately it seems like Elsevier buying the Enquirer and charging $30 per article might not be a bad business decision. Maybe they just don’t want to play second-banana to TMZ?

Interestingly there’s a survey at the end of the article which indicates some additional sources of academic copyright infringement. I do have to wonder how the data for the survey will be used? There’s always the possibility that logged in users will be indicating they’re circumventing copyright and opening themselves up to litigation.

I also found the concept of using the massive data store as a means of applied corpus linguistics for science an entertaining proposition. This type of research could mean great things for science communication in general. I have heard of people attempting to do such meta-analysis to guide the purchase of potential intellectual property for patent trolling as well.

Finally, for those who haven’t done it (ever or recently), I’ll recommend that it’s certainly well worth their time and energy to attend one or more of the many 30-60 minute sessions most academic libraries offer at the beginning of their academic terms to train library users on research tools and methods. You’ll save yourself a huge amount of time.

Global Language Networks

Yesterday I ran across this nice little video explaining some recent research on global language networks. It’s not only interesting in its own right, but is a fantastic example of science communication as well.

I’m interested in some of the information theoretic aspects of this as well as the relation of this to the area of corpus linguistics. I’m also curious if one could build worthwhile datasets like this for the ancient world (cross reference some of the sources I touch on in relation to the Dickinson College Commentaries within Latin Pedagogy and the Digital Humanities) to see what influences different language cultures have had on each other. Perhaps the historical record could help to validate some of the predictions made in relation to the future?

The paper “Global distribution and drivers of language extinction risk” indicates that of all the variables tested, economic growth was most strongly linked to language loss.

This research also has some interesting relation to the concept of “Collective Learning” within the realm of a Big History framework via David Christian, Fred Spier, et al. I’m curious to revisit my hypothesis: Collective learning has potentially been growing at the expense of a shrinking body of diverse language some of which was informed by the work of Jared Diamond.

Some of the discussion in the video is reminiscent to me of some of the work Stuart Kauffman lays out in At Home in the Universe: The Search for the Laws of Self-Organization and Complexity (Oxford, 1995). Particularly in chapter 3 in which Kauffman discusses the networks of life. The analogy of this to the networks of language here indicate to me that some of Cesar Hidalgo’s recent work in Why Information Grows: The Evolution of Order, From Atoms to Economies (MIT Press, 2015) is even more interesting in helping to show the true value of links between people and firms (information sources which he measures as personbytes and firmbytes) within economies.

Finally, I can also only think about how this research may help to temper some of the xenophobic discussion that occurs in American political life with respect to fears relating to Mexican immigration issues as well as the position of China in the world economy.

Those intrigued by the video may find the website set up by the researchers very interesting. It contains links to the full paper as well as visualizations and links to the data used.

Abstract

Languages vary enormously in global importance because of historical, demographic, political, and technological forces. However, beyond simple measures of population and economic power, there has been no rigorous quantitative way to define the global influence of languages. Here we use the structure of the networks connecting multilingual speakers and translated texts, as expressed in book translations, multiple language editions of Wikipedia, and Twitter, to provide a concept of language importance that goes beyond simple economic or demographic measures. We find that the structure of these three global language networks (GLNs) is centered on English as a global hub and around a handful of intermediate hub languages, which include Spanish, German, French, Russian, Portuguese, and Chinese. We validate the measure of a language’s centrality in the three GLNs by showing that it exhibits a strong correlation with two independent measures of the number of famous people born in the countries associated with that language. These results suggest that the position of a language in the GLN contributes to the visibility of its speakers and the global popularity of the cultural content they produce.

Citation: Ronen S, Goncalves B, Hu KZ, Vespignani A, Pinker S, Hidalgo CA
Links that speak: the global language network and its association with global fame, Proceedings of the National Academy of Sciences (PNAS) (2014), 10.1073/pnas.1410931111

“A language like Dutch — spoken by 27 million people — can be a disproportionately large conduit, compared with a language like Arabic, which has a whopping 530 million native and second-language speakers,” Science reports. “This is because the Dutch are very multilingual and very online.”

Tag: corpus linguistics

The emotional arcs of stories are dominated by six basic shapes

Webmention + Books = BookMention

Some Thoughts on Academic Publishing and “Who’s downloading pirated papers? Everyone” from Science | AAAS

Global Language Networks

Abstract

Related posts: