About this board

Previous discussion was archived at User talk:Deansfa/Archive 1 on 2016-12-09.

Elizium23 (talkcontribs)

Greetings,

Please see Help talk:Label. Also, the Phabricator task:

Now these discussions say that supposedly, a blank label should fallback to use the English one. Does it not do this for you? That would be a bug.

If we set the "mul" language label then that should suffice, rather than spamming it to hundreds of non-English labels.

Deansfa (talkcontribs)

I often (almost always) use "mul" for family name items. But the gadget I use to fill the labels set up all the latin alphabet labels though (here). I believe it's a Wikidata gadget. Maybe they should remove it then.

Deansfa (talkcontribs)

This is always like this with English speakers, English is the default and deal with it. I'm sorry, not everyone speak English, not everyone set its language to English, this is a multilingual project. And lots of contributors, bots, etc add those labels in their language. It's a common practice, I have really enough.

Elizium23 (talkcontribs)

"Adding the label in their language" is like Pope Francis (Q450675). Can you see how those labels are translated into individual languages, rather than being copy-pastes of the native name (his native name is not English, by the way.)

Deansfa (talkcontribs)

When we will start to remove label for big items like Barack Obama (Q76) because there will be system that automatically set the label in French and other languages, then I would be happy not to set it for other languages. But as for today, every wikipedia item about people has multiple labels in every language. This is not spamming.

Elizium23 (talkcontribs)

I look forward to removing these en masse. Because it is a "kludge", as we call it. You're relying on a hack to get around some software bug, some issue you see in SPARQL that isn't respecting the fallback chain of translations.

The adding of many copy-paste labels is spam, and it burdens the system, and it will slow down your queries, I guarantee it.

This is not a universal practice, and we have no need to imitate other editors who are disregarding the proper application of non-native labels and langauges.

Elizium23 (talkcontribs)

The language I speak is irrelevant. We are discussing an item whose name is in the English language. The person is an American with an English name. You have no need to put his French name in English, or German or Portuguese or his Asturian name, because the fallback and the translatewiki.net chains take care of this automatically.

Deansfa (talkcontribs)

Irrelevant when all this conversation is in English just for you to understand? Is it a joke? How insulting it is.

Just one example: When you work with Sparql (and we do that a lot on the French Wikipedia, we use Sparql to generate lists) and you want the result to display a given language, the label will be blank if you don't set it up. There's lots of use case when it won't show up. I will start a thread on the French chat because at least I want it to be set in French language. There's no way you will dictate what is displayed in my native language.

Elizium23 (talkcontribs)

You can write in French, I don't mind; I am not a language snob and I can understand you just fine.

Elizium23 (talkcontribs)

You're proposing a massive maintenance burden. If you want to copy-paste hundreds, thousands of labels, then how to keep them synchronized? How to ensure they are identical? You can't, really. You're creating debt, debt for future editors to pick through them and maintain hundreds of copy-pasted spam labels that aren't actually in those languages, just to hack and work around current bugs.

Deansfa (talkcontribs)

I'm not sure what's your point. We use those labels on the French Wikipédia in some templates, for example the author of a book pulled from Wikidata may not display in French if its label is not in French, or again, my example of the Sparql lists is a good one.

The massive burden and the "spam" (your expression, not mine) are your edits that revert the edits of lots of other people.

Elizium23 (talkcontribs)

So show me an example where a label does not fall back to a default, but displays as blank? Can you demonstrate this defect for us please? Because I just wrote a SPARQL query, based on an example provided, and it had no trouble with a chain of specified languages. Please show us how Wikidata is falling short here. I want to see it.

Deansfa (talkcontribs)

Yes, English has to be the default. Why is that?

the irony is that the query i can provide about French stuff will show up in French because I set the items in French.

About you attitude that you're planning to move forward with it. I just see hundreds of bot edits that do the exact opposite of what you do. Maybe there should be some coordination because then all of those edits, including yours, are spam at this point.

Elizium23 (talkcontribs)

No, English doesn't need to be the default. Are you not aware of the configurable fallback chains? You can tell Wikidata which languages you understand, and want to see, and Wikidata will present pages in those languages, with a preference order. Have you not configured this for yourself?

Deansfa (talkcontribs)

ok but Elizium, if I don't set English in this query as the default (or the fallback as you say), some items will show as QXXXXX. I have to specify English as the fallback. Which is fine for sure for you, but I'm sorry, doesn't this project should emancipate itself from English at some point?

And what do you propose (real question) for let's say an item of a French person. we just put the label in English? Just in French? Both English and French?

VIGNERON (talkcontribs)
Elizium23 (talkcontribs)

@VIGNERON, is that not what I proposed in my first post in this discussion???

VIGNERON (talkcontribs)

It was not really clear. Good, so you will wait, not remove labels anymore nor revert the last edits on Richard Simmons (Q498019) then, right?

Elizium23 (talkcontribs)

Personally, I would hope that both sides of this would hold off, but y'all seem convinced that this is currently the right thing to do, so, who am I to judge?

VIGNERON (talkcontribs)

Yes, the current system is suboptimal and, as the transition is soon, yes ideally people and bots should stop.

But the future system is not there yet, so applying a non-existing system is (at best) premature. And even when the new system will be there, the change will need quite some times, reverting without *a lot* of pedagogy will be as useless as fighting windmills.

Reply to "Non-English labels"
Richard Arthur Norton (1958- ) (talkcontribs)

Is there anyway you can automate creating entries for all the obits in the "Overlooked No More" series in the NYT. They are obits for people that have been overlooked by history.

Deansfa (talkcontribs)

Hi, we should already have all the "Overlooked No More" obituaries if their publication date is prior to 2024. I will wait the end of the year to see how we can get the obituaries of year 2024. Do you have an example of one prior to 2024 that we missed?

Deansfa (talkcontribs)
Richard Arthur Norton (1958- ) (talkcontribs)

That is an amazing collection you have accumulated.

Reply to "Overlooked No More"

You were one of only three people to respond to the question

1
Richard Arthur Norton (1958- ) (talkcontribs)

You were one of only three people to respond to the question of whether people with Wikidata entries only need to be described "using serious and publicly available references" or if they need to meet the Wikipedia notability standard and need to be famous. The question has come up again at Wikidata:Requests for deletions#Q125118469

Reply to "You were one of only three people to respond to the question"
Moumou82 (talkcontribs)
Deansfa (talkcontribs)

Bonjour Moumou, je ferais ça ce soir (EST) ou demain. Je te tiens au courant quand c'est fini.

Deansfa (talkcontribs)

Rebonjour, j'ai renseigné la propriété pour tous les items.

Détail important: dans mon script je n'ai pas requêté les nouvelles URLs pour voir si elles correspondaient à une page (car la page HTML est difficile à "parser", c'est plus difficile que certains sites), donc il se peut qu'il y ait des URLs qui atterrissent sur une "Page Not Found". Mais dans l'immense majorité des cas, je pense que c'est bon.

Je pourrais probablement corriger cela lorsque j'arrivais à "parser" correctement la page. Je verrais si j'ai je temps ce weekend. Bonne fin de semaine.

Requête Sparql qui permet de voir les items sans la nouvelle propriété: Résultat Requête

Moumou82 (talkcontribs)

Merci beaucoup !

Je pense que les "Page Not Found" pourraient éventuellement se trouver dans les noms composés où la version du site diffère de celle de Wikidata, comme ici.

Reply to "Akadem"
Richard Arthur Norton (1958- ) (talkcontribs)

Do you have the ability to import all the obituaries from the New York Times archive?

Deansfa (talkcontribs)

It depends. it's always more difficult for old articles, like the ones you have to go in the "Times Machine" to read them.


For more "recent" articles, just looking quickly, it seems for 2018 and after, we can access the obituaries this way:

And we can iterate over the dates and probably get all the obituaries this way (calling the API, etc).


For obituaries previous to 2018, it seems it's there:

Same thing, I can probably loop over the dates in the URL and get the obituaries this way.


Previous to 2006, I need to do more research.


I definitely can do it, it can be some work. Would you be interested?

Richard Arthur Norton (1958- ) (talkcontribs)

I will help any way I can, you just need to tell me what to do. Ancestry and Familysearch have done a great job identifying obituaries at newspapeers.com (ancestry) and genealogybank (familysearch) and automatically assigning them to the entry for the deceased. They recently ran their program again looking for marriage announcements.

Deansfa (talkcontribs)

Hi Richard,

I started importing New York Times obituaries in Wikidata! I'm very happy with the result so far, I finished the 2022 year, and I'm planning in the upcoming days and weeks to do the previous years (2021, 2020, etc).

Here is a Sparql query to see all the obituaries for 2022:

I will improve the query to get the date of death of the person, and to do the difference between publication date and date of death (so we can track the discrepencies).

As you can see, so far less than a dozen are associated with their main subject. Tonight I will run a script to do the association (doing a matching based on the name and the date of death), which will probably work for a majority of the cases. But probably the rest would have to be manual. Will see.

I keep you updated.

Richard Arthur Norton (1958- ) (talkcontribs)

Excellent, you should make this into a talk for the next Wikimania event. Have you given a talk before? Do you think you can automate creation of a Wikidata entry for the decedent, when we have no entry? I especially love the New York Times "Overlooked" series. I will try and add in Familysearch IDs for the people. One more thing to automate is the backlink: "described_by_source=". See: Q115213783

Richard Arthur Norton (1958- ) (talkcontribs)

If you go back far enough in the archive titles are in ALL CAPS are you going to reduce to the sentence case? I see that some journals we have the title of the article in ALL caps, it is very distracting.

Deansfa (talkcontribs)

Hi Richard, thank you for bringing that point. I would try to capitalize the title if it happens. But I won't go that far. I can only process obituaries back to 2006 (right now, we have all obituaries from 2012 to 2022).

Here is a query that shows the number of obituaries by year (the number is hight in 2020 because of COVID):

I'm not sure how to find obituaries previous to 2006.

Reply to "New York Times archive"
Richard Arthur Norton (1958- ) (talkcontribs)

I have asked at Project Chat but there is no consensus. Take a look at Q90922820, there are two places for the image of the new article, which one should we standardize on? I value your opinion, since you are now the newspaper guy.

Reply to "Images of new articles"
Richard Arthur Norton (1958- ) (talkcontribs)

At John Fred Pierson (1839-1932) obituary (Q112567088) you attributed a 1932 obituary to an author born in 1947. I do not see any author attribution in the original article, where is the data coming from? Same for {{Q|Q105762166}}, the obituary is from 1966 and the author started work at the NYT in 2000 and there is no byline in the article, because is came from the Associated Press. I think your bot is off a bit. Does this mean your word count may be incorrect? At Q104907766 where we have the full text the word count is 245, not the 1,091 you added.

Deansfa (talkcontribs)

Hi Richard, thank you for reporting the error. I'm sorry about that.

Just for info, I get the data from a New York Times API. Let's take an example: For this article When Progressives Embrace Hate(50317550), the API response will be here: and some of the response will be like this: {"wordCount":1826,"id":100000005322886,"publishedDate":1501597856000,"publishedTimestamp":1501597856000,.. }


I see now what's the issue in my code. I will rerun it and see where I made errors, and fix them. give me a couple of days. I keep you updated.


Deans

Richard Arthur Norton (1958- ) (talkcontribs)

All good stuff, once working properly! The NYT archive is an amazing source, IBM Watson used Wikipedia and the NYT Times archive as the main source of info to win Jeopardy. At one time reverse CAPTCHA was being used to transcribe articles. https://www.nytimes.com/2011/03/29/science/29recaptcha.html

I wish the Associated Press had an ASCII archive of all their articles.

Reply to "{{Q|Q112567088}}"

publication date as a qualifier

3
Billinghurst (talkcontribs)

Hi. With reference to special:diff/1904523818 and the other like edits, the publication date and other aspects of article are best applied as qualifiers of the publication, rather than directly to the item. This is most notable when people ridiculously apply a page number directly to an item, rather than to the publication (Help:Qualifiers). You truly see the value of this approach when copying and moving these parts of an item.

Deansfa (talkcontribs)

I don't disagree but noone follow this direction, so when you query articles based on date, you miss those (or you have to have a very verbose query to handle this use case).

Deansfa (talkcontribs)

I didn't noticed you removed the date. You should not revert the change, the date can stand as qualifier AND directly on the item. There's more than 4000 NYT articles, I don't see why these 40 articles should differ from the rest. It makes querying articles against the date very challenging.

Reply to "publication date as a qualifier"
Thelemic Magick (talkcontribs)

Hi Deansfa,


is this a mockup for importing all WSJ articles as Wikidata items?

Deansfa (talkcontribs)

Hey Thelemic, I don't plan to import all WSJ items (at least not if it's not needed), so it's not really a mockup.

I definitely did some test imports in the past, but it was limited to one author. I currently import some articles from time to time, in general because I used them as reference in articles in Wikipedia (using templates like Template:Cite Q on wp:en or Template:Bibliographie on wp:fr).

If there's a need, I can definitely do bigger imports (by issue, by author,). I'm currently working on doing something similar for the nytimes and bloomberg). I think there's lots of possibilities in the future.

Reply to "WSJ articles project"
FeralOink (talkcontribs)

Hello Deansfa.

I noticed that you created a Wikidata entry for a Wall Street Journal article that is referenced in the en Wikipedia article for Alameda Research. Could you check the Wikidata entry for that WSJ article? There is an error in one of the fields. It is because the WSJ doesn't use a consistent article URL naming convention. I don't recall what error is thrown in Wikidata, but you'll see it toward the end of the entry, where it is denoted with a circled question mark. Could you correct that please? I don't know how to do that.

Deansfa (talkcontribs)

Hi FeralOink, thank you for the message. So two WSJ-WD articles are referenced in the Alameda Research article: Alameda, FTX Executives Are Said to Have Known FTX Was Using Customer Funds (Q115184709) and Binance Walks Away From Deal to Rescue FTX (Q115184738). I don't think it creates any issue in the article, but you're right, there are some "exceptions"/"errors" thrown in the Wikidata element (around article ID (P2322)).

It's because for each of the WSJ articles I set up an article ID (which is unique per article, it's an exposed and documented attribute) with the property article ID (P2322). The problem is that this property accepts only alphanumeric characters, while WSJ article IDs can be like "WP-WSJ-0000344745" (with dashes). Actually the format of this ID has slightly changed overtime (It was like /SB[0-9]+/ for most of the time).

I'm planning to create a property for WSJ articles ID, so I won't have to use article ID (P2322) in the future (and it won't throw errors). One temporary fix is to add dashes as an accepted character for article ID (P2322).

FeralOink (talkcontribs)

Okay! That sounds like a good way to deal with it. I am a WSJ subscriber, so I know their URL conventions changed about 3 years ago and have some weird variations like the dashes but only sometimes. Thank you so much for looking into it. I saw the notes about regex but that is beyond what I felt like considering ;o)

Deansfa (talkcontribs)
Reply to "WSJ article re FTX and Alameda"