Dealing with web addresses that use delimiter other than ?
I've been removing tracking IDs from website addresses that users post to my message boards and classifieds, which has admittedly gotten WAY more complicated than I intended. But I've recently run across a new one, so I'm curious how you guys and gals would suggest dealing with it.
++++++++++++++
list of top cheapest host http://Listfreetop.pw
Top 200 best traffic exchange sites http://Listfreetop.pw/surf
free link exchange sites list http://Listfreetop.pw/links
list of top ptc sites
list of top ptp sites
Listfreetop.pw
Listfreetop.pw
+++++++++++++++
In this example, the link looked like:
https://example.com/foo/bar|pcrid|391022977133|pkw||pmt||pdv|m|slid||product||pgrid|78378217177|ptaid||&pgrid=78378217177&ptaid=&source=WFP2019-DD-NATL-GD-US-BCON&subsource=78378217177---391022977133&refcode=WFP2019-DD-NATL-GD-US-BCON&refcode2=78378217177---391022977133&utm_source=Google&utm_campaign=WFP2019-DD-NATL-GD-US-BCON&utm_term=-391022977133&utm_medium=Display&gclid=EAIaIQobChMIpMLsjry95QIVQqFRCh3slA_eEAEYASAAEgLLc_D_BwE
I use Perl's URI::Find to find links in the text and convert it to a ... tag, but it doesn't recognize the | delimiter so I end up with:
https://example.com/foo/bar|pcrid|391022977133|pkw||pmt||pdv|m|slid||product||pgrid|78378217177|ptaid||&pgrid=78378217177&ptaid=&source=WFP2019-DD-NATL-GD-US-BCON&subsource=78378217177---391022977133&refcode=WFP2019-DD-NATL-GD-US-BCON&refcode2=78378217177---391022977133&utm_source=Google&utm_campaign=WFP2019-DD-NATL-GD-US-BCON&utm_term=-391022977133&utm_medium=Display&gclid=EAIaIQobChMIpMLsjry95QIVQqFRCh3slA_eEAEYASAAEgLLc_D_BwE
And since the rest of that isn't recognized as part of the link, my system doesn't remove any of those parameters, including the parts that are actually delimited by & (and it would usually remove all of them).
I'm kind of at a loss on how to handle this one. I could use a regex to find http, followed by anything that's not a space, until it gets to a |, and then remove everything after and including that |. That's a bit dangerous since someone could realistically use a | in the parameter value that I wouldn't want to remove, though.
I guess that the regex would look something like:
$text =~ s#\b(https?://[^\s])\|[^\s]*\b#$1#i;
What do you all think?
I say: Crikey.
[^\s]
==
\S
unless there's something I am overlooking.
Did you mean [^\s]+ (i.e. \S+) ? That was my own question mark, but in fact you'd need to say
\S+?\|
in order to stop as soon as possible, i.e. before the first | character if there's more than one of them.
.www.affiliate-traffic-builder.com
www.earthcam.com
host the oscars
witcher 2 make money
happybirthdates.com
x domain y
hosting angular 6
o domain registrar
1 domain 2 email servers
domain xyz
Seems like you'd want something like
[\w/.,~-]+
for the path part, to constrain it to things that can reasonably occur. If you could be certain that the | is the only no-no that will ever show up, the pattern would be more like
[^\s?|]+([?|]blahblah)?
where ? and | don't need to be escaped inside grouping brackets (but it does no harm if you do escape them).
I suppose you have already considered the possibility of telling your code to disregard (don't make links from) URLs that don't follow the rules :( Your forums members really are an unruly bunch aren't they.
Disclaimer: I am about to disappear for at least 24 hours, possibly longer (thank you very much, PG&E), so if I said something hopelessly misleading you will have to remain misled.
Yikes! 24 hours without lucy24!
(get those folks out there to solve the power problems!)